Deploy optimized transformer based models on Nvidia Triton server

Overview

🤗 Hugging Face Transformer submillisecond inference 🤯 and deployment on Nvidia Triton server

Yes, you can perfom inference with transformer based model in less than 1ms on the cheapest GPU available on Amazon (T4)!

The commands below have been tested on a AWS G4.dnn with Deep Learning Base AMI (Ubuntu 18.04) Version 44.0. They may require some small adaptations to be run on a another Linux distribution.

You can find explanations on how it works in Hugging Face Transformer inference UNDER 1 millisecond latency

Baseline set by Hugging Face Infinity demo

Hugging Face infinity demo video

  • AWS virtual machine: g4dn.xlarge (T4 GPU)
  • model: "philschmid/MiniLM-L6-H384-uncased-sst2" (Hugging Face hub URL)
  • experience 1 : batch size 1, seq len 16 tokens -> 1.7ms
  • experience 2 : batch size 1, seq len 128 tokens -> 2.5ms

Install dependencies

Those dependencies have to be installed on the remote machine directly (no container).

git clone [email protected]:ELS-RD/triton_transformers.git
pip3 install -r requirements.txt

Generate optimized models

We generate the models from a Docker image so we can also get measures for TensorRT + ONNX Runtime.

cd triton_transformers
DOCKER_BUILDKIT=1 docker build --tag onnxruntime-trt:latest -f Dockerfile .
docker run -it --rm --gpus all -v $PWD:/project onnxruntime-trt bash -c "cd /project && python convert_onnx.py"

⚠️ WARNING ⚠️ : if you run the conversion outside Docker container, you may have very different timings, and TensorRT won't work

It should produce something like that:

10/31/2021 11:35:08 INFO     inference done on Tesla T4
10/31/2021 11:35:08 INFO     timing [[TensorrtExecutionProvider] ./onnx_models/model-shape.onnx]: mean=0.61ms, sd=0.11ms, min=0.52ms, max=0.92ms, median=0.54ms, 95p=0.88ms, 99p=0.90ms
10/31/2021 11:35:08 INFO     timing [[CUDAExecutionProvider] ./onnx_models/model.onnx]: mean=1.10ms, sd=0.10ms, min=1.04ms, max=3.44ms, median=1.07ms, 95p=1.29ms, 99p=1.36ms
10/31/2021 11:35:08 INFO     timing [[CUDAExecutionProvider] ./onnx_models/model-optimized.onnx]: mean=0.63ms, sd=0.05ms, min=0.60ms, max=0.84ms, median=0.61ms, 95p=0.77ms, 99p=0.79ms
10/31/2021 11:35:08 INFO     timing [Pytorch_32]: mean=5.09ms, sd=0.16ms, min=4.88ms, max=6.11ms, median=5.07ms, 95p=5.28ms, 99p=5.35ms
10/31/2021 11:35:08 INFO     timing [Pytorch_FP16]: mean=6.04ms, sd=0.74ms, min=5.77ms, max=28.79ms, median=6.05ms, 95p=6.19ms, 99p=6.29ms

TensorRT and optimized ONNX Runtime provides very similar results on short sequences. In the following steps, we will continue with ONNX Runtime model because the dynamic axis are easier to work with compared to TensorRT.

Docker build will is very slow on a G4, be patient... the docker image is only required for TensorRT support inside ONNX Runtime (and measure a difference, if any, with ONNX Runtime).

FastAPI server

This is our baseline, easy to run, but not very performant.

# launch server, disable logging for best performances
python3 -m uvicorn --log-level warning server_onnx:app --port 8000 --host 0.0.0.0
# other variation, 1 worker per CPU for best latency (plus not a good idea to have several times the same model on a single GPU):
python3 -m gunicorn -w 1 -k uvicorn.workers.UvicornWorker --log-level warning server_onnx:app --bind 0.0.0.0:8000

# simple inference timing
time curl -G --data-urlencode query="This live event is great. I will sign-up for Infinity." localhost:8000/predict
# slightly more serious measure
sudo apt-get install linux-tools-common linux-tools-generic linux-tools-`uname -r`
sudo perf stat -r 50 -d curl -G --data-urlencode query="This live event is great. I will sign-up for Infinity." localhost:8000/predict -s > /dev/null

It should produce:

Performance counter stats for 'curl -G --data-urlencode query=This live event is great. I will sign-up for Infinity. localhost:8000/predict' (50 runs):

              6.14 msec task-clock                #    0.494 CPUs utilized            ( +-  0.59% )
                 3      context-switches          #    0.462 K/sec                    ( +-  1.84% )
                 0      cpu-migrations            #    0.000 K/sec                  
               577      page-faults               #    0.094 M/sec                    ( +-  0.06% )
   <not supported>      cycles                                                      
   <not supported>      instructions                                                
   <not supported>      branches                                                    
   <not supported>      branch-misses                                               
   <not supported>      L1-dcache-loads                                             
   <not supported>      L1-dcache-load-misses                                       
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             

         0.0124429 +- 0.0000547 seconds time elapsed  ( +-  0.44% )

Triton server

We want to copy the ONNX model we have generated in the first step in this folder. Then we launch the Triton image. As you can see we install Transformers and then launch the server itself. This is of course a bad practice, you should make your own 2 lines Dockerfile with Transformers inside.

# copy the generated model to triton model folder
cp ./onnx_models/model-optimized.onnx ./triton_models/sts/1/model.onnx
# install transformers (and its tokenizer) and launch server in a single line, ugly but good enough for our demo
# --shm-size 256m -> to have several Python backend at the same time
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:21.10-py3 \
  bash -c "pip install transformers && tritonserver --model-repository=/models"

Triton server perf analysis

You need to edit the source code to load the 16 or 128 token sequence (the text is already included).

  • 16 tokens:
ubuntu@ip-172-31-31-84:~/triton_transformers$ python3 triton_transformers.py 
10/31/2021 12:09:34 INFO     timing [triton transformers]: mean=1.53ms, sd=0.06ms, min=1.48ms, max=1.78ms, median=1.51ms, 95p=1.66ms, 99p=1.74ms
[[-3.4355469  3.2753906]]
  • 128 tokens:
ubuntu@ip-XXX:~/triton_transformers$ python3 triton_transformers.py 
10/31/2021 12:12:00 INFO     timing [triton transformers]: mean=1.96ms, sd=0.08ms, min=1.88ms, max=2.24ms, median=1.93ms, 95p=2.17ms, 99p=2.23ms
[[-3.4589844  3.3027344]]

There is also a more serious performance analysis tool called perf_analyzer (it will take care to check that measures are stable, etc.). documentation The tool need to be run on Ubuntu >= 20.04 (and won't work on Ubuntu 18.04 used for the AWS official Ubuntu deep learning image): It also make measures on torchserve and tensorflow.

# perf_analyzer needs this dependency
sudo apt install libb64-dev
# add -a for async measures, and -i grpc to use that protocol instead of http 
~/.local/bin/perf_analyzer -m transformers --percentile=95 --input-data perf_data.json --shape TEXT:1 # -i grpc -a
# just test the model part (easier to get random input)
~/.local/bin/perf_analyzer --input-data zero -m sts --shape input_ids:1,16 --shape attention_mask:1,16 #-i grpc -a

Call Triton HTTP API directly

If you don't want to use the tritonclient API, you can call the Triton server those ways:

# if you like Python requests library
python3 triton_requests.py

# if you want generic HTTP template, the @ means no data conversion
curl -X POST  http://localhost:8000/v2/models/transformers/versions/1/infer \
  --data-binary "@query_body.bin" \
  --header "Inference-Header-Content-Length: 160"

Use TensorRT model in Triton server (instead of ONNX)

To use TensorRT model instead of ONNX Runtime one:

  • we need to convert the ONNX to TensorRT engine
  • update the configuration, TensorRT takes int32 as input instead of int64
# we use Docker container to guarantee the use of the right trtexec version (otherwise you will have a deserialization error)
# it's a bacic conversion, IRL you want to provide minimum, optimimum and maximum shape at least
# it may take a few minutes...
docker run -it --rm --gpus all -v $PWD/onnx_models:/models nvcr.io/nvidia/tritonserver:21.10-py3 \
    /usr/src/tensorrt/bin/trtexec \
    --onnx=/models/model.onnx \
    --best \
    --minShapes=input_ids:1x16,attention_mask:1x16 \
    --optShapes=input_ids:1x16,attention_mask:1x16 \
    --maxShapes=input_ids:32x16,attention_mask:32x16  \
    --saveEngine="/models/model.plan" \
    --workspace=6000 \
    --useCudaGraph

# move to triton model folder
cp ./onnx_models/model.plan ./triton_models/sts/1/model.plan

You then need to update you config.pbtxt in STS and tokenizer folders, replace all TYPE_INT64 tensor type by TYPE_INT32. In STS configuraiton file, replace platform: "onnxruntime_onnx" by platform: "tensorrt_plan" Finally convert the numpy tensors to int32 in the tokenizer python code, like below (notice the astype()):

input_ids = pb_utils.Tensor("INPUT_IDS", tokens['input_ids'].astype(np.int32))
attention = pb_utils.Tensor("ATTENTION", tokens['attention_mask'].astype(np.int32))

And you are done!

Comments
  • Support for large models (external data format)

    Support for large models (external data format)

    This PR closes #59.

    Changelog:

    • Refactored dockerfile and fixed dependencies to cope with python3: /root/gpgpu/MachineLearning/myelin/src/compiler/optimizer/reshape_ppg.cpp:950: void myelin::ir::reshape_ppg_t::transform_op(myelin::ir::bb_t*, myelin::ir::operation_t*): Assertionop->outputs()[0]->dimensions().size() == 3' failed.`
    • Bumped patch version
    • Added --fast argument to skip the fp16 conversion (saving GPU memory)
    • Updated logging (increased default verbosity for better understandability)
    • Added external data path for tensorrt to cope with models > 2G
    • Moved ONNX export post Pytorch benchmark to do conversion on CPU only (for larger models)
    bug documentation 
    opened by oborchers 16
  • Dynamic batching does not give better latency for Roberta running on TensorRT.

    Dynamic batching does not give better latency for Roberta running on TensorRT.

    Hi, I used your build_engine API to convert the Roberta model. While building if I use the constant batch size for input_shapes, i.e. (min, optimal, max) -> (1,1,1) or (4, 4, 4,). The model yields good results (faster than ort and torch).

    But when I convert it with dynamic batch size i.e. (min, optimal, max) -> (1, 4, 4), the model performs really slow compared to ort or torch.

    code to understand the problem better:

    # fast inference but constrained to use always 4 batches during inferencing
    tensor_shapes = list(zip([4, 4, 4], [1, 128, 128]))
    
    # slow inference
    tensor_shapes = list(zip([1, 4, 4], [1, 128, 128]))
    
    engine: ICudaEngine = build_engine(
        runtime=runtime,
        onnx_file_path=onnx_model_path,
        logger=trt_logger,
        min_shape=tensor_shapes[0],
        optimal_shape=tensor_shapes[1],
        max_shape=tensor_shapes[2],
        workspace_size=workspace_size * 1024**3,
        fp16=not quantization,
        int8=quantization,
        profiling=True,
    )
    
    save_engine(engine=engine, engine_file_path=tensorrt_path)
    

    the complete build and inference logs for slow inference case (when converting with dynamic batch)

    [06/02/2022-03:19:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +312, GPU +0, now: CPU 3789, GPU 2470 (MiB)
    [06/02/2022-03:19:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 3790, GPU 2470 (MiB)
    [06/02/2022-03:19:09] [TRT] [I] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 3790 MiB, GPU 2470 MiB
    [06/02/2022-03:19:09] [TRT] [I] [MemUsageSnapshot] End constructing builder kernel library: CPU 3924 MiB, GPU 2504 MiB
    [06/02/2022-03:19:09] [TRT] [I] parsing TensorRT model
    [libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
    [libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1418322027
    [06/02/2022-03:19:22] [TRT] [W] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
    [06/02/2022-03:19:39] [TRT] [W] Output type must be INT32 for shape outputs
    [06/02/2022-03:19:39] [TRT] [W] Output type must be INT32 for shape outputs
    [06/02/2022-03:19:39] [TRT] [W] Output type must be INT32 for shape outputs
    [06/02/2022-03:19:43] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +512, GPU +226, now: CPU 5802, GPU 2730 (MiB)
    [06/02/2022-03:19:43] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +116, GPU +52, now: CPU 5918, GPU 2782 (MiB)
    [06/02/2022-03:19:43] [TRT] [I] Timing cache disabled. Turning it on will improve builder speed.
    [06/02/2022-03:19:43] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are: 
    [06/02/2022-03:19:43] [TRT] [W]  (# 1 (SHAPE input_ids))
    [06/02/2022-03:19:43] [TRT] [W]  (# 0 (SHAPE attention_mask))
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    [06/02/2022-03:25:32] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are: 
    [06/02/2022-03:25:32] [TRT] [W]  (# 1 (SHAPE input_ids))
    [06/02/2022-03:25:32] [TRT] [W]  (# 0 (SHAPE attention_mask))
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
    [06/02/2022-03:30:10] [TRT] [I] Detected 2 inputs and 1 output network tensors.
    [06/02/2022-03:30:10] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are: 
    [06/02/2022-03:30:10] [TRT] [W]  (# 1 (SHAPE input_ids))
    [06/02/2022-03:30:10] [TRT] [W]  (# 0 (SHAPE attention_mask))
    [06/02/2022-03:30:32] [TRT] [I] Total Host Persistent Memory: 208
    [06/02/2022-03:30:32] [TRT] [I] Total Device Persistent Memory: 0
    [06/02/2022-03:30:32] [TRT] [I] Total Scratch Memory: 442827264
    [06/02/2022-03:30:32] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 774 MiB, GPU 2058 MiB
    [06/02/2022-03:30:32] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 0.038945ms to assign 4 blocks to 4 nodes requiring 443041280 bytes.
    [06/02/2022-03:30:32] [TRT] [I] Total Activation Memory: 443041280
    [06/02/2022-03:30:32] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5993, GPU 4298 (MiB)
    [06/02/2022-03:30:32] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 5993, GPU 4306 (MiB)
    [06/02/2022-03:30:32] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +1353, now: CPU 0, GPU 1353 (MiB)
    [06/02/2022-03:30:33] [TRT] [I] Loaded engine size: 1364 MiB
    [06/02/2022-03:30:33] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 7354, GPU 4282 (MiB)
    [06/02/2022-03:30:33] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 7355, GPU 4290 (MiB)
    [06/02/2022-03:30:33] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1352, now: CPU 0, GPU 1352 (MiB)
    [06/02/2022-03:30:38] [TRT] [I] Loaded engine size: 1364 MiB
    [06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 7366, GPU 5636 (MiB)
    [06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 7367, GPU 5644 (MiB)
    [06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1352, now: CPU 0, GPU 2704 (MiB)
    [06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 6002, GPU 5636 (MiB)
    [06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 6002, GPU 5644 (MiB)
    [06/02/2022-03:30:43] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +423, now: CPU 0, GPU 3127 (MiB)
    
    latencies in ms
    --------------------------------------------------
    Pytorch 
    --------------------------------------------------
    [93.5968, 94.0308, 94.8224, 93.6746, 94.5972, 94.0188, 92.3105, 93.6535, 92.4908, 91.4413]
    --------------------------------------------------
    Onnxruntime 
     --------------------------------------------------
    [81.445, 81.3684, 80.2145, 81.5339, 82.9578, 83.6845, 83.6738, 82.6652, 81.5462, 82.8237]
    --------------------------------------------------
    TensorRT (FP16) 
     --------------------------------------------------
    [426.353, 425.1992, 426.0317, 425.8226, 426.8828, 428.0485, 426.3119, 426.4556, 425.4863, 426.0393]
    --------------------------------------------------
    

    Is this the expected behavior?

    I want to convert the model to use dynamic batches. When inferencing, the model should be able to handle a variable batch size and perform faster. How can I achieve that?

    Any help would be greatly appreciated, thank you in advance.

    bug 
    opened by Ki6an 12
  • Optimizations for T0

    Optimizations for T0

    I'm trying to replicate the T5 ONNX optimization notebook (the latest version, on the feat/t5_3b branch) but for T0_3B (which in itself is a derivative of T5, but with a slightly different config and no tie_word_embeddings.

    I installed ONNX runtime from source as described in the notebook.

    The only changes I made to the notebook are replacing "t5-3b" with "bigscience/T0_3B", and commenting out out_dec["last_hidden_state"] = out_dec["last_hidden_state"] * (pytorch_model.model_dim**-0.5) in the ExportT5 class, as T0 does not use tie word embeddings.

    However, the notebook fails on dec_if_ort_model = create_model_for_provider(dec_if_model_path, "CUDAExecutionProvider", log_severity=3), with the error: Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from ./test-dec-if/model.onnx failed:This is an invalid model. Error: the graph is not acyclic.

    Shouldn't T0 work because it is essentially T5? Your help would be greatly appreciated @pommedeterresautee. Thanks!

    opened by michaelroyzen 10
  • WIP - Support Token Classification

    WIP - Support Token Classification

    I think this makes it so TD can handle TokenClassification.

    However, the model I picked to test with seems to not convert well, both the ONNX and TensorRT conversion fail on the assert np.areclose, and I am not sure what this means...

    I am testing it with

    $ python src/transformer_deploy/convert.py -m dslim/bert-large-NER --backend onnx --seq-len 8 128 256 --batch-size 1 1 1 --task=TokenClassification --verbose

    Also, sorry for all the commits, I can squash them on my fork and make it clean later, I just wanted to know if you had any idea why this was failing.

    opened by sam-writer 9
  • got error in optimize onnx when ran gpt2 file from demo/generative-model

    got error in optimize onnx when ran gpt2 file from demo/generative-model

    getting error when ran this code part logging.basicConfig() logging.getLogger().setLevel(logging.INFO) num_attention_heads, hidden_size = get_model_size(path=model_name) optimize_onnx( onnx_path="test-gpt2.onnx", onnx_optim_model_path="test-gpt2-opt.onnx", fp16=True, use_cuda=True, num_attention_heads=num_attention_heads, hidden_size=hidden_size, architecture='gpt2' )

    INFO:fusion_base:Fused LayerNormalization count: 25 INFO:fusion_base:Fused FastGelu count: 12

    failed in shape inference <class 'AssertionError'> failed in shape inference <class 'AssertionError'> failed in shape inference <class 'AssertionError'>

    INFO:onnx_model:Graph pruned: 0 inputs, 0 outputs and 720 nodes are removed INFO:onnx_model_gpt2:postprocess: remove Reshape count:72 INFO:fusion_base:Fused FastGelu(add bias) count: 12 INFO:onnx_model_bert:opset verion: 13


    AssertionError Traceback (most recent call last)

    in () 9 num_attention_heads=num_attention_heads, 10 hidden_size=hidden_size, ---> 11 architecture='gpt2' 12 )

    7 frames

    /usr/local/lib/python3.7/dist-packages/onnxruntime/transformers/../tools/symbolic_shape_infer.py in _add_suggested_merge(self, symbols, apply) 209 210 def add_suggested_merge(self, symbols, apply=False): --> 211 assert all([(type(s) == str and s in self.symbolic_dims_) or is_literal(s) for s in symbols]) 212 symbols = set(symbols) 213 for k, v in self.suggested_merge.items():

    AssertionError:

    bug 
    opened by rohitmishra94 8
  • Support other tasks/architectures?

    Support other tasks/architectures?

    First off: thank you! This is a great project, I'm really grateful you released it publically.

    From what I can tell, this supports encoder-only architectures, and the Sequence Classification task (ex). Am I correct? If so, are there plans to support, or interest in supporting, other architectures (encoder/decoder, decoder-only) and/or tasks (Token Classification and Masked token prediction for encoder-only architectures, or Seq2SeqLM for the other architectures)?

    opened by sam-writer 7
  • GPT2 has slow inference

    GPT2 has slow inference

    Hello,

    your wrapper for gpt2 does not support 'past_key_values' as huggingface transformers initially do. I've seen your measurements in the gpt2 demo, and at least for pytorch they are not really correct, instead of just simply calling the model with always the same input, you should call the generate method..

    I tried to run gpt2 in pytorch both on cpu and gpu (GPU: TESLA T4) with your sample text: "Here is some text to encode Hello World"

    here are my results (vanilla pytorch): gpu no cache: 14s/sequence gpu cache: 3.6s/sequence

    cpu no cache: 114s/sequence cpu cache: 9.8s/sequence

    For every measurement, the result is average out of ten runs of the generate method, I used number of beams=5

    when running greedysearch, the difference is not so big, but still.. cpu no cache: 29s cpu cache: 4.8s

    CPU: Intel(R) Xeon(R) Platinum 8259CL CPU

    opened by kobzaond 6
  • Calibration failure occurred with no scaling factors detected

    Calibration failure occurred with no scaling factors detected

    Hey,

    first of all, thanks a lot for your great work. This repo was already a great help to me.

    With your quantization update for INT8, however, I ran into a problem. As soon as I activate --quantization, I get the following error:

    [01/14/2022-11:18:37] [TRT] [W] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32 or Bool.
    [01/14/2022-11:18:37] [TRT] [E] 4: [standardEngineBuilder.cpp::initCalibrationParams::1402] Error Code 4: Internal Error (Calibration failure occurred with no scaling factors detected. This could be due to no int8 calibrator or insufficient custom scales for network layers. Please see int8 sample to setup calibration correctly.)
    [01/14/2022-11:18:37] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )
    
    Traceback (most recent call last):
      File "/data/repos/transformer-deploy/src/transformer_deploy/convert.py", line 326, in <module>
        entrypoint()
      File "/data/repos/transformer-deploy/src/transformer_deploy/convert.py", line 322, in entrypoint
        main(commands=args)
      File "/data/repos/transformer-deploy/src/transformer_deploy/convert.py", line 216, in main
        engine: ICudaEngine = build_engine(
      File "/data/repos/transformer-deploy/src/transformer_deploy/backends/trt_utils.py", line 181, in build_engine
        engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
    TypeError: deserialize_cuda_engine(): incompatible function arguments. The following argument types are supported:
        1. (self: tensorrt.tensorrt.Runtime, serialized_engine: buffer) -> tensorrt.tensorrt.ICudaEngine
    
    Invoked with: <tensorrt.tensorrt.Runtime object at 0x7feb14128e30>, None
    

    The problem in the traceback is then just that the trt_engine will be None. I don't get any other warnings or errors, so I'm a bit at a loss. I've tried with distilroberta-base and also with bert-base-uncased, but I get the same error each time. Did you, by any chance, run into the same problem at some point in time or do you see what the issue may be?

    Thanks a lot in advance!

    opened by v1nc3nt27 6
  • Failed to load private model

    Failed to load private model

    Hi,

    I tried to convert a private model of sentence-transformer on the Hugging Face Hub:

    docker run -it --rm --gpus all \
        -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
        bash -c "cd /project && \
        convert_model -m \"Matthieu/paraphrase-multilingual-MiniLM-L12-v2-pooling-GPLv2\" \
        --backend tensorrt onnx \
        --task embedding \
        --seq-len 16 128 128 \
        --auth-token XXX"
    

    However, the download of config.json file failed with the following message:

    =============================
    == Triton Inference Server ==
    =============================
    
    NVIDIA Release 22.01 (build 31237563)
    
    Copyright (c) 2018-2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
    
    Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
    
    This container image and its contents are governed by the NVIDIA Deep Learning Container License.
    By pulling and using the container, you accept the terms and conditions of this license:
    https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
    
    Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 451/451 [00:00<00:00, 650kB/s]
    Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.83M/4.83M [00:02<00:00, 2.48MB/s]
    Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16.3M/16.3M [00:07<00:00, 2.41MB/s]
    Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 280/280 [00:00<00:00, 425kB/s]
    401 Client Error: Unauthorized for url: https://huggingface.co/Matthieu/paraphrase-multilingual-MiniLM-L12-v2-pooling-GPLv2/resolve/main/config.json
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 585, in _get_config_dict
        resolved_config_file = cached_path(
      File "/usr/local/lib/python3.8/dist-packages/transformers/file_utils.py", line 1846, in cached_path
        output_path = get_from_cache(
      File "/usr/local/lib/python3.8/dist-packages/transformers/file_utils.py", line 2050, in get_from_cache
        _raise_for_status(r)
      File "/usr/local/lib/python3.8/dist-packages/transformers/file_utils.py", line 1977, in _raise_for_status
        request.raise_for_status()
      File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 960, in raise_for_status
        raise HTTPError(http_error_msg, response=self)
    requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/Matthieu/paraphrase-multilingual-MiniLM-L12-v2-pooling-GPLv2/resolve/main/config.json
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/usr/local/bin/convert_model", line 8, in <module>
        sys.exit(entrypoint())
      File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 357, in entrypoint
        main(commands=args)
      File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 152, in main
        model_config: PretrainedConfig = AutoConfig.from_pretrained(pretrained_model_name_or_path=commands.model)
      File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/configuration_auto.py", line 612, in from_pretrained
        config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 537, in get_config_dict
        config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 618, in _get_config_dict
        raise EnvironmentError(
    OSError: We couldn't connect to 'https://huggingface.co/' to load this model and it looks like Matthieu/paraphrase-multilingual-MiniLM-L12-v2-pooling-GPLv2 is not the path to a directory conaining a config.json file.
    Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
    

    Any advice?

    Thanks!

    opened by Matthieu-Tinycoaching 5
  • 'assert num_heads > 0' error with DistilBert

    'assert num_heads > 0' error with DistilBert

    I get the following error when I try to optimize distilbert:

    AssertionError                            Traceback (most recent call last)
    <timed eval> in <module>
    
    /opt/conda/lib/python3.7/site-packages/transformer_deploy/convert.py in main(input_args)
        245             onnx_path=onnx_model_path,
        246             onnx_optim_fp16_path=onnx_optim_fp16_path,
    --> 247             use_cuda=True,
        248         )
        249         onnx_model = create_model_for_provider(path=onnx_optim_fp16_path, provider_to_use="CUDAExecutionProvider")
    
    /opt/conda/lib/python3.7/site-packages/transformer_deploy/backends/ort_utils.py in optimize_onnx(onnx_path, onnx_optim_fp16_path, use_cuda)
         72         num_heads=0,  # automatic detection don't work with opset 13
         73         hidden_size=0,  # automatic detection
    ---> 74         optimization_options=optimization_options,
         75     )
         76 
    
    /opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/optimizer.py in optimize_model(input, model_type, num_heads, hidden_size, optimization_options, opt_level, use_gpu, only_onnxruntime)
        289 
        290     if not only_onnxruntime:
    --> 291         optimizer.optimize(optimization_options)
        292 
        293     # Remove the temporary model.
    
    /opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/onnx_model_bert.py in optimize(self, options, add_dynamic_axes)
        317             if options is not None:
        318                 self.attention_mask.set_mask_format(options.attention_mask_format)
    --> 319             self.fuse_attention()
        320 
        321         self.fuse_shape()
    
    /opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/onnx_model_bert.py in fuse_attention(self)
         52 
         53     def fuse_attention(self):
    ---> 54         self.attention_fusion.apply()
         55 
         56     def fuse_gelu(self):
    
    /opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/fusion_base.py in apply(self)
         41                     raise Exception("Can not find node in any graphs")
         42                 self.this_graph_name = graph.name
    ---> 43                 self.fuse(node, input_name_to_nodes, output_name_to_node)
         44 
         45         op_list = [node.op_type for node in self.nodes_to_add]
    
    /opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/fusion_attention.py in fuse(self, normalize_node, input_name_to_nodes, output_name_to_node)
        444             new_node = self.create_attention_node(mask_index, matmul_q, matmul_k, matmul_v, add_q, add_k, add_v,
        445                                                   q_num_heads, self.hidden_size, root_input,
    --> 446                                                   attention_last_node.output[0], add_qk_str)
        447             if new_node is None:
        448                 return
    
    /opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/fusion_attention.py in create_attention_node(self, mask_index, q_matmul, k_matmul, v_matmul, q_add, k_add, v_add, num_heads, hidden_size, input, output, add_qk_str)
        161             Union[NodeProto, None]: the node created or None if failed.
        162         """
    --> 163         assert num_heads > 0
        164 
        165         if hidden_size > 0 and (hidden_size % num_heads) != 0:
    
    AssertionError: 
    

    While trying to resolve the issue, I observed that it did not occur when optimizer from onnxruntime-tools was used with opt_level 99 (instead of the one in onnxruntime.transformers). But the code then threw Exceptions due to some skip layer normalization issues.

    opened by vishalsrao 5
  • Unable to install transformer-deploy module

    Unable to install transformer-deploy module

    Any support would be appreciated:

    When running demo/torchdynamo/benchmark.ipynb, specifically this cell (pasted code), I run into the error below.

    from typing import Dict
    
    import numpy as np
    import torch
    from onnxruntime import GraphOptimizationLevel
    
    from transformers import AutoModel, PreTrainedModel
    from transformer_deploy.backends.ort_utils import convert_fp16
    from transformer_deploy.backends.onnx_utils import save_onnx
    
    from dynamo_utils import (
        benchmark,
        check_output,
        get_dynamo_optimizer,
        get_onnx_inference,
        get_pytorch_inference,
        get_pytorch_input,
        plot_benchmarks,
        print_pytorch_profile,
        get_tensorrt_inference,
        seq_lengths,
    )
    
    import gc
    import tensorrt as trt
    from tensorrt.tensorrt import ICudaEngine, Logger, Runtime
    import onnx
    from transformer_deploy.backends.trt_utils import build_engine, save_engine
    
    
    ModuleNotFoundError Traceback (most recent call last)Cell In [3], line 8 5 from onnxruntime import GraphOptimizationLevel 7 from transformers import AutoModel, PreTrainedModel----> 8 from transformer_deploy.backends.ort_utils import convert_fp16 9 from transformer_deploy.backends.onnx_utils import save_onnx 11 from dynamo_utils import ( 12 benchmark, 13 check_output, (...) 21 seq_lengths, 22 )ModuleNotFoundError: No module named 'transformer_deploy'
    
    opened by elvinagam 4
  • Question-Answering example not working for batch_size > 1

    Question-Answering example not working for batch_size > 1

    I'm running demo/question-answering/triton_client.py from the examples directory. The script returns expected result with batch_size=1. However, if you make the batch_size > 1 in this line, it outputs only the result of the first element in the batch and other elements are ignored.

    I saw #84 and #106 about the question-answering example and batch_size but I don't think they are related to this. The triton server does not yield in any errors.

    Am I missing something here?

    opened by lakshaykc 0
  • Support for constrained beam-search in T5

    Support for constrained beam-search in T5

    HF T5 model (actually seq2seq model in general) supports complex decoding schemes such as constrained beam search https://huggingface.co/blog/constrained-beam-search. In my use case, I just really need the simplest constrained beam search where decoded sequences have to belong to a pre-defined Trie. This can be done via https://huggingface.co/docs/transformers/internal/generation_utils#transformers.PrefixConstrainedLogitsProcessor

    Is this possible for transformer-deploy ?

    opened by junwang-wish 0
  • Attempting to run T5 ORT model in Triton inference server

    Attempting to run T5 ORT model in Triton inference server

    Hi there,

    Thanks again for this library!

    We're trying to convert a fine-tuned T5 model to ONNX and run it in Triton. We've managed to convert the model to ONNX and use the T5 notebook guide to run the model just fine in python.

    But trying to get it to run in Triton has been a challenge. In particular, we're not sure how to get past_key_values to be passed through in Triton. We have the decoder config as follows:

    name: "t5-dec-if-node_onnx_model"
    max_batch_size: 0
    platform: "onnxruntime_onnx"
    default_model_filename: "model.bin"
    input [
        {
            name: "input_ids"
            data_type: TYPE_INT32
            dims: [ -1, -1 ]
        },
        {
            name: "encoder_hidden_states"
            data_type: TYPE_FP32
            dims: [ -1, -1, 2048 ]
        },
        {
            name: "enable_cache"
            data_type: TYPE_BOOL
            dims: [ 1 ]
        },
        
            {
                name: "past_key_values.0.decoder.key"
                data_type: TYPE_FP32
                dims: [-1, 32, -1, 64]
            },
            {
                name: "past_key_values.0.decoder.value"
                data_type: TYPE_FP32
                dims: [-1, 32, -1, 64]
            },
            {
                name: "past_key_values.0.encoder.key"
                data_type: TYPE_FP32
                dims: [-1, 32, -1, 64]
            },
            {
                name: "past_key_values.0.encoder.value"
                data_type: TYPE_FP32
                dims: [-1, 32, -1, 64]
            }
         ...
    ]
    output [
        {
            name: "logits"
            data_type: TYPE_FP32
            dims: [ -1, -1, 32128 ]
        }
    ]
    instance_group [
        {
          count: 1
          kind: KIND_GPU
        }
    ]
    

    And when we do the following:

    input_1 = tritonclient.http.InferInput(name="input_ids", shape=(1, 24), datatype="INT32")
    input_2 = tritonclient.http.InferInput(name="encoder_hidden_states", shape=(1, 24, 2048), datatype="FP32")
    input_3 = tritonclient.http.InferInput(name="enable_cache", shape=(1, ), datatype="BOOL")
    
    input_1.set_data_from_numpy(input_ids)
    input_2.set_data_from_numpy(encoder_hidden_states)
    input_3.set_data_from_numpy(np.asarray([True]))
    
    result = triton_client.infer(
        model_name='t5-dec-if-node_onnx_model', 
        inputs=[input_1, input_2, input_3], 
        outputs=[tritonclient.http.InferRequestedOutput(name="logits", binary_data=False)]
    )
    

    We get this error:

    InferenceServerException: [request id: <id_unknown>] expected 99 inputs but got 3 inputs for model 't5-dec-if-node_onnx_model'
    

    Any idea how we can fix this?

    opened by samiur 1
  • Two GPU are slower than one

    Two GPU are slower than one

    Hi, I run Triton web server on two GPUs NVIDIA RTX3090Ti with --shm-size 20g. When I do inference, I get time near 1.56s. But if I run web server with only one GPU set --gpus '"device=0"' after that I get the time near 860ms. Length of input sequence was 256 tokens. I've optimized GPT2-medium by your script.

    convert_model -m gpt2-medium \
        --backend tensorrt onnx \
        --seq-len 32 512 512 \
        --task text-generation --atol=2"
    
    opened by OleksandrKorovii 0
  • Tensorrt engine

    Tensorrt engine

    I tried running TRT based-off three methods:

    1. python src/transformer-deploy/convert.py
    2. exisiting docker image
    3. build docker image from repo

    In all three instances, I got back the same response while running TRT backend.

    The command I have been trying to run (docker for example):

    docker run -it --rm --gpus all -v $PWD:/project ghcr.io/els-rd/transformer-deploy:latest bash -c "cd /project && \
      convert_model -m \"sentence-transformers/multi-qa-distilbert-cos-v1\" \
      --backend tensorrt onnx \
      --seq-len 128 128 256 \
      --batch-size 1 32 300"
    
    

    When i pass only 'onnx' as backend param everything runs pretty smoothly. But face issues with 'tensorrt' backend.

    [11/29/2022-10:58:17] [TRT] [E] 2: [optimizer.cpp::getFormatRequirements::2945] Error Code 2: Internal Error (Assertion !n->candidateRequirements.empty() failed. no supported formats)
    [11/29/2022-10:58:17] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
    Traceback (most recent call last):
      File "/usr/local/bin/convert_model", line 8, in <module>
        sys.exit(entrypoint())
      File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 417, in entrypoint
        main(commands=args)
      File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 308, in main
        engine: ICudaEngine = build_engine(
      File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/backends/trt_utils.py", line 206, in build_engine
        engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
    TypeError: deserialize_cuda_engine(): incompatible function arguments. The following argument types are supported:
        1. (self: tensorrt.tensorrt.Runtime, serialized_engine: buffer) -> tensorrt.tensorrt.ICudaEngine
    
    Invoked with: <tensorrt.tensorrt.Runtime object at 0x7f88c85c12b0>, None
    free(): invalid pointer
    
    

    Would be great if I could have a workaround for this.

    Versions: Python: 3.8.15 transformers-deploy: 0.5.3 TensorRT: 8.4.1.5 Onnxruntime (GPU): 1.12.0 transformers: 4.24.0

    opened by imsiddhant07 1
  • Token type ids bug

    Token type ids bug

    Some models don't use token_type_ids in the forward pass. E.g. deberta has type_vocab_size=0 as a default value.

    What happens is the model ignores token_type_ids (https://github.com/huggingface/transformers/blob/bac2d29a802803a7f2db8e8597a2ec81730afcc9/src/transformers/models/deberta/modeling_deberta.py#L810)

    However, tokenizer doesn't know about this and token_type_ids is still in tokenizer.model_input_names.

    This mismatch leads to

    docker run -it --rm --gpus all \
      -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.3 \
      bash -c "cd /project && \
        convert_model -m \"microsoft/deberta-base-mnli\" \
        --backend onnx \
        --seq-len 16 128 128"
    
    docker run -itd --rm --gpus '"device=3"' -p8000:8000 --shm-size 256m \
      -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
      bash -c "pip install transformers && tritonserver --model-repository=/models"
    

    And the triton inference server fails with

    I1123 13:49:09.821427 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fbf36000000' with size 268435456
    I1123 13:49:09.821983 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
    I1123 13:49:09.828017 1 model_repository_manager.cc:1206] loading: transformer_onnx_tokenize:1
    I1123 13:49:09.828058 1 model_repository_manager.cc:1206] loading: transformer_onnx_model:1
    I1123 13:49:09.830743 1 onnxruntime.cc:2458] TRITONBACKEND_Initialize: onnxruntime
    I1123 13:49:09.830786 1 onnxruntime.cc:2468] Triton TRITONBACKEND API version: 1.10
    I1123 13:49:09.830804 1 onnxruntime.cc:2474] 'onnxruntime' TRITONBACKEND API version: 1.10
    I1123 13:49:09.830814 1 onnxruntime.cc:2504] backend configuration:
    {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}
    I1123 13:49:09.846110 1 onnxruntime.cc:2560] TRITONBACKEND_ModelInitialize: transformer_onnx_model (version 1)
    I1123 13:49:09.847111 1 onnxruntime.cc:666] skipping model configuration auto-complete for 'transformer_onnx_model': inputs and outputs already specified
    I1123 13:49:09.851839 1 onnxruntime.cc:2603] TRITONBACKEND_ModelInstanceInitialize: transformer_onnx_model_0 (GPU device 0)
    I1123 13:49:12.063610 1 onnxruntime.cc:2637] TRITONBACKEND_ModelInstanceFinalize: delete instance state
    I1123 13:49:12.063688 1 onnxruntime.cc:2583] TRITONBACKEND_ModelFinalize: delete model state
    E1123 13:49:12.063708 1 model_repository_manager.cc:1355] failed to load 'transformer_onnx_model' version 1: Invalid argument: unable to load model 'transformer_onnx_model', configuration expects 3 inputs, model provides 2
    None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
    I1123 13:49:13.744756 1 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: transformer_onnx_tokenize_0 (GPU device 0)
    None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
    I1123 13:49:14.298233 1 model_repository_manager.cc:1352] successfully loaded 'transformer_onnx_tokenize' version 1
    E1123 13:49:14.298380 1 model_repository_manager.cc:1559] Invalid argument: ensemble 'transformer_onnx_inference' depends on 'transformer_onnx_model' which has no loaded version
    I1123 13:49:14.298438 1 server.cc:559]
    +------------------+------+
    | Repository Agent | Path |
    +------------------+------+
    +------------------+------+
    
    I1123 13:49:14.298487 1 server.cc:586]
    +-------------+----------------------------------------------------------------+----------------------------------------------------------------+
    | Backend     | Path                                                           | Config                                                         |
    +-------------+----------------------------------------------------------------+----------------------------------------------------------------+
    | onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.s | {"cmdline":{"auto-complete-config":"true","min-compute-capabil |
    |             | o                                                              | ity":"6.000000","backend-directory":"/opt/tritonserver/backend |
    |             |                                                                | s","default-max-batch-size":"4"}}                              |
    |             |                                                                |                                                                |
    | python      | /opt/tritonserver/backends/python/libtriton_python.so          | {"cmdline":{"auto-complete-config":"true","min-compute-capabil |
    |             |                                                                | ity":"6.000000","backend-directory":"/opt/tritonserver/backend |
    |             |                                                                | s","default-max-batch-size":"4"}}                              |
    +-------------+----------------------------------------------------------------+----------------------------------------------------------------+
    
    I1123 13:49:14.298549 1 server.cc:629]
    +---------------------------+---------+---------------------------------------------------------------------------------------------------------+
    | Model                     | Version | Status                                                                                                  |
    +---------------------------+---------+---------------------------------------------------------------------------------------------------------+
    | transformer_onnx_model    | 1       | UNAVAILABLE: Invalid argument: unable to load model 'transformer_onnx_model', configuration expects 3 i |
    |                           |         | nputs, model provides 2                                                                                 |
    | transformer_onnx_tokenize | 1       | READY                                                                                                   |
    +---------------------------+---------+---------------------------------------------------------------------------------------------------------+
    
    I1123 13:49:14.351997 1 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3090
    I1123 13:49:14.352405 1 tritonserver.cc:2176]
    +----------------------------------+------------------------------------------------------------------------------------------------------------+
    | Option                           | Value                                                                                                      |
    +----------------------------------+------------------------------------------------------------------------------------------------------------+
    | server_id                        | triton                                                                                                     |
    | server_version                   | 2.24.0                                                                                                     |
    | server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configu |
    |                                  | ration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace                         |
    | model_repository_path[0]         | /models                                                                                                    |
    | model_control_mode               | MODE_NONE                                                                                                  |
    | strict_model_config              | 0                                                                                                          |
    | rate_limit                       | OFF                                                                                                        |
    | pinned_memory_pool_byte_size     | 268435456                                                                                                  |
    | cuda_memory_pool_byte_size{0}    | 67108864                                                                                                   |
    | response_cache_byte_size         | 0                                                                                                          |
    | min_supported_compute_capability | 6.0                                                                                                        |
    | strict_readiness                 | 1                                                                                                          |
    | exit_timeout                     | 30                                                                                                         |
    +----------------------------------+------------------------------------------------------------------------------------------------------------+
    
    I1123 13:49:14.352443 1 server.cc:260] Waiting for in-flight requests to complete.
    I1123 13:49:14.352453 1 server.cc:276] Timeout 30: Found 0 model versions that have in-flight inferences
    I1123 13:49:14.352460 1 model_repository_manager.cc:1230] unloading: transformer_onnx_tokenize:1
    I1123 13:49:14.352525 1 server.cc:291] All models are stopped, unloading models
    I1123 13:49:14.352534 1 server.cc:298] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
    I1123 13:49:15.352620 1 server.cc:298] Timeout 29: Found 1 live models and 0 in-flight non-inference requests
    I1123 13:49:15.444143 1 model_repository_manager.cc:1335] successfully unloaded 'transformer_onnx_tokenize' version 1
    I1123 13:49:16.352790 1 server.cc:298] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
    error: creating server: Internal - failed to load all models
    

    The proposed solution fixes this bug

    opened by fursovia 2
Releases(v0.4.0)
  • v0.4.0(Feb 8, 2022)

    • add support for decoder based model (GPT-2) on both ONNX Runtime and TensorRT
    • refactor triton configuration generation (simplification)
    • add GPT-2 model documentation (notebook)
    • fix CPU quantization benchmark (was not using the quant model)
    • fix sentence transformers bug
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Dec 28, 2021)

    What's Changed

    • Update requirements_gpu.txt by @sam-writer in https://github.com/ELS-RD/transformer-deploy/pull/22
    • refactoring by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/27
    • add CPU inference support by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/28
    • Add QAT support to more models by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/29

    Full Changelog: https://github.com/ELS-RD/transformer-deploy/compare/v0.2.0...v0.3.0

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Dec 8, 2021)

    • support int-8 GPU quantization
    • add a tuto to perform quantization end to end
    • add QDQRoberta model
    • switch to ONNX opset 13
    • refactoring in the TensorRT engine creation
    • fix bugs
    • add auth token (for private HF repo)

    What's Changed

    • Update triton by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/11
    • fix README.md by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/13
    • Fix install errors by @sam-writer in https://github.com/ELS-RD/transformer-deploy/pull/20
    • Add auth token by @sam-writer in https://github.com/ELS-RD/transformer-deploy/pull/19
    • Support GPU INT-8 quantization by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/15

    New Contributors

    • @sam-writer made their first contribution in https://github.com/ELS-RD/transformer-deploy/pull/20

    Full Changelog: https://github.com/ELS-RD/transformer-deploy/compare/v0.1.1...v0.2.0

    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Nov 24, 2021)

  • v0.1.0(Nov 23, 2021)

    • switch from a proof of concept to a library
    • add support for TensorRT Python API (for best performances)
    • improve documentation (separate Hugging Face Infinity thing from the doc, add benchmark, etc.)
    • fix issues with mixed precision
    • add license
    • add tests, Github actions, Makefile
    • change the way the Docker image is built
    Source code(tar.gz)
    Source code(zip)
  • v0.0.1(Nov 8, 2021)

Owner
Lefebvre Sarrut Services
Département R&D du groupe Lefebvre Sarrut
Lefebvre Sarrut Services
Implementation of a Transformer, but completely in Triton

Transformer in Triton (wip) Implementation of a Transformer, but completely in Triton. I'm completely new to lower-level neural net code, so this repo

Phil Wang 152 Dec 22, 2022
NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

NVIDIA Merlin NVIDIA Merlin is an open source library designed to accelerate recommender systems on NVIDIA’s GPUs. It enables data scientists, machine

null 419 Jan 3, 2023
Optimized code based on M2 for faster image captioning training

Transformer Captioning This repository contains the code for Transformer-based image captioning. Based on meshed-memory-transformer, we further optimi

lyricpoem 16 Dec 16, 2022
A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

Eugenio Herrera 175 Dec 29, 2022
Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

BiDR Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval. Requirements torch==

Microsoft 11 Oct 20, 2022
VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

VSR-Transformer By Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool This paper proposes a new Transformer for video super-resolution (called VSR-Transf

Jiezhang Cao 225 Nov 13, 2022
Here I will explain the flow to deploy your custom deep learning models on Ultra96V2.

Xilinx_Vitis_AI This repo will help you to Deploy your Deep Learning Model on Ultra96v2 Board. Prerequisites Vitis Core Development Kit 2019.2 This co

Amin Mamandipoor 1 Feb 8, 2022
Machine learning and Deep learning models, deploy on telegram (the best social media)

Semi Intelligent BOT The project involves : Classifying fake news Classifying objects such as aeroplane, automobile, bird, cat, deer, dog, frog, horse

MohammadReza Norouzi 5 Mar 6, 2022
Alex Pashevich 62 Dec 24, 2022
Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Annoy Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given quer

Spotify 10.6k Jan 4, 2023
MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.

Documentation | FAQ | Release Notes | Roadmap | MACE Model Zoo | Demo | Join Us | 中文 Mobile AI Compute Engine (or MACE for short) is a deep learning i

Xiaomi 4.7k Dec 29, 2022
🦕 NanoSaur is a little tracked robot ROS2 enabled, made for an NVIDIA Jetson Nano

?? nanosaur NanoSaur is a little tracked robot ROS2 enabled, made for an NVIDIA Jetson Nano Website: nanosaur.ai Do you need an help? Discord For tech

NanoSaur 162 Dec 9, 2022
Code for the paper: Adversarial Training Against Location-Optimized Adversarial Patches. ECCV-W 2020.

Adversarial Training Against Location-Optimized Adversarial Patches arXiv | Paper | Code | Video | Slides Code for the paper: Sukrut Rao, David Stutz,

Sukrut Rao 32 Dec 13, 2022
Official Pytorch implementation of 'GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network' (NeurIPS 2020)

Official implementation of GOCor This is the official implementation of our paper : GOCor: Bringing Globally Optimized Correspondence Volumes into You

Prune Truong 71 Nov 18, 2022
The 1st place solution of track2 (Vehicle Re-Identification) in the NVIDIA AI City Challenge at CVPR 2021 Workshop.

AICITY2021_Track2_DMT The 1st place solution of track2 (Vehicle Re-Identification) in the NVIDIA AI City Challenge at CVPR 2021 Workshop. Introduction

Hao Luo 91 Dec 21, 2022
Use AI to generate a optimized stock portfolio

Use AI, Modern Portfolio Theory, and Monte Carlo simulation's to generate a optimized stock portfolio that minimizes risk while maximizing returns. Ho

Greg James 30 Dec 22, 2022
Tutorial on active learning with the Nvidia Transfer Learning Toolkit (TLT).

Active Learning with the Nvidia TLT Tutorial on active learning with the Nvidia Transfer Learning Toolkit (TLT). In this tutorial, we will show you ho

Lightly 25 Dec 3, 2022
Nvidia Semantic Segmentation monorepo

Paper | YouTube | Cityscapes Score Pytorch implementation of our paper Hierarchical Multi-Scale Attention for Semantic Segmentation. Please refer to t

NVIDIA Corporation 1.6k Jan 4, 2023
In-Place Activated BatchNorm for Memory-Optimized Training of DNNs

In-Place Activated BatchNorm In-Place Activated BatchNorm for Memory-Optimized Training of DNNs In-Place Activated BatchNorm (InPlace-ABN) is a novel

null 1.3k Dec 29, 2022