⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x using fastT5.

Overview

fastt5 icon

Reduce T5 model size by 3X and increase the inference speed up to 5X.

PyPI - License example workflow PyPI PyPI - Downloads


T5 models can be used for several NLP tasks such as summarization, QA, QG, translation, text generation, and more. Sequential text generation is naturally slow, and for larger T5 models it gets even slower. fastT5 makes the T5 models inference faster by running it on onnxruntime. and it also decreases the model size by quantizing it.

fastT5 library allows you to convert a pretrained T5 model to onnx, quantizes it, and gives the model as output which is running on an onnxruntime in a single line of code. You can also customize this whole process.


Install

You can install fastT5 from PyPI:

 pip install fastt5

If you want to build from source:

git clone https://github.com/Ki6an/fastT5
cd fastT5
pip3 install -e .

Usage

The export_and_get_onnx_model() method exports the given pretrained T5 model to onnx, quantizes it and runs it on the onnxruntime with default settings. The returned model from this method supports the generate() method of huggingface.

If you don't wish to quantize the model then use quantized=False in the method.

from fastT5 import export_and_get_onnx_model
from transformers import AutoTokenizer

model_name = 't5-small'
model = export_and_get_onnx_model(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)
t_input = "translate English to French: The universe is a dark forest."
token = tokenizer(t_input, return_tensors='pt')

tokens = model.generate(input_ids=token['input_ids'],
               attention_mask=token['attention_mask'],
               num_beams=2)

output = tokenizer.decode(tokens.squeeze(), skip_special_tokens=True)
print(output)

to run the already exported model use get_onnx_model()

you can customize the whole pipeline as shown in the below code example:

from fastT5 import (OnnxT5, get_onnx_runtime_sessions,
                    generate_onnx_representation, quantize)
from transformers import AutoTokenizer

model_or_model_path = 't5-small'

# Step 1. convert huggingfaces t5 model to onnx
onnx_model_paths = generate_onnx_representation(model_or_model_path)

# Step 2. (recommended) quantize the converted model for fast inference and to reduce model size.
quant_model_paths = quantize(onnx_model_paths)

# step 3. setup onnx runtime
model_sessions = get_onnx_runtime_sessions(quant_model_paths)

# step 4. get the onnx model
model = OnnxT5(model_or_model_path, model_sessions)

                      ...

Details

T5 is a seq2seq model (Encoder-Decoder), as it uses decoder repeatedly for inference, we can't directly export the whole model to onnx. We need to export the encoder and decoder separately.

past_key_values contain pre-computed hidden-states (key and values in the self-attention blocks and cross-attention blocks) that can be used to speed up sequential decoding.

models can only be exported with a constant number of inputs. Contrary to this, the decoder of the first step does not take past_key_values and the rest of the steps decoders do. To get around this issue, we can create two decoders: one for the first step that does not take past_key_values and another for the rest of the steps that utilize the past_key_values.

Next, we'll export all three models (encoder, decoder, init_decoder). And then quantize them, quantizing 32bit to 8bit should give the 4x memory reduction. Since there is an extra decoder the model size reduces by 3x.

Finally, we'll run the quantized model on onnx runtime.

The inference is simple as the model supports the generate() method of huggingface.

Functionalities

  • Export any pretrained T5 model to ONNX easily (with past_key_values).
  • The exported model supports beam search and greedy search and more via generate() method.
  • Reduce the model size by 3X using quantization.
  • Up to 5X speedup compared to PyTorch execution for greedy search and 3-4X for beam search.

Benchmarks

The benchmarks are the result of the T5-base model tested on English to French translation.

Onnx model

The following graph shows the latency of the quantized onnx model vs the PyTorch model for beam numbers varying from 1 to 9. The latencies shown here are for the mean of sequence lengths up to 130.

t5-base

The following heat map shows the X times faster which the ratio of latency of PyTorch to onnx model. The onnx model outperforms most cases. however, the speed of the model drops for a longer sequence length.

t5-base-hist

Quantized onnx model

Quantized models are lightweight models as mentioned earlier, these models have almost the same accuracy as the original model (quantized model scores are mentioned in the next section). Quantized onnx models have the lowest latency compared to both Onnx & PyTorch models.

t5-base-quant

The model outperforms the PyTorch model by 5.7X for greedy search on average and 3-4X for beam search.

t5-base-quant-hist

Note : The results were generated on AMD EPYC 7B12, these results may vary from device to device. The Onnx models usually perform well on high-end CPUs with more cores.

Quantized model scores

The results were tested for English to French translation with beam search number of 3.

Bleu_4 METEOR ROUGE_L
t5-small (quant) 0.240769 0.282342 0.468817
t5-small (pytorch) 0.254601 0.295172 0.492749
t5-base (quant) 0.267606 0.306019 0.499188
t5-base (pytorch) 0.268346 0.304969 0.503306
t5-large (quant) 0.286726 0.316845 0.503585
t5-large (pytorch) 0.294015 0.315774 0.508677

further improvements

  • currently the fastT5 library supports only the cpu version of onnxruntime, gpu implementation still needs to be done.
  • graph optimization of the onnx model will further reduce the latency.

Get Help

Acknowledgements

@article{2019t5,
  author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
  title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
  journal = {arXiv e-prints},
  year = {2019},
  archivePrefix = {arXiv},
  eprint = {1910.10683},
}
Comments
  • forward() got an unexpected keyword argument 'cross_attn_head_mask'

    forward() got an unexpected keyword argument 'cross_attn_head_mask'

    ----> 1 paraphrase_t5("Kyle Lowry scored 33 points and Norman Powell added 23 to lift the Toronto Raptors to a 122-125 victory over the Boston Celtics on Wednesday night.")
    
    4 frames
    /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
       1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1050                 or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1051             return forward_call(*input, **kwargs)
       1052         # Do not call functions when jit is used
       1053         full_backward_hooks, non_full_backward_hooks = [], []
    
    TypeError: forward() got an unexpected keyword argument 'cross_attn_head_mask'
    
    opened by mrm8488 20
  • Observing difference in outputs from decoder with IO bindings.

    Observing difference in outputs from decoder with IO bindings.

    Hi @Ki6an Was trying to implement IO bindings for the decoder part of the model. Used the same code from your repo to convert the model to ONNX. After loading the model and making predictions using the decoder session directly the output appears to be fine but with inputs binded the result is coming to be different.

    Below is the code for the IO bindings:

    def dec_pred_with_io_bindings(input_ids, attention_mask, encoder_output, past_key_values_dict,dec_session):
      dec_io_binding = dec_session.io_binding()
      dec_io_binding.bind_input(name="input_ids",
                              device_type="cuda",
                              device_id=0,
                              element_type=np.longlong,
                              shape=list(input_ids.shape),
                              buffer_ptr=input_ids.data_ptr())
      dec_io_binding.bind_input(name="encoder_attention_mask",
                              device_type="cuda",
                              device_id=0,
                              element_type=np.longlong,
                              shape=list(attention_mask.shape),
                              buffer_ptr=attention_mask.data_ptr())
                            
      dec_io_binding.bind_input(name="encoder_hidden_states",
                              device_type="cuda",
                              device_id=0,
                              element_type=np.float32,
                              shape=list(encoder_output.shape),
                              buffer_ptr=encoder_output.data_ptr())
      
    
      for key,val in past_key_values_dict.items():
        dec_io_binding.bind_input(name=key,
                                          device_type="cuda",
                                          device_id=0,
                                          element_type=np.float32,
                                          shape=list(val.shape),
                                          buffer_ptr=val.data_ptr())
      
      #Bind outputs.
      for arg in self.decoder.get_outputs():
        dec_io_binding.bind_output(arg.name, "cuda")
        
      dec_session.run_with_iobinding(dec_io_binding)
      ort_output = dec_io_binding.get_outputs()
    
      logits=ort_output[0]
    
      list_pkv = tuple(torch.from_numpy(x.numpy()).cuda() for x in ort_output[1:])
    
      # creates a tuple of tuples of shape 6x4 from the above tuple
      out_past_key_values = tuple(
          list_pkv[i : i + 4] for i in range(0, len(list_pkv), 4)
      )
    
    
      return torch.from_numpy(logits.numpy()).cuda(),out_past_key_values
    
    opened by VikasOjha666 10
  • Errors when loading saved onnx files

    Errors when loading saved onnx files

    We have an issue with saving and loading onnx files. When passing the generated quant_model_paths to get_onnx_runtime_sessions everything works okay but if I save the file and then run get_onnx_runtime_sessions on the loaded quantized files the model throws an error:

    File "/Users/itai/Code/email-cleaner/.venv/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 184, in run
        raise ValueError("Model requires {} inputs. Input Feed contains {}".format(num_required_inputs, num_inputs))
    ValueError: Model requires 3 inputs. Input Feed contains 2
    

    This doesn't seem to happen on SageMaker but it happens on mac and also on linux containerized environment.

    opened by itaim 10
  • Updating fastT5?

    Updating fastT5?

    Hello @Ki6an.

    Thanks a lot for fastT5! One question: are you planning to update it to the latest versions of ONNX, ONNX Runtime and Transformers?

    Thank you.

    opened by piegu 6
  • Conversion of decoder with past_key_values to float16.

    Conversion of decoder with past_key_values to float16.

    Hi @Ki6an. With the converted ONNX model generated. I was trying to convert the decoder_init and decoder to float16. I did the quantization with onnxruntime's transformer optimizer. I was able to convert the decoder_init to fp16 but while converting the decoder with past_key_values I am getting following issue:

    AssertionError Traceback (most recent call last) in () 25 , 26 ) ---> 27 optimized_model.convert_float_to_float16() # FP32 -> FP16 28 optimized_model.save_model_to_file('/content/optimized_models/t5-base-qa-qg-hl-decoder.onnx') 29

    6 frames /usr/local/lib/python3.7/dist-packages/onnxruntime/transformers/../tools/symbolic_shape_infer.py in _add_suggested_merge(self, symbols, apply) 216 217 def add_suggested_merge(self, symbols, apply=False): --> 218 assert all([(type(s) == str and s in self.symbolic_dims_) or is_literal(s) for s in symbols]) 219 symbols = set(symbols) 220 for k, v in self.suggested_merge.items():

    AssertionError:

    Code used: from onnxruntime.transformers import optimizer

    #Decoder. optimized_model =optimizer.optimize_model( input='/content/models/t5-base-qa-qg-hl-decoder.onnx', use_gpu=True, opt_level=1, only_onnxruntime= True , ) optimized_model.convert_float_to_float16() # FP32 -> FP16 optimized_model.save_model_to_file('/content/optimized_models/t5-base-qa-qg-hl-decoder.onnx')

    opened by VikasOjha666 5
  • t5-11b out of memory/FileNotFoundError

    t5-11b out of memory/FileNotFoundError

    ``First of all, this seems like a great repo that I was super excited to find!

    When testing with t5-small everything works correctly. But when trying with my custom t5-11b I get out of memory issues.

    I was running this with a t5-11b as model: onnx_model_paths = generate_onnx_representation("t5-11b",model=model)

    And at first I got this error:

    RuntimeError: Exporting model exceed maximum protobuf size of 2GB. Please call torch.onnx.export with use_external_data_format=True.

    So I simply added use_external_data_format=True to all of the three torch.onnx.export in onnx_exporter.py in fastT5.

    Then I can run onnx_model_paths = generate_onnx_representation(model_name,model=model), and get no error (First time I posted I got an error but it seems like I made an error and only had 100 GB disk memory, when trying 200 GB it worked).

    Then when running quant_model_paths = quantize(onnx_model_paths) I get the error:

    `FileNotFoundError                         Traceback (most recent call last)
    <ipython-input-7-3a782b6d5a25> in <module>
          8 
          9 # Step 2. (recommended) quantize the converted model for fast inference and to reduce model size.
    ---> 10 quant_model_paths = quantize(onnx_model_paths)
         11 
         12 # step 3. setup onnx runtime
    
    ~/fastT5/fastT5/onnx_exporter.py in quantize(models_name_or_path)
        273             activation_type=QuantType.QUInt8,
        274             weight_type=QuantType.QUInt8,
    --> 275             optimize_model=False,
        276         )  # op_types_to_quantize=['MatMul', 'Relu', 'Add', 'Mul' ],
        277         quant_model_paths.append(output_model_name)
    
    /opt/conda/lib/python3.7/site-packages/onnxruntime/quantization/quantize.py in quantize_dynamic(model_input, model_output, op_types_to_quantize, per_channel, reduce_range, activation_type, weight_type, nodes_to_quantize, nodes_to_exclude, optimize_model, use_external_data_format)
        266         op_types_to_quantize = list(IntegerOpsRegistry.keys())
        267 
    --> 268     model = load_model(Path(model_input), optimize_model)
        269     quantizer = ONNXQuantizer(
        270         model,
    
    /opt/conda/lib/python3.7/site-packages/onnxruntime/quantization/quantize.py in load_model(model_path, optimize)
         51         return onnx_model.model
         52 
    ---> 53     return onnx.load(Path(model_path))
         54 
         55 
    
    /opt/conda/lib/python3.7/site-packages/onnx/__init__.py in load_model(f, format, load_external_data)
        125         if model_filepath:
        126             base_dir = os.path.dirname(model_filepath)
    --> 127             load_external_data_for_model(model, base_dir)
        128 
        129     return model
    
    /opt/conda/lib/python3.7/site-packages/onnx/external_data_helper.py in load_external_data_for_model(model, base_dir)
         69     for tensor in _get_all_tensors(model):
         70         if uses_external_data(tensor):
    ---> 71             load_external_data_for_tensor(tensor, base_dir)
         72             # After loading raw_data from external_data, change the state of tensors
         73             tensor.data_location = TensorProto.DEFAULT
    
    /opt/conda/lib/python3.7/site-packages/onnx/external_data_helper.py in load_external_data_for_tensor(tensor, base_dir)
         48     external_data_file_path = os.path.join(base_dir, file_location)
         49 
    ---> 50     with open(external_data_file_path, 'rb') as data_file:
         51 
         52         if info.offset:
    
    FileNotFoundError: [Errno 2] No such file or directory: '/home/jupyter/encoder.embed_tokens.weight'`
    

    Has anyone successfully exported the t5-11b version and knows how to solve this?

    Update:

    I tried changing the working directory to /home/jupyter/models instead of /home/jupyter/, which seems to solve the FileNotFoundError. But then again I get problems with the size:

    ValueError                                Traceback (most recent call last)
    <ipython-input-10-032d95bca1c8> in <module>
          1 os.chdir(r'/home/jupyter/models/')
    ----> 2 quant_model_paths = quantize(onnx_model_paths)
    
    ~/fastT5/fastT5/onnx_exporter.py in quantize(models_name_or_path)
        273             activation_type=QuantType.QUInt8,
        274             weight_type=QuantType.QUInt8,
    --> 275             optimize_model=False,
        276         )  # op_types_to_quantize=['MatMul', 'Relu', 'Add', 'Mul' ],
        277         quant_model_paths.append(output_model_name)
    
    /opt/conda/lib/python3.7/site-packages/onnxruntime/quantization/quantize.py in quantize_dynamic(model_input, model_output, op_types_to_quantize, per_channel, reduce_range, activation_type, weight_type, nodes_to_quantize, nodes_to_exclude, optimize_model, use_external_data_format)
        278         nodes_to_quantize,
        279         nodes_to_exclude,
    --> 280         op_types_to_quantize)
        281 
        282     quantizer.quantize_model()
    
    /opt/conda/lib/python3.7/site-packages/onnxruntime/quantization/onnx_quantizer.py in __init__(self, model, per_channel, reduce_range, mode, static, weight_qType, input_qType, tensors_range, nodes_to_quantize, nodes_to_exclude, op_types_to_quantize)
         30 
         31         # run shape inference on the model
    ---> 32         model = onnx.shape_inference.infer_shapes(model)
         33         self.value_infos = {vi.name: vi for vi in model.graph.value_info}
         34         self.value_infos.update({ot.name: ot for ot in model.graph.output})
    
    /opt/conda/lib/python3.7/site-packages/onnx/shape_inference.py in infer_shapes(model, check_type, strict_mode)
         34 def infer_shapes(model, check_type=False, strict_mode=False):  # type: (ModelProto, bool, bool) -> ModelProto
         35     if isinstance(model, ModelProto):
    ---> 36         model_str = model.SerializeToString()
         37         inferred_model_str = C.infer_shapes(model_str, check_type, strict_mode)
         38         return onnx.load_from_string(inferred_model_str)
    
    ValueError: Message onnx.ModelProto exceeds maximum protobuf size of 2GB: 19459248612
    
    opened by ViktorThink 5
  • small onnx optimizations

    small onnx optimizations

    A few small tweaks:

    • Set transformers version to >4.6.1 — this is the minimum necessary requirement
    • make sequence length configurable, so ORT can use this information when optimizing. Also set batch_size to 1, since this is all CPU inference, ORT should optimize for batch size 1
    • use U8S8 per ONNX docs and add reduce_range to help with saturation issue
    • make opset version configurable for users who want to experiment with more recent ONNX versions
    opened by sam-writer 4
  • Problems with T5 & onnxruntime

    Problems with T5 & onnxruntime

    Hello there. My purpose is to speed up a T5-small (fine-tuned) both on CPU and GPU. So, I am trying to transform the net through fastT5. However, using the quantized model on CPU, I get similar performances with respect to the initial model (T5-small, on CPU), without any significant improvement. Am I missing something?

    Moreover, I have problems with onnxruntime-gpu. I have read from the other issues that I can't use onnxruntime-gpu with quantization, is that correct?

    Also, I am trying to transform the T5-small model into a non-quantized onnx model, in order to be able to use it on the GPU with onnxruntime-gpu, to obtain some improvements. In this case, i get the errors:

    [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Add node. Name:'Add_98' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/math/element_wise_ops.h:487 void onnxruntime::BroadcastIterator::Append(ptrdiff_t, ptrdiff_t) axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 14 by 16

    or

    [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Add node. Name:'Add_98' Status Message: Add_98: right operand cannot broadcast on dim 2 LeftShape: {1,8,85,85}, RightShape: {1,8,14,85}

    The code is reported below. It fails in the generate method when I try to run it with onnxruntime-gpu.

        t_input = 'translate {} to SQL: '.format(languages[lang]) + original_text
        tokenizer = AutoTokenizer.from_pretrained(model_directory)
        token = tokenizer(t_input, return_tensors='pt')
        tokens = model.generate(input_ids=token['input_ids'], attention_mask=token['attention_mask'], num_beams=3)
    

    Thank you.

    opened by GenVr 4
  • Getting runtime error.

    Getting runtime error.

    Hi, @Ki6an it's great work. But while executing below code

    from fastT5 import export_and_get_onnx_model
    from transformers import AutoTokenizer
    
    model_name = 't5-small'
    model = export_and_get_onnx_model(model_name)
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    t_input = "translate English to French: The universe is a dark forest."
    token = tokenizer(t_input, return_tensors='pt')
    
    tokens = model.generate(input_ids=token['input_ids'],
                   attention_mask=token['attention_mask'],
                   num_beams=2)
    
    output = tokenizer.decode(tokens.squeeze(), skip_special_tokens=True)
    print(output)
    

    I'm getting this error.

    RuntimeError: output with shape [5, 12, 1, 2] doesn't match the broadcast shape [5, 12, 2, 2]
    
    
    
    good first issue 
    opened by abhinavsp0730 4
  • Error from export_and_get_onnx_model()

    Error from export_and_get_onnx_model()

    Hi,

    I am getting the following error when I run the following test snippet provided in the readme.md file. But I can see 3 onnx files are created one for encoder, one for decoder and one for init decoder.

    Environment fastT5 - 0.0.7 MacOS - Big Sur (11.2.3) Python conda - 3.7.6

    Error Exporting to onnx... |################################| 3/3 [libprotobuf ERROR google/protobuf/descriptor_database.cc:394] Invalid file descriptor data passed to EncodedDescriptorDatabase::Add(). [libprotobuf FATAL google/protobuf/descriptor.cc:1356] CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size): libc++abi.dylib: terminating with uncaught exception of type google::protobuf::FatalException: CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):

    Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

    Test code snippet

    from fastT5 import export_and_get_onnx_model
    from transformers import AutoTokenizer
    
    model_name = 't5-small'
    model = export_and_get_onnx_model(model_name)
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    t_input = "translate English to French: The universe is a dark forest."
    token = tokenizer(t_input, return_tensors='pt')
    
    tokens = model.generate(input_ids=token['input_ids'],
                            attention_mask=token['attention_mask'],
                            num_beams=2)
    
    output = tokenizer.decode(tokens.squeeze(), skip_special_tokens=True)
    print(output)
    
    opened by PhaneendraGunda 3
  • Thank You and Demo Running in the Browser

    Thank You and Demo Running in the Browser

    I just wanted to thank you for this library. I learned a lot from it and will be using it my own projects.

    I was able to use it to create an end-to-end pipeline (transformers-js) to run T5 in the browser. I think it's pretty cool and wanted to show you: https://transformers-js.praeclarum.org

    Thanks again!

    opened by praeclarum 2
  • fastt5 not working with FastAPI gunicorn and docker

    fastt5 not working with FastAPI gunicorn and docker

    I am using fastt5 with FastAPI web framework and gunicorn as a server within a docker container. The server doesn't startup completely i.e. it hangs during the startup process. command to start the server: gunicorn app.my_app:app --bind 0.0.0.0:${PORT} --reload --timeout 120 --access-logfile -

    requirements.txt:

    anyio==3.6.2
    certifi==2022.9.24
    charset-normalizer==2.1.1
    click==8.1.3
    fastapi==0.85.1
    filelock==3.8.0
    h11==0.14.0
    huggingface-hub==0.10.1
    idna==3.4
    numpy==1.23.4
    omegaconf==2.2.3
    packaging==21.3
    pydantic==1.10.2
    pyparsing==3.0.9
    PyYAML==6.0
    regex==2022.9.13
    requests==2.28.1
    sentencepiece==0.1.97
    sniffio==1.3.0
    starlette==0.20.4
    tokenizers==0.13.1
    torch==1.12.1
    tqdm==4.64.1
    transformers==4.23.1
    typing_extensions==4.4.0
    urllib3==1.26.12
    uvicorn==0.19.0
    gunicorn==20.1.0
    httptools==0.5.0
    python-dotenv==0.21.0
    uvloop==0.17.0
    watchfiles==0.18.0
    websockets==10.4
    fastt5==0.1.4
    six==1.16.0
    

    There is NO error in the output during the start-up. It just hangs.

    [2022-11-17 14:19:38 +0000] [7] [INFO] Listening at: http://0.0.0.0:8000 (7)
    [2022-11-17 14:19:38 +0000] [7] [INFO] Using worker: sync
    [2022-11-17 14:19:38 +0000] [8] [INFO] Booting worker with pid: 8
    Downloading: 100%|██████████| 1.20k/1.20k [00:00<00:00, 473kB/s]
    Downloading: 100%|██████████| 242M/242M [00:28<00:00, 8.41MB/s] 
    In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
    In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
    Exporting to onnx... |################################| 3/3
    Quantizing... |################################| 3/3
    Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]Setting up onnx model...
    Done!
    Downloading: 100%|██████████| 792k/792k [00:00<00:00, 1.41MB/s] 
    

    Could you please have a look into this?

    opened by kklivil 1
  • M2M100 to ONNX

    M2M100 to ONNX

    I am currently trying to, with slight modifications, apply fastT5 to M2M100. While the conversion itself is working like a charm, I am getting a lot of

    Ignore MatMul due to non constant B: /[MatMul_(insert int here)] e.G. Ignore MatMul due to non constant B: /[MatMul_2256]

    errors during quantization. After some research I dug up multiple colabs that are simply ignoring this warning. Is any treatment necessary and are there known ways to handle it?

    Apart from that: do you plan on expanding this repo to different models (for example in different branches) or do you rather want forks that provide support for other models?

    opened by LuckiestOne23 0
  • Mt5 model loading fails

    Mt5 model loading fails

    Hallo, I have MT5 pretrained model, i am using fastt5 approch to convert the model to onnx. The convestion of the model works fine. But when creating the decoder_sess at decoder_sess = InferenceSession(str(path_to_decoder)) more specfic it fails at

    # initialize the C++ InferenceSession
    sess.initialize_session(providers, provider_options, disabled_optimizers)
    

    it fails without any error, as Process finished with exit code 135 (interrupted by signal 7: SIGEMT) Loading the encoder model works, but not decoder model

    I am using latest version of fastt5==0.1.4 Any ideas to create session.

    opened by OriAlpha 11
  • Fails to convert T0-3B

    Fails to convert T0-3B

    T0-3B is just a finetune of T5v1.1_3B_LMadapt (according to their paper), and in HF it loads via just the standard T5 code.

    I am able to successfully ONNX-convert other T5 models on my computer.

    But when I try on T0:

    ❯ MODEL=~/Downloads/PT_T0_3B poetry run test1_onnx
    Loading model from /home/user/Downloads/PT_T0_3B
    In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
    In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
    Exporting to onnx... |################################| 3/3
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/home/user/Dev/Proj/codet5_tests/codet5_tests/test1_onnx.py", line 63, in main
        typer.run(bean_cli)
      File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/typer/main.py", line 864, in run
        app()
      File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/typer/main.py", line 214, in __call__
        return get_command(self)(*args, **kwargs)
      File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
        return self.main(*args, **kwargs)
      File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/click/core.py", line 1055, in main
        rv = self.invoke(ctx)
      File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/click/core.py", line 760, in invoke
        return __callback(*args, **kwargs)
      File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/typer/main.py", line 500, in wrapper
        return callback(**use_params)  # type: ignore
      File "/home/user/Dev/Proj/codet5_tests/codet5_tests/test1_onnx.py", line 45, in bean_cli
        models = load_torch_models(os.environ.get("MODEL"))
      File "/home/user/Dev/Proj/codet5_tests/codet5_tests/test1_onnx.py", line 17, in load_torch_models
        model = export_and_get_onnx_model(model_path, onnx_model_output_path)
      File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/fastT5/onnx_models.py", line 219, in export_and_get_onnx_model
        quant_model_paths = quantize(onnx_model_paths)
      File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/fastT5/onnx_exporter.py", line 280, in quantize
        quantize_dynamic(
      File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/onnxruntime/quantization/quantize.py", line 308, in quantize_dynamic
        model = load_model(Path(model_input), optimize_model)
      File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/onnxruntime/quantization/quantize.py", line 53, in load_model
        return onnx.load(Path(model_path))
      File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/onnx/__init__.py", line 127, in load_model
        load_external_data_for_model(model, base_dir)
      File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/onnx/external_data_helper.py", line 69, in load_external_data_for_model
        load_external_data_for_tensor(tensor, base_dir)
      File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/onnx/external_data_helper.py", line 48, in load_external_data_for_tensor
        with open(external_data_file_path, 'rb') as data_file:
    FileNotFoundError: [Errno 2] No such file or directory: '/home/user/Dev/Proj/codet5_tests/encoder.embed_tokens.weight'
    
    opened by redthing1 1
  • Unable to retrieve hidden_states

    Unable to retrieve hidden_states

    I converted a locally saved T5 checkpoint to ONNX using FastT5:

    >>> from fastT5 import export_and_get_onnx_model
    >>> from transformers import AutoTokenizer
    
    >>> model_checkpoint = "path/to/checkpoint"
    >>> model = export_and_get_onnx_model(model_name)
    

    I tested it for inference:

    >>> tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    >>> token = tokenizer(input_terms, max_length=512 * 2, padding=True, truncation=True, return_tensors='pt')
    
    >>> out = model.generate(input_ids=token['input_ids'].to('cpu'),
                                attention_mask=token['attention_mask'].to('cpu'),
                                return_dict_in_generate=True,
                                max_length=512 * 2,
                                num_beams=1,
                                output_scores=True,
                                output_hidden_states=True)
    
    >>> out.encoder_hidden_states
    >>> out.decoder_hidden_states
    (None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
     None,
    ...
    
    >>> out
    GreedySearchEncoderDecoderOutput(sequences=tensor([[  0, 119, 114, 102, 108, 111, 108, 125, 120, 112, 100, 101,  35,  53, ...
    ...
    ), , encoder_attentions=None, encoder_hidden_states=None, decoder_attentions=None, cross_attentions=None, decoder_hidden_states=(None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None))
    

    The hidden states are all None.

    Is there any way that I can retrieve the hidden states for both encoder and decoder?

    opened by vsoesanto 2
Owner
Kiran R
Kiran R
Modified GPT using average pooling to reduce the softmax attention memory constraints.

NLP-GPT-Upsampling This repository contains an implementation of Open AI's GPT Model. In particular, this implementation takes inspiration from the Ny

WD 1 Dec 3, 2021
An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

IVR-Chatbot Achievements ?? Team Uhtred won the Maverick 2.0 Bot-a-thon 2021 organized by AbInbev India. ❓ Problem Statement As we all know that, lot

ARYAMAAN PANDEY 9 Dec 8, 2022
Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge This is an implementation of the paper,

Mutian He 19 Oct 14, 2022
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

Abel 211 Dec 28, 2022
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

Abel 137 Feb 1, 2021
Wind Speed Prediction using LSTMs in PyTorch

Implementation of Deep-Forecast using PyTorch Deep Forecast: Deep Learning-based Spatio-Temporal Forecasting Adapted from original implementation Setu

Onur Kaplan 151 Dec 14, 2022
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Max Woolf 4.8k Dec 30, 2022
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Max Woolf 4.3k Feb 18, 2021
Topic Inference with Zeroshot models

zeroshot_topics Table of Contents Installation Usage License Installation zeroshot_topics is distributed on PyPI as a universal wheel and is available

Rita Anjana 55 Nov 28, 2022
A minimal code for fairseq vq-wav2vec model inference.

vq-wav2vec inference A minimal code for fairseq vq-wav2vec model inference. Runs without installing the fairseq toolkit and its dependencies. Usage ex

Vladimir Larin 7 Nov 15, 2022
An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

GPT-NeoX An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hun

EleutherAI 3.1k Jan 8, 2023
An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

Ultra_Fast_Lane_Detection_TensorRT An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI to accelerate. our model support for in

steven.yan 121 Dec 27, 2022
Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

Wav2Vec2CTC With KenLM Using KenLM ARPA language model with beam search to decode audio files and show the most probable transcription. Assuming you'v

farisalasmary 65 Sep 21, 2022
Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

LEE YOON HYUNG 147 Dec 5, 2022
Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

Generating Persona Consistent Dialogues by Exploiting Natural Language Inference Source code for RCDG model in AAAI20 Generating Persona Consistent Di

null 16 Oct 8, 2022
LightSeq: A High-Performance Inference Library for Sequence Processing and Generation

LightSeq is a high performance inference library for sequence processing and generation implemented in CUDA. It enables highly efficient computation of modern NLP models such as BERT, GPT2, Transformer, etc. It is therefore best useful for Machine Translation, Text Generation, Dialog, Language Modelling, and other related tasks using these models.

Bytedance Inc. 2.5k Jan 3, 2023
Spert NLP Relation Extraction API deployed with torchserve for inference

SpERT torchserve Spert_torchserve is the Relation Extraction model (SpERT)Span-based Entity and Relation Transformer API deployed with pytorch/serve.

Zichu Chen 1 Nov 24, 2021
Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

hezw.tkcw 20 Dec 12, 2022