⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x using fastT5.

Kiran R

Last update: Jan 5, 2023

Related tags

Text Data & NLP python nlp fast translation question-answering quantization onnx t5 onnxruntime fastt5 quantized-onnx-models inference-speed

Overview

Reduce T5 model size by 3X and increase the inference speed up to 5X.

Install
Usage
Details
Functionalities
Benchmarks
- Onnx model
- Quantized onnx model
Quantized model scores
further improvements
License
Get Help
Acknowledgements

T5 models can be used for several NLP tasks such as summarization, QA, QG, translation, text generation, and more. Sequential text generation is naturally slow, and for larger T5 models it gets even slower. fastT5 makes the T5 models inference faster by running it on onnxruntime. and it also decreases the model size by quantizing it.

fastT5 library allows you to convert a pretrained T5 model to onnx, quantizes it, and gives the model as output which is running on an onnxruntime in a single line of code. You can also customize this whole process.

Install

You can install fastT5 from PyPI:

 pip install fastt5

If you want to build from source:

git clone https://github.com/Ki6an/fastT5
cd fastT5
pip3 install -e .

Usage

The export_and_get_onnx_model() method exports the given pretrained T5 model to onnx, quantizes it and runs it on the onnxruntime with default settings. The returned model from this method supports the generate() method of huggingface.

If you don't wish to quantize the model then use quantized=False in the method.

from fastT5 import export_and_get_onnx_model
from transformers import AutoTokenizer

model_name = 't5-small'
model = export_and_get_onnx_model(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)
t_input = "translate English to French: The universe is a dark forest."
token = tokenizer(t_input, return_tensors='pt')

tokens = model.generate(input_ids=token['input_ids'],
               attention_mask=token['attention_mask'],
               num_beams=2)

output = tokenizer.decode(tokens.squeeze(), skip_special_tokens=True)
print(output)

to run the already exported model use get_onnx_model()

you can customize the whole pipeline as shown in the below code example:

from fastT5 import (OnnxT5, get_onnx_runtime_sessions,
                    generate_onnx_representation, quantize)
from transformers import AutoTokenizer

model_or_model_path = 't5-small'

# Step 1. convert huggingfaces t5 model to onnx
onnx_model_paths = generate_onnx_representation(model_or_model_path)

# Step 2. (recommended) quantize the converted model for fast inference and to reduce model size.
quant_model_paths = quantize(onnx_model_paths)

# step 3. setup onnx runtime
model_sessions = get_onnx_runtime_sessions(quant_model_paths)

# step 4. get the onnx model
model = OnnxT5(model_or_model_path, model_sessions)

                      ...

Details

T5 is a seq2seq model (Encoder-Decoder), as it uses decoder repeatedly for inference, we can't directly export the whole model to onnx. We need to export the encoder and decoder separately.

past_key_values contain pre-computed hidden-states (key and values in the self-attention blocks and cross-attention blocks) that can be used to speed up sequential decoding.

models can only be exported with a constant number of inputs. Contrary to this, the decoder of the first step does not take past_key_values and the rest of the steps decoders do. To get around this issue, we can create two decoders: one for the first step that does not take past_key_values and another for the rest of the steps that utilize the past_key_values.

Next, we'll export all three models (encoder, decoder, init_decoder). And then quantize them, quantizing 32bit to 8bit should give the 4x memory reduction. Since there is an extra decoder the model size reduces by 3x.

Finally, we'll run the quantized model on onnx runtime.

The inference is simple as the model supports the generate() method of huggingface.

Functionalities

Export any pretrained T5 model to ONNX easily (with past_key_values).
The exported model supports beam search and greedy search and more via generate() method.
Reduce the model size by 3X using quantization.
Up to 5X speedup compared to PyTorch execution for greedy search and 3-4X for beam search.

Benchmarks

The benchmarks are the result of the T5-base model tested on English to French translation.

Onnx model

The following graph shows the latency of the quantized onnx model vs the PyTorch model for beam numbers varying from 1 to 9. The latencies shown here are for the mean of sequence lengths up to 130.

The following heat map shows the X times faster which the ratio of latency of PyTorch to onnx model. The onnx model outperforms most cases. however, the speed of the model drops for a longer sequence length.

Quantized onnx model

Quantized models are lightweight models as mentioned earlier, these models have almost the same accuracy as the original model (quantized model scores are mentioned in the next section). Quantized onnx models have the lowest latency compared to both Onnx & PyTorch models.

The model outperforms the PyTorch model by 5.7X for greedy search on average and 3-4X for beam search.

Note : The results were generated on AMD EPYC 7B12, these results may vary from device to device. The Onnx models usually perform well on high-end CPUs with more cores.

Quantized model scores

The results were tested for English to French translation with beam search number of 3.

	Bleu_4	METEOR	ROUGE_L
t5-small (quant)	0.240769	0.282342	0.468817
t5-small (pytorch)	0.254601	0.295172	0.492749
t5-base (quant)	0.267606	0.306019	0.499188
t5-base (pytorch)	0.268346	0.304969	0.503306
t5-large (quant)	0.286726	0.316845	0.503585
t5-large (pytorch)	0.294015	0.315774	0.508677

further improvements

currently the fastT5 library supports only the cpu version of onnxruntime, gpu implementation still needs to be done.
graph optimization of the onnx model will further reduce the latency.

Get Help

Contact me at [email protected]
If appropriate, open an issue on GitHub

Acknowledgements

original T5 paper
transformers by huggingface
onnx
onnxruntime by microsoft
onnxt5

@article{2019t5,
  author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
  title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
  journal = {arXiv e-prints},
  year = {2019},
  archivePrefix = {arXiv},
  eprint = {1910.10683},
}

Comments

forward() got an unexpected keyword argument 'cross_attn_head_mask'

----> 1 paraphrase_t5("Kyle Lowry scored 33 points and Norman Powell added 23 to lift the Toronto Raptors to a 122-125 victory over the Boston Celtics on Wednesday night.")

4 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

TypeError: forward() got an unexpected keyword argument 'cross_attn_head_mask'

opened by mrm8488 20

Observing difference in outputs from decoder with IO bindings.

Hi @Ki6an Was trying to implement IO bindings for the decoder part of the model. Used the same code from your repo to convert the model to ONNX. After loading the model and making predictions using the decoder session directly the output appears to be fine but with inputs binded the result is coming to be different.

Below is the code for the IO bindings:

def dec_pred_with_io_bindings(input_ids, attention_mask, encoder_output, past_key_values_dict,dec_session):
  dec_io_binding = dec_session.io_binding()
  dec_io_binding.bind_input(name="input_ids",
                          device_type="cuda",
                          device_id=0,
                          element_type=np.longlong,
                          shape=list(input_ids.shape),
                          buffer_ptr=input_ids.data_ptr())
  dec_io_binding.bind_input(name="encoder_attention_mask",
                          device_type="cuda",
                          device_id=0,
                          element_type=np.longlong,
                          shape=list(attention_mask.shape),
                          buffer_ptr=attention_mask.data_ptr())
                        
  dec_io_binding.bind_input(name="encoder_hidden_states",
                          device_type="cuda",
                          device_id=0,
                          element_type=np.float32,
                          shape=list(encoder_output.shape),
                          buffer_ptr=encoder_output.data_ptr())
  

  for key,val in past_key_values_dict.items():
    dec_io_binding.bind_input(name=key,
                                      device_type="cuda",
                                      device_id=0,
                                      element_type=np.float32,
                                      shape=list(val.shape),
                                      buffer_ptr=val.data_ptr())
  
  #Bind outputs.
  for arg in self.decoder.get_outputs():
    dec_io_binding.bind_output(arg.name, "cuda")
    
  dec_session.run_with_iobinding(dec_io_binding)
  ort_output = dec_io_binding.get_outputs()

  logits=ort_output[0]

  list_pkv = tuple(torch.from_numpy(x.numpy()).cuda() for x in ort_output[1:])

  # creates a tuple of tuples of shape 6x4 from the above tuple
  out_past_key_values = tuple(
      list_pkv[i : i + 4] for i in range(0, len(list_pkv), 4)
  )


  return torch.from_numpy(logits.numpy()).cuda(),out_past_key_values

opened by VikasOjha666 10

Errors when loading saved onnx files
We have an issue with saving and loading onnx files. When passing the generated quant_model_paths to get_onnx_runtime_sessions everything works okay but if I save the file and then run get_onnx_runtime_sessions on the loaded quantized files the model throws an error:

File "/Users/itai/Code/email-cleaner/.venv/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 184, in run raise ValueError("Model requires {} inputs. Input Feed contains {}".format(num_required_inputs, num_inputs)) ValueError: Model requires 3 inputs. Input Feed contains 2

This doesn't seem to happen on SageMaker but it happens on mac and also on linux containerized environment.
opened by itaim 10
Updating fastT5?

Hello @Ki6an.

Thanks a lot for fastT5! One question: are you planning to update it to the latest versions of ONNX, ONNX Runtime and Transformers?

Thank you.

opened by piegu 6
Conversion of decoder with past_key_values to float16.

Hi @Ki6an. With the converted ONNX model generated. I was trying to convert the decoder_init and decoder to float16. I did the quantization with onnxruntime's transformer optimizer. I was able to convert the decoder_init to fp16 but while converting the decoder with past_key_values I am getting following issue:

AssertionError Traceback (most recent call last) in () 25 , 26 ) ---> 27 optimized_model.convert_float_to_float16() # FP32 -> FP16 28 optimized_model.save_model_to_file('/content/optimized_models/t5-base-qa-qg-hl-decoder.onnx') 29

6 frames /usr/local/lib/python3.7/dist-packages/onnxruntime/transformers/../tools/symbolic_shape_infer.py in _add_suggested_merge(self, symbols, apply) 216 217 def add_suggested_merge(self, symbols, apply=False): --> 218 assert all([(type(s) == str and s in self.symbolic_dims_) or is_literal(s) for s in symbols]) 219 symbols = set(symbols) 220 for k, v in self.suggested_merge.items():

AssertionError:

Code used: from onnxruntime.transformers import optimizer

#Decoder. optimized_model =optimizer.optimize_model( input='/content/models/t5-base-qa-qg-hl-decoder.onnx', use_gpu=True, opt_level=1, only_onnxruntime= True , ) optimized_model.convert_float_to_float16() # FP32 -> FP16 optimized_model.save_model_to_file('/content/optimized_models/t5-base-qa-qg-hl-decoder.onnx')

opened by VikasOjha666 5

t5-11b out of memory/FileNotFoundError

``First of all, this seems like a great repo that I was super excited to find!

When testing with t5-small everything works correctly. But when trying with my custom t5-11b I get out of memory issues.

I was running this with a t5-11b as model: onnx_model_paths = generate_onnx_representation("t5-11b",model=model)

And at first I got this error:

RuntimeError: Exporting model exceed maximum protobuf size of 2GB. Please call torch.onnx.export with use_external_data_format=True.

So I simply added use_external_data_format=True to all of the three torch.onnx.export in onnx_exporter.py in fastT5.

Then I can run onnx_model_paths = generate_onnx_representation(model_name,model=model), and get no error (First time I posted I got an error but it seems like I made an error and only had 100 GB disk memory, when trying 200 GB it worked).

Then when running quant_model_paths = quantize(onnx_model_paths) I get the error:

`FileNotFoundError                         Traceback (most recent call last)
<ipython-input-7-3a782b6d5a25> in <module>
      8 
      9 # Step 2. (recommended) quantize the converted model for fast inference and to reduce model size.
---> 10 quant_model_paths = quantize(onnx_model_paths)
     11 
     12 # step 3. setup onnx runtime

~/fastT5/fastT5/onnx_exporter.py in quantize(models_name_or_path)
    273             activation_type=QuantType.QUInt8,
    274             weight_type=QuantType.QUInt8,
--> 275             optimize_model=False,
    276         )  # op_types_to_quantize=['MatMul', 'Relu', 'Add', 'Mul' ],
    277         quant_model_paths.append(output_model_name)

/opt/conda/lib/python3.7/site-packages/onnxruntime/quantization/quantize.py in quantize_dynamic(model_input, model_output, op_types_to_quantize, per_channel, reduce_range, activation_type, weight_type, nodes_to_quantize, nodes_to_exclude, optimize_model, use_external_data_format)
    266         op_types_to_quantize = list(IntegerOpsRegistry.keys())
    267 
--> 268     model = load_model(Path(model_input), optimize_model)
    269     quantizer = ONNXQuantizer(
    270         model,

/opt/conda/lib/python3.7/site-packages/onnxruntime/quantization/quantize.py in load_model(model_path, optimize)
     51         return onnx_model.model
     52 
---> 53     return onnx.load(Path(model_path))
     54 
     55 

/opt/conda/lib/python3.7/site-packages/onnx/__init__.py in load_model(f, format, load_external_data)
    125         if model_filepath:
    126             base_dir = os.path.dirname(model_filepath)
--> 127             load_external_data_for_model(model, base_dir)
    128 
    129     return model

/opt/conda/lib/python3.7/site-packages/onnx/external_data_helper.py in load_external_data_for_model(model, base_dir)
     69     for tensor in _get_all_tensors(model):
     70         if uses_external_data(tensor):
---> 71             load_external_data_for_tensor(tensor, base_dir)
     72             # After loading raw_data from external_data, change the state of tensors
     73             tensor.data_location = TensorProto.DEFAULT

/opt/conda/lib/python3.7/site-packages/onnx/external_data_helper.py in load_external_data_for_tensor(tensor, base_dir)
     48     external_data_file_path = os.path.join(base_dir, file_location)
     49 
---> 50     with open(external_data_file_path, 'rb') as data_file:
     51 
     52         if info.offset:

FileNotFoundError: [Errno 2] No such file or directory: '/home/jupyter/encoder.embed_tokens.weight'`

Has anyone successfully exported the t5-11b version and knows how to solve this?

Update:

I tried changing the working directory to /home/jupyter/models instead of /home/jupyter/, which seems to solve the FileNotFoundError. But then again I get problems with the size:

ValueError                                Traceback (most recent call last)
<ipython-input-10-032d95bca1c8> in <module>
      1 os.chdir(r'/home/jupyter/models/')
----> 2 quant_model_paths = quantize(onnx_model_paths)

~/fastT5/fastT5/onnx_exporter.py in quantize(models_name_or_path)
    273             activation_type=QuantType.QUInt8,
    274             weight_type=QuantType.QUInt8,
--> 275             optimize_model=False,
    276         )  # op_types_to_quantize=['MatMul', 'Relu', 'Add', 'Mul' ],
    277         quant_model_paths.append(output_model_name)

/opt/conda/lib/python3.7/site-packages/onnxruntime/quantization/quantize.py in quantize_dynamic(model_input, model_output, op_types_to_quantize, per_channel, reduce_range, activation_type, weight_type, nodes_to_quantize, nodes_to_exclude, optimize_model, use_external_data_format)
    278         nodes_to_quantize,
    279         nodes_to_exclude,
--> 280         op_types_to_quantize)
    281 
    282     quantizer.quantize_model()

/opt/conda/lib/python3.7/site-packages/onnxruntime/quantization/onnx_quantizer.py in __init__(self, model, per_channel, reduce_range, mode, static, weight_qType, input_qType, tensors_range, nodes_to_quantize, nodes_to_exclude, op_types_to_quantize)
     30 
     31         # run shape inference on the model
---> 32         model = onnx.shape_inference.infer_shapes(model)
     33         self.value_infos = {vi.name: vi for vi in model.graph.value_info}
     34         self.value_infos.update({ot.name: ot for ot in model.graph.output})

/opt/conda/lib/python3.7/site-packages/onnx/shape_inference.py in infer_shapes(model, check_type, strict_mode)
     34 def infer_shapes(model, check_type=False, strict_mode=False):  # type: (ModelProto, bool, bool) -> ModelProto
     35     if isinstance(model, ModelProto):
---> 36         model_str = model.SerializeToString()
     37         inferred_model_str = C.infer_shapes(model_str, check_type, strict_mode)
     38         return onnx.load_from_string(inferred_model_str)

ValueError: Message onnx.ModelProto exceeds maximum protobuf size of 2GB: 19459248612

opened by ViktorThink 5

small onnx optimizations
A few small tweaks:

Set transformers version to >4.6.1 — this is the minimum necessary requirement

make sequence length configurable, so ORT can use this information when optimizing. Also set batch_size to 1, since this is all CPU inference, ORT should optimize for batch size 1

use U8S8 per ONNX docs and add reduce_range to help with saturation issue

make opset version configurable for users who want to experiment with more recent ONNX versions
opened by sam-writer 4
Problems with T5 & onnxruntime
Hello there. My purpose is to speed up a T5-small (fine-tuned) both on CPU and GPU. So, I am trying to transform the net through fastT5. However, using the quantized model on CPU, I get similar performances with respect to the initial model (T5-small, on CPU), without any significant improvement. Am I missing something?

Moreover, I have problems with onnxruntime-gpu. I have read from the other issues that I can't use onnxruntime-gpu with quantization, is that correct?

Also, I am trying to transform the T5-small model into a non-quantized onnx model, in order to be able to use it on the GPU with onnxruntime-gpu, to obtain some improvements. In this case, i get the errors:

[ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Add node. Name:'Add_98' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/math/element_wise_ops.h:487 void onnxruntime::BroadcastIterator::Append(ptrdiff_t, ptrdiff_t) axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 14 by 16

or

[ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Add node. Name:'Add_98' Status Message: Add_98: right operand cannot broadcast on dim 2 LeftShape: {1,8,85,85}, RightShape: {1,8,14,85}

The code is reported below. It fails in the generate method when I try to run it with onnxruntime-gpu.

t_input = 'translate {} to SQL: '.format(languages[lang]) + original_text tokenizer = AutoTokenizer.from_pretrained(model_directory) token = tokenizer(t_input, return_tensors='pt') tokens = model.generate(input_ids=token['input_ids'], attention_mask=token['attention_mask'], num_beams=3)

Thank you.
opened by GenVr 4

Getting runtime error.

Hi, @Ki6an it's great work. But while executing below code

from fastT5 import export_and_get_onnx_model
from transformers import AutoTokenizer

model_name = 't5-small'
model = export_and_get_onnx_model(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)
t_input = "translate English to French: The universe is a dark forest."
token = tokenizer(t_input, return_tensors='pt')

tokens = model.generate(input_ids=token['input_ids'],
               attention_mask=token['attention_mask'],
               num_beams=2)

output = tokenizer.decode(tokens.squeeze(), skip_special_tokens=True)
print(output)

I'm getting this error.

RuntimeError: output with shape [5, 12, 1, 2] doesn't match the broadcast shape [5, 12, 2, 2]

good first issue

opened by abhinavsp0730 4

Error from export_and_get_onnx_model()
Hi,

I am getting the following error when I run the following test snippet provided in the readme.md file. But I can see 3 onnx files are created one for encoder, one for decoder and one for init decoder.

Environment fastT5 - 0.0.7 MacOS - Big Sur (11.2.3) Python conda - 3.7.6

Error Exporting to onnx... |################################| 3/3 [libprotobuf ERROR google/protobuf/descriptor_database.cc:394] Invalid file descriptor data passed to EncodedDescriptorDatabase::Add(). [libprotobuf FATAL google/protobuf/descriptor.cc:1356] CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size): libc++abi.dylib: terminating with uncaught exception of type google::protobuf::FatalException: CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

Test code snippet

from fastT5 import export_and_get_onnx_model from transformers import AutoTokenizer model_name = 't5-small' model = export_and_get_onnx_model(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) t_input = "translate English to French: The universe is a dark forest." token = tokenizer(t_input, return_tensors='pt') tokens = model.generate(input_ids=token['input_ids'], attention_mask=token['attention_mask'], num_beams=2) output = tokenizer.decode(tokens.squeeze(), skip_special_tokens=True) print(output)
opened by PhaneendraGunda 3
Thank You and Demo Running in the Browser

I just wanted to thank you for this library. I learned a lot from it and will be using it my own projects.

I was able to use it to create an end-to-end pipeline (transformers-js) to run T5 in the browser. I think it's pretty cool and wanted to show you: https://transformers-js.praeclarum.org

Thanks again!

opened by praeclarum 2

fastt5 not working with FastAPI gunicorn and docker

I am using fastt5 with FastAPI web framework and gunicorn as a server within a docker container. The server doesn't startup completely i.e. it hangs during the startup process. command to start the server: gunicorn app.my_app:app --bind 0.0.0.0:${PORT} --reload --timeout 120 --access-logfile -

requirements.txt:

anyio==3.6.2
certifi==2022.9.24
charset-normalizer==2.1.1
click==8.1.3
fastapi==0.85.1
filelock==3.8.0
h11==0.14.0
huggingface-hub==0.10.1
idna==3.4
numpy==1.23.4
omegaconf==2.2.3
packaging==21.3
pydantic==1.10.2
pyparsing==3.0.9
PyYAML==6.0
regex==2022.9.13
requests==2.28.1
sentencepiece==0.1.97
sniffio==1.3.0
starlette==0.20.4
tokenizers==0.13.1
torch==1.12.1
tqdm==4.64.1
transformers==4.23.1
typing_extensions==4.4.0
urllib3==1.26.12
uvicorn==0.19.0
gunicorn==20.1.0
httptools==0.5.0
python-dotenv==0.21.0
uvloop==0.17.0
watchfiles==0.18.0
websockets==10.4
fastt5==0.1.4
six==1.16.0

There is NO error in the output during the start-up. It just hangs.

[2022-11-17 14:19:38 +0000] [7] [INFO] Listening at: http://0.0.0.0:8000 (7)
[2022-11-17 14:19:38 +0000] [7] [INFO] Using worker: sync
[2022-11-17 14:19:38 +0000] [8] [INFO] Booting worker with pid: 8
Downloading: 100%|██████████| 1.20k/1.20k [00:00<00:00, 473kB/s]
Downloading: 100%|██████████| 242M/242M [00:28<00:00, 8.41MB/s] 
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
Exporting to onnx... |################################| 3/3
Quantizing... |################################| 3/3
Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]Setting up onnx model...
Done!
Downloading: 100%|██████████| 792k/792k [00:00<00:00, 1.41MB/s]

Could you please have a look into this?

opened by kklivil 1

M2M100 to ONNX

I am currently trying to, with slight modifications, apply fastT5 to M2M100. While the conversion itself is working like a charm, I am getting a lot of

Ignore MatMul due to non constant B: /[MatMul_(insert int here)] e.G. Ignore MatMul due to non constant B: /[MatMul_2256]

errors during quantization. After some research I dug up multiple colabs that are simply ignoring this warning. Is any treatment necessary and are there known ways to handle it?

Apart from that: do you plan on expanding this repo to different models (for example in different branches) or do you rather want forks that provide support for other models?

opened by LuckiestOne23 0
Mt5 model loading fails
Hallo, I have MT5 pretrained model, i am using fastt5 approch to convert the model to onnx. The convestion of the model works fine. But when creating the decoder_sess at decoder_sess = InferenceSession(str(path_to_decoder)) more specfic it fails at

# initialize the C++ InferenceSession sess.initialize_session(providers, provider_options, disabled_optimizers)

it fails without any error, as Process finished with exit code 135 (interrupted by signal 7: SIGEMT) Loading the encoder model works, but not decoder model

I am using latest version of fastt5==0.1.4 Any ideas to create session.
opened by OriAlpha 11

Fails to convert T0-3B

T0-3B is just a finetune of T5v1.1_3B_LMadapt (according to their paper), and in HF it loads via just the standard T5 code.

I am able to successfully ONNX-convert other T5 models on my computer.

But when I try on T0:

❯ MODEL=~/Downloads/PT_T0_3B poetry run test1_onnx
Loading model from /home/user/Downloads/PT_T0_3B
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
Exporting to onnx... |################################| 3/3
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/user/Dev/Proj/codet5_tests/codet5_tests/test1_onnx.py", line 63, in main
    typer.run(bean_cli)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/typer/main.py", line 864, in run
    app()
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/typer/main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/typer/main.py", line 500, in wrapper
    return callback(**use_params)  # type: ignore
  File "/home/user/Dev/Proj/codet5_tests/codet5_tests/test1_onnx.py", line 45, in bean_cli
    models = load_torch_models(os.environ.get("MODEL"))
  File "/home/user/Dev/Proj/codet5_tests/codet5_tests/test1_onnx.py", line 17, in load_torch_models
    model = export_and_get_onnx_model(model_path, onnx_model_output_path)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/fastT5/onnx_models.py", line 219, in export_and_get_onnx_model
    quant_model_paths = quantize(onnx_model_paths)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/fastT5/onnx_exporter.py", line 280, in quantize
    quantize_dynamic(
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/onnxruntime/quantization/quantize.py", line 308, in quantize_dynamic
    model = load_model(Path(model_input), optimize_model)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/onnxruntime/quantization/quantize.py", line 53, in load_model
    return onnx.load(Path(model_path))
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/onnx/__init__.py", line 127, in load_model
    load_external_data_for_model(model, base_dir)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/onnx/external_data_helper.py", line 69, in load_external_data_for_model
    load_external_data_for_tensor(tensor, base_dir)
  File "/home/user/.cache/pypoetry/virtualenvs/codet5-tests-LNmARMSb-py3.9/lib/python3.9/site-packages/onnx/external_data_helper.py", line 48, in load_external_data_for_tensor
    with open(external_data_file_path, 'rb') as data_file:
FileNotFoundError: [Errno 2] No such file or directory: '/home/user/Dev/Proj/codet5_tests/encoder.embed_tokens.weight'

opened by redthing1 1

Unable to retrieve hidden_states

I converted a locally saved T5 checkpoint to ONNX using FastT5:

>>> from fastT5 import export_and_get_onnx_model
>>> from transformers import AutoTokenizer

>>> model_checkpoint = "path/to/checkpoint"
>>> model = export_and_get_onnx_model(model_name)

I tested it for inference:

>>> tokenizer = AutoTokenizer.from_pretrained(model_name)

>>> token = tokenizer(input_terms, max_length=512 * 2, padding=True, truncation=True, return_tensors='pt')

>>> out = model.generate(input_ids=token['input_ids'].to('cpu'),
                            attention_mask=token['attention_mask'].to('cpu'),
                            return_dict_in_generate=True,
                            max_length=512 * 2,
                            num_beams=1,
                            output_scores=True,
                            output_hidden_states=True)

>>> out.encoder_hidden_states
>>> out.decoder_hidden_states
(None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
...

>>> out
GreedySearchEncoderDecoderOutput(sequences=tensor([[  0, 119, 114, 102, 108, 111, 108, 125, 120, 112, 100, 101,  35,  53, ...
...
), , encoder_attentions=None, encoder_hidden_states=None, decoder_attentions=None, cross_attentions=None, decoder_hidden_states=(None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None))

The hidden states are all None.

Is there any way that I can retrieve the hidden states for both encoder and decoder?

opened by vsoesanto 2

⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x using fastT5.

Related tags

Overview

Reduce T5 model size by 3X and increase the inference speed up to 5X.

Install

Usage

Details

Functionalities

Benchmarks

Onnx model

Quantized onnx model

Quantized model scores

further improvements

Get Help

Acknowledgements

Comments

Owner

Kiran R

Modified GPT using average pooling to reduce the softmax attention memory constraints.

An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Wind Speed Prediction using LSTMs in PyTorch

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

Topic Inference with Zeroshot models

A minimal code for fairseq vq-wav2vec model inference.

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

LightSeq: A High-Performance Inference Library for Sequence Processing and Generation

Spert NLP Relation Extraction API deployed with torchserve for inference

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation