Convert BART models to ONNX with quantization. 3X reduction in size, and upto 3X boost in inference speed

Siddharth Sharma

Last update: Dec 9, 2022

Related tags

Deep Learning fast-Bart

Overview

fast-Bart

Reduction of BART model size by 3X, and boost in inference speed up to 3X

BART implementation of the fastT5 library (https://github.com/Ki6an/fastT5)

Pytorch model -> ONNX model -> Quantized ONNX model

Install

Install using requirements.txt file

git clone https://github.com/siddharth-sharma7/fast-Bart
cd fast-Bart
pip install -r requirements.txt

Usage

The export_and_get_onnx_model() method exports the given pretrained Bart model to onnx, quantizes it and runs it on the onnxruntime with default settings. The returned model from this method supports the generate() method of huggingface.

If you don't wish to quantize the model then use quantized=False in the method.

from fastBart import export_and_get_onnx_model
from transformers import AutoTokenizer

model_name = 'facebook/bart-base'
model = export_and_get_onnx_model(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)
input = "This is a very long sentence and needs to be summarized."
token = tokenizer(input, return_tensors='pt')

tokens = model.generate(input_ids=token['input_ids'],
               attention_mask=token['attention_mask'],
               num_beams=3)

output = tokenizer.decode(tokens.squeeze(), skip_special_tokens=True)
print(output)

to run the already exported model use get_onnx_model()

you can customize the whole pipeline as shown in the below code example:

from fastBart import (OnnxBart, get_onnx_runtime_sessions,
                    generate_onnx_representation, quantize)
from transformers import AutoTokenizer

model_or_model_path = 'facebook/bart-base'

# Step 1. convert huggingfaces bart model to onnx
onnx_model_paths = generate_onnx_representation(model_or_model_path)

# Step 2. (recommended) quantize the converted model for fast inference and to reduce model size.
# The process is slow for the decoder and init-decoder onnx files (can take up to 15 mins)
quant_model_paths = quantize(onnx_model_paths)

# step 3. setup onnx runtime
model_sessions = get_onnx_runtime_sessions(quant_model_paths)

# step 4. get the onnx model
model = OnnxBart(model_or_model_path, model_sessions)

                      ...

custom output paths

By default, fastBart creates a models-bart folder in the current directory and stores all the models. You can provide a custom path for a folder to store the exported models. And to run already exported models that are stored in a custom folder path: use get_onnx_model(onnx_models_path="/path/to/custom/folder/")

from fastBart import export_and_get_onnx_model, get_onnx_model

model_name = "facebook/bart-base"
custom_output_path = "/path/to/custom/folder/"

# 1. stores models to custom_output_path
model = export_and_get_onnx_model(model_name, custom_output_path)

# 2. run already exported models that are stored in custom path
# model = get_onnx_model(model_name, custom_output_path)

Functionalities

Export any pretrained Bart model to ONNX easily.
The exported model supports beam search and greedy search and more via generate() method.
Reduce the model size by 3X using quantization.
Up to 3X speedup compared to PyTorch execution for greedy search and 2-3X for beam search.

You might also like...

MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch Onnx and CoreML.

MMdnn MMdnn is a comprehensive and cross-framework tool to convert, visualize and diagnose deep learning (DL) models. The "MM" stands for model manage

5.7k Jan 9, 2023

ONNX Runtime Web demo is an interactive demo portal showing real use cases running ONNX Runtime Web in VueJS.

ONNX Runtime Web demo is an interactive demo portal showing real use cases running ONNX Runtime Web in VueJS. It currently supports four examples for you to quickly experience the power of ONNX Runtime Web.

58 Dec 18, 2022

ONNX-GLPDepth - Python scripts for performing monocular depth estimation using the GLPDepth model in ONNX

18 Nov 6, 2022

ONNX-PackNet-SfM: Python scripts for performing monocular depth estimation using the PackNet-SfM model in ONNX

Python scripts for performing monocular depth estimation using the PackNet-SfM model in ONNX

14 Dec 9, 2022

Simple ONNX operation generator. Simple Operation Generator for ONNX.

sog4onnx Simple ONNX operation generator. Simple Operation Generator for ONNX. https://github.com/PINTO0309/simple-onnx-processing-tools Key concept V

6 May 15, 2022

Python library containing BART query generation and BERT-based Siamese models for neural retrieval.

Neural Retrieval Embedding-based Zero-shot Retrieval through Query Generation leverages query synthesis over large corpuses of unlabeled text (such as

35 Apr 14, 2022

tf2onnx - Convert TensorFlow, Keras and Tflite models to ONNX.

tf2onnx converts TensorFlow (tf-1.x or tf-2.x), tf.keras and tflite models to ONNX via command line or python api.

1.8k Jan 8, 2023

Convert onnx models to pytorch.

onnx2torch onnx2torch is an ONNX to PyTorch converter. Our converter: Is easy to use – Convert the ONNX model with the function call convert; Is easy

264 Dec 30, 2022

Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

TR-BERT Source code and dataset for "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference". The code is based on huggaface's transformers.

37 Oct 30, 2022

Comments

AttributeError: 'BartEncoder' object has no attribute 'main_input_name'

Whenever i run the model (facebook/bart-large-cnn), or even the bart-base model to try to summarize text, using the code below, i keep running into this error, and i'm honestly stuck, so any help would be greatly appreciated.

Code:

from fastBart import export_and_get_onnx_model, OnnxBart, quantize
from transformers import AutoTokenizer
from pathlib import  Path
from fastBart.ort_settings import get_onnx_runtime_sessions
import os
from fastBart.onnx_exporter import (
    generate_onnx_representation)

model_name = 'facebook/bart-large-cnn'

# generate onnx model
model_paths = generate_onnx_representation(model_name)
model_sessions = get_onnx_runtime_sessions(tuple(model_paths))
model_paths = tuple(model_paths)

# after model gotten
# name = 'bart-large-cnn'
# custom_output_path = './models-bart/'
# model_paths = [os.path.join(custom_output_path,name+"-"+x) for x in ['encoder.onnx','decoder.onnx','init-decoder.onnx']]
# model_paths = tuple(model_paths)


model_sessions = get_onnx_runtime_sessions(model_paths)
model = OnnxBart(model_name, model_sessions)



tokenizer = AutoTokenizer.from_pretrained(model_name)

input = """Since being shared on Reddit, the 21-second video has garnered more than 35,000 upvotes. People also filled the comment section with adorable comments.
A Reddit user said that he liked the little cat staring at the mama dog for a moment. Another liked the calmness and patience of the mama dog.
“This is soooo cute,” commented a third Reddit user. Many users also added that humans should learn kindness from these animals.
Well, this is not the first time that a kitten was feeding on a nursing dog.Last year, a similar heartwarming video had surfaced 
from a remote village in Nigeria that showed a little feline feeding on milk from a mama dog. The clip, which had gone viral, has 
so far accumulated over 1 million views. Twitter users showered love for the dog and the little kitten. The 32-second clip was shared 
by Reuters on its official Twitter handle with the caption, “It's a most unusual sight: a kitten was spotted feeding on milk from a nursing dog in a remote village in Nigeria.”
"""

token = tokenizer(input, return_tensors='pt',  truncation=True)
tokens = model.generate(input_ids=token['input_ids'],attention_mask=token['attention_mask'],num_beams=4)

output = tokenizer.decode(tokens.squeeze(), skip_special_tokens=True)
print(output)

Error:

Traceback (most recent call last):
File "/Users/zeke/Documents/Github/SumanyWeb/sumany/summarize/fast-Bart/bartoptimize.py", line 34, in <module>
    tokens = model.generate(input_ids=token['input_ids'],attention_mask=token['attention_mask'],num_beams=4)
File "/usr/local/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/transformers/generation_utils.py", line 1162, in generate
    inputs_tensor, model_input_name, model_kwargs = self._prepare_model_inputs(inputs, bos_token_id, model_kwargs)
File "/usr/local/lib/python3.9/site-packages/transformers/generation_utils.py", line 412, in _prepare_model_inputs
    and self.encoder.main_input_name != self.main_input_name
File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1207, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'BartEncoder' object has no attribute 'main_input_name'

opened by zeke-john 1

Inference slower than Pytorch model for long sequence length

Hi @siddharth-sharma7

Thank you for providing fast-bart. It has made my life much easier.

I find the bart-onnx-quantized model 2-3x faster than the Pytorch model. However, when the sequence length is long (~500 tokens), the onnx-based model is 1.5-2x slower.

I also find a similar problem for T5-onnx model that has been discussed at https://github.com/microsoft/onnxruntime/issues/6835#:~:text=the%20converted%20t5%20onnx%20model,and%20higher%20beam%2Dsearch%20number.

Just wondering if we're facing the same issue here.

opened by jasontian6666 1
unable to use cudaprovider for inferencing

Hi @siddharth-sharma7 , You package is great and very easy to use, but I'm unable to figure out how to actually use CUDAExecutionProvider, and use gpu for inferencing. Whenever I provide the providers=['CUDAExecutionProvider'], the model is still not being loaded to gpu and inferencing still happens in cpu.

opened by girishnadiger-gep 2
deploy onnx model to tensorrt

Thank you very much for your work, very helpful! Now I can convert BART model to ONNX model, and the output of the two is consistent, but I would like to ask whether you have tried to deploy onNX model to Tensorrt, so far I have been able to run on Tensorrt, but the result of tensorRT deployment is not consistent with that of ONNX model; (

opened by will-wiki 1

Convert BART models to ONNX with quantization. 3X reduction in size, and upto 3X boost in inference speed

Related tags

Overview

fast-Bart

Reduction of BART model size by 3X, and boost in inference speed up to 3X

Install

Usage

custom output paths

Functionalities

You might also like...

MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch Onnx and CoreML.

ONNX Runtime Web demo is an interactive demo portal showing real use cases running ONNX Runtime Web in VueJS.

ONNX-GLPDepth - Python scripts for performing monocular depth estimation using the GLPDepth model in ONNX

ONNX-PackNet-SfM: Python scripts for performing monocular depth estimation using the PackNet-SfM model in ONNX

Simple ONNX operation generator. Simple Operation Generator for ONNX.

Python library containing BART query generation and BERT-based Siamese models for neural retrieval.

tf2onnx - Convert TensorFlow, Keras and Tflite models to ONNX.

Convert onnx models to pytorch.

Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

Comments

AttributeError: 'BartEncoder' object has no attribute 'main_input_name'

Inference slower than Pytorch model for long sequence length

unable to use cudaprovider for inferencing

deploy onnx model to tensorrt

Owner

Siddharth Sharma

DiffQ performs differentiable quantization using pseudo quantization noise. It can automatically tune the number of bits used per weight or group of weights, in order to achieve a given trade-off between model size and accuracy.

This project aim to create multi-label classification annotation tool to boost annotation speed and make it more easier.

A very simple tool to rewrite parameters such as attributes and constants for OPs in ONNX models. Simple Attribute and Constant Modifier for ONNX.

An executor that loads ONNX models and embeds documents using the ONNX runtime.

Simple tool to combine(merge) onnx models. Simple Network Combine Tool for ONNX.

Quantization library for PyTorch. Support low-precision and mixed-precision quantization, with hardware implementation through TVM.

🍅🍅🍅YOLOv5-Lite: lighter, faster and easier to deploy. Evolved from yolov5 and the size of model is only 1.7M (int8) and 3.3M (fp16). It can reach 10+ FPS on the Raspberry Pi 4B when the input size is 320×320~

Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation. In CVPR 2022.

Very simple NCHW and NHWC conversion tool for ONNX. Change to the specified input order for each and every input OP. Also, change the channel order of RGB and BGR. Simple Channel Converter for ONNX.

This package proposes simplified exporting pytorch models to ONNX and TensorRT, and also gives some base interface for model inference.