Convert BART models to ONNX with quantization. 3X reduction in size, and upto 3X boost in inference speed

Overview

fast-Bart

Reduction of BART model size by 3X, and boost in inference speed up to 3X

BART implementation of the fastT5 library (https://github.com/Ki6an/fastT5)

Pytorch model -> ONNX model -> Quantized ONNX model


Install

Install using requirements.txt file

git clone https://github.com/siddharth-sharma7/fast-Bart
cd fast-Bart
pip install -r requirements.txt

Usage

The export_and_get_onnx_model() method exports the given pretrained Bart model to onnx, quantizes it and runs it on the onnxruntime with default settings. The returned model from this method supports the generate() method of huggingface.

If you don't wish to quantize the model then use quantized=False in the method.

from fastBart import export_and_get_onnx_model
from transformers import AutoTokenizer

model_name = 'facebook/bart-base'
model = export_and_get_onnx_model(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)
input = "This is a very long sentence and needs to be summarized."
token = tokenizer(input, return_tensors='pt')

tokens = model.generate(input_ids=token['input_ids'],
               attention_mask=token['attention_mask'],
               num_beams=3)

output = tokenizer.decode(tokens.squeeze(), skip_special_tokens=True)
print(output)

to run the already exported model use get_onnx_model()

you can customize the whole pipeline as shown in the below code example:

from fastBart import (OnnxBart, get_onnx_runtime_sessions,
                    generate_onnx_representation, quantize)
from transformers import AutoTokenizer

model_or_model_path = 'facebook/bart-base'

# Step 1. convert huggingfaces bart model to onnx
onnx_model_paths = generate_onnx_representation(model_or_model_path)

# Step 2. (recommended) quantize the converted model for fast inference and to reduce model size.
# The process is slow for the decoder and init-decoder onnx files (can take up to 15 mins)
quant_model_paths = quantize(onnx_model_paths)

# step 3. setup onnx runtime
model_sessions = get_onnx_runtime_sessions(quant_model_paths)

# step 4. get the onnx model
model = OnnxBart(model_or_model_path, model_sessions)

                      ...
custom output paths

By default, fastBart creates a models-bart folder in the current directory and stores all the models. You can provide a custom path for a folder to store the exported models. And to run already exported models that are stored in a custom folder path: use get_onnx_model(onnx_models_path="/path/to/custom/folder/")

from fastBart import export_and_get_onnx_model, get_onnx_model

model_name = "facebook/bart-base"
custom_output_path = "/path/to/custom/folder/"

# 1. stores models to custom_output_path
model = export_and_get_onnx_model(model_name, custom_output_path)

# 2. run already exported models that are stored in custom path
# model = get_onnx_model(model_name, custom_output_path)

Functionalities

  • Export any pretrained Bart model to ONNX easily.
  • The exported model supports beam search and greedy search and more via generate() method.
  • Reduce the model size by 3X using quantization.
  • Up to 3X speedup compared to PyTorch execution for greedy search and 2-3X for beam search.
You might also like...
MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch Onnx and CoreML.
MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch Onnx and CoreML.

MMdnn MMdnn is a comprehensive and cross-framework tool to convert, visualize and diagnose deep learning (DL) models. The "MM" stands for model manage

ONNX Runtime Web demo is an interactive demo portal showing real use cases running ONNX Runtime Web in VueJS.

ONNX Runtime Web demo is an interactive demo portal showing real use cases running ONNX Runtime Web in VueJS. It currently supports four examples for you to quickly experience the power of ONNX Runtime Web.

ONNX-GLPDepth - Python scripts for performing monocular depth estimation using the GLPDepth model in ONNX
ONNX-GLPDepth - Python scripts for performing monocular depth estimation using the GLPDepth model in ONNX

ONNX-GLPDepth - Python scripts for performing monocular depth estimation using the GLPDepth model in ONNX

ONNX-PackNet-SfM: Python scripts for performing monocular depth estimation using the PackNet-SfM model in ONNX
ONNX-PackNet-SfM: Python scripts for performing monocular depth estimation using the PackNet-SfM model in ONNX

Python scripts for performing monocular depth estimation using the PackNet-SfM model in ONNX

Simple ONNX operation generator. Simple Operation Generator for ONNX.
Simple ONNX operation generator. Simple Operation Generator for ONNX.

sog4onnx Simple ONNX operation generator. Simple Operation Generator for ONNX. https://github.com/PINTO0309/simple-onnx-processing-tools Key concept V

Python library containing BART query generation and BERT-based Siamese models for neural retrieval.
Python library containing BART query generation and BERT-based Siamese models for neural retrieval.

Neural Retrieval Embedding-based Zero-shot Retrieval through Query Generation leverages query synthesis over large corpuses of unlabeled text (such as

tf2onnx - Convert TensorFlow, Keras and Tflite models to ONNX.

tf2onnx converts TensorFlow (tf-1.x or tf-2.x), tf.keras and tflite models to ONNX via command line or python api.

Convert onnx models to pytorch.

onnx2torch onnx2torch is an ONNX to PyTorch converter. Our converter: Is easy to use – Convert the ONNX model with the function call convert; Is easy

Source code for NAACL 2021 paper
Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

TR-BERT Source code and dataset for "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference". The code is based on huggaface's transformers.

Comments
  • AttributeError: 'BartEncoder' object has no attribute 'main_input_name'

    AttributeError: 'BartEncoder' object has no attribute 'main_input_name'

    Whenever i run the model (facebook/bart-large-cnn), or even the bart-base model to try to summarize text, using the code below, i keep running into this error, and i'm honestly stuck, so any help would be greatly appreciated.

    Code:

    from fastBart import export_and_get_onnx_model, OnnxBart, quantize
    from transformers import AutoTokenizer
    from pathlib import  Path
    from fastBart.ort_settings import get_onnx_runtime_sessions
    import os
    from fastBart.onnx_exporter import (
        generate_onnx_representation)
    
    model_name = 'facebook/bart-large-cnn'
    
    # generate onnx model
    model_paths = generate_onnx_representation(model_name)
    model_sessions = get_onnx_runtime_sessions(tuple(model_paths))
    model_paths = tuple(model_paths)
    
    # after model gotten
    # name = 'bart-large-cnn'
    # custom_output_path = './models-bart/'
    # model_paths = [os.path.join(custom_output_path,name+"-"+x) for x in ['encoder.onnx','decoder.onnx','init-decoder.onnx']]
    # model_paths = tuple(model_paths)
    
    
    model_sessions = get_onnx_runtime_sessions(model_paths)
    model = OnnxBart(model_name, model_sessions)
    
    
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    input = """Since being shared on Reddit, the 21-second video has garnered more than 35,000 upvotes. People also filled the comment section with adorable comments.
    A Reddit user said that he liked the little cat staring at the mama dog for a moment. Another liked the calmness and patience of the mama dog.
    “This is soooo cute,” commented a third Reddit user. Many users also added that humans should learn kindness from these animals.
    Well, this is not the first time that a kitten was feeding on a nursing dog.Last year, a similar heartwarming video had surfaced 
    from a remote village in Nigeria that showed a little feline feeding on milk from a mama dog. The clip, which had gone viral, has 
    so far accumulated over 1 million views. Twitter users showered love for the dog and the little kitten. The 32-second clip was shared 
    by Reuters on its official Twitter handle with the caption, “It's a most unusual sight: a kitten was spotted feeding on milk from a nursing dog in a remote village in Nigeria.”
    """
    
    token = tokenizer(input, return_tensors='pt',  truncation=True)
    tokens = model.generate(input_ids=token['input_ids'],attention_mask=token['attention_mask'],num_beams=4)
    
    output = tokenizer.decode(tokens.squeeze(), skip_special_tokens=True)
    print(output)
    

    Error:

    Traceback (most recent call last):
    File "/Users/zeke/Documents/Github/SumanyWeb/sumany/summarize/fast-Bart/bartoptimize.py", line 34, in <module>
        tokens = model.generate(input_ids=token['input_ids'],attention_mask=token['attention_mask'],num_beams=4)
    File "/usr/local/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
        return func(*args, **kwargs)
    File "/usr/local/lib/python3.9/site-packages/transformers/generation_utils.py", line 1162, in generate
        inputs_tensor, model_input_name, model_kwargs = self._prepare_model_inputs(inputs, bos_token_id, model_kwargs)
    File "/usr/local/lib/python3.9/site-packages/transformers/generation_utils.py", line 412, in _prepare_model_inputs
        and self.encoder.main_input_name != self.main_input_name
    File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1207, in __getattr__
        raise AttributeError("'{}' object has no attribute '{}'".format(
    AttributeError: 'BartEncoder' object has no attribute 'main_input_name'
    
    opened by zeke-john 1
  • Inference slower than Pytorch model for long sequence length

    Inference slower than Pytorch model for long sequence length

    Hi @siddharth-sharma7

    Thank you for providing fast-bart. It has made my life much easier.

    I find the bart-onnx-quantized model 2-3x faster than the Pytorch model. However, when the sequence length is long (~500 tokens), the onnx-based model is 1.5-2x slower.

    I also find a similar problem for T5-onnx model that has been discussed at https://github.com/microsoft/onnxruntime/issues/6835#:~:text=the%20converted%20t5%20onnx%20model,and%20higher%20beam%2Dsearch%20number.

    Just wondering if we're facing the same issue here.

    opened by jasontian6666 1
  • unable to use cudaprovider for inferencing

    unable to use cudaprovider for inferencing

    Hi @siddharth-sharma7 , You package is great and very easy to use, but I'm unable to figure out how to actually use CUDAExecutionProvider, and use gpu for inferencing. Whenever I provide the providers=['CUDAExecutionProvider'], the model is still not being loaded to gpu and inferencing still happens in cpu.

    opened by girishnadiger-gep 2
  • deploy onnx model to tensorrt

    deploy onnx model to tensorrt

    Thank you very much for your work, very helpful! Now I can convert BART model to ONNX model, and the output of the two is consistent, but I would like to ask whether you have tried to deploy onNX model to Tensorrt, so far I have been able to run on Tensorrt, but the result of tensorRT deployment is not consistent with that of ONNX model; (

    opened by will-wiki 1
Owner
Siddharth Sharma
Machine learning | NLP | Computer Vision
Siddharth Sharma
DiffQ performs differentiable quantization using pseudo quantization noise. It can automatically tune the number of bits used per weight or group of weights, in order to achieve a given trade-off between model size and accuracy.

Differentiable Model Compression via Pseudo Quantization Noise DiffQ performs differentiable quantization using pseudo quantization noise. It can auto

Facebook Research 145 Dec 30, 2022
This project aim to create multi-label classification annotation tool to boost annotation speed and make it more easier.

This project aim to create multi-label classification annotation tool to boost annotation speed and make it more easier.

null 4 Aug 2, 2022
A very simple tool to rewrite parameters such as attributes and constants for OPs in ONNX models. Simple Attribute and Constant Modifier for ONNX.

sam4onnx A very simple tool to rewrite parameters such as attributes and constants for OPs in ONNX models. Simple Attribute and Constant Modifier for

Katsuya Hyodo 6 May 15, 2022
An executor that loads ONNX models and embeds documents using the ONNX runtime.

ONNXEncoder An executor that loads ONNX models and embeds documents using the ONNX runtime. Usage via Docker image (recommended) from jina import Flow

Jina AI 2 Mar 15, 2022
Simple tool to combine(merge) onnx models. Simple Network Combine Tool for ONNX.

snc4onnx Simple tool to combine(merge) onnx models. Simple Network Combine Tool for ONNX. https://github.com/PINTO0309/simple-onnx-processing-tools 1.

Katsuya Hyodo 8 Oct 13, 2022
Quantization library for PyTorch. Support low-precision and mixed-precision quantization, with hardware implementation through TVM.

HAWQ: Hessian AWare Quantization HAWQ is an advanced quantization library written for PyTorch. HAWQ enables low-precision and mixed-precision uniform

Zhen Dong 293 Dec 30, 2022
🍅🍅🍅YOLOv5-Lite: lighter, faster and easier to deploy. Evolved from yolov5 and the size of model is only 1.7M (int8) and 3.3M (fp16). It can reach 10+ FPS on the Raspberry Pi 4B when the input size is 320×320~

YOLOv5-Lite:lighter, faster and easier to deploy Perform a series of ablation experiments on yolov5 to make it lighter (smaller Flops, lower memory, a

pogg 1.5k Jan 5, 2023
Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation. In CVPR 2022.

Nonuniform-to-Uniform Quantization This repository contains the training code of N2UQ introduced in our CVPR 2022 paper: "Nonuniform-to-Uniform Quanti

Zechun Liu 60 Dec 28, 2022
Very simple NCHW and NHWC conversion tool for ONNX. Change to the specified input order for each and every input OP. Also, change the channel order of RGB and BGR. Simple Channel Converter for ONNX.

scc4onnx Very simple NCHW and NHWC conversion tool for ONNX. Change to the specified input order for each and every input OP. Also, change the channel

Katsuya Hyodo 16 Dec 22, 2022
This package proposes simplified exporting pytorch models to ONNX and TensorRT, and also gives some base interface for model inference.

PyTorch Infer Utils This package proposes simplified exporting pytorch models to ONNX and TensorRT, and also gives some base interface for model infer

Alex Gorodnitskiy 11 Mar 20, 2022