Hugging Face Optimum
The AI ecosystem evolves quickly and more and more specialized hardware along with their own optimizations are emerging every day. As such, Optimum enables users to efficiently use any of these platforms with the same ease inherent to transformers.
Integration with Hardware Partners
To achieve this, we are collaborating with the following hardware manufacturers in order to provide the best transformers integration:
- Graphcore IPUs - IPUs are a completely new kind of massively parallel processor to accelerate machine intelligence. More information here.
- Habana Gaudi Processor (HPU) - HPUs are designed to maximize training throughput and efficiency. More information here.
- More to come soon!
⭐
Optimizing models towards inference
Along with supporting dedicated AI hardware for training, Optimum also provides inference optimizations towards various frameworks and platforms.
We currently support ONNX runtime along with Intel Neural Compressor (INC).
Features | ONNX Runtime | Intel Neural Compressor |
---|---|---|
Post-training Dynamic Quantization |
|
|
Post-training Static Quantization |
|
|
Quantization Aware Training (QAT) | Stay tuned! |
|
Pruning | N/A |
|
Installation
pip
as follows:
python -m pip install optimum
If you'd like to use the accelerator-specific features of
Accelerator | Installation |
---|---|
ONNX runtime | python -m pip install optimum[onnxruntime] |
Intel Neural Compressor (INC) | python -m pip install optimum[intel] |
Graphcore IPU | python -m pip install optimum[graphcore] |
Habana Gaudi Processor (HPU) | python -m pip install optimum[habana] |
If you'd like to play with the examples or need the bleeding edge of the code and can't wait for a new release, you can install the base library from source as follows:
python -m pip install git+https://github.com/huggingface/optimum.git
For the accelerator-specific features, you can install them by appending #egg=optimum[accelerator_type]
to the pip
command, e.g.
python -m pip install git+https://github.com/huggingface/optimum.git#egg=optimum[onnxruntime]
Quickstart
At its core,
Quantization
For example, here's how you can apply dynamic quantization with ONNX Runtime:
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from optimum.onnxruntime import ORTQuantizer
# The model we wish to quantize
model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
# The type of quantization to apply
qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)
quantizer = ORTQuantizer.from_pretrained(model_checkpoint, feature="sequence-classification")
# Quantize the model!
quantizer.export(
onnx_model_path="model.onnx",
onnx_quantized_model_output_path="model-quantized.onnx",
quantization_config=qconfig,
)
In this example, we've quantized a model from the Hugging Face Hub, but it could also be a path to a local model directory. The feature
argument in the from_pretrained()
method corresponds to the type of task that we wish to quantize the model for. The result from applying the export()
method is a model-quantized.onnx
file that can be used to run inference. Here's an example of how to load an ONNX Runtime model and generate predictions with it:
from functools import partial
from datasets import Dataset
from optimum.onnxruntime.model import ORTModel
# Load quantized model
ort_model = ORTModel("model-quantized.onnx", quantizer._onnx_config)
# Create a dataset or load one from the Hub
ds = Dataset.from_dict({"sentence": ["I love burritos!"]})
# Tokenize the inputs
def preprocess_fn(ex, tokenizer):
return tokenizer(ex["sentence"])
tokenized_ds = ds.map(partial(preprocess_fn, tokenizer=quantizer.tokenizer))
ort_outputs = ort_model.evaluation_loop(tokenized_ds)
# Extract logits!
ort_outputs.predictions
Similarly, you can apply static quantization by simply setting is_static
to True
when instantiating the QuantizationConfig
object:
qconfig = AutoQuantizationConfig.arm64(is_static=True, per_channel=False)
Static quantization relies on feeding batches of data through the model to estimate the activation quantization parameters ahead of inference time. To support this, Dataset
object from the sst2
dataset that the model was originally trained on:
from optimum.onnxruntime.configuration import AutoCalibrationConfig
# Create the calibration dataset
calibration_dataset = quantizer.get_calibration_dataset(
"glue",
dataset_config_name="sst2",
preprocess_function=partial(preprocess_fn, tokenizer=quantizer.tokenizer),
num_samples=50,
dataset_split="train",
)
# Create the calibration configuration containing the parameters related to calibration.
calibration_config = AutoCalibrationConfig.minmax(calibration_dataset)
# Perform the calibration step: computes the activations quantization ranges
ranges = quantizer.fit(
dataset=calibration_dataset,
calibration_config=calibration_config,
onnx_model_path="model.onnx",
operators_to_quantize=qconfig.operators_to_quantize,
)
# Quantize the same way we did for dynamic quantization!
quantizer.export(
onnx_model_path="model.onnx",
onnx_quantized_model_output_path="model-quantized.onnx",
calibration_tensors_range=ranges,
quantization_config=qconfig,
)
Graph optimization
Then let's take a look at applying graph optimizations techniques such as operator fusion and constant folding. As before, we load a configuration object, but this time by setting the optimization level instead of the quantization approach:
from optimum.onnxruntime.configuration import OptimizationConfig
# optimization_config=99 enables all available graph optimisations
optimization_config = OptimizationConfig(optimization_level=99)
Next, we load an optimizer to apply these optimisations to our model:
from optimum.onnxruntime import ORTOptimizer
optimizer = ORTOptimizer.from_pretrained(
model_checkpoint,
feature="sequence-classification",
)
# Export the optimized model
optimizer.export(
onnx_model_path="model.onnx",
onnx_optimized_model_output_path="model-optimized.onnx",
optimization_config=optimization_config,
)
And that's it - the model is now optimized and ready for inference!
As you can see, the process is similar in each case:
- Define the optimization / quantization strategies via an
OptimizationConfig
/QuantizationConfig
object - Instantiate an
ORTQuantizer
orORTOptimizer
class - Apply the
export()
method - Run inference
Training
Besides supporting ONNX Runtime inference, ORTTrainer
, which possess a similar behavior than the Trainer
of
-from transformers import Trainer
+from optimum.onnxruntime import ORTTrainer
# Step 1: Create your ONNX Runtime Trainer
-trainer = Trainer(
+trainer = ORTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
tokenizer=tokenizer,
data_collator=default_data_collator,
feature="sequence-classification",
)
# Step 2: Use ONNX Runtime for training and evalution!🤗
train_result = trainer.train()
eval_metrics = trainer.evaluate()
By replacing Trainer
by ORTTrainer
, you will be able to leverage ONNX Runtime for fine-tuning tasks.
Check out the examples
directory for more sophisticated usage.
Happy optimizing