🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

Overview



Build GitHub Documentation GitHub release Contributor Covenant

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0

🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.

🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets then share them with the community on our model hub. At the same time, each python module defining an architecture can be used as a standalone and modified to enable quick research experiments.

🤗 Transformers is backed by the two most popular deep learning libraries, PyTorch and TensorFlow, with a seamless integration between them, allowing you to train your models with one then load it for inference with the other.

Online demos

You can test most of our models directly on their pages from the model hub. We also offer private model hosting, versioning, & an inference API to use those models.

Here are a few examples:

Write With Transformer, built by the Hugging Face team, is the official demo of this repo’s text generation capabilities.

Quick tour

To immediately use a model on a given text, we provide the pipeline API. Pipelines group together a pretrained model with the preprocessing that was used during that model training. Here is how to quickly use a pipeline to classify positive versus negative texts

>>> from transformers import pipeline

# Allocate a pipeline for sentiment-analysis
>>> classifier = pipeline('sentiment-analysis')
>>> classifier('We are very happy to include pipeline into the transformers repository.')
[{'label': 'POSITIVE', 'score': 0.9978193640708923}]

The second line of code downloads and caches the pretrained model used by the pipeline, the third line evaluates it on the given text. Here the answer is "positive" with a confidence of 99.8%.

This is another example of pipeline used for that can extract question answers from some context:

>>> from transformers import pipeline

# Allocate a pipeline for question-answering
>>> question_answerer = pipeline('question-answering')
>>> question_answerer({
...     'question': 'What is the name of the repository ?',
...     'context': 'Pipeline have been included in the huggingface/transformers repository'
... })
{'score': 0.5135612454720828, 'start': 35, 'end': 59, 'answer': 'huggingface/transformers'}

On top of the answer, the pretrained model used here returned its confidence score, along with the start position and its end position in the tokenized sentence. You can learn more about the tasks supported by the pipeline API in this tutorial.

To download and use any of the pretrained models on your given task, you just need to use those three lines of codes (PyTorch version):

>>> from transformers import AutoTokenizer, AutoModel

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> model = AutoModel.from_pretrained("bert-base-uncased")

>>> inputs = tokenizer("Hello world!", return_tensors="pt")
>>> outputs = model(**inputs)

or for TensorFlow:

>>> from transformers import AutoTokenizer, TFAutoModel

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> model = TFAutoModel.from_pretrained("bert-base-uncased")

>>> inputs = tokenizer("Hello world!", return_tensors="tf")
>>> outputs = model(**inputs)

The tokenizer is responsible for all the preprocessing the pretrained model expects, and can be called directly on one (or list) of texts (as we can see on the fourth line of both code examples). It will output a dictionary you can directly pass to your model (which is done on the fifth line).

The model itself is a regular Pytorch nn.Module or a TensorFlow tf.keras.Model (depending on your backend) which you can use normally. For instance, this tutorial explains how to integrate such a model in classic PyTorch or TensorFlow training loop, or how to use our Trainer API to quickly fine-tune the on a new dataset.

Why should I use transformers?

  1. Easy-to-use state-of-the-art models:

    • High performance on NLU and NLG tasks.
    • Low barrier to entry for educators and practitioners.
    • Few user-facing abstractions with just three classes to learn.
    • A unified API for using all our pretrained models.
  2. Lower compute costs, smaller carbon footprint:

    • Researchers can share trained models instead of always retraining.
    • Practitioners can reduce compute time and production costs.
    • Dozens of architectures with over 2,000 pretrained models, some in more than 100 languages.
  3. Choose the right framework for every part of a model's lifetime:

    • Train state-of-the-art models in 3 lines of code.
    • Move a single model between TF2.0/PyTorch frameworks at will.
    • Seamlessly pick the right framework for training, evaluation, production.
  4. Easily customize a model or an example to your needs:

    • Examples for each architecture to reproduce the results by the official authors of said architecture.
    • Expose the models internal as consistently as possible.
    • Model files can be used independently of the library for quick experiments.

Why shouldn't I use transformers?

  • This library is not a modular toolbox of building blocks for neural nets. The code in the model files is not refactored with additional abstractions on purpose, so that researchers can quickly iterate on each of the models without diving in additional abstractions/files.
  • The training API is not intended to work on any model but is optimized to work with the models provided by the library. For generic machine learning loops, you should use another library.
  • While we strive to present as many use cases as possible, the scripts in our examples folder are just that: examples. It is expected that they won't work out-of-the box on your specific problem and that you will be required to change a few lines of code to adapt them to your needs.

Installation

With pip

This repository is tested on Python 3.6+, PyTorch 1.0.0+ (PyTorch 1.3.1+ for examples) and TensorFlow 2.0.

You should install 🤗 Transformers in a virtual environment. If you're unfamiliar with Python virtual environments, check out the user guide.

First, create a virtual environment with the version of Python you're going to use and activate it.

Then, you will need to install at least one of TensorFlow 2.0, PyTorch or Flax. Please refer to TensorFlow installation page, PyTorch installation page regarding the specific install command for your platform and/or Flax installation page.

When TensorFlow 2.0 and/or PyTorch has been installed, 🤗 Transformers can be installed using pip as follows:

pip install transformers

If you'd like to play with the examples or need the bleeding edge of the code and can't wait for a new release, you must install the library from source.

With conda

Since Transformers version v4.0.0, we now have a conda channel: huggingface.

🤗 Transformers can be installed using conda as follows:

conda install -c huggingface transformers

Follow the installation pages of TensorFlow, PyTorch or Flax to see how to install them with conda.

Models architectures

All the model checkpoints provided by 🤗 Transformers are seamlessly integrated from the huggingface.co model hub where they are uploaded directly by users and organizations.

Current number of checkpoints:

🤗 Transformers currently provides the following architectures (see here for a high-level summary of each them):

  1. ALBERT (from Google Research and the Toyota Technological Institute at Chicago) released with the paper ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
  2. BART (from Facebook) released with the paper BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
  3. BARThez (from École polytechnique) released with the paper BARThez: a Skilled Pretrained French Sequence-to-Sequence Model by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
  4. BERT (from Google) released with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
  5. BERT For Sequence Generation (from Google) released with the paper Leveraging Pre-trained Checkpoints for Sequence Generation Tasks by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
  6. Blenderbot (from Facebook) released with the paper Recipes for building an open-domain chatbot by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
  7. BlenderbotSmall (from Facebook) released with the paper Recipes for building an open-domain chatbot by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
  8. BORT (from Alexa) released with the paper Optimal Subarchitecture Extraction For BERT by Adrian de Wynter and Daniel J. Perry.
  9. CamemBERT (from Inria/Facebook/Sorbonne) released with the paper CamemBERT: a Tasty French Language Model by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
  10. ConvBERT (from YituTech) released with the paper ConvBERT: Improving BERT with Span-based Dynamic Convolution by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
  11. CTRL (from Salesforce) released with the paper CTRL: A Conditional Transformer Language Model for Controllable Generation by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
  12. DeBERTa (from Microsoft Research) released with the paper DeBERTa: Decoding-enhanced BERT with Disentangled Attention by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
  13. DialoGPT (from Microsoft Research) released with the paper DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
  14. DistilBERT (from HuggingFace), released together with the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into DistilGPT2, RoBERTa into DistilRoBERTa, Multilingual BERT into DistilmBERT and a German version of DistilBERT.
  15. DPR (from Facebook) released with the paper Dense Passage Retrieval for Open-Domain Question Answering by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
  16. ELECTRA (from Google Research/Stanford University) released with the paper ELECTRA: Pre-training text encoders as discriminators rather than generators by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
  17. FlauBERT (from CNRS) released with the paper FlauBERT: Unsupervised Language Model Pre-training for French by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
  18. Funnel Transformer (from CMU/Google Brain) released with the paper Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
  19. GPT (from OpenAI) released with the paper Improving Language Understanding by Generative Pre-Training by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
  20. GPT-2 (from OpenAI) released with the paper Language Models are Unsupervised Multitask Learners by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
  21. LayoutLM (from Microsoft Research Asia) released with the paper LayoutLM: Pre-training of Text and Layout for Document Image Understanding by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
  22. LED (from AllenAI) released with the paper Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan.
  23. Longformer (from AllenAI) released with the paper Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan.
  24. LXMERT (from UNC Chapel Hill) released with the paper LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering by Hao Tan and Mohit Bansal.
  25. MarianMT Machine translation models trained using OPUS data by Jörg Tiedemann. The Marian Framework is being developed by the Microsoft Translator Team.
  26. MBart (from Facebook) released with the paper Multilingual Denoising Pre-training for Neural Machine Translation by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
  27. MPNet (from Microsoft Research) released with the paper MPNet: Masked and Permuted Pre-training for Language Understanding by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
  28. MT5 (from Google AI) released with the paper mT5: A massively multilingual pre-trained text-to-text transformer by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
  29. Pegasus (from Google) released with the paper PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization> by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
  30. ProphetNet (from Microsoft Research) released with the paper ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
  31. Reformer (from Google Research) released with the paper Reformer: The Efficient Transformer by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
  32. RoBERTa (from Facebook), released together with the paper a Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
  33. SqueezeBert released with the paper SqueezeBERT: What can computer vision teach NLP about efficient neural networks? by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
  34. T5 (from Google AI) released with the paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
  35. TAPAS (from Google AI) released with the paper TAPAS: Weakly Supervised Table Parsing via Pre-training by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
  36. Transformer-XL (from Google/CMU) released with the paper Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
  37. Wav2Vec2 (from Facebook AI) released with the paper wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
  38. XLM (from Facebook) released together with the paper Cross-lingual Language Model Pretraining by Guillaume Lample and Alexis Conneau.
  39. XLM-ProphetNet (from Microsoft Research) released with the paper ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
  40. XLM-RoBERTa (from Facebook AI), released together with the paper Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
  41. XLNet (from Google/CMU) released with the paper ​XLNet: Generalized Autoregressive Pretraining for Language Understanding by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
  42. Want to contribute a new model? We have added a detailed guide and templates to guide you in the process of adding a new model. You can find them in the templates folder of the repository. Be sure to check the contributing guidelines and contact the maintainers or open an issue to collect feedbacks before starting your PR.

To check if each model has an implementation in PyTorch/TensorFlow/Flax or has an associated tokenizer backed by the 🤗 Tokenizers library, refer to this table

These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations. You can find more details on the performances in the Examples section of the documentation.

Learn more

Section Description
Documentation Full API documentation and tutorials
Task summary Tasks supported by 🤗 Transformers
Preprocessing tutorial Using the Tokenizer class to prepare data for the models
Training and fine-tuning Using the models provided by 🤗 Transformers in a PyTorch/TensorFlow training loop and the Trainer API
Quick tour: Fine-tuning/usage scripts Example scripts for fine-tuning models on a wide range of tasks
Model sharing and uploading Upload and share your fine-tuned models with the community
Migration Migrate to 🤗 Transformers from pytorch-transformers or pytorch-pretrained-bert

Citation

We now have a paper you can cite for the 🤗 Transformers library:

@inproceedings{wolf-etal-2020-transformers,
    title = "Transformers: State-of-the-Art Natural Language Processing",
    author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = oct,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6",
    pages = "38--45"
}
Issues
  • How to use BERT for finding similar sentences or similar news?

    How to use BERT for finding similar sentences or similar news?

    I have used BERT NextSentencePredictor to find similar sentences or similar news, However, It's super slow. Even on Tesla V100 which is the fastest GPU till now. It takes around 10secs for a query title with around 3,000 articles. Is there a way to use BERT better for finding similar sentences or similar news given a corpus of news articles?

    opened by Raghavendra15 161
  • Summarization Fine Tuning

    Summarization Fine Tuning

    ❓ Questions & Help

    Details

    I tried using T5 and Bart but the abstraction summarization on scientific texts does not seem to give the results I want since I think they are both trained on news corpora. I have scraped all of the free PMC articles and I am thinking about fine-tuning a seq2seq model between the articles and their abstracts to make an abstractive summarizer for scientific texts. This Medium article (https://medium.com/huggingface/encoder-decoders-in-transformers-a-hybrid-pre-trained-architecture-for-seq2seq-af4d7bf14bb8) provides a bit of an introduction to how to approach this but does not quite go into detail so I am wondering how to approach this.

    I'm not really asking for help being stuck but I just don't really know how to approach this problem.

    A link to original question on Stack Overflow: https://stackoverflow.com/questions/61826443/train-custom-seq2seq-transformers-model

    Discussion wontfix 
    opened by kevinlu1248 79
  • GPT-J-6B

    GPT-J-6B

    What does this PR do?

    Introduces the long awaited GPT J model class to HuggingFace! Concurrently with this PR being merged I will make a GPT J 6B checkpoint public on the EleutherAI HF page for people to use. The model has been evaluated as being within error tolerances of the GPT J 6B model we released in Jax two months ago.

    @patil-suraj was very helpful in assisting me to understand HF philosophy and how to make this PR most in line with the rest of the codebase. Other than that, the major design consideration was to make the configs compatible with GPT-2 rather than GPT-Neo. GPT-Neo has some usability limitations due to its configs having names unrelated to GPT-2’s (see #12183 for details). Given those problems and my hope that GPT-Neo will have it’s configs updated in the future, it seemed like a clear choice to align GPT J with GPT-2.

    Shout outs to @finetuneanon whose implementation this one is based off of, as well as @kumuruz for assistence optimizing and debugging.

    Supersedes #12243 #13010 #13022

    Closes #12098

    Before submitting

    • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
    • [X] Did you read the contributor guideline, Pull Request section?
    • [X] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case. It was discussed in Slack with @patil-suraj
    • [X] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
    • [X] Did you write any new necessary tests?

    Who can review?

    • gpt2: @patrickvonplaten, @LysandreJik, @patil-suraj
    opened by StellaAthena 75
  • [DeepSpeed] [success] trained t5-11b on 1x 40GB gpu

    [DeepSpeed] [success] trained t5-11b on 1x 40GB gpu

    Managed to train t5-11b on 1x 40GB gpu w/ Deepspeed (A100-SXM4-40GB)

    Thank you, @PeterAJansen for letting me use your hardware!

    Thank you, @jeffra and @samyam, for not believing that it is not possible to train t5-11b on 1x 40GB gpu w/ Deepspeed and supporting me that lead me to find a few bugs in the integration.

    Sharing details for those who need.

    If you want to try this at home please make sure you use transformers master as some bug fixes were just merged in

    Well, it's similar to the t5-3b on 24GB success reported here and here. But this time t5-11b on 1x 40GB gpu (or 4x if you wanted things faster)

    As someone asked me before you need a huge amount of general RAM to use ZeRO-Offload for a huge model:

    • for t5-3b on 1x 24GB gpu: ~71GB RAM
    • for t5-11b on 1x 40GB gpu: ~234GB RAM

    I was using /usr/bin/time -v program to get the peak memory measurement - it's the Maximum resident set size entry in the final report.

    Question: I don't think /usr/bin/time does the right thing for multi-process - I think it only measures the parent process. e.g. with 4x gpus it reported only 102GB RAM, but I clearly saw in top that it was around 240GB. If you have an easy way to measure peak memory that takes into an account forked processes I'm all ears.

    Batch sizes on one gpu:

    • with buffers of 5e8 I was able to run BS=2, which might be too small for training,
    • but with 2e8 I managed to squeeze in BS=10 for training, but OOMed on prediction

    I'm referring to these batch sizes in ds_config.json:

            "allgather_bucket_size": 2e8,
            "reduce_bucket_size": 2e8,
    

    And I tested for 2x and 4x DDP as well, BS=16 OOMed, BS=8 was good so I used that - but could probably squeeze some more.

    edit1: later tests show that my test was too short and wasn't getting the CPU Adam optimizer kick in, as it skips the first 20 or so tests because of the overflow. So once it kicks in it takes more GPU memory, so the practical BS is much smaller - I think around 2 on this setup. So most likely you will need to use BS=2 for real work, until things get optimized even more.

    edit2: things are getting re-shuffling in the tests, so the default ds_config.json file has moved in master to a new, hopefully permanent home. It's now at examples/tests/deepspeed/ds_config.json so you will need to adjust the command line to reflect this new location or simply copy it over to where the old one used to be.

    here is the full benchmark:

    # 1 gpu: 
    # only training fits with this BS, eval needs a smaller BS
    
    export BS=8; rm -rf output_dir; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=1 ./finetune_trainer.py --model_name_or_path t5-11b --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16
    
    {'train_runtime': 31.0897, 'train_samples_per_second': 0.257, 'epoch': 1.0}
    
    # 2 gpus:
    
    export BS=8; rm -rf output_dir; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=2 ./finetune_trainer.py --model_name_or_path t5-11b --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16
    
    {'train_runtime': 17.9026, 'train_samples_per_second': 0.223, 'epoch': 1.0}
    
    # 4 gpus
    
    export BS=8; rm -rf output_dir; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=4 ./finetune_trainer.py --model_name_or_path t5-11b --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16
    
    {'train_runtime': 10.4404, 'train_samples_per_second': 0.192, 'epoch': 1.0}
    

    Checkpointing should allow making even bigger batch sizes.

    DeepSpeed 
    opened by stas00 64
  • FP16 overflow with GPT-Neo when using sequence lengths of 2048.

    FP16 overflow with GPT-Neo when using sequence lengths of 2048.

    Environment info

    • transformers version: 4.5.0.dev0
    • Platform: Linux-5.4.0-54-generic-x86_64-with-glibc2.29
    • Python version: 3.8.5
    • PyTorch version (GPU?): 1.8.0+cu111
    • Tensorflow version (GPU?): N/A
    • Using GPU in script?: Yes
    • Using distributed or parallel set-up in script?: No

    Who can help

    @stas00

    Models:

    • GPT-Neo 1.3b

    Library:

    • deepspeed: @stas00

    Information

    Model I am using (Bert, XLNet ...):

    The problem arises when using:

    • [ ] the official example scripts: (give details below)
    • [x] my own modified scripts: (give details below)

    The tasks I am working on is:

    • [ ] an official GLUE/SQUaD task: (give the name)
    • [x] my own task or dataset: (give details below)

    To reproduce

    Steps to reproduce the behavior:

    1. Use GPT-Neo 1.3b with The Pile dataset and built in trainer. Artificial data also suffices. It does not matter what the data is, as long as the attention mask spans all 2048 tokens.
    2. Enable FP16 and set max_length to 2048
    3. Observe that all loses reported are NaN

    Also reproducible using AMP or DeepSpeed. It seems like there is code to circumvent this outlined in the GPT-Neo implementation where q,k,v are casted to fp32 in the attention block.

    When the max_length is shorter (512) this overflow does not occur.

    Expected behavior

    I expected no overflows.

    Aside

    I'm reaching out on behalf of EleutherAI, Lysandre told us to create an issue about this.

    opened by LouisCastricato 60
  • How to use fine-tuned BART for prediction?

    How to use fine-tuned BART for prediction?

    ❓ Questions & Help

    Details

    I fine-tuned the BART model on a custom summarization dataset using the transformers/examples/summarization/bart/finetune.py and transformers/examples/summarization/bart/run_train.sh files in the repository for training (which generated three checkpointepoch=*.ckpt files) and prediction (which generated a .txt file with the test loss scores).

    I have two questions on using this model for prediction:

    • How can I modify finetune.py to generate predictions for the test set, in addition to the loss scores? I see some test functions in finetune.py, but I'm not sure how to use these for generating a .txt file with the predictions.

    • How can I load the generated .ckpt files into BartForConditionalGeneration()? A config.json file was not generated along with the checkpoint files; there doesn't seem to be a TFBartForConditionalGeneration; and the convert_tf_checkpoint_to_pytorch.py script in the repo doesn't seem to support BART yet.

    Thank you for your time!

    Discussion wontfix 
    opened by riacheruvu 56
  • Pegasus finetuning: OOM

    Pegasus finetuning: OOM

    Epoch 0: 91% 5747/6331 [39:52<04:03, 2.40it/s, loss=75.765, v_num=2]/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:200: UserWarning: Please also save or load the state of the optimzer when saving or loading the scheduler. warnings.warn(SAVE_STATE_WARNING, UserWarning) tcmalloc: large alloc 1083260928 bytes == 0x1aece0000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 1354080256 bytes == 0x21e5c000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 1692606464 bytes == 0x7f10651ce000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 2115764224 bytes == 0x7f0fe700e000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 2644705280 bytes == 0x7f0f495de000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 3305881600 bytes == 0x7f0fe700e000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 4132356096 bytes == 0x7f0e530f2000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 5165449216 bytes == 0x7f0f495de000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 ./finetune_pegasus_xsum.sh: line 15: 876 Killed

    I appreciate any help. Thank you.

    opened by laibamehnaz 47
  • Sharded DDP training fails with seq2seq models

    Sharded DDP training fails with seq2seq models

    Information

    Model I am using (Bert, XLNet ...): T5/BART/mBART/Marian

    The problem arises when using:

    • [x] the official example scripts: (give details below)
    • [ ] my own modified scripts: (give details below)

    The tasks I am working on is:

    • [x] an official GLUE/SQUaD task: seq2seq
    • [ ] my own task or dataset: (give details below)

    To reproduce

    Steps to reproduce the behavior:

    Run

    python -m torch.distributed.launch --nproc_per_node=2 examples/seq2seq/finetune_trainer.py \
    --model_name_or_path sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir \
    ~/Downloads/wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 \
    --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 \
    --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 4 --sortish_sampler \
    --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 \
    --n_train 500 --sharded_ddp
    

    will fail with

    Traceback (most recent call last):
    File "examples/seq2seq/finetune_trainer.py", line 379, in <module>
    main()
    File "examples/seq2seq/finetune_trainer.py", line 316, in main
    model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
    File "/home/sgugger/git/transformers/src/transformers/trainer.py", line 821, in train
    self.optimizer.step()
    File "/home/sgugger/.pyenv/versions/base/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
    return wrapped(*args, **kwargs)
    File "/home/sgugger/git/fairscale/fairscale/optim/oss.py", line 210, in step
    self._broadcast_params()
    File "/home/sgugger/git/fairscale/fairscale/optim/oss.py", line 522, in _broadcast_params
    if self.should_bucket_param[param]:
    KeyError: Parameter containing:
    tensor([[-0.0296,  0.0038],
    [ 0.0000,  0.0000],
    [ 0.0298,  0.0385],
    ...,
    [-0.0161, -0.0024],
    [ 0.0022, -0.0576],
    [ 0.0053,  0.0256]], device='cuda:1')
    0%|   
    

    Using FP16 also fails.

    Expected behavior

    The script should run to completion.

    opened by sgugger 46
  • Feature extraction for sequential labelling

    Feature extraction for sequential labelling

    Hi, I have a question in terms of using BERT for sequential labeling task. Please correct me if I'm wrong. My understanding is:

    1. Use BertModel loaded with pretrained weights instead of MaskedBertModel.
    2. In such case, take a sequence of tokens as input, BertModel would output a list of hidden states, I only use the top layer hidden states as the embedding for that sequence.
    3. Then to fine tune the model, add a linear fully connected layer and softmax to make final decision.

    Is this entire process correct? I followed this procedure but could not have any results.

    Thank you!

    Discussion wontfix 
    opened by zhaoxy92 46
  • What to do about this warning message:

    What to do about this warning message: "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification"

    model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
    

    returns this warning message:

    Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
    - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
    - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    

    This just started popping up with v.3 so I'm not sure what is the recommended action to take here. Please advise if you can. Basically, any of my code using the AutoModelFor<X> is throwing up this warning now.

    Thanks.

    opened by ohmeow 42
  • Does the word embedding matrix of GPT2 load from the checkpoint,  during the fine-tuning?

    Does the word embedding matrix of GPT2 load from the checkpoint, during the fine-tuning?

    Does the word embedding matrix of GPT2 load from the checkpoint, during the fine-tuning? (transformer.wte.weight⇒shared token embedding layer Shared weights logic)

    I tried to do the following:

    model = TFGPT2LMHeadModel.from_pretrained("gpt2-medium") model.get_output_embeddings().weight

    But the following error was occurred, probably because the parameters of wte.weight were not initialized.

    tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found. (0) Failed precondition: Error while reading resource variable tfgp_t2lm_head_model/transformer/wte/weight from Container: localhost. This could mean that the variable was uninitialized. Not found: Container localhost does not exist. (Could not find resource: localhost/tfgp_t2lm_head_model/transformer/wte/weight) [[node tfgp_t2lm_head_model/transformer/wte/weight/Read/ReadVariableOp (defined at home/nak/cho/.local/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1751) ]] [[tfgp_t2lm_head_model/transformer/wte/weight/Read/ReadVariableOp/_1]] (1) Failed precondition: Error while reading resource variable tfgp_t2lm_head_model/transformer/wte/weight from Container: localhost. This could mean that the variable was uninitialized. Not found: Container localhost does not exist. (Could not find resource: localhost/tfgp_t2lm_head_model/transformer/wte/weight) [[node tfgp_t2lm_head_model/transformer/wte/weight/Read/ReadVariableOp (defined at home/nak/cho/.local/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1751) ]]

    I am very confused. Shouldn't the parameters of the word embedding table be restored directly from the checkpoint during fine-tuning? Why is it not initialized?

    Moreover, when adding special tokens through the following operations, the number of tokens in the dictionary changes, which means the dimension of the word embedding table is changed. Does this mean that the wte.weight needs to be reinitialized? Will the fine-tuning effect be good in this case?

    num_added_toks = tokenizer.add_special_tokens(special_tokens_dict) model.resize_token_embeddings(len(tokenizer))

    Many thanks!

    opened by Choitsugun 0
  • [Flax] token-classification model steps enumerate start from 1

    [Flax] token-classification model steps enumerate start from 1

    What does this PR do?

    Model saving at the last step was skipped due to enumerate starting at 0.

    Before submitting

    • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
    • [ ] Did you read the contributor guideline, Pull Request section?
    • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
    • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
    • [ ] Did you write any new necessary tests?

    Who can review?

    Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

    @sgugger @patil-suraj

    opened by kamalkraj 0
  • Fix a Bug, trainer_seq2seq.py, in the else branch at Line 172, generation_inputs should be a dict

    Fix a Bug, trainer_seq2seq.py, in the else branch at Line 172, generation_inputs should be a dict

    Fixing Bug

    Fixes # (issue)

    In trainer_seq2seq.py / Seq2SeqTrainer / prediction_step, Line 174 reads:

    generated_tokens = self.model.generate(
        **generation_inputs,
        **gen_kwargs,
    )
    

    which require the generated_tokens to be a dict. However, in the else branch in Line 171, the generation_inputs is created as a Tensor object, which will cause a problem.

    Fix this by creating generation_inputs as a dict, and add a key called input_ids.

    opened by TranSirius 0
  • TAPAS tokenizer is unable to handle indexed dataframes

    TAPAS tokenizer is unable to handle indexed dataframes

    Running the following code will result in a iloc error:

    from transformers import TapasTokenizer
    import pandas as pd
    
    model_name = 'google/tapas-base'
    tokenizer = TapasTokenizer.from_pretrained(model_name)
    
    data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
    queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
    answer_coordinates = [[(0, 0)], [(2, 1)], [(0, 1), (1, 1), (2, 1)]]
    answer_text = [["Brad Pitt"], ["69"], ["209"]]
    table = pd.DataFrame.from_dict(data)
    
    # Let's add random years - this will break the tokenzier
    table.index = [2000, 2010, 2020]
    
    inputs = tokenizer(table=table, queries=queries, answer_coordinates=answer_coordinates, answer_text=answer_text, padding='max_length', return_tensors='pt')
    inputs
    
    opened by xhlulu 1
  • TAPAS tanh activation on the pooling layer

    TAPAS tanh activation on the pooling layer

    I noticed the following in the TAPAS pooling layer:

    https://github.com/huggingface/transformers/blob/d83b0e0c079f0826d186270a86622ff5f1efd9c1/src/transformers/models/tapas/modeling_tapas.py#L696-L709

    I'm curious about the use of nn.Tanh(). I wasn't able to find more information about that activation in the paper. Is it possible to know where it comes from? Thanks!

    opened by xhlulu 1
  • CUDA OOM at `self.optimizer.consolidate_state_dict()` in Trainer when using sharded_ddp

    CUDA OOM at `self.optimizer.consolidate_state_dict()` in Trainer when using sharded_ddp

    Environment info

    • transformers version: 4.12.3
    • Platform: Linux-5.4.0-1057-aws-x86_64-with-debian-buster-sid
    • Python version: 3.7.10
    • PyTorch version (GPU?): 1.7.1 (True)
    • Tensorflow version (GPU?): not installed (NA)
    • Flax version (CPU?/GPU?/TPU?): not installed (NA)
    • Jax version: not installed
    • JaxLib version: not installed
    • Using GPU in script?: 8 GPUs
    • Using distributed or parallel set-up in script?: sharded_ddp (fairscale 0.4.2)

    Who can help

    @sgugger

    Information

    Model I am using (Bert, XLNet ...): BART-base

    The problem arises when using:

    • my own modified scripts: (give details below)
    • I'm using my own code which is mainly modified from run_mlm.py(https://github.com/huggingface/transformers/blob/v4.12.3/examples/pytorch/language-modeling/run_mlm.py) for pretraining with huggingface trainer

    The tasks I am working on is:

    • my own task or dataset: (give details below)
    • I'm using wikipedia corpus.

    To reproduce

    Steps to reproduce the behavior:

    1. run the script run_mlm.py(https://github.com/huggingface/transformers/blob/v4.12.3/examples/pytorch/language-modeling/run_mlm.py)
    2. run the script with the following command line
    python -m torch.distributed.launch --nproc_per_node=8 --master_port=10000 run_mlm.py \
        --model_name_or_path roberta-base \
        --dataset_name wikitext \
        --dataset_config_name wikitext-2-raw-v1 \
        --do_train \
        --do_eval \
        --cache_dir /tmp/test-mlm \
        --output_dir /tmp/test-mlm \
        --sharded_ddp simple \
        --overwrite_output_dir \
        --per_device_train_batch_size 2 \
        --per_device_eval_batch_size 4
    

    Traceback (most recent call last): File "run_mlm.py", line 538, in main() File "run_mlm.py", line 487, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/transformers/trainer.py", line 1383, in train self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/transformers/trainer.py", line 1495, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/transformers/trainer.py", line 1565, in _save_checkpoint self.optimizer.consolidate_state_dict() File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/fairscale/optim/oss.py", line 358, in consolidate_state_dict obj_list, src=self._local_to_global_rank[rank], group=self.group, File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1403, in broadcast_object_list object_list[i] = _tensor_to_object(obj_view, obj_size) File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1187, in _tensor_to_object out = pickle.loads(buf) File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/torch/storage.py", line 141, in _load_from_bytes return torch.load(io.BytesIO(b)) File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/torch/serialization.py", line 595, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/torch/serialization.py", line 774, in _legacy_load result = unpickler.load() File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/torch/serialization.py", line 730, in persistent_load deserialized_objects[root_key] = restore_location(obj, location) File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/torch/serialization.py", line 175, in default_restore_location result = fn(storage, location) File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/torch/serialization.py", line 155, in _cuda_deserialize return storage_type(obj.size()) File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/torch/cuda/init.py", line 462, in _lazy_new return super(_CudaBase, cls).new(cls, *args, **kwargs) RuntimeError: CUDA error: out of memory

    Expected behavior

    Could you please tell me how to fix this issue?

    opened by yana-xuyan 0
  • Out of the box GPT-2 CLM hits out of memory on an AWS 8x Nvidia A100 VM

    Out of the box GPT-2 CLM hits out of memory on an AWS 8x Nvidia A100 VM

    Running the command located here https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling

    python run_clm.py
    --model_name_or_path gpt2
    --dataset_name wikitext
    --dataset_config_name wikitext-2-raw-v1
    --do_train
    --do_eval
    --output_dir /tmp/test-clm

    Produces an OOM error on the AWS p4d.24xlarge VM. Surely this should run out of the box on a VM like this without me having to fiddle with the training params...

    See stack trace below:

    (pytorch_latest_p37) [email protected]:~/transformers/examples/pytorch/language-modeling$ python run_clm.py \
    >     --model_name_or_path gpt2 \
    >     --dataset_name wikitext \
    >     --dataset_config_name wikitext-2-raw-v1 \
    >     --do_train \
    >     --do_eval \
    >     --output_dir /tmp/test-clm
    
    11/26/2021 16:35:03 - WARNING - __main__ - Process rank: -1, device: cuda:0, n_gpu: 8distributed training: False, 16-bits training: False
    11/26/2021 16:35:03 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
    _n_gpu=8,
    adafactor=False,
    adam_beta1=0.9,
    adam_beta2=0.999,
    adam_epsilon=1e-08,
    dataloader_drop_last=False,
    dataloader_num_workers=0,
    dataloader_pin_memory=True,
    ddp_find_unused_parameters=None,
    debug=[],
    deepspeed=None,
    disable_tqdm=False,
    do_eval=True,
    do_predict=False,
    do_train=True,
    eval_accumulation_steps=None,
    eval_steps=None,
    evaluation_strategy=IntervalStrategy.NO,
    fp16=False,
    fp16_backend=auto,
    fp16_full_eval=False,
    fp16_opt_level=O1,
    gradient_accumulation_steps=1,
    gradient_checkpointing=False,
    greater_is_better=None,
    group_by_length=False,
    hub_model_id=None,
    hub_strategy=HubStrategy.EVERY_SAVE,
    hub_token=<HUB_TOKEN>,
    ignore_data_skip=False,
    label_names=None,
    label_smoothing_factor=0.0,
    learning_rate=5e-05,
    length_column_name=length,
    load_best_model_at_end=False,
    local_rank=-1,
    log_level=-1,
    log_level_replica=-1,
    log_on_each_node=True,
    logging_dir=/tmp/test-clm/runs/Nov26_16-35-03_ip-172-31-20-157,
    logging_first_step=False,
    logging_nan_inf_filter=True,
    logging_steps=500,
    logging_strategy=IntervalStrategy.STEPS,
    lr_scheduler_type=SchedulerType.LINEAR,
    max_grad_norm=1.0,
    max_steps=-1,
    metric_for_best_model=None,
    mp_parameters=,
    no_cuda=False,
    num_train_epochs=3.0,
    output_dir=/tmp/test-clm,
    overwrite_output_dir=False,
    past_index=-1,
    per_device_eval_batch_size=8,
    per_device_train_batch_size=8,
    prediction_loss_only=False,
    push_to_hub=False,
    push_to_hub_model_id=None,
    push_to_hub_organization=None,
    push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
    remove_unused_columns=True,
    report_to=[],
    resume_from_checkpoint=None,
    run_name=/tmp/test-clm,
    save_on_each_node=False,
    save_steps=500,
    save_strategy=IntervalStrategy.STEPS,
    save_total_limit=None,
    seed=42,
    sharded_ddp=[],
    skip_memory_metrics=True,
    tpu_metrics_debug=False,
    tpu_num_cores=None,
    use_legacy_prediction_loop=False,
    warmup_ratio=0.0,
    warmup_steps=0,
    weight_decay=0.0,
    xpu_backend=None,
    )
    11/26/2021 16:35:04 - INFO - datasets.info - Loading Dataset Infos from /home/ubuntu/.cache/huggingface/modules/datasets_modules/datasets/wikitext/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126
    11/26/2021 16:35:04 - INFO - datasets.builder - Overwrite dataset info from restored data version.
    11/26/2021 16:35:04 - INFO - datasets.info - Loading Dataset info from /home/ubuntu/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126
    11/26/2021 16:35:04 - WARNING - datasets.builder - Reusing dataset wikitext (/home/ubuntu/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
    11/26/2021 16:35:04 - INFO - datasets.info - Loading Dataset info from /home/ubuntu/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126
    100%|████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 802.17it/s]
    [INFO|configuration_utils.py:602] 2021-11-26 16:35:04,918 >> loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /home/ubuntu/.cache/huggingface/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
    [INFO|configuration_utils.py:639] 2021-11-26 16:35:04,919 >> Model config GPT2Config {
      "_name_or_path": "gpt2",
      "activation_function": "gelu_new",
      "architectures": [
        "GPT2LMHeadModel"
      ],
      "attn_pdrop": 0.1,
      "bos_token_id": 50256,
      "embd_pdrop": 0.1,
      "eos_token_id": 50256,
      "initializer_range": 0.02,
      "layer_norm_epsilon": 1e-05,
      "model_type": "gpt2",
      "n_ctx": 1024,
      "n_embd": 768,
      "n_head": 12,
      "n_inner": null,
      "n_layer": 12,
      "n_positions": 1024,
      "reorder_and_upcast_attn": false,
      "resid_pdrop": 0.1,
      "scale_attn_by_inverse_layer_idx": false,
      "scale_attn_weights": true,
      "summary_activation": null,
      "summary_first_dropout": 0.1,
      "summary_proj_to_labels": true,
      "summary_type": "cls_index",
      "summary_use_proj": true,
      "task_specific_params": {
        "text-generation": {
          "do_sample": true,
          "max_length": 50
        }
      },
      "transformers_version": "4.13.0.dev0",
      "use_cache": true,
      "vocab_size": 50257
    }
    
    [INFO|tokenization_auto.py:344] 2021-11-26 16:35:05,205 >> Could not locate the tokenizer configuration file, will try to use the model config instead.
    [INFO|configuration_utils.py:602] 2021-11-26 16:35:05,783 >> loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /home/ubuntu/.cache/huggingface/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
    [INFO|configuration_utils.py:639] 2021-11-26 16:35:05,784 >> Model config GPT2Config {
      "_name_or_path": "gpt2",
      "activation_function": "gelu_new",
      "architectures": [
        "GPT2LMHeadModel"
      ],
      "attn_pdrop": 0.1,
      "bos_token_id": 50256,
      "embd_pdrop": 0.1,
      "eos_token_id": 50256,
      "initializer_range": 0.02,
      "layer_norm_epsilon": 1e-05,
      "model_type": "gpt2",
      "n_ctx": 1024,
      "n_embd": 768,
      "n_head": 12,
      "n_inner": null,
      "n_layer": 12,
      "n_positions": 1024,
      "reorder_and_upcast_attn": false,
      "resid_pdrop": 0.1,
      "scale_attn_by_inverse_layer_idx": false,
      "scale_attn_weights": true,
      "summary_activation": null,
      "summary_first_dropout": 0.1,
      "summary_proj_to_labels": true,
      "summary_type": "cls_index",
      "summary_use_proj": true,
      "task_specific_params": {
        "text-generation": {
          "do_sample": true,
          "max_length": 50
        }
      },
      "transformers_version": "4.13.0.dev0",
      "use_cache": true,
      "vocab_size": 50257
    }
    
    [INFO|tokenization_utils_base.py:1742] 2021-11-26 16:35:07,791 >> loading file https://huggingface.co/gpt2/resolve/main/vocab.json from cache at /home/ubuntu/.cache/huggingface/transformers/684fe667923972fb57f6b4dcb61a3c92763ad89882f3da5da9866baf14f2d60f.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f
    [INFO|tokenization_utils_base.py:1742] 2021-11-26 16:35:07,791 >> loading file https://huggingface.co/gpt2/resolve/main/merges.txt from cache at /home/ubuntu/.cache/huggingface/transformers/c0c761a63004025aeadd530c4c27b860ec4ecbe8a00531233de21d865a402598.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
    [INFO|tokenization_utils_base.py:1742] 2021-11-26 16:35:07,791 >> loading file https://huggingface.co/gpt2/resolve/main/tokenizer.json from cache at /home/ubuntu/.cache/huggingface/transformers/16a2f78023c8dc511294f0c97b5e10fde3ef9889ad6d11ffaa2a00714e73926e.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0
    [INFO|tokenization_utils_base.py:1742] 2021-11-26 16:35:07,792 >> loading file https://huggingface.co/gpt2/resolve/main/added_tokens.json from cache at None
    [INFO|tokenization_utils_base.py:1742] 2021-11-26 16:35:07,792 >> loading file https://huggingface.co/gpt2/resolve/main/special_tokens_map.json from cache at None
    [INFO|tokenization_utils_base.py:1742] 2021-11-26 16:35:07,792 >> loading file https://huggingface.co/gpt2/resolve/main/tokenizer_config.json from cache at None
    [INFO|configuration_utils.py:602] 2021-11-26 16:35:08,370 >> loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /home/ubuntu/.cache/huggingface/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
    [INFO|configuration_utils.py:639] 2021-11-26 16:35:08,370 >> Model config GPT2Config {
      "_name_or_path": "gpt2",
      "activation_function": "gelu_new",
      "architectures": [
        "GPT2LMHeadModel"
      ],
      "attn_pdrop": 0.1,
      "bos_token_id": 50256,
      "embd_pdrop": 0.1,
      "eos_token_id": 50256,
      "initializer_range": 0.02,
      "layer_norm_epsilon": 1e-05,
      "model_type": "gpt2",
      "n_ctx": 1024,
      "n_embd": 768,
      "n_head": 12,
      "n_inner": null,
      "n_layer": 12,
      "n_positions": 1024,
      "reorder_and_upcast_attn": false,
      "resid_pdrop": 0.1,
      "scale_attn_by_inverse_layer_idx": false,
      "scale_attn_weights": true,
      "summary_activation": null,
      "summary_first_dropout": 0.1,
      "summary_proj_to_labels": true,
      "summary_type": "cls_index",
      "summary_use_proj": true,
      "task_specific_params": {
        "text-generation": {
          "do_sample": true,
          "max_length": 50
        }
      },
      "transformers_version": "4.13.0.dev0",
      "use_cache": true,
      "vocab_size": 50257
    }
    
    [INFO|modeling_utils.py:1352] 2021-11-26 16:35:08,733 >> loading weights file https://huggingface.co/gpt2/resolve/main/pytorch_model.bin from cache at /home/ubuntu/.cache/huggingface/transformers/752929ace039baa8ef70fe21cdf9ab9445773d20e733cf693d667982e210837e.323c769945a351daa25546176f8208b3004b6f563438a7603e7932bae9025925
    [INFO|modeling_utils.py:1619] 2021-11-26 16:35:11,081 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.
    
    [INFO|modeling_utils.py:1628] 2021-11-26 16:35:11,082 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2.
    If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
    11/26/2021 16:35:11 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-87a8d9e859906bd4.arrow
    11/26/2021 16:35:11 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-7e63090e8c713f4a.arrow
    11/26/2021 16:35:11 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-fec244142b111ff5.arrow
    11/26/2021 16:35:11 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-d3b2bb82e64457e6.arrow
    11/26/2021 16:35:11 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-8badde48a528711e.arrow
    11/26/2021 16:35:11 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-4c30c758881dc770.arrow
    [INFO|trainer.py:1196] 2021-11-26 16:35:15,018 >> ***** Running training *****
    [INFO|trainer.py:1197] 2021-11-26 16:35:15,018 >>   Num examples = 2318
    [INFO|trainer.py:1198] 2021-11-26 16:35:15,018 >>   Num Epochs = 3
    [INFO|trainer.py:1199] 2021-11-26 16:35:15,018 >>   Instantaneous batch size per device = 8
    [INFO|trainer.py:1200] 2021-11-26 16:35:15,018 >>   Total train batch size (w. parallel, distributed & accumulation) = 64
    [INFO|trainer.py:1201] 2021-11-26 16:35:15,018 >>   Gradient Accumulation steps = 1
    [INFO|trainer.py:1202] 2021-11-26 16:35:15,018 >>   Total optimization steps = 111
      0%|                                                                               | 0/111 [00:00<?, ?it/s]/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
      warnings.warn('Was asked to gather along dimension 0, but all '
    Traceback (most recent call last):
      File "run_clm.py", line 526, in <module>
        main()
      File "run_clm.py", line 474, in main
        train_result = trainer.train(resume_from_checkpoint=checkpoint)
      File "/home/ubuntu/transformers/src/transformers/trainer.py", line 1317, in train
        tr_loss_step = self.training_step(model, inputs)
      File "/home/ubuntu/transformers/src/transformers/trainer.py", line 1857, in training_step
        loss = self.compute_loss(model, inputs)
      File "/home/ubuntu/transformers/src/transformers/trainer.py", line 1889, in compute_loss
        outputs = model(**inputs)
      File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
        return self.gather(outputs, self.output_device)
      File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 180, in gather
        return gather(outputs, output_device, dim=self.dim)
      File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 76, in gather
        res = gather_map(outputs)
      File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 70, in gather_map
        for k in out))
      File "<string>", line 9, in __init__
      File "/home/ubuntu/transformers/src/transformers/file_utils.py", line 2027, in __post_init__
        for element in iterator:
      File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 70, in <genexpr>
        for k in out))
      File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
        return Gather.apply(target_device, dim, *outputs)
      File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 72, in forward
        return comm.gather(inputs, ctx.dim, ctx.target_device)
      File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/parallel/comm.py", line 235, in gather
        return torch._C._gather(tensors, dim, destination)
    RuntimeError: CUDA out of memory. Tried to allocate 12.27 GiB (GPU 0; 39.59 GiB total capacity; 28.63 GiB already allocated; 9.00 GiB free; 28.71 GiB reserved in total by PyTorch)
      0%|                                                                               | 0/111 [00:27<?, ?it/s]
    (pytorch_latest_p37) [email protected]:~/transformers/examples/pytorch/language-modeling$
    
    
    opened by oscar-p-wiz 0
  • cannot import name 'SpeechEncoderDecoder' from 'transformers' - wav2vec2-xls-r-2b-22-to-16

    cannot import name 'SpeechEncoderDecoder' from 'transformers' - wav2vec2-xls-r-2b-22-to-16

    Hi, I am currently trying to run this model - facebook/wav2vec2-xls-r-2b-22-to-16 https://huggingface.co/facebook/wav2vec2-xls-r-2b-22-to-16

    The example code given using the pipeline is giving significantly different results compared to the api hosted on hugging face. I recorded and sent the same audio to the api through the website as well as ran the sample code on colab. The output is quite different.

    I tried running using the second step-by-step method too, it fails with "cannot import name 'SpeechEncoderDecoder' from 'transformers' "

    I tried with the latest transformer library as well as 4.11.3 Could you check what could be wrong? I can share my colab if needed.

    Thanks for your help in advance.

    opened by programmeddeath1 1
  • Two bugs in AdamW

    Two bugs in AdamW

    Environment info

    • transformers version: 4.13.0.dev0
    • Platform: Linux-3.10.0-1160.45.1.el7.x86_64-x86_64-with-glibc2.17
    • Python version: 3.9.7
    • PyTorch version (GPU?): 1.10.0+cu113 (True)
    • Tensorflow version (GPU?): 2.7.0 (False)
    • Using GPU in script?: No
    • Using distributed or parallel set-up in script?: No

    Who can help

    @thomwolf and @stas00 should be able to help based on git blame

    Information

    There are two bugs in the implementation of AdamW.

    Here's the current code https://github.com/manuelciosici/transformers/blob/04683c0659aacf31a1e1df8aa2e6cf7b447a6f12/src/transformers/optimization.py#L324-L371

    Weight decay bug

    Look at lines 369-370. The weight decay is multiplied with p.data which no longer corresponds to theta_{t-1} since p.data was modified in line 369. Below is a picture of Algorithm 2 from the original Adamw paper that shows on line 12 that the weight decay should be multiplied with the previous step's parameters (i.e., theta_{t-1}).

    Screen Shot 2021-11-26 at 09 19 33

    From what I can tell, this is a regression since the original AdamW implementation in transformers applied weight decay properly. Here's the commit that introduces the bug https://github.com/HuggingFace/transformers/commit/ec07cf5a660926833d6f5208b58730e4af8d1178#diff-40c6163602943c11431f1ec360299a7646bb436c691a646b9f54b2284f556ce0

    For confirmation that weight decay is currently buggy, see the original AdamW implementation, where, on line 74, the weight decay is multiplied with the old parameters as opposed to the new parameters that are calculated on line 71.

    Denominator computation bug

    The second bug appears in the computation of the denominator corresponding to line 10 in Algorithm 2 above. In the current code (see link in the Information section), on line 351, the denominator excludes the division by math.sqrt(bias_correction2). On line 357, division by math.sqrt(bias_correction2) appears, but, by this time, eps has already been added to denom, making the division not equivalent to line 10 in Algorithm 10.

    From what I can tell, this bug was also introduced as part of commit https://github.com/HuggingFace/transformers/commit/ec07cf5a660926833d6f5208b58730e4af8d1178#diff-40c6163602943c11431f1ec360299a7646bb436c691a646b9f54b2284f556ce0. The previous line update = next_m / (next_v.sqrt() + group['e']) was correct.

    For confirmation that the denominator is not properly calculated, see the original AdamW implementation, where, on line 64 the denominator is computed.

    To reproduce

    Steps to reproduce the behavior:

    1. Checkout the branch at https://github.com/manuelciosici/transformers/tree/reveal_broken_adamw:
    2. Run the unit tests in tests/test_optimization.py
    3. Tests test_compare_adamw_no_weight_decay and test_compare_adamw_with_weight_decay should fail (see the attached failed_tests.txt)

    Expected behavior

    The two implementations of AdamW should match their parameter updates.

    Proposed fix

    Checkout the branch at https://github.com/manuelciosici/transformers/tree/fix_adamw . It contains both the unit tests above and a fix for both bugs mentioned above.

    I can make a PR once we agree on the two bugs and the fix.

    opened by manuelciosici 0
  • Is the attention_mask in BertSelfAttention applied correctly?

    Is the attention_mask in BertSelfAttention applied correctly?

    https://github.com/huggingface/transformers/blob/69511cdcaec8c1c7f0d7f378964eca0ce74ed5a8/src/transformers/models/bert/modeling_bert.py#L325-L327

    Relevant Models:

    • BERT: @LysandreJik

    I was just working on adjusting Bert to my custom architecture, and when editing the BertSelfAttention module, I have noticed a very strange couple of lines (see the linked code). Shouldn't the masking be applied multiplicatively instead of additively? :thinking:

    I'm happy to be proven wrong and learn a new thing, but it seemed worth bringing up.

    opened by avolny 3
Releases(v4.12.5)
  • v4.12.5(Nov 17, 2021)

  • v4.12.4(Nov 16, 2021)

    • Fix gradient_checkpointing backward compatibility (#14408)
    • [Wav2Vec2] Make sure that gradient checkpointing is only run if needed (#14407)
    • Experimenting with adding proper get_config() and from_config() methods (#14361)
    • enhance rewrite state_dict missing _metadata (#14348)
    • Support for TF >= 2.7 (#14345)
    • improve rewrite state_dict missing _metadata (#14276)
    • Fix of issue #13327: Wrong weight initialization for TF t5 model (#14241)
    Source code(tar.gz)
    Source code(zip)
  • v4.12.3(Nov 3, 2021)

  • v4.12.2(Oct 29, 2021)

  • v4.12.1(Oct 29, 2021)

  • v4.12.0(Oct 28, 2021)

    TrOCR and VisionEncoderDecoderModel

    One new model is released as part of the TrOCR implementation: TrOCRForCausalLM, in PyTorch. It comes along a new VisionEncoderDecoderModel class, which allows to mix-and-match any vision Transformer encoder with any text Transformer as decoder, similar to the existing SpeechEncoderDecoderModel class.

    The TrOCR model was proposed in TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.

    The TrOCR model consists of an image transformer encoder and an autoregressive text transformer to perform optical character recognition in an end-to-end manner.

    • Add TrOCR + VisionEncoderDecoderModel by @NielsRogge in https://github.com/huggingface/transformers/pull/13874

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?other=trocr

    SEW & SEW-D

    SEW and SEW-D (Squeezed and Efficient Wav2Vec) were proposed in Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.

    SEW and SEW-D models use a Wav2Vec-style feature encoder and introduce temporal downsampling to reduce the length of the transformer encoder. SEW-D additionally replaces the transformer encoder with a DeBERTa one. Both models achieve significant inference speedups without sacrificing the speech recognition quality.

    • Add the SEW and SEW-D speech models by @anton-l in https://github.com/huggingface/transformers/pull/13962
    • Add SEW CTC models by @anton-l in https://github.com/huggingface/transformers/pull/14158

    Compatible checkpoints are available on the Hub: https://huggingface.co/models?other=sew and https://huggingface.co/models?other=sew-d

    DistilHuBERT

    DistilHuBERT was proposed in DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT, by Heng-Jui Chang, Shu-wen Yang, Hung-yi Lee.

    DistilHuBERT is a distilled version of the HuBERT model. Using only two transformer layers, the model scores competitively on the SUPERB benchmark tasks.

    Compatible checkpoint is available on the Hub: https://huggingface.co/ntu-spml/distilhubert

    TensorFlow improvements

    Several bug fixes and UX improvements for TensorFlow

    Keras callback

    Introduction of a Keras callback to push to the hub each epoch, or after a given number of steps:

    • Keras callback to push to hub each epoch, or after N steps by @Rocketknight1 in https://github.com/huggingface/transformers/pull/13773

    Updates on the encoder-decoder framework

    The encoder-decoder framework is now available in TensorFlow, allowing mixing and matching different encoders and decoders together into a single encoder-decoder architecture!

    • Add TFEncoderDecoderModel + Add cross-attention to some TF models by @ydshieh in https://github.com/huggingface/transformers/pull/13222

    Besides this, the EncoderDecoderModel classes have been updated to work similar to models like BART and T5. From now on, users don't need to pass decoder_input_ids themselves anymore to the model. Instead, they will be created automatically based on the labels (namely by shifting them one position to the right, replacing -100 by the pad_token_id and prepending the decoder_start_token_id). Note that this may result in training discrepancies if fine-tuning a model trained with versions anterior to 4.12.0 that set the decoder_input_ids = labels.

    • Fix EncoderDecoderModel classes to be more like BART and T5 by @NielsRogge in https://github.com/huggingface/transformers/pull/14139

    Speech improvements

    • Add DistilHuBERT by @anton-l in https://github.com/huggingface/transformers/pull/14174
    • [Speech Examples] Add pytorch speech pretraining by @patrickvonplaten in https://github.com/huggingface/transformers/pull/13877
    • [Speech Examples] Add new audio feature by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14027
    • Add ASR colabs by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14067
    • [ASR] Make speech recognition example more general to load any tokenizer by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14079
    • [Examples] Add an official audio classification example by @anton-l in https://github.com/huggingface/transformers/pull/13722
    • [Examples] Use Audio feature in speech classification by @anton-l in https://github.com/huggingface/transformers/pull/14052

    Auto-model API

    To make it easier to extend the Transformers library, every Auto class a new register method, that allows you to register your own custom models, configurations or tokenizers. See more in the documentation

    • Add an API to register objects to Auto classes by @sgugger in https://github.com/huggingface/transformers/pull/13989

    Bug fixes and improvements

    • Fix filtering in test fetcher utils by @sgugger in https://github.com/huggingface/transformers/pull/13766
    • Fix warning for gradient_checkpointing by @sgugger in https://github.com/huggingface/transformers/pull/13767
    • Implement len in IterableDatasetShard by @sgugger in https://github.com/huggingface/transformers/pull/13780
    • [Wav2Vec2] Better error message by @patrickvonplaten in https://github.com/huggingface/transformers/pull/13777
    • Fix LayoutLM ONNX test error by @nishprabhu in https://github.com/huggingface/transformers/pull/13710
    • Enable readme link synchronization by @qqaatw in https://github.com/huggingface/transformers/pull/13785
    • Fix length of IterableDatasetShard and add test by @sgugger in https://github.com/huggingface/transformers/pull/13792
    • [docs/gpt-j] addd instructions for how minimize CPU RAM usage by @patil-suraj in https://github.com/huggingface/transformers/pull/13795
    • [examples run_glue.py] missing requirements scipy, sklearn by @stas00 in https://github.com/huggingface/transformers/pull/13768
    • [examples/flax] use Repository API for push_to_hub by @patil-suraj in https://github.com/huggingface/transformers/pull/13672
    • Fix gather for TPU by @sgugger in https://github.com/huggingface/transformers/pull/13813
    • [testing] auto-replay captured streams by @stas00 in https://github.com/huggingface/transformers/pull/13803
    • Add MultiBERTs conversion script by @gchhablani in https://github.com/huggingface/transformers/pull/13077
    • [Examples] Improve mapping in accelerate examples by @patrickvonplaten in https://github.com/huggingface/transformers/pull/13810
    • [DPR] Correct init by @patrickvonplaten in https://github.com/huggingface/transformers/pull/13796
    • skip gptj slow generate tests by @patil-suraj in https://github.com/huggingface/transformers/pull/13809
    • Fix warning situation: UserWarning: max_length is ignored when padding=True" by @shirayu in https://github.com/huggingface/transformers/pull/13829
    • Updating CITATION.cff to fix GitHub citation prompt BibTeX output. by @arfon in https://github.com/huggingface/transformers/pull/13833
    • Add TF notebooks by @Rocketknight1 in https://github.com/huggingface/transformers/pull/13793
    • Bart: check if decoder_inputs_embeds is set by @silviu-oprea in https://github.com/huggingface/transformers/pull/13800
    • include megatron_gpt2 in installed modules by @stas00 in https://github.com/huggingface/transformers/pull/13834
    • Delete MultiBERTs conversion script by @gchhablani in https://github.com/huggingface/transformers/pull/13852
    • Remove a duplicated bullet point in the GPT-J doc by @yaserabdelaziz in https://github.com/huggingface/transformers/pull/13851
    • Add Mistral GPT-2 Stability Tweaks by @siddk in https://github.com/huggingface/transformers/pull/13573
    • Fix broken link to distill models in docs by @Randl in https://github.com/huggingface/transformers/pull/13848
    • :sparkles: update image classification example by @nateraw in https://github.com/huggingface/transformers/pull/13824
    • Update no_* argument (HfArgumentParser) by @BramVanroy in https://github.com/huggingface/transformers/pull/13865
    • Update Tatoeba conversion by @Traubert in https://github.com/huggingface/transformers/pull/13757
    • Fixing 1-length special tokens cut. by @Narsil in https://github.com/huggingface/transformers/pull/13862
    • Fix flax summarization example: save checkpoint after each epoch and push checkpoint to the hub by @ydshieh in https://github.com/huggingface/transformers/pull/13872
    • Fixing empty prompts for text-generation when BOS exists. by @Narsil in https://github.com/huggingface/transformers/pull/13859
    • Improve error message when loading models from Hub by @aphedges in https://github.com/huggingface/transformers/pull/13836
    • Initial support for symbolic tracing with torch.fx allowing dynamic axes by @michaelbenayoun in https://github.com/huggingface/transformers/pull/13579
    • Allow dataset to be an optional argument for (Distributed)LengthGroupedSampler by @ZhaofengWu in https://github.com/huggingface/transformers/pull/13820
    • Fixing question-answering with long contexts by @Narsil in https://github.com/huggingface/transformers/pull/13873
    • fix(integrations): consider test metrics by @borisdayma in https://github.com/huggingface/transformers/pull/13888
    • fix: replace asserts by value error by @m5l14i11 in https://github.com/huggingface/transformers/pull/13894
    • Update parallelism.md by @hyunwoongko in https://github.com/huggingface/transformers/pull/13892
    • Autodocument the list of ONNX-supported models by @sgugger in https://github.com/huggingface/transformers/pull/13884
    • Fixing GPU for token-classification in a better way. by @Narsil in https://github.com/huggingface/transformers/pull/13856
    • Update FSNER code in examples->research_projects->fsner by @sayef in https://github.com/huggingface/transformers/pull/13864
    • Replace assert statements with exceptions by @ddrm86 in https://github.com/huggingface/transformers/pull/13871
    • Fixing Backward compatiblity for zero-shot by @Narsil in https://github.com/huggingface/transformers/pull/13855
    • Update run_qa.py - CorrectTypo by @akulagrawal in https://github.com/huggingface/transformers/pull/13857
    • T5ForConditionalGeneration: enabling using past_key_values and labels in training by @yssjtu in https://github.com/huggingface/transformers/pull/13805
    • Fix trainer logging_nan_inf_filter in torch_xla mode by @ymwangg in https://github.com/huggingface/transformers/pull/13896
    • Fix hp search for non sigopt backends by @sgugger in https://github.com/huggingface/transformers/pull/13897
    • [Trainer] Fix nan-loss condition by @anton-l in https://github.com/huggingface/transformers/pull/13911
    • Raise exceptions instead of asserts in utils/download_glue_data by @hirotasoshu in https://github.com/huggingface/transformers/pull/13907
    • Add an example of exporting BartModel + BeamSearch to ONNX module. by @fatcat-z in https://github.com/huggingface/transformers/pull/13765
    • #12789 Replace assert statements with exceptions by @djroxx2000 in https://github.com/huggingface/transformers/pull/13909
    • Add missing whitespace to multiline strings by @aphedges in https://github.com/huggingface/transformers/pull/13916
    • [Wav2Vec2] Fix mask_feature_prob by @patrickvonplaten in https://github.com/huggingface/transformers/pull/13921
    • Fixes a minor doc issue (missing character) by @mishig25 in https://github.com/huggingface/transformers/pull/13922
    • Fix LED by @Rocketknight1 in https://github.com/huggingface/transformers/pull/13882
    • Add BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese by @datquocnguyen in https://github.com/huggingface/transformers/pull/13788
    • [trainer] memory metrics: add memory at the start report by @stas00 in https://github.com/huggingface/transformers/pull/13915
    • Image Segmentation pipeline by @mishig25 in https://github.com/huggingface/transformers/pull/13828
    • Adding support for tokens being suffixes or part of each other. by @Narsil in https://github.com/huggingface/transformers/pull/13918
    • Adds PreTrainedModel.framework attribute by @StellaAthena in https://github.com/huggingface/transformers/pull/13817
    • Fixed typo: herBERT -> HerBERT by @adamjankaczmarek in https://github.com/huggingface/transformers/pull/13936
    • [Generation] Fix max_new_tokens by @patrickvonplaten in https://github.com/huggingface/transformers/pull/13919
    • Fix typo in README.md by @fullyz in https://github.com/huggingface/transformers/pull/13883
    • Update bug-report.md by @LysandreJik in https://github.com/huggingface/transformers/pull/13934
    • fix issue #13904 -attribute does not exist- by @oraby8 in https://github.com/huggingface/transformers/pull/13942
    • Raise ValueError instead of asserts in src/transformers/benchmark/benchmark.py by @AkechiShiro in https://github.com/huggingface/transformers/pull/13951
    • Honor existing attention mask in tokenzier.pad by @sgugger in https://github.com/huggingface/transformers/pull/13926
    • [Gradient checkpoining] Correct disabling find_unused_parameters in Trainer when gradient checkpointing is enabled by @patrickvonplaten in https://github.com/huggingface/transformers/pull/13961
    • Change DataCollatorForSeq2Seq to pad labels to a multiple of pad_to_multiple_of by @affjljoo3581 in https://github.com/huggingface/transformers/pull/13949
    • Replace assert with unittest assertions by @LuisFerTR in https://github.com/huggingface/transformers/pull/13957
    • Raise exceptions instead of asserts in src/transformers/data/processors/xnli.py by @midhun1998 in https://github.com/huggingface/transformers/pull/13945
    • Make username optional in hub_model_id by @sgugger in https://github.com/huggingface/transformers/pull/13940
    • Raise exceptions instead of asserts in src/transformers/data/processors/utils.py by @killazz67 in https://github.com/huggingface/transformers/pull/13938
    • Replace assert by ValueError of src/transformers/models/electra/modeling_{electra,tf_electra}.py and all other models that had copies by @AkechiShiro in https://github.com/huggingface/transformers/pull/13955
    • Fix missing tpu variable in benchmark_args_tf.py by @hardianlawi in https://github.com/huggingface/transformers/pull/13968
    • Specify im-seg mask greyscole mode by @mishig25 in https://github.com/huggingface/transformers/pull/13974
    • [Wav2Vec2] Make sure tensors are always bool for mask_indices by @patrickvonplaten in https://github.com/huggingface/transformers/pull/13977
    • Fixing the lecture values by making sure defaults are not changed by @Narsil in https://github.com/huggingface/transformers/pull/13976
    • [parallel doc] dealing with layers larger than one gpu by @stas00 in https://github.com/huggingface/transformers/pull/13980
    • Remove wrong model_args supplied by @qqaatw in https://github.com/huggingface/transformers/pull/13937
    • Allow single byte decoding by @patrickvonplaten in https://github.com/huggingface/transformers/pull/13988
    • Replace assertion with ValueError exception by @ddrm86 in https://github.com/huggingface/transformers/pull/14006
    • Add strong test for configuration attributes by @sgugger in https://github.com/huggingface/transformers/pull/14000
    • Fix FNet tokenizer tests by @LysandreJik in https://github.com/huggingface/transformers/pull/13995
    • [Testing] Move speech datasets to hf-internal testing ... by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14008
    • Raise exceptions instead of asserts in src/transformers/models/bart/modeling_flax_[bart, marian, mbart, pegasus].py by @killazz67 in https://github.com/huggingface/transformers/pull/13939
    • Scatter dummies + skip pipeline tests by @LysandreJik in https://github.com/huggingface/transformers/pull/13996
    • Fixed horizon_length for PPLM by @jacksukk in https://github.com/huggingface/transformers/pull/13886
    • Fix: replace assert statements with exceptions in file src/transformers/models/lxmert/modeling_lxmert.py by @murilo-goncalves in https://github.com/huggingface/transformers/pull/14029
    • [Docs] More general docstrings by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14028
    • [CLIP] minor fixes by @patil-suraj in https://github.com/huggingface/transformers/pull/14026
    • Don't duplicate the elements in dir by @sgugger in https://github.com/huggingface/transformers/pull/14023
    • Replace assertions with ValueError exceptions by @ddrm86 in https://github.com/huggingface/transformers/pull/14018
    • Fixes typo in modeling_speech_to_text by @mishig25 in https://github.com/huggingface/transformers/pull/14044
    • [Speech] Move all examples to new audio feature by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14045
    • Update SEW integration test tolerance by @anton-l in https://github.com/huggingface/transformers/pull/14048
    • [Flax] Clip fix test by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14046
    • Fix save when laod_best_model_at_end=True by @sgugger in https://github.com/huggingface/transformers/pull/14054
    • [Speech] Refactor Examples by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14040
    • fix typo by @yyy-Apple in https://github.com/huggingface/transformers/pull/14049
    • Fix typo by @ihoromi4 in https://github.com/huggingface/transformers/pull/14056
    • [FX] Fix passing None as concrete args when tracing by @thomasw21 in https://github.com/huggingface/transformers/pull/14022
    • TF Model train and eval step metrics for seq2seq models. by @pedro-r-marques in https://github.com/huggingface/transformers/pull/14009
    • update to_py_obj to support np.number by @PrettyMeng in https://github.com/huggingface/transformers/pull/14064
    • Trainer._load_rng_state() path fix (#14069) by @tlby in https://github.com/huggingface/transformers/pull/14071
    • replace assert with exception in src/transformers/utils/model_pararallel_utils.py by @skpig in https://github.com/huggingface/transformers/pull/14072
    • Add missing autocast() in Trainer.prediction_step() by @juice500ml in https://github.com/huggingface/transformers/pull/14075
    • Fix assert in src/transformers/data/datasets/language_modeling.py by @skpig in https://github.com/huggingface/transformers/pull/14077
    • Fix label attribution in token classification examples by @sgugger in https://github.com/huggingface/transformers/pull/14055
    • Context managers by @lvwerra in https://github.com/huggingface/transformers/pull/13900
    • Fix broken link in the translation section of task summaries by @h4iku in https://github.com/huggingface/transformers/pull/14087
    • [ASR] Small fix model card creation by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14093
    • Change asserts in src/transformers/models/xlnet/ to raise ValueError by @WestonKing-Leatham in https://github.com/huggingface/transformers/pull/14088
    • Replace assertions with ValueError exceptions by @ddrm86 in https://github.com/huggingface/transformers/pull/14061
    • [Typo] Replace "Masked" with "Causal" in TF CLM script by @cakiki in https://github.com/huggingface/transformers/pull/14014
    • [Examples] Add audio classification notebooks by @anton-l in https://github.com/huggingface/transformers/pull/14099
    • Fix ignore_mismatched_sizes by @qqaatw in https://github.com/huggingface/transformers/pull/14085
    • Fix typo in comment by @stalkermustang in https://github.com/huggingface/transformers/pull/14102
    • Replace assertion with ValueError exception by @ddrm86 in https://github.com/huggingface/transformers/pull/14098
    • fix typo in license docstring by @21jun in https://github.com/huggingface/transformers/pull/14094
    • Fix a typo in preprocessing docs by @h4iku in https://github.com/huggingface/transformers/pull/14108
    • Replace assertions with ValueError exceptions by @iDeepverma in https://github.com/huggingface/transformers/pull/14091
    • [tests] fix hubert test sort by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14116
    • Replace assert statements with exceptions (#13871) by @ddrm86 in https://github.com/huggingface/transformers/pull/13901
    • Translate README.md to Korean by @yeounyi in https://github.com/huggingface/transformers/pull/14015
    • Replace assertions with valueError Exeptions by @jyshdewangan in https://github.com/huggingface/transformers/pull/14117
    • Fix assertion in models by @skpig in https://github.com/huggingface/transformers/pull/14090
    • [wav2vec2] Add missing --validation_split_percentage data arg by @falcaopetri in https://github.com/huggingface/transformers/pull/14119
    • Rename variables with unclear naming by @qqaatw in https://github.com/huggingface/transformers/pull/14122
    • Update TP parallel GEMM image by @hyunwoongko in https://github.com/huggingface/transformers/pull/14112
    • Fix some typos in the docs by @h4iku in https://github.com/huggingface/transformers/pull/14126
    • Supporting Seq2Seq model for question answering task by @karthikrangasai in https://github.com/huggingface/transformers/pull/13432
    • Fix rendering of examples version links by @h4iku in https://github.com/huggingface/transformers/pull/14134
    • Fix some writing issues in the docs by @h4iku in https://github.com/huggingface/transformers/pull/14136
    • BartEnocder add set_input_embeddings by @Liangtaiwan in https://github.com/huggingface/transformers/pull/13960
    • Remove unneeded to_tensor() in TF inline example by @Rocketknight1 in https://github.com/huggingface/transformers/pull/14140
    • Enable DefaultDataCollator class by @Rocketknight1 in https://github.com/huggingface/transformers/pull/14141
    • Fix lazy init to stop hiding errors in import by @sgugger in https://github.com/huggingface/transformers/pull/14124
    • Add TF<>PT and Flax<>PT everywhere by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14047
    • Add Camembert to models exportable with ONNX by @ChainYo in https://github.com/huggingface/transformers/pull/14059
    • [Speech Recognition CTC] Add auth token to fine-tune private models by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14154
    • Add vision_encoder_decoder to models/init.py by @ydshieh in https://github.com/huggingface/transformers/pull/14151
    • [Speech Recognition] - Distributed training: Make sure vocab file removal and creation don't interfer by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14161
    • Include Keras tensor in the allowed types by @sergiovalmac in https://github.com/huggingface/transformers/pull/14155
    • [megatron_gpt2] dynamic gelu, add tokenizer, save config by @stas00 in https://github.com/huggingface/transformers/pull/13928
    • Add Unispeech & Unispeech-SAT by @patrickvonplaten in https://github.com/huggingface/transformers/pull/13963
    • [ONNX] Add symbolic function for XSoftmax op for exporting to ONNX. by @fatcat-z in https://github.com/huggingface/transformers/pull/14013
    • Typo on ner accelerate example code by @monologg in https://github.com/huggingface/transformers/pull/14150
    • fix typos in error messages in speech recognition example and modelcard.py by @mgoldey in https://github.com/huggingface/transformers/pull/14166
    • Replace assertions with ValueError exception by @huberemanuel in https://github.com/huggingface/transformers/pull/14142
    • switch to inference_mode from no_gard by @kamalkraj in https://github.com/huggingface/transformers/pull/13667
    • Fix gelu test for torch 1.10 by @LysandreJik in https://github.com/huggingface/transformers/pull/14167
    • [Gradient checkpointing] Enable for Deberta + DebertaV2 + SEW-D by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14175
    • [Pipelines] Fix ASR model types check by @anton-l in https://github.com/huggingface/transformers/pull/14178
    • Replace assert of data/data_collator.py by ValueError by @AkechiShiro in https://github.com/huggingface/transformers/pull/14131
    • [TPU tests] Enable first TPU examples pytorch by @patrickvonplaten in https://github.com/huggingface/transformers/pull/14121
    • [modeling_utils] respect original dtype in _get_resized_lm_head by @stas00 in https://github.com/huggingface/transformers/pull/14181

    New Contributors

    • @arfon made their first contribution in https://github.com/huggingface/transformers/pull/13833
    • @silviu-oprea made their first contribution in https://github.com/huggingface/transformers/pull/13800
    • @yaserabdelaziz made their first contribution in https://github.com/huggingface/transformers/pull/13851
    • @Randl made their first contribution in https://github.com/huggingface/transformers/pull/13848
    • @Traubert made their first contribution in https://github.com/huggingface/transformers/pull/13757
    • @ZhaofengWu made their first contribution in https://github.com/huggingface/transformers/pull/13820
    • @m5l14i11 made their first contribution in https://github.com/huggingface/transformers/pull/13894
    • @hyunwoongko made their first contribution in https://github.com/huggingface/transformers/pull/13892
    • @ddrm86 made their first contribution in https://github.com/huggingface/transformers/pull/13871
    • @akulagrawal made their first contribution in https://github.com/huggingface/transformers/pull/13857
    • @yssjtu made their first contribution in https://github.com/huggingface/transformers/pull/13805
    • @ymwangg made their first contribution in https://github.com/huggingface/transformers/pull/13896
    • @hirotasoshu made their first contribution in https://github.com/huggingface/transformers/pull/13907
    • @fatcat-z made their first contribution in https://github.com/huggingface/transformers/pull/13765
    • @djroxx2000 made their first contribution in https://github.com/huggingface/transformers/pull/13909
    • @adamjankaczmarek made their first contribution in https://github.com/huggingface/transformers/pull/13936
    • @oraby8 made their first contribution in https://github.com/huggingface/transformers/pull/13942
    • @AkechiShiro made their first contribution in https://github.com/huggingface/transformers/pull/13951
    • @affjljoo3581 made their first contribution in https://github.com/huggingface/transformers/pull/13949
    • @LuisFerTR made their first contribution in https://github.com/huggingface/transformers/pull/13957
    • @midhun1998 made their first contribution in https://github.com/huggingface/transformers/pull/13945
    • @killazz67 made their first contribution in https://github.com/huggingface/transformers/pull/13938
    • @hardianlawi made their first contribution in https://github.com/huggingface/transformers/pull/13968
    • @jacksukk made their first contribution in https://github.com/huggingface/transformers/pull/13886
    • @murilo-goncalves made their first contribution in https://github.com/huggingface/transformers/pull/14029
    • @yyy-Apple made their first contribution in https://github.com/huggingface/transformers/pull/14049
    • @ihoromi4 made their first contribution in https://github.com/huggingface/transformers/pull/14056
    • @thomasw21 made their first contribution in https://github.com/huggingface/transformers/pull/14022
    • @pedro-r-marques made their first contribution in https://github.com/huggingface/transformers/pull/14009
    • @PrettyMeng made their first contribution in https://github.com/huggingface/transformers/pull/14064
    • @tlby made their first contribution in https://github.com/huggingface/transformers/pull/14071
    • @skpig made their first contribution in https://github.com/huggingface/transformers/pull/14072
    • @juice500ml made their first contribution in https://github.com/huggingface/transformers/pull/14075
    • @h4iku made their first contribution in https://github.com/huggingface/transformers/pull/14087
    • @WestonKing-Leatham made their first contribution in https://github.com/huggingface/transformers/pull/14088
    • @cakiki made their first contribution in https://github.com/huggingface/transformers/pull/14014
    • @stalkermustang made their first contribution in https://github.com/huggingface/transformers/pull/14102
    • @iDeepverma made their first contribution in https://github.com/huggingface/transformers/pull/14091
    • @yeounyi made their first contribution in https://github.com/huggingface/transformers/pull/14015
    • @jyshdewangan made their first contribution in https://github.com/huggingface/transformers/pull/14117
    • @karthikrangasai made their first contribution in https://github.com/huggingface/transformers/pull/13432
    • @ChainYo made their first contribution in https://github.com/huggingface/transformers/pull/14059
    • @sergiovalmac made their first contribution in https://github.com/huggingface/transformers/pull/14155
    • @huberemanuel made their first contribution in https://github.com/huggingface/transformers/pull/14142

    Full Changelog: https://github.com/huggingface/transformers/compare/v4.11.0...v4.12.0

    Source code(tar.gz)
    Source code(zip)
  • v4.11.3(Oct 6, 2021)

    v4.11.3: Patch release

    This patch release fixes a few issues encountered since the release of v4.11.2:

    • [DPR] Correct init (#13796)
    • Fix warning situation: UserWarning: max_length is ignored when padding=True" (#13829)
    • Bart: check if decoder_inputs_embeds is set (#13800)
    • include megatron_gpt2 in installed modules (#13834)
    • Fixing 1-length special tokens cut. (#13862)
    • Fixing empty prompts for text-generation when BOS exists. (#13859)
    • Fixing question-answering with long contexts (#13873)
    • Fixing GPU for token-classification in a better way. (#13856)
    • Fixing Backward compatiblity for zero-shot (#13855)
    • Fix hp search for non sigopt backends (#13897)
    • Fix trainer logging_nan_inf_filter in torch_xla mode #13896 (@ymwangg)
    • [Trainer] Fix nan-loss condition #13911 (@anton-l)
    Source code(tar.gz)
    Source code(zip)
  • v4.11.2(Sep 30, 2021)

  • v4.11.1(Sep 29, 2021)

    v4.11.1: Patch release

    Patch release with a few bug fixes:

    • [Wav2Vec2] Better error message (#13777)
    • Fix LayoutLM ONNX test error (#13710)
    • Fix warning for gradient_checkpointing (#13767)
    • Implement len in IterableDatasetShard (#13780)
    • Fix length of IterableDatasetShard and add test (#13792)
    Source code(tar.gz)
    Source code(zip)
  • v4.11.0(Sep 27, 2021)

    v4.11.0: GPT-J, Speech2Text2, FNet, Pipeline GPU utilization, dynamic model code loading

    GPT-J

    Three new models are released as part of the GPT-J implementation: GPTJModel, GPTJForCausalLM, GPTJForSequenceClassification, in PyTorch.

    The GPT-J model was released in the kingoflolz/mesh-transformer-jax repository by Ben Wang and Aran Komatsuzaki. It is a GPT-2-like causal language model trained on the Pile dataset.

    It was contributed by @StellaAthena, @kurumuz, @EricHallahan, and @leogao2.

    • GPT-J-6B #13022 (@StellaAthena)

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=gptj

    SpeechEncoderDecoder & Speech2Text2

    One new model is released as part of the Speech2Text2 implementation: Speech2Text2ForCausalLM, in PyTorch.

    The Speech2Text2 model is used together with Wav2Vec2 for Speech Translation models proposed in Large-Scale Self- and Semi-Supervised Learning for Speech Translation by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.

    Speech2Text2 is a decoder-only transformer model that can be used with any speech encoder-only, such as Wav2Vec2 or HuBERT for Speech-to-Text tasks. Please refer to the SpeechEncoderDecoder class on how to combine Speech2Text2 with any speech encoder-only model.

    • Add SpeechEncoderDecoder & Speech2Text2 #13186 (@patrickvonplaten)

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?other=speech2text2

    FNet

    Eight new models are released as part of the FNet implementation: FNetModel, FNetForPreTraining, FNetForMaskedLM, FNetForNextSentencePrediction, FNetForSequenceClassification, FNetForMultipleChoice, FNetForTokenClassification, FNetForQuestionAnswering, in PyTorch.

    The FNet model was proposed in FNet: Mixing Tokens with Fourier Transforms by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. The model replaces the self-attention layer in a BERT model with a fourier transform which returns only the real parts of the transform. The model is significantly faster than the BERT model because it has fewer parameters and is more memory efficient. The model achieves about 92-97% accuracy of BERT counterparts on GLUE benchmark, and trains much faster than the BERT model.

    • Add FNet #13045 (@gchhablani)

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?other=fnet

    TensorFlow improvements

    Several bug fixes and UX improvements for Tensorflow:

    • Users should notice much fewer unnecessary warnings and less 'console spam' in general while using Transformers with TensorFlow.
    • TensorFlow models should be less picky about the specific integer dtypes (int32/int64) that are passed as input

    Changes to compile() and train_step()

    • You can now compile our TensorFlow models without passing a loss argument! If you do, the model will compute loss internally during the forward pass and then use this value to fit() on. This makes it much more convenient to get the right loss, particularly since many models have unique losses for certain tasks that are easy to overlook and annoying to reimplement. Remember to pass your labels as the "labels" key of your input dict when doing this, so that they're accessible to the model during the forward pass. There is no change to the behavior if you pass a loss argument, so all old code should remain unaffected by this change.

    Associated PRs:

    • Modified TF train_step #13678 (@Rocketknight1)
    • Fix Tensorflow T5 with int64 input #13479 (@Rocketknight1)
    • MarianMT int dtype fix #13496 (@Rocketknight1)
    • Removed console spam from misfiring warnings #13625 (@Rocketknight1)

    Pipelines

    Pipeline refactor

    The pipelines underwent a large refactor that should make contributing pipelines much simpler, and much less error-prone. As part of this refactor, PyTorch-based pipelines are now optimized for GPU performance based on PyTorch's Datasets and DataLoaders.

    See below for an example leveraging the superb dataset.

    pipe = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0)
    dataset = datasets.load_dataset("superb", name="asr", split="test")
    
    # KeyDataset (only `pt`) will simply return the item in the dict returned by the dataset item
    # as we're not interested in the `target` part of the dataset.
    for out in tqdm.tqdm(pipe(KeyDataset(dataset, "file"))):
        print(out)
        # {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
        # {"text": ....}
        # ....
    
    • [Large PR] Entire rework of pipelines. #13308 (@Narsil)

    Audio classification pipeline

    Additionally, an additional pipeline is available, for audio classification.

    • Add the AudioClassificationPipeline #13342 (@anton-l)
    • Enabling automatic loading of tokenizer with pipeline for audio-classification. #13376 (@Narsil)

    Setters for common properties

    Version v4.11.0 introduces setters for common configuration properties. Different configurations have different properties as coming from different implementations.

    One such example is the BertConfig having the hidden_size attribute, while the GPT2Config has the n_embed attribute, which are essentially the same.

    The newly introduced setters allow setting such properties through a standardized naming scheme, even on configuration objects that do not have them by default.

    See the following code sample for an example:

    from transformers import GPT2Config
    config = GPT2Config()
    
    config.hidden_size = 4  # Failed previously
    config = GPT2Config(hidden_size =4)  # Failed previously
    
    config.n_embed  # returns 4
    config.hidden_size  # returns 4
    
    • Update model configs - Allow setters for common properties #13026 (@nreimers)

    Dynamic model code loading

    An experimental feature adding support for model files hosted on the hub is added as part of this release. A walkthrough is available in the PR description.

    :warning: This means that code files will be fetched from the hub to be executed locally. An additional argument, trust_remote_code is required when instantiating the model from the hub. We heavily encourage you to also specify a revision if using code from another user's or organization's repository.

    • Dynamically load model code from the Hub #13467 (@sgugger)

    Trainer

    The Trainer has received several new features, the main one being that models are uploaded to the Hub each time you save them locally (you can specify another strategy). This push is asynchronous, so training continues normally without interruption.

    Also:

    • The SigOpt optimization framework is now integrated in the Trainer API as an opt-in component.
    • The Trainer API now supports fine-tuning on distributed CPUs.

    Associated PRs:

    • Push to hub when saving checkpoints #13503 (@sgugger)
    • Add SigOpt HPO to transformers trainer api #13572 (@kding1)
    • Add cpu distributed fine-tuning support for transformers Trainer API #13574 (@kding1)

    Model size CPU memory usage reduction

    The memory required to load a model in memory using PyTorch's torch.load requires twice the amount of memory necessary. An experimental feature allowing model loading while requiring only the model size in terms of memory usage is out in version v4.11.0.

    It can be used by using the low_cpu_mem_usage=True argument with PyTorch pretrained models.

    • 1x model size CPU memory usage for from_pretrained #13466 (@stas00)

    GPT-Neo: simplified local attention

    The GPT-Neo local attention was greatly simplified with no loss of performance.

    • [GPT-Neo] Simplify local attention #13491 (@finetuneanon, @patil-suraj)

    Breaking changes

    We strive for no breaking changes between releases - however, some bugs are not discovered for long periods of time, and users may eventually rely on such bugs. We document here such changes that may affect users when updating to a recent version.

    Order of overflowing tokens

    The overflowing tokens returned by the slow tokenizers were returned in the wrong order. This is changed in the PR below.

    • Correct order of overflowing_tokens for slow tokenizer #13179 (@Apoorvgarg-creator)

    Non-prefixed tokens for token classification pipeline

    Updates the behavior of aggregation_strategy to more closely mimic the deprecated grouped_entities pipeline argument.

    • Fixing backward compatiblity for non prefixed tokens (B-, I-). #13493 (@Narsil)

    Inputs normalization for Wav2Vec2 feature extractor

    The changes in v4.10 (#12804) introduced a bug in inputs normalization for non-padded tensors that affected Wav2Vec2 fine-tuning. This is fixed in the PR below.

    • [Wav2Vec2] Fix normalization for non-padded tensors #13512 (@patrickvonplaten)

    General bug fixes and improvements

    • Fixes for the documentation #13361 (@sgugger)
    • fix wrong 'cls' masking for bigbird qa model output #13143 (@donggyukimc)
    • Improve T5 docs #13240 (@NielsRogge)
    • Fix tokenizer saving during training with Trainer #12806 (@SaulLu)
    • Fix DINO #13369 (@NielsRogge)
    • Properly register missing submodules in main init #13372 (@sgugger)
    • Add Hubert to the AutoFeatureExtractor #13366 (@anton-l)
    • Add missing feature extractors #13374 (@LysandreJik)
    • Fix RemBERT tokenizer initialization #13375 (@LysandreJik)
    • [Flax] Fix BigBird #13380 (@patrickvonplaten)
    • [GPU Tests] Fix SpeechEncoderDecoder GPU tests #13383 (@patrickvonplaten)
    • Fix name and get_class method in AutoFeatureExtractor #13385 (@sgugger)
    • [Flax/run_hybrid_clip] Fix duplicating images when captions_per_image exceeds the number of captions, enable truncation #12752 (@edugp)
    • Move Flax self-push to test machine #13364 (@patrickvonplaten)
    • Torchscript test #13350 (@LysandreJik)
    • Torchscript test for DistilBERT #13351 (@LysandreJik)
    • Torchscript test for ConvBERT #13352 (@LysandreJik)
    • Torchscript test for Flaubert #13353 (@LysandreJik)
    • Fix GPT-J _CHECKPOINT_FOR_DOC typo #13368 (@LysandreJik)
    • Update clip loss calculation #13217 (@sachinruk)
    • Add LayoutXLM tokenizer docs #13373 (@NielsRogge)
    • [doc] fix mBART example #13387 (@patil-suraj)
    • [docs] Update perplexity.rst to use negative log likelihood #13386 (@madaan)
    • [Tests] Fix SpeechEncoderDecoder tests #13395 (@patrickvonplaten)
    • [SpeechEncoderDecoder] Fix final test #13396 (@patrickvonplaten)
    • ✨ Add PyTorch image classification example #13134 (@nateraw)
    • Fix tests without any real effect in EncoderDecoderMixin #13406 (@ydshieh)
    • Fix scheduled tests for SpeechEncoderDecoderModel #13422 (@anton-l)
    • add torchvision in example test requirements #13438 (@patil-suraj)
    • [EncoderDecoder] Fix torch device in tests #13448 (@patrickvonplaten)
    • Adding a test for multibytes unicode. #13447 (@Narsil)
    • skip image classification example test #13451 (@patil-suraj)
    • Add TAPAS MLM-only models #13408 (@NielsRogge)
    • Fix scheduled TF Speech tests #13403 (@anton-l)
    • Update version of packaging package #13454 (@shivdhar)
    • Update setup.py #13421 (@anukaal)
    • Fix img classification tests #13456 (@nateraw)
    • Making it raise real errors on ByT5. #13449 (@Narsil)
    • Optimized bad word ids #13433 (@guillaume-be)
    • Use powers of 2 in download size calculations #13468 (@anton-l)
    • [docs] update dead quickstart link on resuing past for GPT2 #13455 (@shabie)
    • fix CLIP conversion script. #13474 (@patil-suraj)
    • Deprecate Mirror #13470 (@JetRunner)
    • [CLIP] fix logit_scale init #13436 (@patil-suraj)
    • Don't modify labels inplace in LabelSmoother #13464 (@sgugger)
    • Enable automated model list copying for localized READMEs #13465 (@qqaatw)
    • Better error raised when cloned without lfs #13401 (@LysandreJik)
    • Throw ValueError for mirror downloads #13478 (@JetRunner)
    • Fix Tensorflow T5 with int64 input #13479 (@Rocketknight1)
    • Object detection pipeline #12886 (@mishig25)
    • Typo in "end_of_word_suffix" #13477 (@KoichiYasuoka)
    • Fixed the MultilabelTrainer document, which would cause a potential bug when executing the code originally documented. #13414 (@Mohan-Zhang-u)
    • Fix integration tests for TFWav2Vec2 and TFHubert #13480 (@anton-l)
    • Fix typo in deepspeed documentation #13482 (@apohllo)
    • flax ner example #13365 (@kamalkraj)
    • Fix typo in documentation #13494 (@apohllo)
    • MarianMT int dtype fix #13496 (@Rocketknight1)
    • [Tentative] Moving slow tokenizer to the Trie world. #13220 (@Narsil)
    • Refactor internals for Trainer push_to_hub #13486 (@sgugger)
    • examples: minor fixes in flax example readme #13502 (@stefan-it)
    • [Wav2Vec2] Fix normalization for non-padded tensors #13512 (@patrickvonplaten)
    • TF multiple choice loss fix #13513 (@Rocketknight1)
    • [Wav2Vec2] Fix dtype 64 bug #13517 (@patrickvonplaten)
    • fix PhophetNet 'use_cache' assignment of no effect #13532 (@holazzer)
    • Ignore past_key_values during GPT-Neo inference #13521 (@aphedges)
    • Fix attention mask size checking for CLIP #13535 (@Renovamen)
    • [Speech2Text2] Skip newly added tokenizer test #13536 (@patrickvonplaten)
    • [Speech2Text] Give feature extraction higher tolerance #13538 (@patrickvonplaten)
    • [tokenizer] use use_auth_token for config #13523 (@stas00)
    • Small changes in perplexity.rstto make the notebook executable on google collaboratory #13541 (@SaulLu)
    • [Feature Extractors] Return attention mask always in int32 #13543 (@patrickvonplaten)
    • Nightly torch ci #13550 (@LysandreJik)
    • Add long overdue link to the Google TRC project #13501 (@avital)
    • Fixing #13381 #13400 (@Narsil)
    • fixing BC in fill-mask (wasn't tested in theses test suites apparently). #13540 (@Narsil)
    • add flax mbart in auto seq2seq lm #13560 (@patil-suraj)
    • [Flax] Addition of FlaxPegasus #13420 (@bhadreshpsavani)
    • Add checks to build cleaner model cards #13542 (@sgugger)
    • separate model card git push from the rest #13514 (@elishowk)
    • Fix test_fetcher when setup is updated #13566 (@sgugger)
    • [Flax] Fixes typo in Bart based Flax Models #13565 (@bhadreshpsavani)
    • Fix GPTNeo onnx export #13524 (@patil-suraj)
    • upgrade sentencepiece version #13564 (@elishowk)
    • [Pretrained Model] Add resize_position_embeddings #13559 (@patrickvonplaten)
    • [ci] nightly: add deepspeed master #13589 (@stas00)
    • [Tests] Disable flaky s2t test #13585 (@patrickvonplaten)
    • Correct device when resizing position embeddings #13593 (@patrickvonplaten)
    • Fix DataCollatorForSeq2Seq when labels are supplied as Numpy array instead of list #13582 (@Rocketknight1)
    • Fix a pipeline test with the newly updated weights #13608 (@LysandreJik)
    • Fix make fix-copies with type annotations #13586 (@sgugger)
    • DataCollatorForTokenClassification numpy fix #13609 (@Rocketknight1)
    • Feature Extractor: Wav2Vec2 & Speech2Text - Allow truncation + padding=longest #13600 (@patrickvonplaten)
    • [deepspeed] replaced deprecated init arg #13587 (@stas00)
    • Properly use test_fetcher for examples #13604 (@sgugger)
    • XLMR tokenizer is fully picklable #13577 (@ben-davidson-6)
    • Optimize Token Classification models for TPU #13096 (@ibraheem-moosa)
    • [Trainer] Add nan/inf logging filter #13619 (@patrickvonplaten)
    • Fix special tokens not correctly tokenized #13489 (@qqaatw)
    • Removed console spam from misfiring warnings #13625 (@Rocketknight1)
    • Use config_dict_or_path for deepspeed.zero.Init #13614 (@aphedges)
    • Fixes issues with backward pass in LED/Longformer Self-attention #13613 (@aleSuglia)
    • fix some docstring in encoder-decoder models #13611 (@ydshieh)
    • Updated tiny distilbert models #13631 (@LysandreJik)
    • Fix GPT2Config parameters in GPT2ModelTester #13630 (@calpt)
    • [run_summarization] fix typo #13647 (@patil-suraj)
    • [Fix]Make sure the args tb_writer passed to the TensorBoardCallback works #13636 (@iamlockelightning)
    • Fix mT5 documentation #13639 (@ayaka14732)
    • Update modeling_tf_deberta.py #13654 (@kamalkraj)
    • [megatron_gpt2] checkpoint v3 #13508 (@stas00)
    • Change https:/ to https:// to dataset GitHub repo #13644 (@flozi00)
    • fix research_projects/mlm_wwm readme.md examples #13646 (@LowinLi)
    • Fix typo distilbert doc to code link #13643 (@flozi00)
    • Add Speech AutoModels #13655 (@patrickvonplaten)
    • beit-flax #13515 (@kamalkraj)
    • [FLAX] Question Answering Example #13649 (@kamalkraj)
    • Typo "UNKWOWN" -> "UNKNOWN" #13675 (@kamalkraj)
    • [SequenceFeatureExtractor] Rewrite padding logic from pure python to numpy #13650 (@anton-l)
    • [SinusoidalPositionalEmbedding] incorrect dtype when resizing in forward #13665 (@stas00)
    • Add push_to_hub to no_trainer examples #13659 (@sgugger)
    • Layoutlm onnx support (Issue #13300) #13562 (@nishprabhu)
    • Update modeling_flax_wav2vec2.py #13680 (@kamalkraj)
    • [FlaxWav2Vec2] Revive Test #13688 (@patrickvonplaten)
    • [AutoTokenizer] Allow creation of tokenizers by tokenizer type #13668 (@patrickvonplaten)
    • [Wav2Vec2FeatureExtractor] Fix extractor.pad() dtype backwards compatibility #13693 (@anton-l)
    • Make gradient_checkpointing a training argument #13657 (@sgugger)
    • Assertions to exceptions #13692 (@MocktaiLEngineer)
    • Fix non-negligible difference between GPT2 and TFGP2 #13679 (@ydshieh)
    • Allow only textual inputs to VisualBert #13687 (@gchhablani)
    • Patch training arguments issue #13699 (@LysandreJik)
    • Patch training arguments issue #13700 (@LysandreJik)
    • [GPT-J] Use the float16 checkpoints in integration tests #13676 (@anton-l)
    • [docs/gpt-j] add a note about tokenizer #13696 (@patil-suraj)
    • Fix FNet reference to tpu short seq length #13686 (@gchhablani)
    • Add BlenderBot small tokenizer to the init #13367 (@LysandreJik)
    • Fix typo in torchscript tests #13701 (@LysandreJik)
    • Handle UnicodeDecodeError when loading config file #13717 (@qqaatw)
    • Add FSNER example in research_projects #13712 (@sayef)
    • Replace torch.set_grad_enabled by torch.no_grad #13703 (@LysandreJik)
    • [ASR] Add official ASR CTC example to examples/pytorch/speech-recognition #13620 (@patrickvonplaten)
    • Make assertions only if actually chunking forward #13598 (@joshdevins)
    • Use torch.unique_consecutive to check elements are same #13637 (@oToToT)
    • Fixing zero-shot backward compatiblity #13725 (@Narsil)
    • [Tests] FNetTokenizer #13729 (@patrickvonplaten)
    • Warn for unexpected argument combinations #13509 (@shirayu)
    • Add model card creation snippet to example scripts #13730 (@gchhablani)
    • [Examples] speech recognition - remove gradient checkpointing #13733 (@patrickvonplaten)
    • Update test dependence for torch examples #13738 (@sgugger)
    • [Tests] Add decorator to FlaxBeit #13743 (@patrickvonplaten)
    • Update requirements for speech example #13745 (@sgugger)
    • [Trainer] Make sure shown loss in distributed training is correctly averaged over all workers #13681 (@patrickvonplaten)
    • [megatron gpt checkpoint conversion] causal mask requires pos_embed dimension #13735 (@stas00)
    • [Tests] Cast Hubert model tests to fp16 #13755 (@anton-l)
    • Fix type annotations for distributed_concat() #13746 (@Renovamen)
    • Fix loss computation in Trainer #13760 (@sgugger)
    • Silence warning in gradient checkpointing when it's False #13734 (@sgugger)
    Source code(tar.gz)
    Source code(zip)
  • v4.10.3(Sep 22, 2021)

  • v4.10.2(Sep 10, 2021)

  • v4.10.1(Sep 10, 2021)

    • [Wav2Vec2] Fix normalization for non-padded tensors #13512 (@patrickvonplaten)
    • Fixing backward compatiblity for non prefixed tokens (B-, I-). #13493 (@Narsil)
    • Fixing #13381 #13400 (@Narsil)
    Source code(tar.gz)
    Source code(zip)
  • v4.10.0(Aug 31, 2021)

    v4.10.0: LayoutLM-v2, LayoutXLM, BEiT

    LayoutLM-v2 and LayoutXLM

    Four new models are released as part of the LatourLM-v2 implementation: LayoutLMv2ForSequenceClassification, LayoutLMv2Model, LayoutLMv2ForTokenClassification and LayoutLMv2ForQuestionAnswering, in PyTorch.

    The LayoutLMV2 model was proposed in LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. LayoutLMV2 improves LayoutLM to obtain state-of-the-art results across several document image understanding benchmarks:

    • Add LayoutLMv2 + LayoutXLM #12604 (@NielsRogge)

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=layoutlmv2

    BEiT

    Three new models are released as part of the BEiT implementation: BeitModel, BeitForMaskedImageModeling, and BeitForImageClassification, in PyTorch.

    The BEiT model was proposed in BEiT: BERT Pre-Training of Image Transformers by Hangbo Bao, Li Dong and Furu Wei. Inspired by BERT, BEiT is the first paper that makes self-supervised pre-training of Vision Transformers (ViTs) outperform supervised pre-training. Rather than pre-training the model to predict the class of an image (as done in the original ViT paper), BEiT models are pre-trained to predict visual tokens from the codebook of OpenAI’s DALL-E model given masked patches.

    • Add BEiT #12994 (@NielsRogge)

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=beit

    Speech improvements

    The Wav2Vec2 and HuBERT models now have a sequence classification head available.

    • Add Wav2Vec2 & Hubert ForSequenceClassification #13153 (@anton-l)

    DeBERTa in TensorFlow (@kamalkraj)

    The DeBERTa and DeBERTa-v2 models have been converted from PyTorch to TensorFlow.

    • Deberta tf #12972 (@kamalkraj)
    • Deberta_v2 tf #13120 (@kamalkraj)

    Flax model additions

    EncoderDecoder, DistilBERT, and ALBERT, now have support in Flax!

    • FlaxEncoderDecoder allowing Bert2Bert and Bert2GPT2 in Flax #13008 (@ydshieh)
    • FlaxDistilBERT #13324 (@kamalkraj)
    • FlaxAlBERT #13294 (@kamalkraj)

    TensorFlow examples

    A new example has been added in TensorFlow: multiple choice! Data collators have become framework agnostic and can now work for both TensorFlow and NumPy on top of PyTorch.

    • Add TF multiple choice example #12865 (@Rocketknight1)
    • TF/Numpy variants for all DataCollator classes #13105 (@Rocketknight1)

    Auto API refactor

    The Auto APIs have been disentangled from all the other mode modules of the Transformers library, so you can now safely import the Auto classes without importing all the models (and maybe getting errors if your setup is not compatible with one specific model). The actual model classes are only imported when needed.

    • Disentangle auto modules from other modeling files #13023 (@sgugger)
    • Fix AutoTokenizer when no fast tokenizer is available #13336 (@sgugger)

    Slight breaking change

    When loading some kinds of corrupted state dictionaries of models, the PreTrainedModel.from_pretrained method was sometimes silently ignoring weights. This has now become a real error.

    • Fix from_pretrained with corrupted state_dict #12939 (@sgugger)

    General improvements and bugfixes

    • Improving pipeline tests #12784 (@Narsil)

    • Pin git python to <3.1.19 #12858 (@patrickvonplaten)

    • [tests] fix logging_steps requirements #12860 (@stas00)

    • [Sequence Feature Extraction] Add truncation #12804 (@patrickvonplaten)

    • add classifier_dropout to classification heads #12794 (@PhilipMay)

    • Fix barrier for SM distributed #12853 (@sgugger)

    • Add possibility to ignore imports in test_fecther #12801 (@sgugger)

    • Add accelerate to examples requirements #12888 (@sgugger)

    • Fix documentation of BigBird tokenizer #12889 (@sgugger)

    • Better heuristic for token-classification pipeline. #12611 (@Narsil)

    • Fix push_to_hub for TPUs #12895 (@sgugger)

    • Seq2SeqTrainer set max_length and num_beams only when non None #12899 (@cchen-dialpad)

    • [FLAX] Minor fixes in CLM example #12914 (@stefan-it)

    • Correct validation_split_percentage argument from int (ex:5) to float (0.05) #12897 (@Elysium1436)

    • Fix typo in the example of MobileBertForPreTraining #12919 (@buddhics)

    • Add option to set max_len in run_ner #12929 (@sgugger)

    • Fix QA examples for roberta tokenizer #12928 (@sgugger)

    • Print defaults when using --help for scripts #12930 (@sgugger)

    • Fix StoppingCriteria ABC signature #12918 (@willfrey)

    • Add missing @classmethod decorators #12927 (@willfrey)

    • fix distiller.py #12910 (@chutaklee)

    • Update generation_logits_process.py #12901 (@willfrey)

    • Update generation_logits_process.py #12900 (@willfrey)

    • Update tokenization_auto.py #12896 (@willfrey)

    • Fix docstring typo in tokenization_auto.py #12891 (@willfrey)

    • [Flax] Correctly Add MT5 #12988 (@patrickvonplaten)

    • ONNX v2 raises an Exception when using PyTorch < 1.8.0 #12933 (@mfuntowicz)

    • Moving feature-extraction pipeline to new testing scheme #12843 (@Narsil)

    • Add CpmTokenizerFast #12938 (@JetRunner)

    • fix typo in gradient_checkpointing arg #12855 (@21jun)

    • Log Azure ML metrics only for rank 0 #12766 (@harshithapv)

    • Add substep end callback method #12951 (@wulu473)

    • Add multilingual documentation support #12952 (@JetRunner)

    • Fix division by zero in NotebookProgressPar #12953 (@sgugger)

    • [FLAX] Minor fixes in LM example #12947 (@stefan-it)

    • Prevent Trainer.evaluate() crash when using only tensorboardX #12963 (@aphedges)

    • Fix typo in example of DPRReader #12954 (@tadejsv)

    • Place BigBirdTokenizer in sentencepiece-only objects #12975 (@sgugger)

    • fix typo in example/text-classification README #12974 (@fullyz)

    • Fix template for inputs docstrings #12976 (@sgugger)

    • fix Trainer.train(resume_from_checkpoint=False) is causing an exception #12981 (@PhilipMay)

    • Cast logits from bf16 to fp32 at the end of TF_T5 #12332 (@szutenberg)

    • Update CANINE test #12453 (@NielsRogge)

    • pad_to_multiple_of added to DataCollatorForWholeWordMask #12999 (@Aktsvigun)

    • [Flax] Align jax flax device name #12987 (@patrickvonplaten)

    • [Flax] Correct flax docs #12782 (@patrickvonplaten)

    • T5: Create position related tensors directly on device instead of CPU #12846 (@armancohan)

    • Skip ProphetNet test #12462 (@LysandreJik)

    • Create perplexity.rst #13004 (@sashavor)

    • GPT-Neo ONNX export #12911 (@michaelbenayoun)

    • Update generate method - Fix floor_divide warning #13013 (@nreimers)

    • [Flax] Correct pt to flax conversion if from base to head #13006 (@patrickvonplaten)

    • [Flax T5] Speed up t5 training #13012 (@patrickvonplaten)

    • FX submodule naming fix #13016 (@michaelbenayoun)

    • T5 with past ONNX export #13014 (@michaelbenayoun)

    • Fix ONNX test: Put smaller ALBERT model #13028 (@LysandreJik)

    • Tpu tie weights #13030 (@sgugger)

    • Use min version for huggingface-hub dependency #12961 (@lewtun)

    • tfhub.de -> tfhub.dev #12565 (@abhishekkrthakur)

    • [Flax] Refactor gpt2 & bert example docs #13024 (@patrickvonplaten)

    • Add MBART to models exportable with ONNX #13049 (@LysandreJik)

    • Add to ONNX docs #13048 (@LysandreJik)

    • Fix small typo in M2M100 doc #13061 (@SaulLu)

    • Add try-except for torch_scatter #13040 (@JetRunner)

    • docs: add HuggingArtists to community notebooks #13050 (@AlekseyKorshuk)

    • Fix ModelOutput instantiation form dictionaries #13067 (@sgugger)

    • Roll out the test fetcher on push tests #13055 (@sgugger)

    • Fix fallback of test_fetcher #13071 (@sgugger)

    • Revert to all tests whil we debug what's wrong #13072 (@sgugger)

    • Use original key for label in DataCollatorForTokenClassification #13057 (@ibraheem-moosa)

    • [Doctest] Setup, quicktour and task_summary #13078 (@sgugger)

    • Add VisualBERT demo notebook #12263 (@gchhablani)

    • Install git #13091 (@LysandreJik)

    • Fix classifier dropout in AlbertForMultipleChoice #13087 (@ibraheem-moosa)

    • Doctests job #13088 (@LysandreJik)

    • Fix VisualBert Embeddings #13017 (@gchhablani)

    • Proper import for unittest.mock.patch #13085 (@sgugger)

    • Reactive test fecthers on scheduled test with proper git install #13097 (@sgugger)

    • Change a parameter name in FlaxBartForConditionalGeneration.decode() #13074 (@ydshieh)

    • [Flax/JAX] Run jitted tests at every commit #13090 (@patrickvonplaten)

    • Rely on huggingface_hub for common tools #13100 (@sgugger)

    • [FlaxCLIP] allow passing params to image and text feature methods #13099 (@patil-suraj)

    • Ci last fix #13103 (@sgugger)

    • Improve type checker performance #13094 (@bschnurr)

    • Fix VisualBERT docs #13106 (@gchhablani)

    • Fix CircleCI nightly tests #13113 (@sgugger)

    • Create py.typed #12893 (@willfrey)

    • Fix flax gpt2 hidden states #13109 (@ydshieh)

    • Moving fill-mask pipeline to new testing scheme #12943 (@Narsil)

    • Fix omitted lazy import for xlm-prophetnet #13052 (@minwhoo)

    • Fix classifier dropout in bertForMultipleChoice #13129 (@mandelbrot-walker)

    • Fix frameworks table so it's alphabetical #13118 (@osanseviero)

    • [Feature Processing Sequence] Remove duplicated code #13051 (@patrickvonplaten)

    • Ci continue through smi failure #13140 (@LysandreJik)

    • Fix missing seq_len in electra model when inputs_embeds is used. #13128 (@sararb)

    • Optimizes ByT5 tokenizer #13119 (@Narsil)

    • Add splinter #12955 (@oriram)

    • [AutoFeatureExtractor] Fix loading of local folders if config.json exists #13166 (@patrickvonplaten)

    • Fix generation docstrings regarding input_ids=None #12823 (@jvamvas)

    • Update namespaces inside torch.utils.data to the latest. #13167 (@qqaatw)

    • Fix the loss calculation of ProphetNet #13132 (@StevenTang1998)

    • Fix LUKE tests #13183 (@NielsRogge)

    • Add min and max question length options to TapasTokenizer #12803 (@NielsRogge)

    • SageMaker: Fix sagemaker DDP & metric logs #13181 (@philschmid)

    • correcting group beam search function output score bug #13211 (@sourabh112)

    • Change how "additional_special_tokens" argument in the ".from_pretrained" method of the tokenizer is taken into account #13056 (@SaulLu)

    • remove unwanted control-flow code from DeBERTa-V2 #13145 (@kamalkraj)

    • Fix load_tf_weights alias. #13159 (@qqaatw)

    • Add RemBert to AutoTokenizer #13224 (@LysandreJik)

    • Allow local_files_only for fast pretrained tokenizers #13225 (@BramVanroy)

    • fix AutoModel.from_pretrained(..., torch_dtype=...) #13209 (@stas00)

    • Fix broken links in Splinter documentation #13237 (@oriram)

    • Custom errors and BatchSizeError #13184 (@AmbiTyga)

    • Bump notebook from 6.1.5 to 6.4.1 in /examples/research_projects/lxmert #13226 (@dependabot[bot])

    • Update generation_logits_process.py #12671 (@willfrey)

    • Remove side effects of disabling gradient computaiton #13257 (@LysandreJik)

    • Replace assert statement with if condition and raise ValueError #13263 (@nishprabhu)

    • Better notification service #13267 (@LysandreJik)

    • Fix failing Hubert test #13261 (@LysandreJik)

    • Add CLIP tokenizer to AutoTokenizer #13258 (@LysandreJik)

    • Some model_types cannot be in the mapping #13259 (@LysandreJik)

    • Add require flax to MT5 Flax test #13260 (@LysandreJik)

    • Migrating conversational pipeline tests to new testing format #13114 (@Narsil)

    • fix tokenizer_class_from_name for models with - in the name #13251 (@stas00)

    • Add error message concerning revision #13266 (@BramVanroy)

    • Move image-classification pipeline to new testing #13272 (@Narsil)

    • [Hotfix] Fixing the test (warnings was incorrect.) #13278 (@Narsil)

    • Moving question_answering tests to the new testing scheme. Had to tweak a little some ModelTesterConfig for pipelines. #13277 (@Narsil)

    • Moving summarization pipeline to new testing format. #13279 (@Narsil)

    • Moving table-question-answering pipeline to new testing. #13280 (@Narsil)

    • Moving table-question-answering pipeline to new testing #13281 (@Narsil)

    • Hotfixing master tests. #13282 (@Narsil)

    • Moving text2text-generation to new pipeline testing mecanism #13283 (@Narsil)

    • Add DINO conversion script #13265 (@NielsRogge)

    • Moving text-generation pipeline to new testing framework. #13285 (@Narsil)

    • Moving token-classification pipeline to new testing. #13286 (@Narsil)

    • examples: add keep_linebreaks option to CLM examples #13150 (@stefan-it)

    • Moving translation pipeline to new testing scheme. #13297 (@Narsil)

    • Fix BeitForMaskedImageModeling #13275 (@NielsRogge)

    • Moving zero-shot-classification pipeline to new testing. #13299 (@Narsil)

    • Fixing mbart50 with return_tensors argument too. #13301 (@Narsil)

    • [Flax] Correct all return tensors to numpy #13307 (@patrickvonplaten)

    • examples: only use keep_linebreaks when reading TXT files #13320 (@stefan-it)

    • Slow tests - run rag token in half precision #13304 (@patrickvonplaten)

    • [Slow tests] Disable Wav2Vec2 pretraining test for now #13303 (@patrickvonplaten)

    • Announcing the default model used by the pipeline (with a link). #13276 (@Narsil)

    • use float 16 in causal mask and masked bias #13194 (@hwijeen)

    • ✨ add citation file #13214 (@flaxel)

    • Improve documentation of pooler_output in ModelOutput #13228 (@navjotts)

    • fix: typo spelling grammar #13212 (@slowy07)

    • Check None before going through iteration #13250 (@qqaatw)

    • Use existing functionality for #13251 #13333 (@sgugger)

    • neptune.ai logger: add ability to connect to a neptune.ai run #13319 (@fcakyon)

    • Update label2id in the model config for run_glue #13334 (@sgugger)

    • :bug: fix small model card bugs #13310 (@nateraw)

    • Fall back to observed_batch_size when the dataloader does not know the batch_size. #13188 (@mbforbes)

    • Fixes #12941 where use_auth_token not been set up early enough #13205 (@bennimmo)

    • Correct wrong function signatures on the docs website #13198 (@qqaatw)

    • Fix release utils #13337 (@sgugger)

    • Add missing module spec #13321 (@laurahanu)

    • Use DS callable API to allow hf_scheduler + ds_optimizer #13216 (@tjruwase)

    • Tests fetcher tests #13340 (@sgugger)

    • [Testing] Add Flax Tests on GPU, Add Speech and Vision to Flax & TF tests #13313 (@patrickvonplaten)

    • Fixing a typo in the data_collator documentation #13309 (@Serhiy-Shekhovtsov)

    • Add GPT2ForTokenClassification #13290 (@tucan9389)

    • Doc mismatch fixed #13345 (@Apoorvgarg-creator)

    • Handle nested dict/lists of tensors as inputs in the Trainer #13338 (@sgugger)

    • [doc] correct TP implementation resources #13248 (@stas00)

    • Fix minor typo in parallelism doc #13289 (@jaketae)

    • Set missing seq_length variable when using inputs_embeds with ALBERT & Remove code duplication #13152 (@olenmg)

    • TF CLM example fix typo #13002 (@Rocketknight1)

    • Add generate kwargs to Seq2SeqTrainingArguments #13339 (@sgugger)

    Source code(tar.gz)
    Source code(zip)
  • v4.9.2(Aug 9, 2021)

    v4.9.2: Patch release

    • Tpu tie weights #13030 (@sgugger)
    • ONNX fixes & examples: #13048, #13049, #13028, #13014, #12911, (@mfuntowicz, @michaelbenayoun, @LysandreJik)
    • Fix push_to_hub for TPUs #12895 (@sgugger)
    Source code(tar.gz)
    Source code(zip)
  • v4.9.1(Jul 26, 2021)

  • v4.9.0(Jul 22, 2021)

    v4.9.0: TensorFlow examples, CANINE, tokenizer training, ONNX rework

    ONNX rework

    This version introduces a new package, transformers.onnx, which can be used to export models to ONNX. Contrary to the previous implementation, this approach is meant as an easily extendable package where users may define their own ONNX configurations and export the models they wish to export.

    python -m transformers.onnx --model=bert-base-cased onnx/bert-base-cased/
    
    Validating ONNX model...
            -[✓] ONNX model outputs' name match reference model ({'pooler_output', 'last_hidden_state'}
            - Validating ONNX Model output "last_hidden_state":
                    -[✓] (2, 8, 768) matchs (2, 8, 768)
                    -[✓] all values close (atol: 0.0001)
            - Validating ONNX Model output "pooler_output":
                    -[✓] (2, 768) matchs (2, 768)
                    -[✓] all values close (atol: 0.0001)
    All good, model saved at: onnx/bert-base-cased/model.onnx
    
    • [RFC] Laying down building stone for more flexible ONNX export capabilities #11786 (@mfuntowicz)

    CANINE model

    Four new models are released as part of the CANINE implementation: CanineForSequenceClassification, CanineForMultipleChoice, CanineForTokenClassification and CanineForQuestionAnswering, in PyTorch.

    The CANINE model was proposed in CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. It’s among the first papers that train a Transformer without using an explicit tokenization step (such as Byte Pair Encoding (BPE), WordPiece, or SentencePiece). Instead, the model is trained directly at a Unicode character level. Training at a character level inevitably comes with a longer sequence length, which CANINE solves with an efficient downsampling strategy, before applying a deep Transformer encoder.

    • Add CANINE #12024 (@NielsRogge)

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=canine

    Tokenizer training

    This version introduces a new method to train a tokenizer from scratch based off of an existing tokenizer configuration.

    from datasets import load_dataset
    from transformers import AutoTokenizer
    
    dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")
    # We train on batch of texts, 1000 at a time here.
    batch_size = 1000
    corpus = (dataset[i : i + batch_size]["text"] for i in range(0, len(dataset), batch_size))
    
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    new_tokenizer = tokenizer.train_new_from_iterator(corpus, vocab_size=20000)
    
    • Easily train a new fast tokenizer from a given one - tackle the special tokens format (str or AddedToken) #12420 (@SaulLu)
    • Easily train a new fast tokenizer from a given one #12361 (@sgugger)

    TensorFlow examples

    The TFTrainer is now entering deprecation - and it is replaced by Keras. With version v4.9.0 comes the end of a long rework of the TensorFlow examples, for them to be more Keras-idiomatic, clearer, and more robust.

    • NER example for Tensorflow #12469 (@Rocketknight1)
    • TF summarization example #12617 (@Rocketknight1)
    • Adding TF translation example #12667 (@Rocketknight1)
    • Deprecate TFTrainer #12706 (@Rocketknight1)

    TensorFlow implementations

    HuBERT is now implemented in TensorFlow:

    • Add TFHubertModel #12206 (@will-rice)

    Breaking changes

    When load_best_model_at_end was set to True in the TrainingArguments, having a different save_strategy and eval_strategy was accepted but the save_strategy was overwritten by the eval_strategy (the option to keep track of the best model needs to make sure there is an evaluation each time there is a save). This led to a lot of confusion with users not understanding why the script was not doing what it was told, so this situation will now raise an error indicating to set save_strategy and eval_strategy to the same values, and in the case that value is "steps", save_steps must be a round multiple of eval_steps.

    General improvements and bugfixes

    • UpdateDescription of TrainingArgs param save_strategy #12328 (@sam-qordoba)
    • [Deepspeed] new docs #12077 (@stas00)
    • [ray] try fixing import error #12338 (@richardliaw)
    • [examples/Flax] move the examples table up #12341 (@patil-suraj)
    • Fix torchscript tests #12336 (@LysandreJik)
    • Add flax/jax quickstart #12342 (@marcvanzee)
    • Fixed a typo in readme #12356 (@MichalPitr)
    • Fix exception in prediction loop occurring for certain batch sizes #12350 (@jglaser)
    • Add FlaxBigBird QuestionAnswering script #12233 (@vasudevgupta7)
    • Replace NotebookProgressReporter by ProgressReporter in Ray Tune run #12357 (@krfricke)
    • [examples] remove extra white space from log format #12360 (@stas00)
    • fixed multiplechoice tokenization #12362 (@cronoik)
    • [trainer] add main_process_first context manager #12351 (@stas00)
    • [Examples] Replicates the new --log_level feature to all trainer-based pytorch #12359 (@bhadreshpsavani)
    • [Examples] Update Example Template for --log_level feature #12365 (@bhadreshpsavani)
    • [Examples] Replace print statement with logger.info in QA example utils #12368 (@bhadreshpsavani)
    • Onnx export v2 fixes #12388 (@LysandreJik)
    • [Documentation] Warn that DataCollatorForWholeWordMask is limited to BertTokenizer-like tokenizers #12371 (@ionicsolutions)
    • Update run_mlm.py #12344 (@TahaAslani)
    • Add possibility to maintain full copies of files #12312 (@sgugger)
    • [CI] add dependency table sync verification #12364 (@stas00)
    • [Examples] Added context manager to datasets map #12367 (@bhadreshpsavani)
    • [Flax community event] Add more description to readme #12398 (@patrickvonplaten)
    • Remove the need for einsum in Albert's attention computation #12394 (@mfuntowicz)
    • [Flax] Adapt flax examples to include push_to_hub #12391 (@patrickvonplaten)
    • Tensorflow LM examples #12358 (@Rocketknight1)
    • [Deepspeed] match the trainer log level #12401 (@stas00)
    • [Flax] Add T5 pretraining script #12355 (@patrickvonplaten)
    • [models] respect dtype of the model when instantiating it #12316 (@stas00)
    • Rename detr targets to labels #12280 (@NielsRogge)
    • Add out of vocabulary error to ASR models #12288 (@will-rice)
    • Fix TFWav2Vec2 SpecAugment #12289 (@will-rice)
    • [example/flax] add summarization readme #12393 (@patil-suraj)
    • [Flax] Example scripts - correct weight decay #12409 (@patrickvonplaten)
    • fix ids_to_tokens naming error in tokenizer of deberta v2 #12412 (@hjptriplebee)
    • Minor fixes in original RAG training script #12395 (@shamanez)
    • Added talks #12415 (@suzana-ilic)
    • [modelcard] fix #12422 (@stas00)
    • Add option to save on each training node #12421 (@sgugger)
    • Added to talks section #12433 (@suzana-ilic)
    • Fix default bool in argparser #12424 (@sgugger)
    • Add default bos_token and eos_token for tokenizer of deberta_v2 #12429 (@hjptriplebee)
    • fix typo in mt5 configuration docstring #12432 (@fcakyon)
    • Add to talks section #12442 (@suzana-ilic)
    • [JAX/Flax readme] add philosophy doc #12419 (@patil-suraj)
    • [Flax] Add wav2vec2 #12271 (@patrickvonplaten)
    • Add test for a WordLevel tokenizer model #12437 (@SaulLu)
    • [Flax community event] How to use hub during training #12447 (@patrickvonplaten)
    • [Wav2Vec2, Hubert] Fix ctc loss test #12458 (@patrickvonplaten)
    • Comment fast GPU TF tests #12452 (@LysandreJik)
    • Fix training_args.py barrier for torch_xla #12464 (@jysohn23)
    • Added talk details #12465 (@suzana-ilic)
    • Add TPU README #12463 (@patrickvonplaten)
    • Import check_inits handling of duplicate definitions. #12467 (@Iwontbecreative)
    • Validation split added: custom data files @sgugger, @patil-suraj #12407 (@Souvic)
    • Fixing bug with param count without embeddings #12461 (@TevenLeScao)
    • [roberta] fix lm_head.decoder.weight ignore_key handling #12446 (@stas00)
    • Rework notebooks and move them to the Notebooks repo #12471 (@sgugger)
    • fixed typo in flax-projects readme #12466 (@mplemay)
    • Fix TAPAS test uncovered by #12446 #12480 (@LysandreJik)
    • Add guide on how to build demos for the Flax sprint #12468 (@osanseviero)
    • Add Repository import to the FLAX example script #12501 (@LysandreJik)
    • [examples/flax] clip style image-text training example #12491 (@patil-suraj)
    • [Flax] Fix wav2vec2 pretrain arguments #12498 (@Wikidepia)
    • [Flax] ViT training example #12300 (@patil-suraj)
    • Fix order of state and input in Flax Quickstart README #12510 (@navjotts)
    • [Flax] Dataset streaming example #12470 (@patrickvonplaten)
    • [Flax] Correct flax training scripts #12514 (@patrickvonplaten)
    • [Flax] Correct logging steps flax #12515 (@patrickvonplaten)
    • [Flax] Fix another bug in logging steps #12516 (@patrickvonplaten)
    • [Wav2Vec2] Flax - Adapt wav2vec2 script #12520 (@patrickvonplaten)
    • [Flax] Fix hybrid clip #12519 (@patil-suraj)
    • [RoFormer] Fix some issues #12397 (@JunnYu)
    • FlaxGPTNeo #12493 (@patil-suraj)
    • Updated README #12540 (@suzana-ilic)
    • Edit readme #12541 (@SaulLu)
    • implementing tflxmertmodel integration test #12497 (@sadakmed)
    • [Flax] Adapt examples to be able to use eval_steps and save_steps #12543 (@patrickvonplaten)
    • [examples/flax] add adafactor optimizer #12544 (@patil-suraj)
    • [Flax] Add FlaxMBart #12236 (@stancld)
    • Add a warning for broken ProphetNet fine-tuning #12511 (@JetRunner)
    • [trainer] add option to ignore keys for the train function too (#11719) #12551 (@shabie)
    • MLM training fails with no validation file(same as #12406 for pytorch now) #12517 (@Souvic)
    • [Flax] Allow retraining from save checkpoint #12559 (@patrickvonplaten)
    • Adding prepare_decoder_input_ids_from_labels methods to all TF ConditionalGeneration models #12560 (@Rocketknight1)
    • Remove tf.roll wherever not needed #12512 (@szutenberg)
    • Double check for attribute num_examples #12562 (@sgugger)
    • [examples/hybrid_clip] fix loading clip vision model #12566 (@patil-suraj)
    • Remove logging of GPU count etc from run_t5_mlm_flax.py #12569 (@ibraheem-moosa)
    • raise exception when arguments to pipeline are incomplete #12548 (@hwijeen)
    • Init pickle #12567 (@sgugger)
    • Fix group_lengths for short datasets #12558 (@sgugger)
    • Don't stop at num_epochs when using IterableDataset #12561 (@sgugger)
    • Fixing the pipeline optimization by reindexing targets (V2) #12330 (@Narsil)
    • Fix MT5 init #12591 (@sgugger)
    • [model.from_pretrained] raise exception early on failed load #12574 (@stas00)
    • [doc] fix broken ref #12597 (@stas00)
    • Add Flax sprint project evaluation section #12592 (@osanseviero)
    • This will reduce "Already borrowed error": #12550 (@Narsil)
    • [Flax] Add flax marian #12595 (@patrickvonplaten)
    • [Flax] Fix cur step flax examples #12608 (@patrickvonplaten)
    • Simplify unk token #12582 (@sgugger)
    • Fix arg count for partial functions #12609 (@sgugger)
    • Pass model_kwargs when loading a model in pipeline() #12449 (@aphedges)
    • [Flax] Fix mt5 auto #12612 (@patrickvonplaten)
    • [Flax Marian] Add marian flax example #12614 (@patrickvonplaten)
    • [FLax] Fix marian docs 2 #12615 (@patrickvonplaten)
    • [debugging utils] minor doc improvements #12525 (@stas00)
    • [doc] DP/PP/TP/etc parallelism #12524 (@stas00)
    • [doc] fix anchor #12620 (@stas00)
    • [Examples][Flax] added test file in summarization example #12630 (@bhadreshpsavani)
    • Point to the right file for hybrid CLIP #12599 (@edugp)
    • [flax]fix jax array type check #12638 (@patil-suraj)
    • Add tokenizer_file parameter to PreTrainedTokenizerFast docstring #12624 (@lewisbails)
    • Skip TestMarian_MT_EN #12649 (@LysandreJik)
    • The extended trainer tests should require torch #12650 (@LysandreJik)
    • Pickle auto models #12654 (@sgugger)
    • Pipeline should be agnostic #12656 (@LysandreJik)
    • Fix transfo xl integration test #12652 (@LysandreJik)
    • Remove SageMaker documentation #12657 (@philschmid)
    • Fixed docs #12646 (@KickItLikeShika)
    • fix typo in modeling_t5.py docstring #12640 (@PhilipMay)
    • Translate README.md to Simplified Chinese #12596 (@JetRunner)
    • Fix typo in README_zh-hans.md #12663 (@JetRunner)
    • Updates timeline for project evaluation #12660 (@osanseviero)
    • [WIP] Patch BigBird tokenization test #12653 (@LysandreJik)
    • **encode_plus() shouldn't run for W2V2CTC #12655 (@LysandreJik)
    • Add ByT5 option to example run_t5_mlm_flax.py #12634 (@mapmeld)
    • Wrong model is used in example, should be character instead of subword model #12676 (@jsteggink)
    • [Blenderbot] Fix docs #12227 (@patrickvonplaten)
    • Add option to load a pretrained model with mismatched shapes #12664 (@sgugger)
    • Fix minor docstring typos. #12682 (@qqaatw)
    • [tokenizer.prepare_seq2seq_batch] change deprecation to be easily actionable #12669 (@stas00)
    • [Flax Generation] Correct inconsistencies PyTorch/Flax #12662 (@patrickvonplaten)
    • [Deepspeed] adapt multiple models, add zero_to_fp32 tests #12477 (@stas00)
    • Add timeout to CI. #12684 (@LysandreJik)
    • Fix Tensorflow Bart-like positional encoding #11897 (@JunnYu)
    • [Deepspeed] non-native optimizers are mostly ok with zero-offload #12690 (@stas00)
    • Fix multiple choice doc examples #12679 (@sgugger)
    • Provide mask_time_indices to _mask_hidden_states to avoid double masking #12692 (@mfuntowicz)
    • Update TF examples README #12703 (@Rocketknight1)
    • Fix uninitialized variables when config.mask_feature_prob > 0 #12705 (@mfuntowicz)
    • Only test the files impacted by changes in the diff #12644 (@sgugger)
    • flax model parallel training #12590 (@patil-suraj)
    • [test] split test into 4 sub-tests to avoid timeout #12710 (@stas00)
    • [trainer] release tmp memory in checkpoint load #12718 (@stas00)
    • [Flax] Correct shift labels for seq2seq models in Flax #12720 (@patrickvonplaten)
    • Fix typo in Speech2TextForConditionalGeneration example #12716 (@will-rice)
    • Init adds its own files as impacted #12709 (@sgugger)
    • LXMERT integration test typo #12736 (@LysandreJik)
    • Fix AutoModel tests #12733 (@LysandreJik)
    • Skip test while the model is not available #12739 (@LysandreJik)
    • Skip test while the model is not available #12740 (@LysandreJik)
    • Translate README.md to Traditional Chinese #12701 (@qqaatw)
    • Fix MBart failing test #12737 (@LysandreJik)
    • Patch T5 device test #12742 (@LysandreJik)
    • Fix DETR integration test #12734 (@LysandreJik)
    • Fix led torchscript #12735 (@LysandreJik)
    • Remove framework mention #12731 (@LysandreJik)
    • [doc] parallelism: Which Strategy To Use When #12712 (@stas00)
    • [doc] performance: batch sizes #12725 (@stas00)
    • Replace specific tokenizer in log message by AutoTokenizer #12745 (@SaulLu)
    • [Wav2Vec2] Correctly pad mask indices for PreTraining #12748 (@patrickvonplaten)
    • [doc] testing: how to trigger a self-push workflow #12724 (@stas00)
    • add intel-tensorflow-avx512 to the candidates #12751 (@zzhou612)
    • [flax/model_parallel] fix typos #12757 (@patil-suraj)
    • Turn on eval mode when exporting to ONNX #12758 (@mfuntowicz)
    • Preserve list type of additional_special_tokens in special_token_map #12759 (@SaulLu)
    • [Wav2Vec2] Padded vectors should not allowed to be sampled #12764 (@patrickvonplaten)
    • Add tokenizers class mismatch detection between cls and checkpoint #12619 (@europeanplaice)
    • Fix push_to_hub docstring and make it appear in doc #12770 (@sgugger)
    • [ray] Fix datasets_modules ImportError with Ray Tune #12749 (@Yard1)
    • Longer timeout for slow tests #12779 (@LysandreJik)
    • Enforce eval and save strategies are compatible when --load_best_model_at_end #12786 (@sgugger)
    • [CIs] add troubleshooting docs #12791 (@stas00)
    • Fix Padded Batch Error 12282 #12487 (@will-rice)
    • Flax MLM: Allow validation split when loading dataset from local file #12689 (@fgaim)
    • [Longformer] Correct longformer docs #12809 (@patrickvonplaten)
    • [CLIP/docs] add and fix examples #12810 (@patil-suraj)
    • [trainer] sanity checks for save_steps=0|None and logging_steps=0 #12796 (@stas00)
    • Expose get_config() on ModelTesters #12812 (@LysandreJik)
    • Refactor slow sentencepiece tokenizers. #11716 (@PhilipMay)
    • Refer warmup_ratio when setting warmup_num_steps. #12818 (@tsuchm)
    • Add versioning system to fast tokenizer files #12713 (@sgugger)
    • Add _CHECKPOINT_FOR_DOC to all models #12811 (@LysandreJik)
    Source code(tar.gz)
    Source code(zip)
  • v4.8.2(Jun 30, 2021)

    Patch release: v4.8.2

    • Rename detr targets to labels #12280 (@NielsRogge)
    • fix ids_to_tokens naming error in tokenizer of deberta v2 #12412 (@hjptriplebee)
    • Add option to save on each training node #12421 (@sgugger)
    Source code(tar.gz)
    Source code(zip)
  • v4.8.1(Jun 24, 2021)

  • v4.8.0(Jun 23, 2021)

    v4.8.0 Integration with the Hub and Flax/JAX support

    Integration with the Hub

    Our example scripts and Trainer are now optimized for publishing your model on the Hugging Face Hub, with Tensorboard training metrics, and an automatically authored model card which contains all the relevant metadata, including evaluation results.

    Trainer Hub integration

    Use --push_to_hub to create a model repo for your training and it will be saved with all relevant metadata at the end of the training.

    Other flags are:

    • push_to_hub_model_id to control the repo name
    • push_to_hub_organization to specify an organization

    Visualizing Training metrics on huggingface.co (based on Tensorboard)

    By default if you have tensorboard installed the training scripts will use it to log, and the logging traces folder is conveniently located inside your model output directory, so you can push them to your model repo by default.

    Any model repo that contains Tensorboard traces will spawn a Tensorboard server:

    image

    which makes it very convenient to see how the training went! This Hub feature is in Beta so let us know if anything looks weird :)

    See this model repo

    Model card generation

    image

    The model card contains info about the datasets used, the eval results, ...

    Many users were already adding their eval results to their model cards in markdown format, but this is a more structured way of adding them which will make it easier to parse and e.g. represent in leaderboards such as the ones on Papers With Code!

    We use a format specified in collaboration with [PaperswithCode] (https://github.com/huggingface/huggingface_hub/blame/main/modelcard.md), see also this repo.

    Model, tokenizer and configurations

    All models, tokenizers and configurations having a revamp push_to_hub() method as well as a push_to_hub argument in their save_pretrained() method. The workflow of this method is changed a bit to be more like git, with a local clone of the repo in a folder of the working directory, to make it easier to apply patches (use use_temp_dir=True to clone in temporary folders for the same behavior as the experimental API).

    • Clean push to hub API #12187 (@sgugger)

    Flax/JAX support

    Flax/JAX is becoming a fully supported backend of the Transformers library with more models having an implementation in it. BART, CLIP and T5 join the already existing models, find the whole list here.

    • [Flax] FlaxAutoModelForSeq2SeqLM #12228 (@patil-suraj)
    • [FlaxBart] few small fixes #12247 (@patil-suraj)
    • [FlaxClip] fix test from/save pretrained test #12284 (@patil-suraj)
    • [Flax] [WIP] allow loading head model with base model weights #12255 (@patil-suraj)
    • [Flax] Fix flax test save pretrained #12256 (@patrickvonplaten)
    • [Flax] Add jax flax to env command #12251 (@patrickvonplaten)
    • add FlaxAutoModelForImageClassification in main init #12298 (@patil-suraj)
    • Flax T5 #12150 (@vasudevgupta7)
    • [Flax T5] Fix weight initialization and fix docs #12327 (@patrickvonplaten)
    • Flax summarization script #12230 (@patil-suraj)
    • FlaxBartPretrainedModel -> FlaxBartPreTrainedModel #12313 (@sgugger)

    General improvements and bug fixes

    • AutoTokenizer: infer the class from the tokenizer config if possible #12208 (@sgugger)
    • update desc for map in all examples #12226 (@bhavitvyamalik)
    • Depreciate pythonic Mish and support PyTorch 1.9 version of Mish #12240 (@digantamisra98)
    • [t5 doc] make the example work out of the box #12239 (@stas00)
    • Better CI feedback #12279 (@LysandreJik)
    • Fix for making student ProphetNet for Seq2Seq Distillation #12130 (@vishal-burman)
    • [DeepSpeed] don't ignore --adafactor #12257 (@stas00)
    • Tensorflow QA example #12252 (@Rocketknight1)
    • [tests] reset report_to to none, avoid deprecation warning #12293 (@stas00)
    • [trainer + examples] set log level from CLI #12276 (@stas00)
    • [tests] multiple improvements #12294 (@stas00)
    • Trainer: adjust wandb installation example #12291 (@stefan-it)
    • Fix and improve documentation for LEDForConditionalGeneration #12303 (@ionicsolutions)
    • [Flax] Main doc for event orga #12305 (@patrickvonplaten)
    • [trainer] 2 bug fixes and a rename #12309 (@stas00)
    • [docs] performance #12258 (@stas00)
    • Add CodeCarbon Integration #12304 (@JetRunner)
    • Optimizing away the fill-mask pipeline. #12113 (@Narsil)
    • Add output in a dictionary for TF generate method #12139 (@stancld)
    • Rewrite ProphetNet to adapt converting ONNX friendly #11981 (@jiafatom)
    • Add mention of the huggingface_hub methods for offline mode #12320 (@LysandreJik)
    • [Flax/JAX] Add how to propose projects markdown #12311 (@patrickvonplaten)
    • [TFWav2Vec2] Fix docs #12283 (@chenht2010)
    • Add all XxxPreTrainedModel to the main init #12314 (@sgugger)
    • Conda build #12323 (@LysandreJik)
    • Changed modeling_fx_utils.py to utils/fx.py for clarity #12326 (@michaelbenayoun)
    Source code(tar.gz)
    Source code(zip)
  • v4.7.0(Jun 17, 2021)

    v4.7.0: DETR, RoFormer, ByT5, Hubert, support for torch 1.9.0

    DETR (@NielsRogge)

    Three new models are released as part of the DETR implementation: DetrModel, DetrForObjectDetection and DetrForSegmentation, in PyTorch.

    DETR consists of a convolutional backbone followed by an encoder-decoder Transformer which can be trained end-to-end for object detection. It greatly simplifies a lot of the complexity of models like Faster-R-CNN and Mask-R-CNN, which use things like region proposals, non-maximum suppression procedure, and anchor generation. Moreover, DETR can also be naturally extended to perform panoptic segmentation, by simply adding a mask head on top of the decoder outputs.

    DETR can support any timm backbone.

    The DETR model was proposed in End-to-End Object Detection with Transformers by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko.

    • Add DETR #11653 (@NielsRogge)
    • Improve DETR #12147 (@NielsRogge)

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=detr

    ByT5 (@patrickvonplaten)

    A new tokenizer is released as part of the ByT5 implementation: ByT5Tokenizer. It can be used with the T5 family of models.

    The ByT5 model was presented in ByT5: Towards a token-free future with pre-trained byte-to-byte models by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel.

    • ByT5 model #11971 (@patrickvonplaten)

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?search=byt5

    RoFormer (@JunnYu)

    14 new models are released as part of the RoFormer implementation: RoFormerModel, RoFormerForCausalLM, RoFormerForMaskedLM, RoFormerForSequenceClassification, RoFormerForTokenClassification, RoFormerForQuestionAnswering and RoFormerForMultipleChoice, TFRoFormerModel, TFRoFormerForCausalLM, TFRoFormerForMaskedLM, TFRoFormerForSequenceClassification, TFRoFormerForTokenClassification, TFRoFormerForQuestionAnswering and TFRoFormerForMultipleChoice, in PyTorch and TensorFlow.

    RoFormer is a BERT-like autoencoding model with rotary position embeddings. Rotary position embeddings have shown improved performance on classification tasks with long texts. The RoFormer model was proposed in RoFormer: Enhanced Transformer with Rotary Position Embedding by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.

    • Add new model RoFormer (use rotary position embedding ) #11684 (@JunnYu)

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=roformer

    HuBERT (@patrickvonplaten)

    HuBERT is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.

    HuBERT was proposed in HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.

    Two new models are released as part of the HuBERT implementation: HubertModel and HubertForCTC, in PyTorch.

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=hubert

    • Hubert #11889 (@patrickvonplaten)

    Hugging Face Course - Part 1

    On Monday, June 14th, 2021, we released the first part of the Hugging Face Course. The course is focused on the Hugging Face ecosystem, including transformers. Most of the material in the course is now linked from the transformers documentation which now includes videos to explain singular concepts.

    • Add video links to the documentation #12162 (@sgugger)
    • Add link to the course #12229 (@sgugger)

    TensorFlow additions

    The Wav2Vec2 model can now be used in TensorFlow:

    • Adding TFWav2Vec2Model #11617 (@will-rice)

    PyTorch 1.9 support

    • Add support for torch 1.9.0 #12224 (@LysandreJik )
    • fix pt-1.9.0 add_ deprecation #12217 (@stas00)

    Notebooks

    • @NielsRogge has contributed five tutorials on the usage of BERT in his repository: Transformers-Tutorials
    • [Community Notebooks] Add Emotion Speech Noteboook #11900 (@patrickvonplaten)

    General improvements and bugfixes

    • Vit deit fixes #11309 (@NielsRogge)
    • Enable option for subword regularization in more tokenizers. #11417 (@PhilipMay)
    • Fix gpt-2 warnings #11709 (@LysandreJik)
    • [Flax] Fix BERT initialization & token_type_ids default #11695 (@patrickvonplaten)
    • BigBird on TPU #11651 (@vasudevgupta7)
    • [T5] Add 3D attention mask to T5 model (2) (#9643) #11197 (@lexhuismans)
    • Fix loading the best model on the last stage of training #11718 (@vbyno)
    • Fix T5 beam search when using parallelize #11717 (@OyvindTafjord)
    • [Flax] Correct example script #11726 (@patrickvonplaten)
    • Add Cloud details to README #11706 (@marcvanzee)
    • Experimental symbolic tracing feature with torch.fx for BERT, ELECTRA and T5 #11475 (@michaelbenayoun)
    • Improvements to Flax finetuning script #11727 (@marcvanzee)
    • Remove tapas model card #11739 (@julien-c)
    • Add visual + link to Premium Support webpage #11740 (@julien-c)
    • Issue with symbolic tracing for T5 #11742 (@michaelbenayoun)
    • [BigBird Pegasus] Make tests faster #11744 (@patrickvonplaten)
    • Use new evaluation loop in TrainerQA #11746 (@sgugger)
    • Flax BERT fix token type init #11750 (@patrickvonplaten)
    • [TokenClassification] Label realignment for subword aggregation #11680 (@Narsil)
    • Fix checkpoint deletion #11748 (@sgugger)
    • Fix incorrect newline in #11650 #11757 (@oToToT)
    • Add more subsections to main doc #11758 (@patrickvonplaten)
    • Fixed: Better names for nlp variables in pipelines' tests and docs. #11752 (@01-vyom)
    • add dataset_name to data_args and added accuracy metric #11760 (@philschmid)
    • Add Flax Examples and Cloud TPU README #11753 (@avital)
    • Fix a bug in summarization example which did not load model from config properly #11762 (@tomy0000000)
    • FlaxGPT2 #11556 (@patil-suraj)
    • Fix usage of head masks by PT encoder-decoder models' generate() function #11621 (@stancld)
    • [T5 failing CI] Fix generate test #11770 (@patrickvonplaten)
    • [Flax MLM] Refactor run mlm with optax #11745 (@patrickvonplaten)
    • Add DOI badge to README #11771 (@albertvillanova)
    • Deprecate commands from the transformers-cli that are in the hf-cli #11779 (@LysandreJik)
    • Fix release utilpattern in conf.py #11784 (@sgugger)
    • Fix regression in regression #11785 (@sgugger)
    • A cleaner and more scalable implementation of symbolic tracing #11763 (@michaelbenayoun)
    • Fix failing test on Windows Platform #11589 (@Lynx1820)
    • [Flax] Align GLUE training script with mlm training script #11778 (@patrickvonplaten)
    • Patch recursive import #11812 (@LysandreJik)
    • fix roformer config doc #11813 (@JunnYu)
    • [Flax] Small fixes in run_flax_glue.py #11820 (@patrickvonplaten)
    • [Deepspeed] support zero.Init in from_config #11805 (@stas00)
    • Add flax text class colab #11824 (@patrickvonplaten)
    • Faster list concat for trainer_pt_utils.get_length_grouped_indices() #11825 (@ctheodoris)
    • Replace double occurrences as the last step #11367 (@LysandreJik)
    • [Flax] Fix PyTorch import error #11839 (@patrickvonplaten)
    • Fix reference to XLNet #11846 (@sgugger)
    • Switch mem metrics flag #11851 (@sgugger)
    • Fix flos single node #11844 (@TevenLeScao)
    • Fix two typos in docs #11852 (@nickls)
    • [Trainer] Report both steps and num samples per second #11818 (@sgugger)
    • Add some tests to the slow suite #11860 (@LysandreJik)
    • Enable memory metrics in tests that need it #11859 (@LysandreJik)
    • fixed a small typo in the CONTRIBUTING doc #11856 (@stsuchi)
    • typo #11858 (@WrRan)
    • Add option to log only once in multinode training #11819 (@sgugger)
    • [Wav2Vec2] SpecAugment Fast #11764 (@patrickvonplaten)
    • [lm examples] fix overflow in perplexity calc #11855 (@stas00)
    • [Examples] create model with custom config on the fly #11798 (@stas00)
    • [Wav2Vec2ForCTC] example typo fixed #11878 (@madprogramer)
    • [AutomaticSpeechRecognitionPipeline] Ensure input tensors are on device #11874 (@francescorubbo)
    • Fix usage of head masks by TF encoder-decoder models' generate() function #11775 (@stancld)
    • Correcting comments in T5Stack to reflect correct tuple order #11330 (@talkhaldi)
    • [Flax] Allow dataclasses to be jitted #11886 (@patrickvonplaten)
    • changing find_batch_size to work with tokenizer outputs #11890 (@joerenner)
    • Link official Cloud TPU JAX docs #11892 (@avital)
    • Flax Generate #11777 (@patrickvonplaten)
    • Update deepspeed config to reflect hyperparameter search parameters #11896 (@Mindful)
    • Adding new argument max_new_tokens for generate. #11476 (@Narsil)
    • Added Sequence Classification class in GPTNeo #11906 (@bhadreshpsavani)
    • [Flax] Return Attention from BERT, ELECTRA, RoBERTa and GPT2 #11918 (@jayendra13)
    • Test optuna and ray #11924 (@LysandreJik)
    • Use self.assertEqual instead of assert in deberta v2 test. #11935 (@PhilipMay)
    • Remove redundant nn.log_softmax in run_flax_glue.py #11920 (@n2cholas)
    • Add MT5ForConditionalGeneration as supported arch. to summarization README #11961 (@PhilipMay)
    • Add FlaxCLIP #11883 (@patil-suraj)
    • RAG-2nd2end-revamp #11893 (@shamanez)
    • modify qa-trainer #11872 (@zhangfanTJU)
    • get_ordinal(local=True) replaced with get_local_ordinal() in training_args.py #11922 (@BassaniRiccardo)
    • reinitialize wandb config for each hyperparameter search run #11945 (@Mindful)
    • Add regression tests for slow sentencepiece tokenizers. #11737 (@PhilipMay)
    • Authorize args when instantiating an AutoModel #11956 (@LysandreJik)
    • Neptune.ai integration #11937 (@vbyno)
    • [deepspeed] docs #11940 (@stas00)
    • typo correction #11973 (@JminJ)
    • Typo in usage example, changed to device instead of torch_device #11979 (@albertovilla)
    • [DeepSpeed] decouple DeepSpeedConfigHF from Trainer #11966 (@stas00)
    • [Trainer] add train loss and flops metrics reports #11980 (@stas00)
    • Bump urllib3 from 1.25.8 to 1.26.5 in /examples/research_projects/lxmert #11983 (@dependabot[bot])
    • [RAG] Fix rag from pretrained question encoder generator behavior #11962 (@patrickvonplaten)
    • Fix examples in VisualBERT docs #11990 (@gchhablani)
    • [docs] fix xref to PreTrainedModel.generate #11049 (@stas00)
    • Update return introduction of forward method #11976 (@kouyk)
    • [deepspeed] Move code and doc into standalone files #11984 (@stas00)
    • [deepspeed] add nvme test skip rule #11997 (@stas00)
    • Fix weight decay masking in run_flax_glue.py #11964 (@n2cholas)
    • [Flax] Refactor MLM #12013 (@patrickvonplaten)
    • [Deepspeed] Assert on mismatches between ds and hf args #12021 (@stas00)
    • [TrainerArguments] format and sort repr, add str #12018 (@stas00)
    • Fixed Typo in modeling_bart.py #12035 (@ceevaaa)
    • Fix deberta 2 Tokenizer Integration Test #12017 (@PhilipMay)
    • fix past_key_values docs #12049 (@patil-suraj)
    • [JAX] Bump jax lib #12053 (@patrickvonplaten)
    • Fixes bug that appears when using QA bert and distilation. #12026 (@madlag)
    • Extend pipelines for automodel tupels #12025 (@Narsil)
    • Add optional grouped parsers description to HfArgumentParser #12042 (@peteriz)
    • adds metric prefix. #12057 (@riklopfer)
    • [CI] skip failing test #12059 (@stas00)
    • Fix LUKE integration tests #12066 (@NielsRogge)
    • Fix tapas issue #12063 (@NielsRogge)
    • updated the original RAG implementation to be compatible with latest Pytorch-Lightning #11806 (@shamanez)
    • Replace legacy tensor.Tensor with torch.tensor/torch.empty #12027 (@mariosasko)
    • Add torch to requirements.txt in language-modeling #12040 (@cdleong)
    • Properly indent block_size #12070 (@sgugger)
    • [Deepspeed] various fixes #12058 (@stas00)
    • [Deepspeed Wav2vec2] integration #11638 (@stas00)
    • Update run_ner.py with id2label config #12001 (@KoichiYasuoka)
    • [wav2vec2 / Deepspeed] sync LayerDrop for Wav2Vec2Encoder + tests #12076 (@stas00)
    • [test] support more than 2 gpus #12074 (@stas00)
    • Wav2Vec2 Pretraining #11306 (@anton-l)
    • [examples/flax] pass decay_mask fn to optimizer #12087 (@patil-suraj)
    • [versions] rm require_version_examples #12088 (@stas00)
    • [Wav2Vec2ForPretraining] Correct checkpoints wav2vec2 & fix tests #12089 (@patrickvonplaten)
    • Add text_column_name and label_column_name to run_ner and run_ner_no_trainer args #12083 (@kumapo)
    • CLIPFeatureExtractor should resize images with kept aspect ratio #11994 (@TobiasNorlund)
    • New TF GLUE example #12028 (@Rocketknight1)
    • Appending label2id and id2label to models for inference #12102 (@Rocketknight1)
    • Fix a condition in test_generate_with_head_masking #11911 (@stancld)
    • [Flax] Adding Visual-Transformer #11951 (@jayendra13)
    • add relevant description to tqdm in examples #11927 (@bhavitvyamalik)
    • Fix head masking generate tests #12110 (@patrickvonplaten)
    • Flax CLM script #12023 (@patil-suraj)
    • Add from_pretrained to dummy timm objects #12097 (@LysandreJik)
    • Fix t5 error message #12136 (@cccntu)
    • Fix megatron_gpt2 attention block's causal mask #12007 (@novatig)
    • Add mlm pretraining xla torch readme #12011 (@patrickvonplaten)
    • add readme for flax clm #12111 (@patil-suraj)
    • [Flax] Add FlaxBart models #11537 (@stancld)
    • Feature to use the PreTrainedTokenizerFast class as a stand-alone tokenizer #11810 (@SaulLu)
    • [Flax] Add links to google colabs #12146 (@patrickvonplaten)
    • Don't log anything before logging is setup in examples #12121 (@sgugger)
    • Use text_column_name variable instead of "text" #12132 (@nbroad1881)
    • [lm examples] Replicate --config_overrides addition to other LM examples #12135 (@kumar-abhishek)
    • [Flax] fix error message #12148 (@patil-suraj)
    • [optim] implement AdafactorSchedule #12123 (@stas00)
    • [style] consistent nn. and nn.functional #12124 (@stas00)
    • [Flax] Fix flax pt equivalence tests #12154 (@patrickvonplaten)
    • [style] consistent nn. and nn.functional: part2: templates #12153 (@stas00)
    • Flax Big Bird #11967 (@vasudevgupta7)
    • [style] consistent nn. and nn.functional: part 3 tests #12155 (@stas00)
    • [style] consistent nn. and nn.functional: part 4 examples #12156 (@stas00)
    • consistent nn. and nn.functional: part 5 docs #12161 (@stas00)
    • [Flax generate] Add params to generate #12171 (@patrickvonplaten)
    • Use a released version of optax rather than installing from Git. #12173 (@avital)
    • Have dummy processors have a from_pretrained method #12145 (@LysandreJik)
    • Add course banner #12157 (@sgugger)
    • Enable add_prefix_space on run_ner if necessary #12116 (@kumapo)
    • Update AutoModel classes in summarization example #12178 (@ionicsolutions)
    • Ray Tune Integration Updates #12134 (@amogkam)
    • [testing] ensure concurrent pytest workers use a unique port for torch.dist #12166 (@stas00)
    • Model card defaults #12122 (@sgugger)
    • Temporarily deactivate torch-scatter while we wait for new release #12181 (@LysandreJik)
    • Temporarily deactivate torchhub test #12184 (@sgugger)
    • [Flax] Add Beam Search #12131 (@patrickvonplaten)
    • updated DLC images and sample notebooks #12191 (@philschmid)
    • Enabling AutoTokenizer for HubertConfig. #12198 (@Narsil)
    • Use yaml to create metadata #12185 (@sgugger)
    • [Docs] fixed broken link #12205 (@bhadreshpsavani)
    • Pipeline update & tests #12207 (@LysandreJik)
    Source code(tar.gz)
    Source code(zip)
  • v4.6.1(May 20, 2021)

    Fix regression in models for sequence classification used for regression tasks #11785 Fix checkpoint deletion when load_bert_model_at_end = True #11748 Fix evaluation in question answering examples #11746 Fix release utils #11784

    Source code(tar.gz)
    Source code(zip)
  • v4.6.0(May 12, 2021)

    v4.6.0: ViT, DeiT, CLIP, LUKE, BigBirdPegasus, MegatronBERT

    Transformers aren't just for text - they can handle a huge range of input types, and there's been a flurry of papers and new models in the last few months applying them to vision tasks that had traditionally been dominated by convolutional networks. With this release, we're delighted to announce that several state-of-the-art pretrained vision and multimodal text+vision transformer models are now accessible in the huggingface/transformers repo. Give them a try!

    ViT (@NielsRogge)

    Two new models are released as part of the ViT implementation: ViTModel and ViTForImageClassification, in PyTorch.

    ViT is an image transformer-based model obtaining state-of-the-art results on image classification tasks. It was the first paper that successfully trained a Transformer encoder on ImageNet, attaining very good results compared to familiar convolutional architectures.

    The Vision Transformer (ViT) model was proposed in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=vit

    DeiT (@NielsRogge)

    Three new models are released as part of the DeiT implementation: DeiTModel, DeiTForImageClassification and DeiTForImageClassificationWithTeacher, in PyTorch.

    DeiT is an image transformer model similar to the ViT model. DeiT (data-efficient image transformers) models are more efficiently trained transformers for image classification, requiring far less data and far less computing resources compared to the original ViT models.

    The DeiT model was proposed in Training data-efficient image transformers & distillation through attention by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=deit

    • Add DeiT (PyTorch) #11056 (@NielsRogge)

    CLIP (@patil-suraj)

    Three new models are released as part of the CLIP implementation: CLIPModel, CLIPVisionModel and CLIPTextModel, in PyTorch.

    CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3.

    The CLIP model was proposed in Learning Transferable Visual Models From Natural Language Supervision by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=clip

    • CLIP #11445 (@patil-suraj)

    BigBirdPegasus (@vasudevgupta7)

    BigBird is a sparse-attention-based transformer that extends Transformer based models, such as BERT to much longer sequences. In addition to sparse attention, BigBird also applies global attention as well as random attention to the input sequence. Theoretically, it has been shown that applying sparse, global, and random attention approximates full attention while being computationally much more efficient for longer sequences. As a consequence of the capability to handle longer context, BigBird has shown improved performance on various long document NLP tasks, such as question answering and summarization, compared to BERT or RoBERTa.

    The BigBird model was proposed in Big Bird: Transformers for Longer Sequences by Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon, Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others.

    • Add BigBirdPegasus #10991 (@vasudevgupta7)

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=bigbird_pegasus

    LUKE (@NielsRogge, @ikuyamada)

    LUKE is based on RoBERTa and adds entity embeddings as well as an entity-aware self-attention mechanism, which helps improve performance on various downstream tasks involving reasoning about entities such as named entity recognition, extractive and cloze-style question answering, entity typing, and relation classification.

    The LUKE model was proposed in LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto.

    • Add LUKE #11223 (@NielsRogge, @ikuyamada)

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=luke

    Megatron (@jdemouth)

    The MegatronBERT model is added to the library, giving access to the 345m variants.

    It is implemented comes with nine different models: MegatronBertModel, MegatronBertForMaskedLM, MegatronBertForCausalLM, MegatronBertForNextSentencePrediction, MegatronBertForPreTraining, MegatronBertForSequenceClassification, MegatronBertForMultipleChoice, MegatronBertForTokenClassification, MegatronBertForQuestionAnswering, in PyTorch.

    The MegatronBERT model was proposed in Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.

    • Add nvidia megatron models #10911 (@jdemouth)

    Hub integration in Transformers

    The Hugging Face Hub integrates better within transformers, through two new added features:

    • Models, configurations and tokenizers now have a push_to_hub method to automatically push their state to the hub.

    • The Trainer can now automatically push its underlying model, configuration and tokenizer in a similar fashion. Additionally, it is able to create a draft of model card on the fly with the training hyperparameters and evaluation results.

    • Auto modelcard #11599 (@sgugger)

    • Trainer push to hub #11328 (@sgugger)

    DeepSpeed ZeRO Stage 3 & ZeRO-Infinity

    The Trainer now integrates two additional stages of ZeRO: ZeRO stage 3 for parameter partitioning, and ZeRO Infinity which extends CPU Offload with NVMe Offload.

    • [DeepSpeed] ZeRO Stage 3 #10753 (@stas00) release notes
    • [Deepspeed] ZeRO-Infinity integration plus config revamp #11418 (@stas00) release notes
    • PLease read both release notes for configuration file changes

    Flax

    Flax support is getting more robust, with model code stabilizing and new models being added to the library.

    • [FlaxRoberta] Add FlaxRobertaModels & adapt run_mlm_flax.py #11470 (@patrickvonplaten)
    • [Flax] Add Electra models #11426 (@CoderPat)
    • Adds Flax BERT finetuning example on GLUE #11564 (@marcvanzee)

    TensorFlow

    We welcome @Rocketknight1 as a TensorFlow contributor. This version includes a brand new TensorFlow example based on Keras, which will be followed by examples covering most tasks. Additionally, more TensorFlow setups are covered by adding support for AMD-based GPUs and M1 Macs.

    • Merge new TF example script #11360 (@Rocketknight1)
    • Update TF text classification example #11496 (@Rocketknight1)
    • run_text_classification.py fix #11660 (@Rocketknight1)
    • Accept tensorflow-rocm package when checking TF availability #11595 (@mvsjober)
    • Add MacOS TF version #11674 (@jplu)

    Pipelines

    Two new pipelines are added:

    • Adding AutomaticSpeechRecognitionPipeline. #11337 (@Narsil)
    • Add the ImageClassificationPipeline #11598 (@LysandreJik)

    Notebooks

    • [Community notebooks] Add Wav2Vec notebook for creating captions for YT Clips #11142 (@Muennighoff)
    • add bigbird-pegasus evaluation notebook #11654 (@vasudevgupta7)
    • Vit notebooks + vit/deit fixes #11309 (@NielsRogge)

    General improvements and bugfixes

    • [doc] gpt-neo #11098 (@stas00)
    • Auto feature extractor #11097 (@sgugger)
    • accelerate question answering examples with no trainer #11091 (@theainerd)
    • dead link fixed #11103 (@cronoik)
    • GPTNeo: handle padded wte (#11078) #11079 (@leogao2)
    • fix: The 'warn' method is deprecated #11105 (@stas00)
    • [examples] fix white space #11099 (@stas00)
    • Dummies multi backend #11100 (@sgugger)
    • Some styling of the training table in Notebooks #11118 (@sgugger)
    • Adds a note to resize the token embedding matrix when adding special … #11120 (@LysandreJik)
    • [BigBird] fix bigbird slow tests #11109 (@vasudevgupta7)
    • [versions] handle version requirement ranges #11110 (@stas00)
    • Adds use_auth_token with pipelines #11123 (@philschmid)
    • Fix and refactor check_repo #11127 (@sgugger)
    • Fix typing error in Trainer class (prediction_step) #11138 (@jannisborn)
    • Typo fix of the name of BertLMHeadModel in BERT doc #11133 (@forest1988)
    • [run_clm] clarify why we get the tokenizer warning on long input #11145 (@stas00)
    • [trainer] solve "scheduler before optimizer step" warning #11144 (@stas00)
    • Add fairscale and deepspeed back to the CI #11147 (@LysandreJik)
    • Updates SageMaker docs for updating DLCs #11140 (@philschmid)
    • Don't duplicate logs in TensorBoard and handle --use_env #11141 (@sgugger)
    • Run mlm pad to multiple for fp16 #11128 (@ak314)
    • [tests] relocate core integration tests #11146 (@stas00)
    • [setup] extras[docs] must include 'all' #11148 (@stas00)
    • Add support for multiple models for one config in auto classes #11150 (@sgugger)
    • [setup] make fairscale and deepspeed setup extras #11151 (@stas00)
    • typo #11152 (@stas00)
    • Fix LogitsProcessor documentation #11130 (@k-tahiro)
    • Correct typographical error in README.md #11161 (@Seyviour)
    • Make get_special_tokens_mask consider all tokens #11163 (@sgugger)
    • Add a special tokenizer for CPM model #11068 (@JetRunner)
    • [examples/translation] support mBART-50 and M2M100 fine-tuning #11170 (@patil-suraj)
    • [examples run_clm] fix _LazyModule hasher error #11168 (@stas00)
    • added json dump and extraction of train run time #11167 (@philschmid)
    • Minor typos fixed #11182 (@cronoik)
    • model_path should be ignored as the checkpoint path #11157 (@tsuchm)
    • Added documentation for data collator. #10941 (@fghuman)
    • Fix typo #11188 (@tma15)
    • Replaced which with who #11183 (@cronoik)
    • Import torch.utils.checkpoint in ProphetNet #11214 (@LysandreJik)
    • Sagemaker test docs update for framework upgrade #11206 (@philschmid)
    • Use MSELoss with single class label in (M)BartForSequenceClassification #11178 (@calpt)
    • wav2vec2 converter: create the proper vocab.json while converting fairseq wav2vec2 finetuned model #11041 (@cceyda)
    • Add Matt as the TensorFlow reference #11212 (@LysandreJik)
    • Fix GPT-2 warnings #11213 (@LysandreJik)
    • fix docs for decoder_input_ids #11221 (@patil-suraj)
    • Add documentation for BertJapanese #11219 (@forest1988)
    • Replace error by warning when loading an architecture in another #11207 (@sgugger)
    • Refactor GPT2 #11225 (@patil-suraj)
    • Doc check: a bit of clean up #11224 (@sgugger)
    • added cache_dir=model_args.cache_dir to all example with cache_dir arg #11220 (@philschmid)
    • Avoid using no_sync on SageMaker DP #11229 (@sgugger)
    • Indent code block in the documentation #11233 (@sgugger)
    • Run CI on deepspeed and fairscale #11172 (@LysandreJik)
    • [Deepspeed] zero3 tests band aid #11235 (@stas00)
    • Wav2Vec2 CommonVoice training - Save the processor before training starts #10910 (@Nithin-Holla)
    • Make "embeddings" plural in warning message within tokenization_utils_base #11228 (@jstremme)
    • Stale bot updated #10562 (@LysandreJik)
    • Close open files to suppress ResourceWarning #11240 (@parakalan)
    • Fix dimention misspellings. #11238 (@odellus)
    • Add prefix to examples in model_doc rst #11226 (@forest1988)
    • [troubleshooting] add 2 points of reference to the offline mode #11236 (@stas00)
    • Fix #10128 #11248 (@sgugger)
    • [deepspeed] test on one node 2 gpus max #11237 (@stas00)
    • Trainer iterable dataset #11254 (@sgugger)
    • Adding pipeline task aliases. #11247 (@Narsil)
    • Support for set_epoch in IterableDataset #11258 (@sgugger)
    • Tokenizer fast save #11234 (@sgugger)
    • update dependency_versions_table #11273 (@stas00)
    • Workflow fixes #11270 (@LysandreJik)
    • Enabling multilingual models for translation pipelines. #10536 (@Narsil)
    • Trainer support for IterableDataset for evaluation and predict #11286 (@sgugger)
    • move device statements outside if statements #11292 (@e-yi)
    • modify double considering special tokens in language_modeling.py #11275 (@taepd)
    • [Trainer] fix the placement on device with fp16_full_eval #11322 (@stas00)
    • [Trainer] Add a progress bar for batches skipped #11324 (@sgugger)
    • Load checkpoint without re-creating the model #11318 (@sgugger)
    • Added translation example script #11196 (@rajvi-k)
    • [Generate] Remove outdated code #11331 (@patrickvonplaten)
    • [GPTNeo] create local attention mask ones #11335 (@patil-suraj)
    • Update to use datasets remove_cloumns method #11343 (@sgugger)
    • Add an error message for Reformer w/ .backward() #11117 (@forest1988)
    • Removed max_length from being mandatory within generate. #11314 (@Narsil)
    • Honor contributors to models #11329 (@sgugger)
    • [deepspeed] fix resume from checkpoint #11352 (@stas00)
    • Examples reorg #11350 (@sgugger)
    • Extract metric_key_prefix during NotebookProgressCallback.on_evaluate #11347 (@lewtun)
    • [testing doc] bring doc up to date #11359 (@stas00)
    • Remove boiler plate code #11340 (@patrickvonplaten)
    • Move old TF text classification script to legacy #11361 (@Rocketknight1)
    • [contributing doc] explain/link to good first issue #11346 (@stas00)
    • Fix token_type_ids error for big_bird model. #11355 (@wlhgtc)
    • [Wav2Vec2] Fix special tokens for Wav2Vec2 tokenizer #11349 (@patrickvonplaten)
    • [Flax] Correct typo #11374 (@patrickvonplaten)
    • [run_translation.py] fix typo #11372 (@johnson7788)
    • Add space #11373 (@tma15)
    • Correctly cast num_train_epochs to int #11379 (@Rocketknight1)
    • Fix typo #11369 (@penut85420)
    • Fix Trainer with remove_unused_columns=False #11382 (@sgugger)
    • [Flax] Big FlaxBert Refactor #11364 (@patrickvonplaten)
    • [Flax] Typo #11393 (@patrickvonplaten)
    • [Flax] Correct Flax <=> PyTorch conversion #11394 (@patrickvonplaten)
    • Fix small typo in text #11396 (@maksym-del)
    • Fix typos in README for text-classification #11391 (@yoshitomo-matsubara)
    • [Blenderbot] Integration Test should be slow #11395 (@patrickvonplaten)
    • Fixed trainer total_flos relaoding in distributed mode #11383 (@TevenLeScao)
    • [Wav2Vec2] Correct conversion script #11400 (@patrickvonplaten)
    • added support for exporting of T5 models to onnx with past_key_values. #10651 (@Ki6an)
    • Fixing bug in generation #11297 (@nicola-decao)
    • Fix cross-attention head mask for Torch encoder-decoder models #10605 (@stancld)
    • Default to accuracy metric in run_glue_no_trainer #11405 (@sgugger)
    • Enable option for subword regularization in XLMRobertaTokenizer #11149 (@PhilipMay)
    • wrong parentclass in documentation #11410 (@cronoik)
    • EncoderDecoderConfigs should not create new objects #11300 (@cronoik)
    • Updating checkpoint for GPT2ForSequenceClassification #11334 #11434 (@abiolaTresor)
    • [BigBird] enable BigBirdForQuestionAnswering to return pooler output #11439 (@vasudevgupta7)
    • Upgrade Black to version 21.4b0 #11442 (@patrickvonplaten)
    • TF BART models - Add cross_attentions to model output and fix cross-attention head masking #10699 (@stancld)
    • Add basic support for FP16 in SageMaker model parallelism #11407 (@sgugger)
    • Fix link to the TPU launcher script in the pytorch examples #11427 (@amineabdaoui)
    • Typo fixes #11432 (@LSinev)
    • Pass along seed to DistributedSampler #11406 (@sgugger)
    • Clarify description of the is_split_into_words argument #11449 (@kstathou)
    • [docs] fix invalid class name #11438 (@stas00)
    • [Makefile] make sure to test against the local checkout #11437 (@stas00)
    • Give each hub test a different repo name #11453 (@sgugger)
    • [Examples] Fixes inconsistency around eval vs val and predict vs test #11380 (@bhadreshpsavani)
    • Variable Correction for Consistency in Distillation Example #11444 (@jaimeenahn)
    • Remove max length beam scorer #11378 (@GeetDsa)
    • update QuickTour docs to reflect model output object #11462 (@hamelsmu)
    • Finish Making Quick Tour respect the model object #11467 (@hamelsmu)
    • fix docs for decoder_input_ids #11466 (@patil-suraj)
    • Update min versions in README and add Flax #11472 (@sgugger)
    • Update PreTrainedTokenizerBase to check/handle batch length for text_pair parameter #11486 (@hamelsmu)
    • [Docs] remove paragraph on CI from installation instructions #11493 (@hamelsmu)
    • [Flax] Add docstrings & model outputs #11498 (@patrickvonplaten)
    • Reformat to make code clearer in tokenizer call #11497 (@sgugger)
    • solved coefficient issue for the TF version of gelu_fast #11514 (@michaelbenayoun)
    • Split checkpoint from model_name_or_path in examples #11492 (@sgugger)
    • Pin HuggingFace Hub dependency #11502 (@LysandreJik)
    • correct incorrect dimension comment in Longformer model #11494 (@fredo838)
    • Fix sp_model_kwargs param missing at unpickle in XLMRobertaTokenizer #11430 (@PhilipMay)
    • [Master] Make style #11520 (@patrickvonplaten)
    • Update README.md #11489 (@mrm8488)
    • T5 Gradient Checkpointing #11353 (@ceshine)
    • Implement Fast Tokenization for Deberta #11387 (@ShubhamSanghvi)
    • Accepts BatchEncoding in LengthGroupedSampler #11431 (@tma15)
    • Fix do_eval default value in training_args.py #11511 (@bonniehyeon)
    • [examples, translation/summerization] resize token embeds #11524 (@patil-suraj)
    • Run model templates on master #11527 (@LysandreJik)
    • [Examples] Added support for test-file in QA examples with no trainer #11510 (@bhadreshpsavani)
    • Add Stas and Suraj as authors #11526 (@sgugger)
    • Improve task summary docs #11513 (@hamelsmu)
    • [debug utils] activation/weights underflow/overflow detector #11274 (@stas00)
    • [DeepSpeed] fp32 support #11499 (@stas00)
    • Fix examples in M2M100 docstrings #11540 (@lewtun)
    • [Flax BERT/Roberta] few small fixes #11558 (@patil-suraj)
    • [Wav2Vec2] Fix convert #11562 (@patrickvonplaten)
    • Remove datasets submodule. #11563 (@LysandreJik)
    • fix the mlm longformer example by changing [MASK] to #11559 (@fredo838)
    • [Wav2vec2] Fixed tokenization mistakes while adding single-char tokens to tokenizer #11538 (@Muktan)
    • Fix metric computation in run_glue_no_trainer #11569 (@sgugger)
    • Fixes a useless warning in generate. #11566 (@Narsil)
    • Fix checkpointing in SageMaker MP #11481 (@sgugger)
    • Update training tutorial #11533 (@sgugger)
    • [Deepspeed] fix resize_token_embeddings #11572 (@stas00)
    • Add multi-class, multi-label and regression to transformers #11012 (@abhi1thakur)
    • add importlib_metadata as dependency as it is required for py<3.8 #11490 (@cdeepali)
    • Enable added tokens #11325 (@LysandreJik)
    • Make quality scripts work when one backend is missing. #11573 (@sgugger)
    • Removes SageMakerTrainer code but keeps class as wrapper #11587 (@philschmid)
    • Reproducible checkpoint #11582 (@sgugger)
    • [trainer] document resume randomness #11588 (@stas00)
    • [template runner CI] copies need to be fixed too #11585 (@stas00)
    • add importlib_metadata and huggingface_hub as dependency in the conda recipe #11591 (@cdeepali)
    • Pytorch - Lazy initialization of models #11471 (@patrickvonplaten)
    • fix head_mask for albert encoder part(AlbertTransformer) #11596 (@baeseongsu)
    • Fix Python version #11607 (@LysandreJik)
    • fix typo in command #11605 (@vipulraheja)
    • Fix typo in docstring #11611 (@eldarkurtic)
    • Re-styling in seq2seq attention #11613 (@sgugger)
    • [Lazy init] Fix edge cases #11615 (@patrickvonplaten)
    • [cuda ext tests] fixing tests #11619 (@stas00)
    • Fix RNG saves in distributed mode. #11620 (@sgugger)
    • Fix comment in run_clm_no_trainer.py #11624 (@cccntu)
    • make fix copy #11627 (@patrickvonplaten)
    • Reduce to 1 worker and set timeout for GPU TF tests #11633 (@LysandreJik)
    • [self-push CI] sync with self-scheduled #11637 (@stas00)
    • [examples] fix sys.path in conftest.py #11636 (@stas00)
    • [Examples] Check key exists in datasets first #11503 (@oToToT)
    • [Examples] Fix invalid links after reorg #11650 (@oToToT)
    • Update code example #11631 (@NielsRogge)
    • Add missing git dependency for RAG example #11634 (@lhoestq)
    • updated user permissions based on umask #11119 (@bhavitvyamalik)
    • Big Bird Fast Tokenizer implementation #11075 (@tanmaylaud)
    • Save scaler state dict when checkpointing #11663 (@sgugger)
    • [BigBird Pegasus] Add config to auto tokenizer #11667 (@patrickvonplaten)
    • Fixes NoneType exception when topk is larger than one coupled with a small context in the Question-Answering pipeline #11628 (@psorianom)
    • Add --text_column to run_summarization_no_trainer #11673 (@cccntu)
    • Fix docstring of description about input_ids #11672 (@nxznm)
    • Grammar and style edits for the frontpage README #11679 (@Rocketknight1)
    • Fix TF Roberta for mixed precision training #11675 (@jplu)
    • Test checkpointing #11682 (@sgugger)
    • Fix clip docs #11694 (@patil-suraj)
    • [Flax] Updates README and fixes bug #11701 (@marcvanzee)
    • remove defaults to None if optional #11703 (@PhilipMay)
    Source code(tar.gz)
    Source code(zip)
  • v4.5.1(Apr 13, 2021)

  • v4.5.0(Apr 6, 2021)

    v4.5.0: BigBird, GPT Neo, Examples, Flax support

    BigBird (@vasudevgupta7)

    Seven new models are released as part of the BigBird implementation: BigBirdModel, BigBirdForPreTraining, BigBirdForMaskedLM, BigBirdForCausalLM, BigBirdForSequenceClassification, BigBirdForMultipleChoice, BigBirdForQuestionAnswering in PyTorch.

    BigBird is a sparse-attention based transformer which extends Transformer based models, such as BERT to much longer sequences. In addition to sparse attention, BigBird also applies global attention as well as random attention to the input sequence.

    The BigBird model was proposed in Big Bird: Transformers for Longer Sequences by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.

    It is released with an accompanying blog post: Understanding BigBird's Block Sparse Attention

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=big_bird

    • BigBird #10183 (@vasudevgupta7)
    • [BigBird] Fix big bird gpu test #10967 (@patrickvonplaten)
    • [Notebook] add BigBird trivia qa notebook #10995 (@patrickvonplaten)
    • [Docs] Add blog to BigBird docs #10997 (@patrickvonplaten)

    GPT Neo (@patil-suraj)

    Two new models are released as part of the GPT Neo implementation: GPTNeoModel, GPTNeoForCausalLM in PyTorch.

    GPT⁠-⁠Neo is the code name for a family of transformer-based language models loosely styled around the GPT architecture. EleutherAI's primary goal is to replicate a GPT⁠-⁠3 DaVinci-sized model and open-source it to the public.

    The implementation within Transformers is a GPT2-like causal language model trained on the Pile dataset.

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=gpt_neo

    • GPT Neo #10848 (@patil-suraj)
    • GPT Neo few fixes #10968 (@patil-suraj)
    • GPT Neo configuration needs to be set to use GPT2 tokenizer #10992 (@LysandreJik)
    • [GPT Neo] fix example in config #10993 (@patil-suraj)
    • GPT Neo cleanup #10985 (@patil-suraj )

    Examples

    Features have been added to some examples, and additional examples have been added.

    Raw training loop examples

    Based on the accelerate library, examples completely exposing the training loop are now part of the library. For easy customization if you want to try a new research idea!

    • Expand a bit the presentation of examples #10799 (@sgugger)
    • Add examples/multiple-choice/run_swag_no_trainer.py #10934 (@stancld)
    • Update the example template for a no Trainer option #10865 (@sgugger)
    • Add examples/run_ner_no_trainer.py #10902 (@stancld)
    • Add examples/language_modeling/run_mlm_no_trainer.py #11001 (@hemildesai)
    • Add examples/language_modeling/run_clm_no_trainer.py #11026 (@hemildesai)

    Standardize examples with Trainer

    Thanks to the amazing contributions of @bhadreshpsavani, all examples with Trainer are now standardized and all support the predict stage and will return/save metrics in the same fashion.

    • [Example] Updating Question Answering examples for Predict Stage #10792 (@bhadreshpsavani)
    • [Examples] Added predict stage and Updated Example Template #10868 (@bhadreshpsavani)
    • [Example] Fixed finename for Saving null_odds in the evaluation stage in QA Examples #10939 (@bhadreshpsavani)
    • [trainer] Fixes Typo in Predict Method of Trainer #10861 (@bhadreshpsavani)

    Trainer & SageMaker Model Parallelism

    The Trainer now supports SageMaker model parallelism out of the box, the old SageMakerTrainer is deprecated as a consequence and will be removed in version 5.

    • Merge trainers #10975 (@sgugger)
    • added new notebook and merge of trainer #11015 (@philschmid)

    FLAX

    FLAX support has been widened to support all model heads of the BERT architecture, alongside a general conversion script for checkpoints in PyTorch to be used in FLAX.

    Auto models now have a FLAX implementation.

    • [Flax] Add general conversion script #10809 (@patrickvonplaten)
    • [Flax] Add other BERT classes #10977 (@patrickvonplaten)
    • Refactor AutoModel classes and add Flax Auto classes #11027 (@sgugger)

    General improvements and bugfixes

    • Patches the full import failure and adds a test #10750 (@LysandreJik)
    • Patches full import failure when sentencepiece is not installed #10752 (@LysandreJik)
    • [Deepspeed] Allow HF optimizer and scheduler to be passed to deepspeed #10464 (@cli99)
    • Fix ProphetNet Flaky Test #10771 (@patrickvonplaten)
    • [doc] [testing] extend the pytest -k section with more examples #10761 (@stas00)
    • Wav2Vec2 - fix flaky test #10773 (@patrickvonplaten)
    • [DeepSpeed] simplify init #10762 (@stas00)
    • [DeepSpeed] improve checkpoint loading code plus tests #10760 (@stas00)
    • [trainer] make failure to find a resume checkpoint fatal + tests #10777 (@stas00)
    • [Issue template] need to update/extend who to tag #10728 (@stas00)
    • [examples] document resuming #10776 (@stas00)
    • Check copies blackify #10775 (@sgugger)
    • Smmp batch not divisible by microbatches fix #10778 (@mansimane)
    • Add support for detecting intel-tensorflow version #10781 (@mfuntowicz)
    • wav2vec2: support datasets other than LibriSpeech #10581 (@elgeish)
    • add run_common_voice script #10767 (@patil-suraj)
    • Fix bug in input check for LengthGroupSampler #10783 (@thominj)
    • [file_utils] do not gobble certain kinds of requests.ConnectionError #10235 (@julien-c)
    • from_pretrained: check that the pretrained model is for the right model architecture #10586 (@vimarshc)
    • [examples/seq2seq/README.md] fix t5 examples #10734 (@stas00)
    • Fix distributed evaluation #10795 (@sgugger)
    • Add XLSR-Wav2Vec2 Fine-Tuning README.md #10786 (@patrickvonplaten)
    • addressing vulnerability report in research project deps #10802 (@stas00)
    • fix backend tokenizer args override: key mismatch #10686 (@theo-m)
    • [XLSR-Wav2Vec2 Info doc] Add a couple of lines #10806 (@patrickvonplaten)
    • Add transformers id to hub requests #10811 (@philschmid)
    • wav2vec doc tweaks #10808 (@julien-c)
    • Sort init import #10801 (@sgugger)
    • [wav2vec sprint doc] add doc for Local machine #10828 (@patil-suraj)
    • Add new community notebook - wav2vec2 with GPT #10794 (@voidful)
    • [Wav2Vec2] Small improvements for wav2vec2 info script #10829 (@patrickvonplaten)
    • [Wav2Vec2] Small tab fix #10846 (@patrickvonplaten)
    • Fix: typo in FINE_TUNE_XLSR_WAV2VEC2.md #10849 (@qqhann)
    • Bump jinja2 from 2.11.2 to 2.11.3 in /examples/research_projects/lxmert #10818 (@dependabot[bot])
    • [vulnerability] in example deps fix #10817 (@stas00)
    • Correct AutoConfig call docstrings #10822 (@Sebelino)
    • [makefile] autogenerate target #10814 (@stas00)
    • Fix on_step_begin and on_step_end Callback Sequencing #10839 (@siddk)
    • feat(wandb): logging and configuration improvements #10826 (@borisdayma)
    • Modify the Trainer class to handle simultaneous execution of Ray Tune and Weights & Biases #10823 (@ruanchaves)
    • Use DataCollatorForSeq2Seq in run_summarization in all cases #10856 (@elsanns)
    • [Generate] Add save mode logits processor to remove nans and infs if necessary #10769 (@patrickvonplaten)
    • Make convert_to_onnx runable as script again #10857 (@sgugger)
    • [trainer] fix nan in full-fp16 label_smoothing eval #10815 (@stas00)
    • Fix p_mask cls token masking in question-answering pipeline #10863 (@mmaslankowska-neurosys)
    • Amazon SageMaker Documentation #10867 (@philschmid)
    • [file_utils] import refactor #10859 (@stas00)
    • Fixed confusing order of args in generate() docstring #10862 (@RafaelWO)
    • Sm trainer smp init fix #10870 (@philschmid)
    • Fix test_trainer_distributed #10875 (@sgugger)
    • Add new notebook links in the docs #10876 (@sgugger)
    • error type of tokenizer in init definition #10879 (@ZhengZixiang)
    • [Community notebooks] Add notebook for fine-tuning Bart with Trainer in two langs #10883 (@elsanns)
    • Fix overflowing bad word ids #10889 (@LysandreJik)
    • Remove version warning in pretrained BART models #10890 (@sgugger)
    • Update Training Arguments Documentation: ignore_skip_data -> ignore_data_skip #10891 (@siddk)
    • run_glue_no_trainer: datasets -> raw_datasets #10898 (@jethrokuan)
    • updates sagemaker documentation #10899 (@philschmid)
    • Fix comment in modeling_t5.py #10886 (@lexhuismans)
    • Rename NLP library to Datasets library #10920 (@tomy0000000)
    • [vulnerability] fix dependency #10914 (@stas00)
    • Add ImageFeatureExtractionMixin #10905 (@sgugger)
    • Return global attentions (see #7514) #10906 (@gui11aume)
    • Updated colab links in readme of examples #10932 (@WybeKoper)
    • Fix initializing BertJapaneseTokenizer with AutoTokenizers #10936 (@singletongue)
    • Instantiate model only once in pipeline #10888 (@sgugger)
    • Use pre-computed lengths, if available, when grouping by length #10953 (@pcuenca)
    • [trainer metrics] fix cpu mem metrics; reformat runtime metric #10937 (@stas00)
    • [vulnerability] dep fix #10954 (@stas00)
    • Fixes in the templates #10951 (@sgugger)
    • Sagemaker test #10925 (@philschmid)
    • Fix summarization notebook link #10959 (@philschmid)
    • improved sagemaker documentation for git_config and examples #10966 (@philschmid)
    • Fixed a bug where the pipeline.framework would actually contain a fully qualified model. #10970 (@Narsil)
    • added py7zr #10971 (@philschmid)
    • fix md file to avoid evaluation crash #10962 (@ydshieh)
    • Fixed some typos and removed legacy url #10989 (@WybeKoper)
    • Sagemaker test fix #10987 (@philschmid)
    • Fix the checkpoint for I-BERT #10994 (@LysandreJik)
    • Add more metadata to the user agent #10972 (@sgugger)
    • Enforce string-formatting with f-strings #10980 (@sgugger)
    • In the group by length documentation length is misspelled as legnth #11000 (@JohnnyC08)
    • Fix Adafactor documentation (recommend correct settings) #10526 (@jsrozner)
    • Improve the speed of adding tokens from added_tokens.json #10780 (@cchen-dialpad)
    • Add Vision Transformer and ViTFeatureExtractor #10950 (@NielsRogge)
    • DebertaTokenizer Rework closes #10258 #10703 (@cronoik)
    • [doc] no more bucket #10793 (@julien-c)
    • Layout lm tf 2 #10636 (@atahmasb)
    • fixed typo: logging instead of logger #11025 (@versis)
    • Add a script to check inits are consistent #11024 (@sgugger)
    • fix incorrect case for s|Pretrained|PreTrained| #11048 (@stas00)
    • [doc] fix code-block rendering #11053 (@erensahin)
    • Pin docutils #11062 (@LysandreJik)
    • Remove unnecessary space #11060 (@LysandreJik)
    • Some models have no tokenizers #11064 (@LysandreJik)
    • Documentation about loading a fast tokenizer within Transformers #11029 (@LysandreJik)
    • Add example for registering callbacks with trainers #10928 (@amalad)
    • Replace pkg_resources with importlib_metadata #11061 (@konstin)
    • Add center_crop to ImageFeatureExtractionMixin #11066 (@sgugger)
    • Document common config attributes #11070 (@sgugger)
    • Fix distributed gather for tuples of tensors of varying sizes #11071 (@sgugger)
    • Make a base init in FeatureExtractionMixin #11074 (@sgugger)
    • Add Readme for language modeling scripts with custom training loop and accelerate #11073 (@hemildesai)
    • HF emoji unicode doesn't work in console #11081 (@stas00)
    • added social thumbnail for docs #11083 (@philschmid)
    • added new merged Trainer test #11090 (@philschmid)
    Source code(tar.gz)
    Source code(zip)
  • v4.4.2(Mar 18, 2021)

  • v4.4.0(Mar 16, 2021)

    v4.4.0: S2T, M2M100, I-BERT, mBART-50, DeBERTa-v2, XLSR-Wav2Vec2

    SpeechToText

    Two new models are released as part of the S2T implementation: Speech2TextModel and Speech2TextForConditionalGeneration, in PyTorch.

    Speech2Text is a speech model that accepts a float tensor of log-mel filter-bank features extracted from the speech signal. It’s a transformer-based seq2seq model, so the transcripts/translations are generated autoregressively.

    The Speech2Text model was proposed in fairseq S2T: Fast Speech-to-Text Modeling with fairseq by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=speech_to_text

    • Speech2TextTransformer #10175 (@patil-suraj)

    M2M100

    Two new models are released as part of the M2M100 implementation: M2M100Model and M2M100ForConditionalGeneration, in PyTorch.

    M2M100 is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation tasks.

    The M2M100 model was proposed in Beyond English-Centric Multilingual Machine Translation by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=m2m_100

    • Add m2m100 #10236 (@patil-suraj)

    I-BERT

    Six new models are released as part of the I-BERT implementation: IBertModel, IBertForMaskedLM, IBertForSequenceClassification, IBertForMultipleChoice, IBertForTokenClassification and IBertForQuestionAnswering, in PyTorch.

    I-BERT is a quantized version of RoBERTa running inference up to four times faster.

    The I-BERT framework in PyTorch allows to identify the best parameters for quantization. Once the model is exported in a framework that supports int8 execution (such as TensorRT), a speedup of up to 4x is visible, with no loss in performance thanks to the parameter search.

    The I-BERT model was proposed in I-BERT: Integer-only BERT Quantization by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney and Kurt Keutzer.

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=ibert

    • I-BERT model support #10153 (@kssteven418)
    • [IBert] Correct link to paper #10445 (@patrickvonplaten)
    • Add I-BERT to README #10462 (@LysandreJik)

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=speech_to_text

    mBART-50

    MBart-50 is created using the original mbart-large-cc25 checkpoint by extending its embedding layers with randomly initialized vectors for an extra set of 25 language tokens and then pretrained on 50 languages.

    The MBart model was presented in Multilingual Translation with Extensible Multilingual Pretraining and Finetuning by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=mbart-50

    • Add mBART-50 #10154 (@patil-suraj)

    DeBERTa-v2

    Fixe new models are released as part of the DeBERTa-v2 implementation: DebertaV2Model, DebertaV2ForMaskedLM, DebertaV2ForSequenceClassification, DeberaV2ForTokenClassification and DebertaV2ForQuestionAnswering, in PyTorch.

    The DeBERTa model was proposed in DeBERTa: Decoding-enhanced BERT with Disentangled Attention by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. It is based on Google’s BERT model released in 2018 and Facebook’s RoBERTa model released in 2019.

    It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in RoBERTa.

    Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=deberta-v2

    • Integrate DeBERTa v2(the 1.5B model surpassed human performance on Su… #10018 (@BigBird01)
    • DeBERTa-v2 fixes #10328 (@LysandreJik)

    Wav2Vec2

    XLSR-Wav2Vec2

    The XLSR-Wav2Vec2 model was proposed in Unsupervised Cross-Lingual Representation Learning For Speech Recognition by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.

    The checkpoint corresponding to that model is added to the model hub: facebook/ wav2vec2-large-xlsr-53

    • [XLSR-Wav2Vec2] Add multi-lingual Wav2Vec2 models #10648 (@patrickvonplaten)

    Training script

    A fine-tuning script showcasing how the Wav2Vec2 model can be trained has been added.

    • Add Fine-Tuning for Wav2Vec2 #10145 (@patrickvonplaten)

    Further improvements

    The Wav2Vec2 architecture becomes more stable as several changes are done to its architecture. This introduces feature extractors and feature processors as the pre-processing aspect of multi-modal speech models.

    • Deprecate Wav2Vec2ForMaskedLM and add Wav2Vec2ForCTC #10089 (@patrickvonplaten)
    • Fix example in Wav2Vec2 documentation #10096 (@abhishekkrthakur)
    • [Wav2Vec2] Remove unused config #10457 (@patrickvonplaten)
    • [Wav2Vec2FeatureExtractor] smal fixes #10455 (@patil-suraj)
    • [Wav2Vec2] Improve Tokenizer & Model for batched inference #10117 (@patrickvonplaten)
    • [PretrainedFeatureExtractor] + Wav2Vec2FeatureExtractor, Wav2Vec2Processor, Wav2Vec2Tokenizer #10324 (@patrickvonplaten)
    • [Wav2Vec2 Example Script] Typo #10547 (@patrickvonplaten)
    • [Wav2Vec2] Make wav2vec2 test deterministic #10714 (@patrickvonplaten)
    • [Wav2Vec2] Fix documentation inaccuracy #10694 (@MikeG112)

    AMP & XLA Support for TensorFlow models

    Most of the TensorFlow models are now compatible with automatic mixed precision and have XLA support.

    • Add AMP for TF Albert #10141 (@jplu)
    • Unlock XLA test for TF ConvBert #10207 (@jplu)
    • Making TF BART-like models XLA and AMP compliant #10191 (@jplu)
    • Making TF XLM-like models XLA and AMP compliant #10211 (@jplu)
    • Make TF CTRL compliant with XLA and AMP #10209 (@jplu)
    • Making TF GPT2 compliant with XLA and AMP #10230 (@jplu)
    • Making TF Funnel compliant with AMP #10216 (@jplu)
    • Making TF Lxmert model compliant with AMP #10257 (@jplu)
    • Making TF MobileBert model compliant with AMP #10259 (@jplu)
    • Making TF MPNet model compliant with XLA #10260 (@jplu)
    • Making TF T5 model compliant with AMP and XLA #10262 (@jplu)
    • Making TF TransfoXL model compliant with AMP #10264 (@jplu)
    • Making TF OpenAI GPT model compliant with AMP and XLA #10261 (@jplu)
    • Rework the AMP for TF XLNet #10274 (@jplu)
    • Making TF Longformer-like models compliant with AMP #10233 (@jplu)

    SageMaker Trainer for model parallelism

    We are rolling out experimental support for model parallelism on SageMaker with a new SageMakerTrainer that can be used in place of the regular Trainer. This is a temporary class that will be removed in a future version, the end goal is to have Trainer support this feature out of the box.

    • Add SageMakerTrainer for model paralellism #10122 (@sgugger)
    • Extend trainer logging for sm #10633 (@philschmid)
    • Sagemaker Model Parallel tensoboard writing fix #10403 (@mansimane)
    • Multiple fixes in SageMakerTrainer #10687 (@sgugger)
    • Add DistributedSamplerWithLoop #10746 (@sgugger)

    General improvements and bugfixes

    • [trainer] deepspeed bug fixes and tests #10039 (@stas00)

    • Removing run_pl_glue.py from text classification docs, include run_xnli.py & run_tf_text_classification.py #10066 (@cbjuan)

    • remove token_type_ids from TokenizerBertGeneration output #10070 (@sadakmed)

    • [deepspeed tests] transition to new tests dir #10080 (@stas00)

    • Added integration tests for Pytorch implementation of the ELECTRA model #10073 (@spatil6)

    • Fix naming in TF MobileBERT #10095 (@jplu)

    • [examples/s2s] add test set predictions #10085 (@patil-suraj)

    • Logging propagation #10092 (@LysandreJik)

    • Fix some edge cases in report_to and add deprecation warnings #10100 (@sgugger)

    • Add head_mask and decoder_head_mask to TF LED #9988 (@stancld)

    • Replace strided slice with tf.expand_dims #10078 (@jplu)

    • Fix Faiss Import #10103 (@patrickvonplaten)

    • [RAG] fix generate #10094 (@patil-suraj)

    • Fix TFConvBertModelIntegrationTest::test_inference_masked_lm Test #10104 (@abhishekkrthakur)

    • doc: update W&B related doc #10086 (@borisdayma)

    • Remove speed metrics from default compute objective [WIP] #10107 (@shiva-z)

    • Fix tokenizers training in notebooks #10110 (@n1t0)

    • [DeepSpeed docs] new information #9610 (@stas00)

    • [CI] build docs faster #10115 (@stas00)

    • [scheduled github CI] add deepspeed fairscale deps #10116 (@stas00)

    • Line endings should be LF across repo and not CRLF #10119 (@LysandreJik)

    • Fix TF LED/Longformer attentions computation #10007 (@jplu)

    • remove adjust_logits_during_generation method #10087 (@patil-suraj)

    • [DeepSpeed] restore memory for evaluation #10114 (@stas00)

    • Update run_xnli.py to use Datasets library #9829 (@Qbiwan)

    • Add new community notebook - Blenderbot #10126 (@lordtt13)

    • [DeepSpeed in notebooks] Jupyter + Colab #10130 (@stas00)

    • [examples/run_s2s] remove task_specific_params and update rouge computation #10133 (@patil-suraj)

    • Fix typo in GPT2DoubleHeadsModel docs #10148 (@M-Salti)

    • [hf_api] delete deprecated methods and tests #10159 (@julien-c)

    • Revert propagation #10171 (@LysandreJik)

    • Conversion from slow to fast for BPE spm vocabs contained an error. #10120 (@Narsil)

    • Fix typo in comments #10157 (@mrm8488)

    • Fix typo in comment #10156 (@mrm8488)

    • [Doc] Fix version control in internal pages #10124 (@sgugger)

    • [t5 tokenizer] add info logs #9897 (@stas00)

    • Fix v2 model loading issue #10129 (@BigBird01)

    • Fix datasets set_format #10178 (@sgugger)

    • Fixing NER pipeline for list inputs. #10184 (@Narsil)

    • Add new model to labels that should not stale #10187 (@LysandreJik)

    • Check TF ops for ONNX compliance #10025 (@jplu)

    • [RAG] fix tokenizer #10167 (@patil-suraj)

    • Fix TF template #10189 (@jplu)

    • fix run_seq2seq.py; porting trainer tests to it #10162 (@stas00)

    • Specify dataset dtype #10195 (@LysandreJik)

    • [CI] make the examples sub-group of tests run always #10196 (@stas00)

    • [WIP][examples/seq2seq] move old s2s scripts to legacy #10136 (@patil-suraj)

    • set tgt_lang of MBart Tokenizer for summarization #10205 (@HeroadZ)

    • Store FLOS as floats to avoid overflow. #10213 (@sgugger)

    • Fix add_token_positions in custom datasets tutorial #10217 (@joeddav)

    • [trainer] fix ignored columns logger #10219 (@stas00)

    • Factor out methods #10215 (@LysandreJik)

    • Fix head masking for TFT5 models #9877 (@stancld)

    • [CI] 2 fixes #10248 (@stas00)

    • [trainer] refactor place_model_on_device logic, add deepspeed #10243 (@stas00)

    • [Trainer] doc update #10241 (@stas00)

    • Reduce the time spent for the TF slow tests #10152 (@jplu)

    • Introduce warmup_ratio training argument #10229 (@tanmay17061)

    • [Trainer] memory tracker metrics #10225 (@stas00)

    • Script for distilling zero-shot classifier to more efficient student #10244 (@joeddav)

    • [test] fix func signature #10271 (@stas00)

    • [trainer] implement support for full fp16 in evaluation/predict #10268 (@stas00)

    • [ISSUES.md] propose using google colab to reproduce problems #10270 (@stas00)

    • Introduce logging_strategy training argument #10267 (@tanmay17061)

    • [CI] Kill any run-away pytest processes #10281 (@stas00)

    • Patch zero shot distillation script cuda issue #10284 (@joeddav)

    • Move the TF NER example #10276 (@jplu)

    • Fix example links in the task summary #10291 (@sgugger)

    • fixes #10303 #10304 (@cronoik)

    • [ci] don't fail when there are no zombies #10308 (@stas00)

    • fix typo in conversion script #10316 (@tagucci)

    • Add note to resize token embeddings matrix when adding new tokens to voc #10331 (@LysandreJik)

    • Deprecate prepare_seq2seq_batch #10287 (@sgugger)

    • [examples/seq2seq] defensive programming + expand/correct README #10295 (@stas00)

    • [Trainer] implement gradient_accumulation_steps support in DeepSpeed integration #10310 (@stas00)

    • Loading from last checkpoint functionality in Trainer.train #10334 (@tanmay17061)

    • [trainer] add Trainer methods for metrics logging and saving #10266 (@stas00)

    • Fix evaluation with label smoothing in Trainer #10338 (@sgugger)

    • Fix broken examples/seq2seq/README.md markdown #10344 (@Wikidepia)

    • [bert-base-german-cased] use model repo, not external bucket #10353 (@julien-c)

    • [Trainer/Deepspeed] handle get_last_lr() before first step() #10362 (@stas00)

    • ConvBERT fix torch <> tf weights conversion #10314 (@abhishekkrthakur)

    • fix deprecated reference tokenizer.max_len in glue.py #10220 (@poedator)

    • [trainer] move secondary methods into a separate file #10363 (@stas00)

    • Run GA on every push even on forks #10383 (@LysandreJik)

    • GA: only run model templates once #10388 (@LysandreJik)

    • Bugfix: Removal of padding_idx in BartLearnedPositionalEmbedding #10200 (@mingruimingrui)

    • Remove unused variable in example for Q&A #10392 (@abhishekkrthakur)

    • Ignore unexpected weights from PT conversion #10397 (@LysandreJik)

    • Add support for ZeRO-2/3 and ZeRO-offload in fairscale #10354 (@sgugger)

    • Fix None in add_token_positions - issue #10210 #10374 (@andreabac3)

    • Make Barthez tokenizer tests a bit faster #10399 (@sgugger)

    • Fix run_glue evaluation when model has a label correspondence #10401 (@sgugger)

    • [ci, flax] non-existing models are unlikely to pass tests #10409 (@julien-c)

    • [LED] Correct Docs #10419 (@patrickvonplaten)

    • Add Ray Tune hyperparameter search integration test #10414 (@krfricke)

    • Ray Tune Integration Bug Fixes #10406 (@amogkam)

    • [examples] better model example #10427 (@stas00)

    • Fix conda-build #10431 (@LysandreJik)

    • [run_seq2seq.py] restore functionality: saving to test_generations.txt #10428 (@stas00)

    • updated logging and saving metrics #10436 (@bhadreshpsavani)

    • Introduce save_strategy training argument #10286 (@tanmay17061)

    • Adds terms to Glossary #10443 (@darigovresearch)

    • Fixes compatibility bug when using grouped beam search and constrained decoding together #10475 (@mnschmit)

    • Generate can return cross-attention weights too #10493 (@Mehrad0711)

    • Fix typos #10489 (@WybeKoper)

    • [T5] Fix speed degradation bug t5 #10496 (@patrickvonplaten)

    • feat(docs): navigate with left/right arrow keys #10481 (@ydcjeff)

    • Refactor checkpoint name in BERT and MobileBERT #10424 (@sgugger)

    • remap MODEL_FOR_QUESTION_ANSWERING_MAPPING classes to names auto-generated file #10487 (@stas00)

    • Fix the bug in constructing the all_hidden_states of DeBERTa v2 #10466 (@felixgwu)

    • Smp grad accum #10488 (@sgugger)

    • Remove unsupported methods from ModelOutput doc #10505 (@sgugger)

    • Not always consider a local model a checkpoint in run_glue #10517 (@sgugger)

    • Removes overwrites for output_dir #10521 (@philschmid)

    • Rework TPU checkpointing in Trainer #10504 (@sgugger)

    • [ProphetNet] Bart-like Refactor #10501 (@patrickvonplaten)

    • Fix example of custom Trainer to reflect signature of compute_loss #10537 (@lewtun)

    • Fixing conversation test for torch 1.8 #10545 (@Narsil)

    • Fix torch 1.8.0 segmentation fault #10546 (@LysandreJik)

    • Fixed dead link in Trainer documentation #10554 (@jwa018)

    • Typo correction. #10531 (@cliang1453)

    • Fix embeddings for PyTorch 1.8 #10549 (@sgugger)

    • Stale Bot #10509 (@LysandreJik)

    • Refactoring checkpoint names for multiple models #10527 (@danielpatrickhug)

    • offline mode for firewalled envs #10407 (@stas00)

    • fix tf doc bug #10570 (@Sniper970119)

    • [run_seq2seq] fix nltk lookup #10585 (@stas00)

    • Fix typo in docstring for pipeline #10591 (@silvershine157)

    • wrong model used for BART Summarization example #10582 (@orena1)

    • [M2M100] fix positional embeddings #10590 (@patil-suraj)

    • Enable torch 1.8.0 on GPU CI #10593 (@LysandreJik)

    • tokenization_marian.py: use current_spm for decoding #10357 (@Mehrad0711)

    • [trainer] fix double wrapping + test #10583 (@stas00)

    • Fix version control with anchors #10595 (@sgugger)

    • offline mode for firewalled envs (part 2) #10569 (@stas00)

    • [examples tests] various fixes #10584 (@stas00)

    • Added max_sample_ arguments #10551 (@bhadreshpsavani)

    • [examples tests on multigpu] resolving require_torch_non_multi_gpu_but_fix_me #10561 (@stas00)

    • Check layer types for Optimizer construction #10598 (@sgugger)

    • Speedup tf tests #10601 (@LysandreJik)

    • [docs] How to solve "Title level inconsistent" sphinx error #10600 (@stas00)

    • [FeatureExtractorSavingUtils] Refactor PretrainedFeatureExtractor #10594 (@patrickvonplaten)

    • fix flaky m2m100 test #10604 (@patil-suraj)

    • [examples template] added max_sample args and metrics changes #10602 (@bhadreshpsavani)

    • Fairscale FSDP fix model save #10596 (@sgugger)

    • Fix tests of TrainerCallback #10615 (@sgugger)

    • Fixes an issue in text-classification where MNLI eval/test datasets are not being preprocessed. #10621 (@allenwang28)

    • [M2M100] remove final_logits_bias #10606 (@patil-suraj)

    • Add new GLUE example with no Trainer. #10555 (@sgugger)

    • Copy tokenizer files in each of their repo #10624 (@sgugger)

    • Document Trainer limitation on custom models #10635 (@sgugger)

    • Fix Longformer tokenizer filename #10653 (@LysandreJik)

    • Update README.md #10647 (@Arvid-pku)

    • Ensure metric results are JSON-serializable #10632 (@sgugger)

    • S2S + M2M100 should be available in tokenization_auto #10657 (@LysandreJik)

    • Remove special treatment for custom vocab files #10637 (@sgugger)

    • [S2T] fix example in docs #10667 (@patil-suraj)

    • W2v2 test require torch #10665 (@LysandreJik)

    • Fix Marian/TFMarian tokenization tests #10661 (@LysandreJik)

    • Fixes Pegasus tokenization tests #10671 (@LysandreJik)

    • Onnx fix test #10663 (@mfuntowicz)

    • Fix integration slow tests #10670 (@sgugger)

    • Specify minimum version for sacrebleu #10662 (@LysandreJik)

    • Add DeBERTa to MODEL_FOR_PRETRAINING_MAPPING #10668 (@jeswan)

    • Fix broken link #10656 (@WybeKoper)

    • fix typing error for HfArgumentParser for Optional[bool] #10672 (@bfineran)

    • MT5 integration test: adjust loss difference #10669 (@LysandreJik)

    • Adding new parameter to generate: max_time. #9846 (@Narsil)

    • TensorFlow tests: having from_pt set to True requires torch to be installed. #10664 (@LysandreJik)

    • Add auto_wrap option in fairscale integration #10673 (@sgugger)

    • fix: #10628 expanduser path in TrainingArguments #10660 (@PaulLerner)

    • Pass encoder outputs into GenerationMixin #10599 (@ymfa)

    • [wip] [deepspeed] AdamW is now supported by default #9624 (@stas00)

    • [Tests] RAG #10679 (@patrickvonplaten)

    • enable loading Mbart50Tokenizer with AutoTokenizer #10690 (@patil-suraj)

    • Wrong link to super class #10709 (@cronoik)

    • Distributed barrier before loading model #10685 (@sgugger)

    • GPT2DoubleHeadsModel made parallelizable #10658 (@ishalyminov)

    • split seq2seq script into summarization & translation #10611 (@theo-m)

    • Adding required flags to non-default arguments in hf_argparser #10688 (@Craigacp)

    • Fix backward compatibility with EvaluationStrategy #10718 (@sgugger)

    • Tests run on Docker #10681 (@LysandreJik)

    • Rename zero-shot pipeline multi_class argument #10727 (@joeddav)

    • Add minimum version check in examples #10724 (@sgugger)

    • independent training / eval with local files #10710 (@riklopfer)

    • Flax testing should not run the full torch test suite #10725 (@patrickvonplaten)

    Source code(tar.gz)
    Source code(zip)
  • v4.3.3(Feb 24, 2021)

  • v4.3.2(Feb 9, 2021)

  • v4.3.1(Feb 9, 2021)

    This patch release modifies the API of the Wav2Vec2 model: the Wav2Vec2ForCTC was added as a replacement of Wav2Vec2ForMaskedLM. Wav2Vec2ForMaskedLM is kept for backwards compatibility but is deprecated.

    • Deprecate Wav2Vec2ForMaskedLM and add Wav2Vec2ForCTC #10089 (@patrickvonplaten)
    Source code(tar.gz)
    Source code(zip)
Owner
Hugging Face
Solving NLP, one commit at a time!
Hugging Face
🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow ?? Transformers provides thousands of pretrained mo

Hugging Face 54.9k Dec 2, 2021
State of the art faster Natural Language Processing in Tensorflow 2.0 .

tf-transformers: faster and easier state-of-the-art NLP in TensorFlow 2.0 ****************************************************************************

null 100 Nov 20, 2021
:mag: End-to-End Framework for building natural language search interfaces to data by utilizing Transformers and the State-of-the-Art of NLP. Supporting DPR, Elasticsearch, HuggingFace’s Modelhub and much more!

Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want

deepset 1.4k Feb 18, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.8k Nov 30, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.6k Feb 18, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.8k Nov 23, 2021
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

flair 11k Dec 2, 2021
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 2.5k Dec 1, 2021
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

flair 10k Feb 18, 2021
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 1.9k Feb 18, 2021
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. Flair is: A powerful NLP library. Flair allo

flair 11k Nov 24, 2021
Natural Language Processing with transformers

we want to create a repo to illustrate usage of transformers in chinese

Datawhale 351 Nov 20, 2021
Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

Samuel Cahyawijaya 3 Nov 25, 2021
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.2k Nov 23, 2021
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.1k Feb 17, 2021
LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

Felipe Maia Polo 65 Nov 26, 2021
Mirco Ravanelli 2.1k Dec 1, 2021
Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

Texar-PyTorch is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar

ASYML 697 Nov 25, 2021