Use fastai-v2 with HuggingFace's pretrained transformers

Overview

FastHugs

Use fastai v2 with HuggingFace's pretrained transformers, see the notebooks below depending on your task:

  • Text classification: fasthugs_seq_classification.ipynb
  • Language model pre-training or fine-tuning (RoBERTa only for now): fasthugs_language_model.ipynb

What's New

April 24, 2020

  • Added fasthugs_language_model.ipynb which shows you how to pre-train or fine-tune a Masked Language Model (MLM), RoBERTa in this case, from scratch

April 17, 2020

  • Added new get_vocab functionality from HuggingFace, unified api to extract a tokenizer's vocab
  • Added new AutoModelForSequenceClassification, AutoConfig, AutoModelForSequenceClassification HuggingFace functionality to make things tider
  • Tidied up and refactored FastHugsTokenizer and FastHugsModel
  • OLD demo and vocab files to be deleted soon

Things You Might Like ( ❤️ ?)

FastHugsTokenizer: A tokenizer wrapper than can be used with fastai-v2’s tokenizer.

FastHugsModel: A model wrapper over the HF models, more or less the same to the wrapper’s from HF fastai-v1 articles mentioned below

Padding: Padding settings for the padding token index and on whether the transformer prefers left or right padding

Model Splitters: Functions to split the classification head from the model backbone in line with fastai-v2’s new definition of Learner (splitters)

Read these first 👇

This notebook heavily borrows from this notebook , which in turn is based off of this tutorial and accompanying article. Huge thanks to Melissa Rajaram and Maximilien Roberti for these great resources, if you're not familiar with the HuggingFace library please given them a read first as they are quite comprehensive.

fastai-v2 ✌️ 2️⃣

This paper introduces the v2 version of the fastai library and you can follow and contribute to v2's progress on the forums. This notebook uses the small IMDB dataset and is based off the fastai-v2 ULMFiT tutorial. Huge thanks to Jeremy, Sylvain, Rachel and the fastai community for making this library what it is. I'm super excited about the additinal flexibility v2 brings. 🎉

Comments
  • learn.lr()_find produces attribute error

    learn.lr()_find produces attribute error

    I appreciate your effort to connect huggingface to fastai. However, when I run these lines, learn.lr_find() errors as below. I've made no changes to your code with the exception of the setup that I detailed below. I wonder if they've changed things to make your code incompatible.

    learn = Learner(dls, fasthugs_model, opt_func=opt_func, splitter=splitter, loss_func=loss, cbs=cbs, metrics=[accuracy]) learn.lr_find(show_plot=True)

    produces error: AttributeError Traceback (most recent call last) /content/fastai2/fastai2/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt) 187 try: --> 188 self._do_begin_fit(n_epoch) 189 for epoch in range(n_epoch):

    26 frames AttributeError: 'str' object has no attribute 'type'

    During handling of the above exception, another exception occurred:

    AttributeError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/fastprogress/fastprogress.py in on_iter_end(self) 155 total_time = format_time(time.time() - self.main_bar.start_t) 156 self.text = f'Total time: {total_time}

    ' + self.text --> 157 self.out.update(HTML(self.text)) 158 159 def add_child(self, child):

    AttributeError: 'NBMasterBar' object has no attribute 'out'

    Run in Google Colab Setup as specified elsewhere: !pip install transformers !git clone https://github.com/fastai/fastai2 %cd fastai2 !pip install -e ".[dev]" !pip uninstall --y fastprogress !pip install git+https://github.com/fastai/fastprogress --upgrade

    I posted this to fastaiv2 issues, but got the following response: "It's impossible to fix your problem without seeing the code you have executed. There is a problem when you try to train, that could come either from the data collection, the model, the loss function, the optimizer, and just giving the last line of the error message is not enough."

    While I clearly didn't give "the last line of the error message" (this is undoubtedly just a standard response). It seems like they are indicating the problem may be in your code.

    opened by randywreed 3
  • TypeError: __init__() got multiple values for argument 'transformer_tokenizer'

    TypeError: __init__() got multiple values for argument 'transformer_tokenizer'

    Hi,

    I am trying to run the fasthugs_seq_classification.ipynb notebook on Colab.

    fastai==2.2.5 transformers = 4.2.1 tokenizers==0.9.4

    I get the error mentioned below when I try to run:

    splits = ColSplitter()(df) x_tfms = [attrgetter("text"), fastai_tokenizer, Numericalize(vocab=tokenizer_vocab_ls), SpecialClsTokens(tokenizer)] dsets = Datasets(df, splits=splits, tfms=[x_tfms, [attrgetter("label"), Categorize()]], dl_type=SortedDL)

    Error:

    Process Process-1: 0.00% [0/1000 00:00<00:00] Traceback (most recent call last): Process Process-2: File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() Traceback (most recent call last): File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/local/lib/python3.6/dist-packages/fastcore/parallel.py", line 118, in _f_pg for i,b in enumerate(obj(batch)): queue.put((start_idx+i,b)) File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.6/dist-packages/fastai/text/core.py", line 136, in call return (L(o).map(self.post_f) for o in self.tok(maps(*self.rules, batch))) File "/usr/local/lib/python3.6/dist-packages/fastcore/parallel.py", line 118, in _f_pg for i,b in enumerate(obj(batch)): queue.put((start_idx+i,b)) TypeError: init() got multiple values for argument 'transformer_tokenizer' File "/usr/local/lib/python3.6/dist-packages/fastai/text/core.py", line 136, in call return (L(o).map(self.post_f) for o in self.tok(maps(*self.rules, batch))) TypeError: init() got multiple values for argument 'transformer_tokenizer'

    PS: I had to make the below mentioned changes to reach till this step.

    1. Manually set max_seq_len = 256 to avoid the warning (not sure why this is happening..) Original: fht=FastHugsTokenizer(transformer_tokenizer=tokenizer, model_name='roberta', max_seq_len= max_seq_len, pretrained=True, pair=False) tokenized_text = next(fht(txt))

    Modified:

    fht=FastHugsTokenizer(transformer_tokenizer=tokenizer, model_name='roberta', max_seq_len=256, pretrained=True, pair=False) tokenized_text = next(fht(txt))

    1. Removed add_prefix_space = True from do_tokenize function Original:

    def do_tokenize(self, o:str): """Returns tokenized text, adds prefix space if needed, limits the maximum sequence length""" if 'roberta' in model_name: tokens=self.tok.tokenize(o, add_prefix_space=True)[:self.max_seq_len-2] else: tokens = self.tok.tokenize(o)[:self.max_seq_len-2] return tokens def call(self, items): for o in items: yield self.do_tokenize(o)

    Modified:

    def do_tokenize(self, o:str): """Returns tokenized text, adds prefix space if needed, limits the maximum sequence length""" if 'roberta' in model_name: tokens=self.tok.tokenize(o)[:self.max_seq_len-2] else: tokens = self.tok.tokenize(o)[:self.max_seq_len-2] return tokens def call(self, items): for o in items: yield self.do_tokenize(o)

    1. Removed res_col_name, post_rules and renamed tok_func to Tok Original:

    fastai_tokenizer = Tokenizer.from_df(text_cols='text', res_col_name='text', tok_func=fasthugstok, rules=[], post_rules=[])

    Modified:

    fastai_tokenizer = Tokenizer.from_df(text_cols='text', tok=fasthugstok, rules=[])

    Thanks in advance! :)

    opened by msakthiganesh 2
  • fix max_len problem

    fix max_len problem

    Hi Morgan, thank you for your hard work, it helps me so much! However, when I tried to run your notebooks, I got some errors.

    First, I think the max_seq_len should be assigned to a fix number (ex. 512), as in my case, the default tokenizer.max_len is 1000000000000000019884624838656, which is too large, triggering a cuda error.

    Second, in the FastHugsTokenizer class, do_tokenize function should keep the length of the tokenized sequence to (max_len - 2), as in bert-like models, two extra token will be added (ex. [cls] and [sep] in bert, [s] and [/s] in roberta). In my case, when I kept the tokenized seq length to max_len, dls.one_batch() raised a error like this: RuntimeError: Trying to create tensor with negative dimension -2: [-2].

    This PR fixes the above problems.

    opened by leiyu-thunder 2
  • MLM: pass max seq length param to padding function

    MLM: pass max seq length param to padding function

    Hi! First of all thanks for you work, it's really valuable and helpful.

    My sessions were crashing because a model I use - available on the HF zoo - has tokenizer.max_len == 1000000000000 (I suppose it's not the only one) and max_seq_len atm is not being passed onto the transformer_mlm_padding call. This thus ignores any override and results in very funny 8TB allocation reqs at pad = x.new_zeros(max_len_l[idx]-x.shape[0])+pad_idx.

    This MR fixes it, preserving expected behaviour when max_seq_len is left as None.

    opened by riccardoangius 2
  • What versions of dependencies?

    What versions of dependencies?

    I keep running into tons of errors around versioning. This code did not work with the latest versions of torch, fastai, and transformers. I downgraded to:

    !pip install fastai==2.1.2
    !pip install fastcore==1.3.1
    !pip install torch==1.7.0
    

    but it keeps breaking in several spots. Rather than go through each error, could you share what versions you are using of everything in your environment to get this to work?

    opened by connormeaton 0
  • ValueError: Expected target size (12, 50265), got torch.Size([12, 510]) when calling learn.lr_find()

    ValueError: Expected target size (12, 50265), got torch.Size([12, 510]) when calling learn.lr_find()

    When trying to train the learner, there seems to be an issue where the expected target size is the size of the vocab for roberta

    ---------------------------------------------------------------------------
    
    ValueError                                Traceback (most recent call last)
    
    <ipython-input-27-d81c6bd29d71> in <module>()
    ----> 1 learn.lr_find()
    
    18 frames
    
    /usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
       2272         if target.size()[1:] != input.size()[2:]:
       2273             raise ValueError('Expected target size {}, got {}'.format(
    -> 2274                 out_size, target.size()))
       2275         input = input.contiguous()
       2276         target = target.contiguous()
    
    ValueError: Expected target size (12, 50265), got torch.Size([12, 510])
    
    opened by max-rbi 0
  • Sep and fastai_tokenizer

    Sep and fastai_tokenizer

    Brilliant work here, Morgan - really looking forward to using this with my students on a project. Deepest apologies if I'm not doing this right - I'm very new to Github and also not a particularly good programmer.

    It looks like perhaps the FastAI v2 team made a change in Tokenizer that is making it choke on the sep argument when instantiating your custom tokenizer in the fasthugs_language_model notebook.

    class MLMTokenizer(Tokenizer): 
        def __init__(self, tokenizer, rules=None, counter=None, lengths=None, mode=None, **kwargs):  # removed sep=' '
            super().__init__(tokenizer, rules, counter, lengths, mode)  # removed sep
    

    Taking the sep argument out seemed to fix the issue at first, but then the fastai_tokenizer kept the datasets from being created. I checked the vaious other components and isolated the issue to the tokenizer, but wasn't able to parse the error message that resulted.

    tfms=[attrgetter("text"), fastai_tokenizer, AddSpecialTokens(tokenizer), MLMTokensLabels(tokenizer)]
    dsets = Datasets(df, splits=splits, tfms=[tfms], dl_type=SortedDL)
    

    Here are head and tail of resulting ten or so pages of error message (again, apologies if I'm not following protocol here):

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-100-070c60545587> in <module>
         11 
         12 #dsets = Datasets(df, splits=splits, tfms=[tfms], dl_type=SortedDL)
    ---> 13 dsets = Datasets(df, splits=splits, tfms=[tfms], dl_type=SortedDL)
         14 
         15 dsets[0][0][:20], dsets[0][1][:20]
    
    <ipython-input-99-0553a9fb405f> in __init__(self, items, tfms, tls, n_inp, dl_type, **kwargs)
          4     "Doesn't create a tuple in __getitem__ as x is already a tuple"
          5     def __init__(self, items=None, tfms=None, tls=None, n_inp=None, dl_type=None, **kwargs):
    ----> 6         super().__init__(items=items, tfms=tfms, tls=tls, n_inp=n_inp, dl_type=dl_type, **kwargs)
          7 
          8     def __getitem__(self, it)
    
    .
    .  (Pages later)
    .
    
    ~\.conda\envs\fastai2\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
         58 def dump(obj, file, protocol=None):
         59     '''Replacement for pickle.dump() using ForkingPickler.'''
    ---> 60     ForkingPickler(file, protocol).dump(obj)
         61 
         62 #
    
    AttributeError: Can't pickle local object 'parallel_gen.<locals>.f'
    

    Anyway, I hope this is helpful. Please keep up the amazing work!

    opened by jrlinton 0
Owner
Morgan McGuire
Enjoying playing around with data in and out of work. Machine learning until I learn better. Having fun along the way.
Morgan McGuire
Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

Ekstra Bladet 141 Dec 30, 2022
Code for "Finetuning Pretrained Transformers into Variational Autoencoders"

transformers-into-vaes Code for Finetuning Pretrained Transformers into Variational Autoencoders (our submission to NLP Insights Workshop 2021). Gathe

Seongmin Park 22 Nov 26, 2022
Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT)

CIRPLANT This repository contains the code and pre-trained models for Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT) For d

Zheyuan (David) Liu 29 Nov 17, 2022
Pytorch NLP library based on FastAI

Quick NLP Quick NLP is a deep learning nlp library inspired by the fast.ai library It follows the same api as fastai and extends it allowing for quick

Agis pof 283 Nov 21, 2022
This repo contains simple to use, pretrained/training-less models for speaker diarization.

PyDiar This repo contains simple to use, pretrained/training-less models for speaker diarization. Supported Models Binary Key Speaker Modeling Based o

null 12 Jan 20, 2022
Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra. What is Lightning Tran

Pytorch Lightning 581 Dec 21, 2022
Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET

Training COMET using seq2seq setting Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET. The codes are modified from run_summarizati

tqfang 9 Dec 17, 2022
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO ?? ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 1, 2023
ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in a matter of minutes. Based on our experiments with a wide range of benchmarks, ProteinBERT usually achieves state-of-the-art performance. ProteinBERT is built on TenforFlow/Keras.

null 241 Jan 4, 2023
PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Cross-Covariance Image Transformer (XCiT) PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer L

Facebook Research 605 Jan 2, 2023
Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

simple_diarizer Simplified diarization pipeline using some pretrained models. Made to be a simple as possible to go from an input audio file to diariz

Chau 65 Dec 30, 2022
A library for finding knowledge neurons in pretrained transformer models.

knowledge-neurons An open source repository replicating the 2021 paper Knowledge Neurons in Pretrained Transformers by Dai et al., and extending the t

EleutherAI 96 Dec 21, 2022
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet ?? ???? 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 40 Nov 30, 2022
Code for evaluating Japanese pretrained models provided by NTT Ltd.

japanese-dialog-transformers 日本語の説明文はこちら This repository provides the information necessary to evaluate the Japanese Transformer Encoder-decoder dialo

NTT Communication Science Laboratories 216 Dec 22, 2022
BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

OpenBMB 377 Jan 2, 2023
🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Pretrained BigBird Model for Korean What is BigBird • How to Use • Pretraining • Evaluation Result • Docs • Citation 한국어 | English What is BigBird? Bi

Jangwon Park 183 Dec 14, 2022
T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets (product titles, images, comments, etc.).

null 55 Nov 22, 2022
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

?? The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Hugging Face 15k Jan 2, 2023
KoBART model on huggingface transformers

KoBART-Transformers SKT에서 공개한 KoBART를 편리하게 사용할 수 있게 transformers로 포팅하였습니다. Install (Optional) BartModel과 PreTrainedTokenizerFast를 이용하면 설치하실 필요 없습니다. p

Hyunwoong Ko 58 Dec 7, 2022