πŸ₯A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI

Overview

PyTorch implementation of OpenAI's Finetuned Transformer Language Model

This is a PyTorch implementation of the TensorFlow code provided with OpenAI's paper "Improving Language Understanding by Generative Pre-Training" by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.

This implementation comprises a script to load in the PyTorch model the weights pre-trained by the authors with the TensorFlow implementation.

Transformer Language Model

The model classes and loading script are located in model_pytorch.py.

The names of the modules in the PyTorch model follow the names of the Variable in the TensorFlow implementation. This implementation tries to follow the original code as closely as possible to minimize the discrepancies.

This implementation thus also comprises a modified Adam optimization algorithm as used in OpenAI's paper with:

Requirements

To use the model it-self by importing model_pytorch.py, you just need:

  • PyTorch (version >=0.4)

To run the classifier training script in train.py you will need in addition:

  • tqdm
  • sklearn
  • spacy
  • ftfy
  • pandas

You can download the weights of the OpenAI pre-trained version by cloning Alec Radford's repo and placing the model folder containing the pre-trained weights in the present repo.

Using the pre-trained model as a Transformer Language Model

The model can be used as a transformer language model with OpenAI's pre-trained weights as follow:

from model_pytorch import TransformerModel, load_openai_pretrained_model, DEFAULT_CONFIG

args = DEFAULT_CONFIG
model = TransformerModel(args)
load_openai_pretrained_model(model)

This model generates Transformer's hidden states. You can use the LMHead class in model_pytorch.py to add a decoder tied with the weights of the encoder and get a full language model. You can also use the ClfHead class in model_pytorch.py to add a classifier on top of the transformer and get a classifier as described in OpenAI's publication. (see an example of both in the __main__ function of train.py)

To use the positional encoder of the transformer, you should encode your dataset using the encode_dataset() function of utils.py. Please refer to the beginning of the __main__ function in train.py to see how to properly define the vocabulary and encode your dataset.

Fine-tuning the pre-trained model on a classification task

This model can also be integrated in a classifier as detailed in OpenAI's paper. An example of fine-tuning on the ROCStories Cloze task is included with the training code in train.py

The ROCStories dataset can be downloaded from the associated website.

As with the TensorFlow code, this code implements the ROCStories Cloze Test result reported in the paper which can be reproduced by running:

python -m spacy download en
python train.py --dataset rocstories --desc rocstories --submit --analysis --data_dir [path to data here]

First experiments on the ROCStories test set

Finetuning the PyTorch model for 3 Epochs on ROCStories takes 10 minutes to run on a single NVidia K-80.

The single run test accuracy of this PyTorch version is 85.84%, while the authors reports a median accuracy with the TensorFlow code of 85.8% and the paper reports a best single run accuracy of 86.5%.

The authors implementations uses 8 GPU and can thus accomodate a batch of 64 samples while the present implementation is single GPU and is in consequence limited to 20 instances on a K80 for memory reasons. In our test, increasing the batch size from 8 to 20 samples increased the test accuracy by 2.5 points. A better accuracy may be obtained by using a multi-GPU setting (not tried yet).

The previous SOTA on the ROCStories dataset is 77.6% ("Hidden Coherence Model" of Chaturvedi et al. published in "Story Comprehension for Predicting What Happens Next" EMNLP 2017, which is a very nice paper too!)

Comments
  • How does Dropout2d help in cloze task?

    How does Dropout2d help in cloze task?

    class ClfHead(nn.Module):
        """ Classifier Head for the transformer """
    
        def __init__(self, clf_token, cfg):
            super(ClfHead, self).__init__()
            self.n_embd = cfg.n_embd
            self.clf_token = clf_token
            self.dropout = nn.Dropout2d(cfg.clf_pdrop)  # To reproduce the noise_shape parameter of TF implementation
            self.linear = nn.Linear(cfg.n_embd, 1)
            nn.init.normal_(self.linear.weight, std=0.02)
            nn.init.normal_(self.linear.bias, 0)
    
        def forward(self, h, x):
            # Classification logits
            clf_h = h.view(-1, self.n_embd)
            flat = x[:, :, :, 0].contiguous().view(-1)
            clf_h = clf_h[flat == self.clf_token, :]
            clf_h = clf_h.view(-1, x.size(1), self.n_embd, 1)
            clf_h = self.dropout(clf_h)
            clf_h = clf_h.view(-1, self.n_embd)
            clf_logits = self.linear(clf_h)
            return clf_logits.view(-1, x.size(1))
    

    Here the self.dropout(clf_h) essentially removes the representation of a sentence and its conclusion, there is remote chance (0.2*0.2) that both representations get removed for a given data item. I am confused on how this aids training .

    opened by sai-prasanna 12
  • Results and questions on text generation experiments with pretrained LM model

    Results and questions on text generation experiments with pretrained LM model

    Dear guys,

    I did some experiments on text generation with the pretrained LM model. I made a PR so you can see the changes: https://github.com/huggingface/pytorch-openai-transformer-lm/pull/35 I have some questions regarding the results.

    1. The generation quality is very poor. The model can not generate grammatical sentences, let alone long coherent sentences. Here are some snippets: Input some beginning words: I love you , " you said . first . last click ... game ' keep ' ' the zer that

    Input some beginning words: Once upon a time . " freyja , freyja freyja freyja freyja freyja freyja freyja freyja freyja freyja freyja freyja freyja freyja freyja freyja

    Input some beginning words: Everytime . the . - holding . - " nothing in . very ... out . " grin .

    Input some beginning words: I feel very royal . please . at , very , ' ! deserving ... ' something , family , had

    1. At each step, the top 5 candidates for next token are dominated by the most frequent tokens, e.g. ",", "and", "the", "was", but also have some infrequent tokens, e.g. "-", "f". When these infrequent tokens show up, they are irrelevant with the sentence context. I don't know why.

    2. As the output layer also have weights for 512 position embeddings, the output dim is 40478 (word indices) + 512 (position indices). The logits for these 512 indices are usually much larger than 40478 word indices, so I have to mask them before softmax. I think this is a bit strange because during pretraining the correct labels are always within the 40478 indices.

    The paper reported a very low ppl of 18.4 on the BooksCorpus. I thought the pretrained model should be a very strong LM model able to generate high quality text. The results confused me. Can you give me some advice? Is it because deep transformer lm is inherently not good at the generation task, or due to some hidden bug in my code?

    • Da Xiao
    opened by xiaoda99 10
  • How should one modify the code to successfully run text classification?

    How should one modify the code to successfully run text classification?

    Hi,

    I am new to PyTorch (but still more at ease with it than TF) so I thought to experiment with @thomwolf 's implementation in this repo (thanks for sharing it!!)

    I would like to try out the code to perform binary text classification of text snippets, similar to the classification tasks such as the Corpus of Linguistic Acceptability (CoLA) and the Stanford Sentiment Treebank (SST-2) in the original reference.

    These are the steps that I think are needed to get the code working (but I am not sure that these are correct and/or exhaustive):

    1. Create two sets snippets_val.csv and snippets_test.csv containing two columns, text (string) and class (an int equal to 0 or 1).
    2. In datasets.py create two new functions:
      • _snippets returning two lists st, y, and
      • snippets defined with different values of n_train and n_validand whose return statement looks like return (trX, trY), (vaX, vaY), (teX, )
    3. In train.py, rewrite transform_roc into a transform_snippet that doesn't use [delimiter] and takes only one argument in input <- somewhat tricky to me can anyone provide some guidance?
    4. In train.py, in the encoding bit and afterwards:
    5. In train.py:
    6. In analysis.py:
      • create a new function snippets so to invoke _snippets (from datasets.py) and read in snippets_test.csv and adjust its call to _snippets so to take into account that it outputs two lists (not 4)
    7. Modify imports in train.py coherently with all of the above.

    Does all of the above make sense as a plan, or can somebody fill missing bits or provide an alternative list of "sub-steps" ? Also, can someone provide some guidance on how to rewrite transform_roc (comments on the original code would be fantastic, I am glad to annotate the original function and contribute to the repo as a result of this!)

    Thanks to anyone patiently reading this!

    opened by davidefiocco 7
  • Why do we need to apply mask while fine tuning?

    Why do we need to apply mask while fine tuning?

    In attention class, you have the following code for masking. I understand the logic for pre training, but in fine tuning if we dont include language model loss we should have a check here for not applying the mask. Do we have to always apply the masking because the model was trained that way, is there an intuitive idea for this, because I dont see a necessity to do it experimentally

    This is the line I am talking about w = w * self.b + -1e9 * (1 - self.b) # TF implem method: mask_attn_weights

    opened by pranoy-k 4
  • dimensioning bug?

    dimensioning bug?

    https://github.com/huggingface/pytorch-openai-transformer-lm/blob/master/model_py.py#L77-L84

    The reference implementation and paper uses tensorflow, in which channel dimension is represented last. But in pytorch, the channel dimension is represented as dim=1 after the batch. So does the equation need to be reversed? Is this done somewhere else in the code?

    so in pytorch it should look like: matmul(v, matmul(k.t(), q))

    example (tensorflow):

    q size [attn_depth=3, feature_depth=2]
    k size [attn_depth=3, feature_depth=2]
    v size [attn_depth=3, feature_depth=2]
    
    weights = matmul(q, k.t()) -> [3, 3]
    result = matmul(weights, v) -> [3, 2]
    

    example (pytorch) with error:

    q size [2, 3]
    k size [2, 3]
    v size [2, 3]
    
    weights = matmul(q, k.t()) -> [2, 2]
    result = matmul(weights, v) -> [2, 3]
    

    Notice the dimensionality of the weights has been over-reduced [2,2] instead of [3,3]

    example (pytorch) with correction:

    q size [2, 3]
    k size [2, 3]
    v size [2, 3]
    
    weights = matmul(k.t(), q) -> [3, 3]
    result = matmul(v, weights) -> [2, 3]
    

    Apologies if this is handled correctly somewhere, limited on time atm for a thorough read.

    opened by jtatusko 4
  • Noise shape dropout

    Noise shape dropout

    Reproducing the specific behavior of the classifier dropout of the original OpenAI implementation of the article. The details of the this patch can be found in issue #11.

    opened by rodgzilla 3
  • Pre-trained LMHead

    Pre-trained LMHead

    Hi!

    First, I would like to thank you for your translation of the implementation of this paper.

    I have read the code and managed to run it but I am not able to find how to load pre-trained weights into a LMHead to get a general English language model. Did the openAI guys not release the weights of this final layer?

    I thought the performance would be better starting the finetuning process from a pre-trained LMHead.

    opened by rodgzilla 3
  • fix the scope of optimizer

    fix the scope of optimizer

    The target of an optimizer should contain clf_head (new task-specific output matrix) in addition to model (Transformer encoder). The code might fail to do that, right?

    opened by soskek 3
  • How does position embedding implementation work?

    How does position embedding implementation work?

    So there's the TransformerModel's forward method, and I just can't get a hold of the position embedding part (and might be wrong about others). So, as far as I can tell, step-by-step it goes like this:

    1. Reshape our input to have 3 dimensions -> [ ? x sequences (?) x tokens (512) ]
    2. Get the individual token embeddings -> [ ? x sequences (?) x tokens (512) x emb_dim (768) ]
    3. Sum up those embeddings along axis 2 (summing token embeddings element-wise for each sequence?) -> [ ? x sequences x emb_dim (768) ]
    4. Shouldn't we have [ sequences x tokens (512) x emb_dim (768) ] here?
    def forward(self, x):
            x = x.view(-1, x.size(-2), x.size(-1))
            e = self.embed(x)
            # Add the position information to the input embeddings
            h = e.sum(dim=2)
            for block in self.h:
                h = block(h)
            return h
    

    My questions are:

    • What are x , e, and h tensors' axes?
    • How can a sum of an internal part add positional information to our token embeddings?
    • How is that operation equivalent to the paper's, where the position embedding is an external, learned matrix which is added to the token embeddings?

    Thank you in advance!

    opened by bcserna 2
  • help to understand bpe logic

    help to understand bpe logic

    Hello. Sorry, but i can't understand how this function work. In my tests in most cases the result is equal to original token parameter value. https://github.com/openai/finetune-transformer-lm/blob/master/text_utils.py#L49

    opened by BogdanDidenko 2
  • Vocabulary size code explanation and occasionally shape error

    Vocabulary size code explanation and occasionally shape error

    From the model definition, vocab is used to define the size of embedding.

       vocab = n_vocab + n_special + n_ctx
    

    I am guessing that the n_ctx here is used for the position embedding, but still not clear.

    In my case, I sometimes run into the following shape error if n_ctx is very large.

    Traceback (most recent call last):
      File "/home/vimos/git/QA/pytorch-openai-transformer-lm/train.py", line 413, in <module>
        load_openai_pretrained_model(dh_model.transformer, n_ctx=n_ctx, n_special=n_special)
      File "/home/vimos/git/QA/pytorch-openai-transformer-lm/model_pytorch.py", line 402, in load_openai_pretrained_model
        assert model.embed.weight.shape == init_params[0].shape
    AssertionError: (torch.Size([41140, 768]), (40993, 768))
    

    Can anybody explain the code? Should I restrict n_ctx to a value? Thanks!

    opened by Vimos 2
  • Implementation of Similarity Head

    Implementation of Similarity Head

    Similarity Head and Loss function were tested on the STS-B dataset, achieving nearly the same performance as reported (82.45% PC relative to the 82% in the paper). I can provide the necessary changed code for loading the dataset and reproducing my results if wanted.

    opened by TEGELB 0
  • Training from scratch: Repeated and mangled words

    Training from scratch: Repeated and mangled words

    I am trying to use this repository to train a language model with an additional input. My data looks like this:

    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”
    β”‚side infoβ”‚startβ”‚The β”‚catβ”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”˜
    

    The labels look like this

    β”Œβ”€β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
    β”‚The β”‚catβ”‚meowsβ”‚
    β””β”€β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
    

    Since my objective is quite different from the original training script I implemented the training from scratch but I noticed that it takes much more time than a simple LSTM model to become somewhat decent and the results are not fully concise language even after 15 epochs on 2 million sentences. I am getting outputs that look like this

    Gold label: In most cases , accurate results can only be achieved after a laborious and expensive trial and error process .

    Output: only most accurate cases can be achieved after a laborious error and process results In trial and expensive suit.

    Currently I am using a small model with 4 layers and 2 heads each.

    I randomly initialized the position encodings and multiplied them by 0.1 to match the variance of my word embeddings.

    Any ideas what I could have missed?

    Here is some of my code

    batch_size = 32
    n_epochs = 100
    max_len = 120
    
    embeddings, emb_weights = load_embeddings(data_path+'de.en.fr.ka.tok.60000.shuf.vec',max_len)
    train_dataset = SortedSentenceDataset(data_path+'train.txt', 200000, max_len, embeddings, 'avg',device)
    train_sampler = train_dataset.get_sampler(batch_size)
    train_loader = DataLoader(train_dataset, batch_size=1, sampler=train_sampler)
    dev_dataset = SortedSentenceDataset(data_path+'valid.txt', 1000, max_len, embeddings, 'avg',device)
    dev_sampler = dev_dataset.get_sampler(batch_size)
    dev_loader = DataLoader(dev_dataset, batch_size=1, sampler=dev_sampler)
    
    args = DEFAULT_CONFIG
    args.n_embd = emb_weights.size(1)
    # Constraint: embedding size % number of heads = 0
    args.n_head = 2
    args.n_layer = 4
    model = load_model(args, emb_weights)
    
    model.to(device)
    
    criterion = torch.nn.CrossEntropyLoss()
    
    optimizer = OpenAIAdam(model.parameters(),
                               lr=6.25e-3,
                               schedule='warmup_linear',
                               warmup=0.02,
                               t_total=n_epochs*len(train_dataset)*20,
                               b1=0.9,
                               b2=0.999,
                               e=1e-8,
                               l2=0.01,
                               vector_l2='store_true',
                               max_grad_norm=1)
    
    best = 1000
    for epoch in range(n_epochs):
        do_epoch(train_loader)
        val_loss = eval(dev_loader)
        print('Validation loss: {}'.format(val_loss))
        if val_loss < best:
            best = val_loss
            print('Saving model')
            torch.save(model.state_dict(),"context-at-each-layer-checkpoint-{}k{}e4b.pt".format(len(train_dataset)//1000,n_epochs))
        print(' '.join(generate(train_dataset,max_len,embeddings)))
    
    opened by maruker 0
  • Instructions for encoding own sentences

    Instructions for encoding own sentences

    I'd like to use GPT to encode my dataset and use the representations further for the task of question generation. I have problems with understanding the code and the name of the arguments in the train.py file (in main). Could anyone direct me to some examples (I already search online) or possibly post some here?

    Cheers

    opened by izaskr 1
  • Running on new dataset similar to rocstories

    Running on new dataset similar to rocstories

    Hi all,

    I am trying to train a new dataset with a similar structure to rocstories. It has a story part, 2 options and one correct option. I just added a new function in datasets.py but this is not enough. I am not able to train. Has anyone done that and can provide me with some suggestions?

    Thanks in advance.

    opened by priyanka-chaudhary 0
  • ConvAI

    ConvAI

    From the ConvAI slides, it sounds like the Hugging Face submission was based off of this model -- is the code for your ConvAI system available somewhere to take a look at? Thanks!

    opened by bkj 0
  • vocab = n_vocab + n_special + n_ctx means?

    vocab = n_vocab + n_special + n_ctx means?

    I know that n_vocab is the total number of tokens in encoder dictionary. But when I saw vocab = n_vocab + n_special + n_ctx, I was confused, maybe n_special is the for start,delimiter and classify. But what is n_ctx? Why add these 3 things? (why there is little comment about variables and functions....Is there somewhere else to see the explanation of the codes?) I am new to learn about the transformer.

    opened by JiahangOK 1
Owner
Hugging Face
The AI community building the future.
Hugging Face
High level network definitions with pre-trained weights in TensorFlow

TensorNets High level network definitions with pre-trained weights in TensorFlow (tested with 2.1.0 >= TF >= 1.4.0). Guiding principles Applicability.

Taehoon Lee 1k Dec 13, 2022
Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

ImageProcessingTransformer Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

null 61 Jan 1, 2023
Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

t5-japanese Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts. The following is a list of models that

Kimio Kuramitsu 1 Dec 13, 2021
CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

fastNLP 341 Dec 29, 2022
A general-purpose, flexible, and easy-to-use simulator alongside an OpenAI Gym trading environment for MetaTrader 5 trading platform (Approved by OpenAI Gym)

gym-mtsim: OpenAI Gym - MetaTrader 5 Simulator MtSim is a simulator for the MetaTrader 5 trading platform alongside an OpenAI Gym environment for rein

Mohammad Amin Haghpanah 184 Dec 31, 2022
A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution.

Awesome Pretrained StyleGAN2 A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution. Note the readme is a

Justin 1.1k Dec 24, 2022
DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect. It handles Algerian

null 117 Jan 7, 2023
A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

CLIP4CMR A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval The original data and pre-calculate

null 9 Jan 12, 2022
Saeed Lotfi 28 Dec 12, 2022
PyTorch implementation of the Transformer in Post-LN (Post-LayerNorm) and Pre-LN (Pre-LayerNorm).

Transformer-PyTorch A PyTorch implementation of the Transformer from the paper Attention is All You Need in both Post-LN (Post-LayerNorm) and Pre-LN (

Jared Wang 22 Feb 27, 2022
PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence) and pre-trained model on ImageNet dataset

Reference-Based-Sketch-Image-Colorization-ImageNet This is a PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization usin

Yuzhi ZHAO 11 Jul 28, 2022
Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

DALL-E in Pytorch Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch. It will also contain CLIP for ranking the ge

Phil Wang 5k Jan 4, 2023
Implementation of Hourglass Transformer, in Pytorch, from Google and OpenAI

Hourglass Transformer - Pytorch (wip) Implementation of Hourglass Transformer, in Pytorch. It will also contain some of my own ideas about how to make

Phil Wang 61 Dec 25, 2022
Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition"

Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition" Pre-trained Deep Convo

Ankush Malaker 5 Nov 11, 2022
This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT).

Dynamic-Vision-Transformer (Pytorch) This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT). Not All Ima

null 210 Dec 18, 2022
Pre-Trained Image Processing Transformer (IPT)

Pre-Trained Image Processing Transformer (IPT) By Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Cha

HUAWEI Noah's Ark Lab 332 Dec 18, 2022
Annotate datasets with a semi-trained or fully trained YOLOv5 model

YOLOv5 Auto Annotator Annotate datasets with a semi-trained or fully trained YOLOv5 model Prerequisites Ubuntu >=20.04 Python >=3.7 System dependencie

Akash James 3 May 14, 2022
Simple implementation of OpenAI CLIP model in PyTorch.

It was in January of 2021 that OpenAI announced two new models: DALL-E and CLIP, both multi-modality models connecting texts and images in some way. In this article we are going to implement CLIP model from scratch in PyTorch. OpenAI has open-sourced some of the code relating to CLIP model but I found it intimidating and it was far from something short and simple. I also came across a good tutorial inspired by CLIP model on Keras code examples and I translated some parts of it into PyTorch to build this tutorial totally with our beloved PyTorch!

Moein Shariatnia 226 Jan 5, 2023
Code, Data and Demo for Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting

InversePrompting Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting Code: The code is provided in the "chinese_ip"

THUDM 101 Dec 16, 2022