Transformer training code for sequential tasks

Overview

Sequential Transformer

This is a code for training Transformers on sequential tasks such as language modeling. Unlike the original Transformer architecture, it uses caching of previous representations and relative position embeddings to better adapt to sequential tasks. In addition, the code also implements the following projects as described below and in this blog post:

Requirements

You need PyTorch 0.4.1 or above and a cuda-enabled GPU to run the code. If there are multiple GPUs available, the code uses nn.DataParallel to utilize them. For better efficiency, enable distributed training by --distributed argument, which can run on multiple nodes.

Adaptive Attention Span

This code can be used for running experiments in Adaptive Attention Span for Transformers paper. The adaptive span allows a model to learn an optimal context size for each self-attention head from training data. As shown in the below figure, only few heads require long attention span, thus making it possible to increase the context size to 8k tokens without increasing computation time and memory footprint significantly.

An argument --adapt-span enables adaptive span. Otherwise a model will have a fixed attention span. The adaptive-span is implemented as a nn.Module to make it easier to plug it into other models.

Running experiments in the paper

Scripts for running experiments in the paper are located in ./experiments/ directory. For example, a smaller 8-layer version of our model can be trained on a single GPU by running:

bash experiments/enwik8_small.sh

It should reach about 1.3bpc on dev after 150k steps.

For training larger models, multiple GPUs are recommended. In the script files, you can configure the number of available GPUs. Increase the --batch-split argument if you run out of GPU memory (it splits batches into smaller pieces without changing the final result).

We obtained the following results in our experiments:

Experiment #params dev test
enwik8 38M 1.04 bpb 1.02 bpb
enwik8_large 209M 1.00 bpb 0.98 bpb
text8 39M 1.05 bpc 1.11 bpc
text8_large 209M 1.01 bpc 1.07 bpc

A large model training takes about 1.2sec/batch near the end (initially it's faster because the attention spans are smaller) on 8 V100 GPUs. So, for example, the whole enwik8_large training of 170k steps should take less than 2.4 days.

Pre-trained models

You can download pre-trained models by running the get_pretrained.sh script. Then the same scripts in ./experiments/ can be used to evaluate those models. Since the download script puts models in ./checkpoints/, make sure there is no file with the same name. Note that these pre-trained models are obtained by rerunning the training scripts after the code cleanup, so there are small differences from the above results due to the randomness of the training.

All-attention Network

The code also can be used for training All-attention Networks introduced in Augmenting Self-attention with Persistent Memory. If --pers-mem-size argument is set to N, all FF sublayers will be removed from the model and N persistent memory vectors will be added to every self-attention sublayer. The following experiments can be found in ./experiments/ directory.

Experiment #params dev test
enwik8_pers_small.sh 39M 1.03 bpb 1.01 bpb
enwik8_pers.sh 114M 1.00 bpb 0.98 bpb
wiki103_pers.sh 133M 18.8 ppl * 19.7 ppl *

(*This number is slightly better than the paper because it includes end-of-line as a token.)

License

The code is licensed under CC-BY-NC license. See the LICENSE file for more details.

Acknowledgement

We thank Xavier Martinet for helping with cleaning the code. The data preprocessing scripts are downloaded from awd-lstm and transformer-XL repos. The adagrad_with_grad_clip.py is mostly adapted from PyTorch.

Comments
  • Understanding adaptive-span loss

    Understanding adaptive-span loss

    Hi,

    Sorry to bother you. I have gone through the paper several times. I've also looked at the code many times I just had one query with adaptive span loss. Here's what I interpreted: This parameter self.current_val = nn.Parameter(torch.zeros(*shape) + init_val) is responsible for calculating loss, mask and span. In this case, this parameter will be initialized with zero values since as per your config since init_val is kept as 0 (since the mean of all the values of the parameter will be 0).

    My question is how is this parameter getting updated ?

    When I call adaptive_span.get_loss(), it in turn calls: self._loss_coeff * self._max_span * self._mask.current_val.mean() which will also return 0. When I do : adaptive_span.clamp_param(), nothing will happen since all the values inside the parameter were initialized with 0. These are the only two function calls happening inside train method. Can you please point out what am I missing ?

    opened by prajjwal1 7
  • Question: How to reduce the memory in this project

    Question: How to reduce the memory in this project

    Hi, I read your paper ,it's great. I'm very interesting about how to reduce the memory in the real project.

    I guess the memory things are: in https://github.com/facebookresearch/adaptive-span/blob/d882404be50f488d85683b8b925f0c6aef33e9f3/adaptive_span.py#L127

    But I just see you cut the key_pe and It's just reduce a little memory and wouldn't help for reduce the Q K things I think.

    So. can you explain How to reduce the memory in the code?

    thanks

    opened by yangyaofei 7
  • BPC

    BPC

    Scripts in experiments directory calculates bits per byte, not bits per character. Am I right?

    It is important when comparing chars or words perplexities.

    For example, for English enwiki8 ratio chars to bits is 1.0033040809995477: BPB: 1.0 -> byte perplexity: 2.718 -> char perplexity: 2.727

    For Polish ratio chars to bits is 1.0505100080652954: BPB: 1.0 -> byte perplexity: 2.718 -> char perplexity: 2.859

    opened by djstrong 6
  • The way you preprocess data is different from that of Transformer-XL

    The way you preprocess data is different from that of Transformer-XL

    I noticed that you add a <eos> tokens at the end of each line: https://github.com/facebookresearch/adaptive-span/blob/master/data.py#L34

    But in Transformer-XL's code, they do not add<eos> for enwik8 and text8: https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/data_utils.py#L205

    According to my experience, in enwik8 (sentence length is short), using <eos> would make the final bpc/bpb about 0.02 lower. It's better if you use the same setting for fair comparison.

    opened by yzh119 5
  • Warning with PyTorch 1.4

    Warning with PyTorch 1.4

    UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)

    opened by djstrong 4
  • did you try to start with maximum possibile cache size

    did you try to start with maximum possibile cache size

    Just curious: would the results (the final span length and pred accuracy) still hold if you start with max cache size -- initialized by torch.ones (need to reduce S) or random number between 0~1 instead of zeros?

    https://github.com/facebookresearch/adaptive-span/blob/a8d90b8a8481ef1ae50a73b696c290aa88d34744/adaptive_span.py#L34

    opened by rouniuyizu 2
  • Accept a mask to remove padding in batch

    Accept a mask to remove padding in batch

    @tesatory and team.

    Thank you for releasing the Adaptive span transformer. For me is the best version of transformer so far!

    On thing I noticed, comparing to another (great) transformer( https://github.com/idiap/fast-transformers), I noticed is that when I set in the forward call the mask of padded items, the model converges much faster.

    Is this something that could be on adaptive-span?

    opened by bratao 1
  • confuse

    confuse

    Are the results of dev and test on the test data set?

    Experiment | #params | dev | test enwik8 | 38M | 1.04 bpb | 1.02 bpb enwik8_large | 209M | 1.00 bpb | 0.98 bpb text8 | 39M | 1.05 bpc | 1.11 bpc text8_large | 209M | 1.01 bpc | 1.07 bpc

    Do they need to be evaluated multiple times on the test? When I reproduce the model's train and valid bpcs are much larger than those obtained on the test, is it normal?

    opened by wymxz 1
  • Generate text

    Generate text

    How to generate a text giving some seed?

    Is there a better way than iterating by one byte with? https://github.com/facebookresearch/adaptive-span/blob/a8d90b8a8481ef1ae50a73b696c290aa88d34744/trainer.py#L20

    opened by djstrong 1
  • Queries about adaptive span

    Queries about adaptive span

    Hi, I had few queries:

    • Do adaptive span change with time as the model sees more data ? Or is the span static ? In my experiments, they do not seem to change for some reason.
    • Secondly, as long as the values in current_val lies between [0,1], adaptive span loss won't change right since you are using _clamp(0,1). So how much weight does this loss carry ?
    opened by prajjwal1 1
  • Compute attention span of individual attention heads

    Compute attention span of individual attention heads

    I am working in model interpretability and wish to learn more about what each head is looking at and it's attention span (similar to the graphs from the paper). Could you please share what did you use to get the span of individual head ?

    opened by prajjwal1 1
  • Please convert to a permissive license

    Please convert to a permissive license

    Other Facebook projects like react use permissive licenses like MIT, would it be possible to relicense this for commercial use so startups could participate in development also?

    opened by bionicles 0
  • Understanding graphs from papers

    Understanding graphs from papers

    Thanks for replying to my previous questions. In the fig 3 of your paper, i had few queries.

    1. In Average Span vs Span Limit (Central graph), you showed that in case of fixed span model, span increases as span limit increases. I wanted to ask, as per your code base, spans are already monitored by current_val only if adapt_span_enabled is set to True (line). So how did you measure the span of fixed model because in that case, the bool value will be false, and then AdaptiveSpan won't monitor it. How did you measure the span of fixed model ?

    2. In FLOPS vs Span Limit, you showed that FLOPS keep on increasing in the case of fixed span model while in the case adaptive span, FLOPS were constant (approximately linear). After through inspection, FLOPS are constant in adaptive span but they don't see seem to be rising in case of standard attention as well. In both the cases, FLOPS are same. Could you please share some insights.

    Thanks

    opened by prajjwal1 0
Owner
Meta Research
Meta Research
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

null 730 Jan 9, 2023
Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipBERT is designed based on 2D CNNs and transformers, and uses a sparse sampling strategy to enable efficient end-to-end video-and-language learning.

Jie Lei 雷杰 612 Jan 4, 2023
Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks

TestRank in Pytorch Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks by Yu Li, Min Li, Qiuxia Lai, Ya

null 3 May 19, 2022
Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks arXiv link: upcoming To be published in Findings of NA

Allen 16 Nov 12, 2022
Ongoing research training transformer language models at scale, including: BERT & GPT-2

What is this fork of Megatron-LM and Megatron-DeepSpeed This is a detached fork of https://github.com/microsoft/Megatron-DeepSpeed, which in itself is

BigScience Workshop 316 Jan 3, 2023
Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

NVIDIA Corporation 3.5k Dec 30, 2022
A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You can find two approaches for achieving this in this repo.

multitask-learning-transformers A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You

Shahrukh Khan 48 Jan 2, 2023
An end to end ASR Transformer model training repo

END TO END ASR TRANSFORMER 本项目基于transformer 6*encoder+6*decoder的基本结构构造的端到端的语音识别系统 Model Instructions 1.数据准备: 自行下载数据,遵循文件结构如下: ├── data │ ├── train │

旷视天元 MegEngine 10 Jul 19, 2022
Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

Justin Terry 32 Nov 9, 2021
Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

Ekstra Bladet 141 Dec 30, 2022
Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

null 186 Dec 29, 2022
Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

null 160 Feb 9, 2021
TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-population, and their combinations to provide a comprehensive robustness analysis.

TextFlint 587 Dec 20, 2022
skweak: A software toolkit for weak supervision applied to NLP tasks

Labelled data remains a scarce resource in many practical NLP scenarios. This is especially the case when working with resource-poor languages (or text domains), or when using task-specific labels without pre-existing datasets. The only available option is often to collect and annotate texts by hand, which is expensive and time-consuming.

Norsk Regnesentral (Norwegian Computing Center) 850 Dec 28, 2022
NLPShala , the best IDE for all Natural language processing tasks.

The revolutionary IDE for all NLP (Natural language processing) stuffs on the internet.

Abhi 3 Aug 8, 2021
pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

A Python multilingual toolkit for Sentiment Analysis and Social NLP tasks

null 297 Dec 29, 2022
Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, which modifies the input text with a textual template and directly uses PLMs to conduct pre-trained tasks. This library provides a standard, flexible and extensible framework to deploy the prompt-learning pipeline. OpenPrompt supports loading PLMs directly from huggingface transformers. In the future, we will also support PLMs implemented by other libraries.

THUNLP 2.3k Jan 8, 2023
TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

TEACh Task-driven Embodied Agents that Chat Aishwarya Padmakumar*, Jesse Thomason*, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Ge

Alexa 98 Dec 9, 2022