This repository contains the code for running the character-level Sandwich Transformers from our ACL 2020 paper on Improving Transformer Models by Reordering their Sublayers.

Ofir Press

Last update: Sep 26, 2022

Related tags

Text Data & NLP sandwich_transformer

Overview

Improving Transformer Models by Reordering their Sublayers

This repository contains the code for running the character-level Sandwich Transformers from our ACL 2020 paper on Improving Transformer Models by Reordering their Sublayers (video presentation here, summary here).

Our character-level model (and this repo) is based on the Adaptive Attention Span for Transformers model. In our paper we showed that by simply reordering that model's self-attention and feedforward sublayers, we could improve performance on the enwik8 benchmark (where we achieve 0.968 BPC on the test set).

The code here simply adds a way to reorder the sublayers of the Adaptive Span model, using the --architecture parameter.

If you use this code or results from our paper, please cite:

@inproceedings{press-etal-2020-improving,
    title = "Improving Transformer Models by Reordering their Sublayers",
    author = "Press, Ofir and Smith, Noah A. and Levy, Omer",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.270",
    doi = "10.18653/v1/2020.acl-main.270",
    pages = "2996--3005",
}

Requirements

You need CUDA 10 and PyTorch 1.2.0 to run this code. See this page for installation instructions. To replicate our experimental conditions eight V100 GPUs are needed.

Running experiments in the paper

The scripts for training the character-level models from the paper are located in the ./experiments/ directory. For example, to train the enwik8 model, run:

bash experiments/enwik8_large.sh

We used eight V100 GPUs, but if you'd like to run this model on GPUs with less memory you can increase the --batch-split (it splits batches into smaller pieces without changing the final result).

We obtained the following results in our experiments:

Experiment	#params	valid (bpc)	test (bpc)
enwik8 Sandwich Transformer	209M	0.992	0.968
text8 Sandwich Transformer	209M	1.012	1.076

The `--architecture` parameter

A standard transformer with 3 layers (so 6 self-attention and feedforward sublayers) would use be trained using --architecture sfsfsf. That 6 sublayer model with a sandwiching coefficient of 1 would be --architecture s.sfsf.f and with a sandwiching coefficient of 2 would be --architecture s.s.sf.f.f. Make sure to also set the --nlayers parameter to be the length of the architecture string divided by 2.

License

The code is licensed under CC-BY-NC license. See the LICENSE file for more details.

Acknowledgements + More Information

This code is based on the code of the Adaptive Span model. We recommend reading the Adaptive Span README for further information on this codebase.

A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.

Rule-Based-Classification-in-a-Banking-Case. A CRM department in a local bank works on classify their lost customers with their past datas. So they wa

4 Mar 20, 2022

This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO 🦕 ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

154 Jan 1, 2023

Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

90 Dec 27, 2022

(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

10 Jul 1, 2022

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

14 Jan 3, 2023

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

29 Oct 16, 2022

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

6 Apr 29, 2022

[ICCV 2021] Instance-level Image Retrieval using Reranking Transformers

Instance-level Image Retrieval using Reranking Transformers Fuwen Tan, Jiangbo Yuan, Vicente Ordonez, ICCV 2021. Abstract Instance-level image retriev

86 Dec 28, 2022

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

43 Dec 28, 2022

Comments

How to modify the reported error

UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) /home/zqw/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:143: UserWarning: The epoch parameter in scheduler.step() was not necessary and is being deprecated where possible. Please use scheduler.step() to step the scheduler. During the deprecation, if epoch is different from None, the closed form is used instead of the new chainable form, where available. Please open an issue if you are unable to replicate your use case: https://github.com/pytorch/pytorch/issues/new/choose. warnings.warn(EPOCH_DEPRECATION_WARNING, UserWarning)

opened by wymxz 2
position

原先的position只是在input中相加，key_pe 每个transform_lay都加了，这样做有什么好处？下面这部分的代码什么功能，我看不懂，可以介绍一下吗 def _skew(X, pad_value): """shift every row 1 step to right""" # X = B x M x L B, M, L = X.size() X = F.pad(X, (0, M + 1), value=pad_value) # B x M x (L+M+1) X = X.view(B, -1) # B x ML+MM+M X = X[:, :-M] # B x ML+MM X = X.view(B, M, M + L) # B x M x L+M return X

def _unskew(X): """reverse _skew operation""" # X = B x M x L+M B, M, L = X.size() L -= M X = X.view(B, -1) # B x ML+MM X = F.pad(X, (0, M)) # B x ML+MM+M X = X.view(B, M, M + L + 1) # B x M x L+M+1 X = X[:, :, :L] # B x M x L return X

opened by wymxz 1

This repository contains the code for running the character-level Sandwich Transformers from our ACL 2020 paper on Improving Transformer Models by Reordering their Sublayers.

Related tags

Overview

Improving Transformer Models by Reordering their Sublayers

Requirements

Running experiments in the paper

The `--architecture` parameter

License

Acknowledgements + More Information

You might also like...

A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.

This repository contains the code for "Generating Datasets with Pretrained Language Models".

Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

[ICCV 2021] Instance-level Image Retrieval using Reranking Transformers

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Comments

How to modify the reported error

position

Owner

Ofir Press

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings of ACL: ACL 2021)

Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

This repository contains the code for running the character-level Sandwich Transformers from our ACL 2020 paper on Improving Transformer Models by Reordering their Sublayers.

Related tags

Overview

Improving Transformer Models by Reordering their Sublayers

Requirements

Running experiments in the paper

The --architecture parameter

License

Acknowledgements + More Information

You might also like...

A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.

This repository contains the code for "Generating Datasets with Pretrained Language Models".

Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

[ICCV 2021] Instance-level Image Retrieval using Reranking Transformers

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Comments

How to modify the reported error

position

Owner

Ofir Press

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings of ACL: ACL 2021)

Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

The `--architecture` parameter