Source code of paper "BP-Transformer: Modelling Long-Range Context via Binary Partitioning"

Related tags

Text Data & NLP BPT
Overview

BP-Transformer

This repo contains the code for our paper

BP-Transformer: Modeling Long-Range Context via Binary Partition

Zihao Ye, Qipeng Guo, Quan Gan, Xipeng Qiu, Zheng Zhang

The code is written in DGL with PyTorch as backend.

Requirements

  • torchtext 0.4
  • dgl 0.4 (the code on master branch is not compatible with dgl 0.5, please checkout develop branch for dgl 0.5 compatible version).
  • yaml
  • spacy
  • PyTorch 1.1+

Usage

For Multi-GPU training, please export NCCL_LL_THRESHOLD=0 before running scripts because of a PyTorch bug mentioned here.

The codebase has two dependencies: graph_kernel and graph_builder, the first one is for efficient graph attention on GPU with node parallel strategy written in CUDA, the second one is for efficient graph construction written in Cython. To install them:

cd graph_builder
python setup.py install
cd ..
cd graph_kernel
python setup.py install
cd ..

We support the following tasks with BPT as backbone:

  • Text Classification: text_classification.py
  • Language Modeling: lm.py
  • Machine Translation: mt.py
  • Natural Language Inference: nli.py

All experiment settings mentioned in our paper are available at configs/.

python *.py --config configs/*.yml --gpu [GPUs]

Note that this repo does not contain any data files, to get dataset required for experiments, run . get_*.sh and the corresponding dataset would be downloaded and preprocessed.

For machine translation, we have another script mt_infer.py for decoding:

python mt_infer.py --config configs/*.yml --gpu [GPU]

Before decoding, please make sure you have finished the training using mt.py with the same config file.

NOTE: Currently we do not support CPU training/inference.

Visualization

Following is the visualization of the sparse matrix of BPT underlying graph when sequence length is 8192 and k is 4. image

Results

  • Character-Level Language Modeling (enwik8, metric: bpc), 12 layers.
    • BPT(context length=8192): 1.02
    • Adaptive Transformer: 1.02
    • Transformer-XL: 1.06
    • To reproduce: python lm.py --config configs/enwik8-8192.yml --gpu 0,1,2,3,4,5,6,7
  • Document-Level Machine Translation (IWSLT 2015 Zh-En, metric: BLEU), base setting.
    • BPT(context length=64): 19.84
    • HAN-NMT: 17.68
    • To reproduce: python mt.py --config configs/iwslt-4-64.yml --gpu 0
  • Text Classification (IMDB, metric: accuracy), 5 layers.
    • BPT+GloVe: 92.12(±0.11)
    • LSTM+CoVe: 91.8
    • Transformer+Glove: 89.24(±0.20)
    • Star Transformer: 90.50
    • To reproduce: python text_classification.py --config configs/imdb-4.yml --gpu 0
      • Note that our CUDA kernel uses atomic operations which may result in non-determinism, we report the mean and std of accuracy in multiple(10) runs.
      • The IMDB dataset has not official train/dev split, we follow the setting of Bryan et al., 2017 and hold out 10% samples for validation. We report the test accuracy of model with best valid loss.

For sentence level modeling, we show that BPT models better inductive bias than vanilla transformer by attending fine-grained features of neighbors and coarse-grained features of far-away tokens.

  • Machine Translation(WMT14 En-De, metric: BLEU), base setting.
    • BPT(k=1): 26.9
    • BPT(k=2): 27.4
    • BPT(k=4): 27.6
    • BPT(k=8): 26.7
    • Transformer-base(our implementation): 27.2
    • To reproduce: python mt.py --config configs/wmt-*.yml --gpu 0,1,2,3,4,5,6,7
      • We report SacreBLEU result for reproducibility (setting: BLEU+c.mixed+l.en-de+#.1+s.exp+t.wmt14+tok.intl+v.1.4.1), the sacrebleu score is usually lower than that produced by get_ende_bleu.sh script in tensor2tensor as described here.
  • Natural Language Inference(SNLI, metric: accuracy), ESIM-like structure, 3 layers for self-attention and 3 layers for cross-sentence attention.
    • BPT(k=4): 88.25(±0.07)
    • Transformer: 87.89(±0.31)
    • To reproduce: python nli.py --config configs/snli.yml --gpu 0
      • Like Text Classification, the result on NLI is also not stable because of randomness in our CUDA kernel, we report the mean and std of accuracy in multiple(7) runs.
  • Text Classification(SST-5, metric: accuracy), 4 layers.
    • BPT+GloVe: 52.71(±0.32)
    • Transformer+GloVe: 50.40
    • Tree-LSTM+GloVe: 51.0
    • To reproduce: python text_classification.py --config configs/sst5-2.yml --gpu 0

TODOs

  • FP16 support (mixed-precision training/inference)
  • Integrate kernels with dgl 0.5
  • CPU support
You might also like...
Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

FCS-applications Source code for CsiNet and CRNet using the Fully Connected Layer-Shared feedback architecture. Introduction This repository contains

Guide to using pre-trained large language models of source code
Guide to using pre-trained large language models of source code

Large Models of Source Code I occasionally train and publicly release large neural language models on programs, including PolyCoder. Here, I describe

Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Recurrent VLN-BERT Code of the Recurrent-VLN-BERT paper: A Recurrent Vision-and-Language BERT for Navigation Yicong Hong, Qi Wu, Yuankai Qi, Cristian

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning This is the PyTorch companion code for the paper: A

This repository will contain the code for the CVPR 2021 paper
This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields Project Page | Paper | Supplementary | Video | Slides | Blog | Talk If

Code for ACL 2021 main conference paper
Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

Comments
  • How to use BPT with a custom dataset for a machine translation task?

    How to use BPT with a custom dataset for a machine translation task?

    Thanks for this great artefact. I am interested to apply this library for a machine translation task. However, in my case, I have to feed a custom dataset. Furthermore, I have to add customized nodes and edges as my task is slightly different.

    Do you think it is possible with BPT? Would be great if I can discuss this with you in greater detail.

    opened by nashid 2
  • I think there is a bug in the implementation of bpc

    I think there is a bug in the implementation of bpc

    According to the material I have find from here and here, bpc=log2(NLL). But in the implementation in your code, I found that bpc = NLL / log2. Is there something wrong for the calculation of bpc, or I have missed anything?

    opened by OleNet 2
Owner
Zihao Ye
Ph.D. Student@University of Washington, focusing on Compilers and Computer Architecture.
Zihao Ye
Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations Created by Jiahao Pang, Duanshun Li, and Dong Tian from InterDigital In

InterDigital 21 Dec 29, 2022
(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

null 10 Jul 1, 2022
Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

Justin Terry 32 Nov 9, 2021
This is the source code of RPG (Reward-Randomized Policy Gradient)

RPG (Reward-Randomized Policy Gradient) Zhenggang Tang*, Chao Yu*, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Shaolei Du, Yu Wang, Yi Wu (

null 40 Nov 25, 2022
Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

Generating Persona Consistent Dialogues by Exploiting Natural Language Inference Source code for RCDG model in AAAI20 Generating Persona Consistent Di

null 16 Oct 8, 2022
This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

Rohan Mathur 9 Jul 17, 2021
The source code of HeCo

HeCo This repo is for source code of KDD 2021 paper "Self-supervised Heterogeneous Graph Neural Network with Co-contrastive Learning". Paper Link: htt

Nian Liu 106 Dec 27, 2022
Open source code for AlphaFold.

AlphaFold This package provides an implementation of the inference pipeline of AlphaFold v2.0. This is a completely new model that was entered in CASP

DeepMind 9.7k Jan 2, 2023
GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

Nathan Cooper 2.3k Jan 1, 2023