Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

Overview

Introduction

Funnel-Transformer is a new self-attention model that gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. More importantly, by re-investing the saved FLOPs from length reduction in constructing a deeper or wider model, Funnel-Transformer usually has a higher capacity given the same FLOPs. In addition, with a decoder, Funnel-Transformer is able to recover the token-level deep representation for each token from the reduced hidden sequence, which enables standard pretraining.

For a detailed description of technical details and experimental results, please refer to our paper:

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

Zihang Dai*, Guokun Lai*, Yiming Yang, Quoc V. Le

(*: equal contribution)

Preprint 2020

Source Code

Data Download

  • The corresponding source code and instructions are in the data-scrips folder, which specifies how to access the raw data we used in this work.

TensorFlow

  • The corresponding source code is in the tensorflow folder, which was developed and exactly used for TPU pretraining & finetuning as presented in the paper.
  • The TensorFlow funetuning code mainly supports TPU finetuining on GLUE benchmark, text classification, SQuAD and RACE.
  • Please refer to tensorflow/README.md for details.

PyTorch

  • The source code is in the pytorch folder, which only serves as an example PyTorch implementation of Funnel-Transformer.
  • Hence, the PyTorch code only supports GPU finetuning for the GLUE benchmark & text classification.
  • Please refer to pytorch/README.md for details.

Pretrained models

Model Size PyTorch TensorFlow TensorFlow-Full
B10-10-10H1024 Link Link Link
B8-8-8H1024 Link Link Link
B6-6-6H768 Link Link Link
B6-3x2-3x2H768 Link Link Link
B4-4-4H768 Link Link Link

Each .tar.gz file contains three items:

  • A TensorFlow or PyTorch checkpoint (model.ckpt-* or model.ckpt.pt) checkpoint containing the pre-trained weights (Note: The TensorFlow checkpoint actually corresponds to 3 files).
  • A Word Piece model (vocab.uncased.txt) used for (de)tokenization.
  • A config file (net_config.json or net_config.pytorch.json) which specifies the hyperparameters of the model.

You also can use download_all_ckpts.sh to download all checkpoints mentioned above.

For how to use the pretrained models, please refer to tensorflow/README.md or pytorch/README.md respectively.

Results

glue-dev

qa

Comments
  • Issue when trying to create TFrecords

    Issue when trying to create TFrecords

    Hey @laiguokun @zihangdai

    I am trying to train funnel-transformer from scratch and running into this issue while trying to create TFrecords. Input files: 2 containing approx 5 million lines each

    Traceback (most recent call last):
      File "/home/nemaniarjun/Funnel-Transformer/tensorflow/create_pretrain_data.py", line 365, in <module>
        tf.app.run()
      File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
        _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
      File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 299, in run
        _run_main(main, args)
      File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 250, in _run_main
        sys.exit(main(argv))
      File "/home/nemaniarjun/Funnel-Transformer/tensorflow/create_pretrain_data.py", line 196, in main
        create_pretrain_data(task_file_paths, tokenizer)
      File "/home/nemaniarjun/Funnel-Transformer/tensorflow/create_pretrain_data.py", line 163, in create_pretrain_data
        input_data = np.concatenate(input_data_list)
      File "<__array_function__ internals>", line 6, in concatenate
    ValueError: need at least one array to concatenate
    

    Any ideas how to debug this?

    Thanks!

    opened by nemani 4
  • Would you like to provide a efficiency pipeline to preprocess large corpus?

    Would you like to provide a efficiency pipeline to preprocess large corpus?

    Hi @zihangdai and @laiguokun ,

    I found that if there are thousands of txt file, the provide script would output a large tfrecord file and the preprocessing speed is very slow. Would you like to provide a efficiency pipeline to preprocess the pretrain data, such as using multi threads.

    By the way, why did you do the negative sample pair of two sentence, I found that there was no next sentence prediction task, What are the benefits of doing this?

    If I want to use the roberta's doc sentence preprocess method, does this not require a passid to be set or just set the passid to "0"?

    opened by RyanHuangNLP 4
  • Several Issue About the Paper

    Several Issue About the Paper

    1. In the Introduction of the paper, there is a sentence below:

    Another line of research aims at designing an architecture that not only has a lower resource-to-performance ratio (more efficient) but also scales as well as the Transformer, at least in certain domains.

    I'm not quite understand what but also scales as well as the Transformer means?

    1. In the Encoder part,

    each pooled hidden state corresponds to a window of 2 unpooled hidden vectors.

    what does a window of 2 unpooled mean?

    Would anyone who know that, thanks very much.

    opened by hscspring 2
  • Creation of pretraining data

    Creation of pretraining data

    Hi @zihangdai and @laiguokun ,

    thanks for publishing the paper and open sourcing the implementation :heart:

    I've one technical question regarding to the vocab. I just downloaded a TF checkpoint and in the vocab-uncased.txt file the e.g. [SEP] and [CLS] that are famously known from BERT are missing. Instead their <sep> and <cls> counterparts can be found.

    Could you confirm that I can just use my BERT-compatible vocab and "transform" it to the <> notation 🤔

    So the following symbols must be found in the vocab file:

    <pad>
    <s>
    </s>
    <eod>
    <eop>
    <unk>
    <cls>
    <sep>
    <mask>
    

    Like it is specified here:

    https://github.com/laiguokun/Funnel-Transformer/blob/1085523bc768e499d8c55edf6af0d70cb1cd27d2/tensorflow/data_utils.py#L35-L45

    I just wanted to train a Funnel-Transformer for Turkish (in addition to BERTurk models) and I would really like to use the same vocab for comparison reasons :)

    Best,

    Stefan

    opened by stefan-it 2
  • Question related to pretraining with residual connection

    Question related to pretraining with residual connection

    Hi,

    I wonder, since the proposed one-step decoder uses full-length encoded representation as well,

    is it possible that the decoder will see the unpooled representation having more representation power rather than the pooled one, thus causes less dependency on the pooling operation to achieve good result for pretraining(correct me if i’m wrong)? If so, how to ensure the effectness of encoded representation if used in downstream tasks?

    Thanks

    opened by hxu38691 0
  • Deep VAEs as a research direction for the successor of XLnet

    Deep VAEs as a research direction for the successor of XLnet

    https://paperswithcode.com/paper/very-deep-vaes-generalize-autoregressive-1 XLnet is arguably the state of the art language model and is autoregressive. I wonder if the observations that deep VAEs can generalize and outperform autoregressive models on computer vision, can transpose to language models. @zihangdai I am posting this here instead of on the XLnet repository because you are not active on it.

    opened by LifeIsStrange 0
  • An Issue about the position embedding code

    An Issue about the position embedding code

    Hello there, as the code below:

    def get_pos_enc(self, seq_len, dtype, device):
      ...
      for bidx in range(0, self.net_config.n_block):
            # For each block with bidx > 0, we need two types pos_encs:
            #   - Attn(pooled-q, unpooled-kv)
            #   - Attn(pooled-q, pooled-kv)
    
            #### First type: Attn(pooled-q, unpooled-kv)
            if bidx > 0:
              # HERE, the pos_id has been changed in the `Second type` below
              pooled_pos_id = self.stride_pool_pos(pos_id, bidx)
    
              # construct rel_pos_id
              q_stride = self.net_config.pooling_size ** bidx
              k_stride = self.net_config.pooling_size ** (bidx - 1)
              rel_pos_id = self.construct_rel_pos_seq(
                  q_pos=pooled_pos_id, q_stride=q_stride,
                  k_pos=pos_id, k_stride=k_stride)
    
              # gather relative positional encoding
              rel_pos_id = rel_pos_id[:, None] + zero_offset
              rel_pos_id = rel_pos_id.expand(rel_pos_id.size(0), d_model)
              pos_enc_2 = torch.gather(pos_enc, 0, rel_pos_id)
            else:
              pos_enc_2 = None
    
            #### Second type: Attn(pooled-q, pooled-kv)
            # construct rel_pos_id
            pos_id = pooled_pos_id
            stride = self.net_config.pooling_size ** bidx
            rel_pos_id = self.construct_rel_pos_seq(
                q_pos=pos_id, q_stride=stride,
                k_pos=pos_id, k_stride=stride)
    
            # gather relative positional encoding
            rel_pos_id = rel_pos_id[:, None] + zero_offset
            rel_pos_id = rel_pos_id.expand(rel_pos_id.size(0), d_model)
            pos_enc_1 = torch.gather(pos_enc, 0, rel_pos_id)
    
            pos_enc_list.append([pos_enc_1, pos_enc_2])
          return pos_enc_list
    

    Here, the pos_id used in the first type has been changed (pooled) after the 1 block. I'm a little confused about that.

    Is anyone can explain that , thanks.

    opened by hscspring 0
  • Pretraining Issues

    Pretraining Issues

    Hey, I am trying to train Funnel Transformer with the following hparams, the cpu usage for my TPUv3-8 has not gone above 4% in the 90 hours the code has been running and it seems to be very slow, took approx 90 hours for 9000 steps.

    Do you guys think there is something wrong here or is this time expected?

    {
        "block_size": "6_6_6",
        "d_embed": 1024,
        "d_head": 64,
        "d_inner": 4096,
        "d_model": 1024,
        "decoder_size": "2",
        "dropact": 0.0,
        "dropatt": 0.1,
        "dropout": 0.1,
        "ff_activation": "gelu",
        "init": "truncated_normal",
        "init_range": 0.1,
        "init_std": 0.02,
        "n_head": 16,
        "pool_q_only": true,
        "pooling_size": 2,
        "pooling_type": "mean",
        "rel_attn_type": "factorized",
        "separate_cls": true,
        "vocab_size": 32000
    }
    
    opened by nemani 7
Owner
GUOKUN LAI
GUOKUN LAI
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

null 652 Jan 6, 2023
Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

NLP-Summarizer Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5 This project aimed to provide in

Samuel Sharkey 1 Feb 7, 2022
LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

Felipe Maia Polo 125 Dec 20, 2022
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

Robert Bogan Kang 3 May 25, 2022
Spam filtering made easy for you

spammy Author: Tasdik Rahman Latest version: 1.0.3 Contents 1 Overview 2 Features 3 Example 3.1 Accuracy of the classifier 4 Installation 4.1 Upgradin

Tasdik Rahman 137 Dec 18, 2022
ACL22 paper: Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost LOVE is accpeted by ACL22 main conference as a long pape

Lihu Chen 32 Jan 3, 2023
Reformer, the efficient Transformer, in Pytorch

Reformer, the Efficient Transformer, in Pytorch This is a Pytorch implementation of Reformer https://openreview.net/pdf?id=rkgNKkHtvB It includes LSH

Phil Wang 1.8k Dec 30, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Explosion 24.9k Jan 2, 2023
🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

Gustavo Rosa 21 Aug 12, 2022
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 2.1k Jan 1, 2023
PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing pororo performs Natural Language Processing and Speech-related tasks. It is easy to

Kakao Brain 1.2k Dec 21, 2022
💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Explosion 19.5k Feb 13, 2021
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 ?? Transformers provides thousands of pretrained models to perform tasks o

Hugging Face 77.3k Jan 3, 2023
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

flair 12.3k Dec 31, 2022
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 3k Jan 5, 2023