Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

GUOKUN LAI

Last update: Dec 11, 2022

Related tags

Text Data & NLP Funnel-Transformer

Overview

Introduction

Funnel-Transformer is a new self-attention model that gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. More importantly, by re-investing the saved FLOPs from length reduction in constructing a deeper or wider model, Funnel-Transformer usually has a higher capacity given the same FLOPs. In addition, with a decoder, Funnel-Transformer is able to recover the token-level deep representation for each token from the reduced hidden sequence, which enables standard pretraining.

For a detailed description of technical details and experimental results, please refer to our paper:

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

Zihang Dai*, Guokun Lai*, Yiming Yang, Quoc V. Le

(*: equal contribution)

Preprint 2020

Source Code

Data Download

The corresponding source code and instructions are in the data-scrips folder, which specifies how to access the raw data we used in this work.

TensorFlow

The corresponding source code is in the tensorflow folder, which was developed and exactly used for TPU pretraining & finetuning as presented in the paper.
The TensorFlow funetuning code mainly supports TPU finetuining on GLUE benchmark, text classification, SQuAD and RACE.
Please refer to tensorflow/README.md for details.

PyTorch

The source code is in the pytorch folder, which only serves as an example PyTorch implementation of Funnel-Transformer.
Hence, the PyTorch code only supports GPU finetuning for the GLUE benchmark & text classification.
Please refer to pytorch/README.md for details.

Pretrained models

Model Size	PyTorch	TensorFlow	TensorFlow-Full
B10-10-10H1024	Link	Link	Link
B8-8-8H1024	Link	Link	Link
B6-6-6H768	Link	Link	Link
B6-3x2-3x2H768	Link	Link	Link
B4-4-4H768	Link	Link	Link

Each .tar.gz file contains three items:

A TensorFlow or PyTorch checkpoint (model.ckpt-* or model.ckpt.pt) checkpoint containing the pre-trained weights (Note: The TensorFlow checkpoint actually corresponds to 3 files).
A Word Piece model (vocab.uncased.txt) used for (de)tokenization.
A config file (net_config.json or net_config.pytorch.json) which specifies the hyperparameters of the model.

You also can use download_all_ckpts.sh to download all checkpoints mentioned above.

For how to use the pretrained models, please refer to tensorflow/README.md or pytorch/README.md respectively.

Results

Comments

Issue when trying to create TFrecords

Hey @laiguokun @zihangdai

I am trying to train funnel-transformer from scratch and running into this issue while trying to create TFrecords. Input files: 2 containing approx 5 million lines each

Traceback (most recent call last):
  File "/home/nemaniarjun/Funnel-Transformer/tensorflow/create_pretrain_data.py", line 365, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/home/nemaniarjun/Funnel-Transformer/tensorflow/create_pretrain_data.py", line 196, in main
    create_pretrain_data(task_file_paths, tokenizer)
  File "/home/nemaniarjun/Funnel-Transformer/tensorflow/create_pretrain_data.py", line 163, in create_pretrain_data
    input_data = np.concatenate(input_data_list)
  File "<__array_function__ internals>", line 6, in concatenate
ValueError: need at least one array to concatenate

Any ideas how to debug this?

Thanks!

opened by nemani 4

Would you like to provide a efficiency pipeline to preprocess large corpus?

Hi @zihangdai and @laiguokun ,

I found that if there are thousands of txt file, the provide script would output a large tfrecord file and the preprocessing speed is very slow. Would you like to provide a efficiency pipeline to preprocess the pretrain data, such as using multi threads.

By the way, why did you do the negative sample pair of two sentence, I found that there was no next sentence prediction task, What are the benefits of doing this?

If I want to use the roberta's doc sentence preprocess method, does this not require a passid to be set or just set the passid to "0"?

opened by RyanHuangNLP 4
Several Issue About the Paper
In the Introduction of the paper, there is a sentence below:

Another line of research aims at designing an architecture that not only has a lower resource-to-performance ratio (more efficient) but also scales as well as the Transformer, at least in certain domains.

I'm not quite understand what but also scales as well as the Transformer means?

In the Encoder part,

each pooled hidden state corresponds to a window of 2 unpooled hidden vectors.

what does a window of 2 unpooled mean?

Would anyone who know that, thanks very much.
opened by hscspring 2
Creation of pretraining data
Hi @zihangdai and @laiguokun ,

thanks for publishing the paper and open sourcing the implementation :heart:

I've one technical question regarding to the vocab. I just downloaded a TF checkpoint and in the vocab-uncased.txt file the e.g. [SEP] and [CLS] that are famously known from BERT are missing. Instead their <sep> and <cls> counterparts can be found.

Could you confirm that I can just use my BERT-compatible vocab and "transform" it to the <> notation 🤔

So the following symbols must be found in the vocab file:

<pad> <s> </s> <eod> <eop> <unk> <cls> <sep> <mask>

Like it is specified here:

https://github.com/laiguokun/Funnel-Transformer/blob/1085523bc768e499d8c55edf6af0d70cb1cd27d2/tensorflow/data_utils.py#L35-L45

I just wanted to train a Funnel-Transformer for Turkish (in addition to BERTurk models) and I would really like to use the same vocab for comparison reasons :)

Best,

Stefan
opened by stefan-it 2
Question related to pretraining with residual connection

Hi,

I wonder, since the proposed one-step decoder uses full-length encoded representation as well,

is it possible that the decoder will see the unpooled representation having more representation power rather than the pooled one, thus causes less dependency on the pooling operation to achieve good result for pretraining(correct me if i’m wrong)? If so, how to ensure the effectness of encoded representation if used in downstream tasks?

Thanks

opened by hxu38691 0
Deep VAEs as a research direction for the successor of XLnet

https://paperswithcode.com/paper/very-deep-vaes-generalize-autoregressive-1 XLnet is arguably the state of the art language model and is autoregressive. I wonder if the observations that deep VAEs can generalize and outperform autoregressive models on computer vision, can transpose to language models. @zihangdai I am posting this here instead of on the XLnet repository because you are not active on it.

opened by LifeIsStrange 0

An Issue about the position embedding code

Hello there, as the code below:

def get_pos_enc(self, seq_len, dtype, device):
  ...
  for bidx in range(0, self.net_config.n_block):
        # For each block with bidx > 0, we need two types pos_encs:
        #   - Attn(pooled-q, unpooled-kv)
        #   - Attn(pooled-q, pooled-kv)

        #### First type: Attn(pooled-q, unpooled-kv)
        if bidx > 0:
          # HERE, the pos_id has been changed in the `Second type` below
          pooled_pos_id = self.stride_pool_pos(pos_id, bidx)

          # construct rel_pos_id
          q_stride = self.net_config.pooling_size ** bidx
          k_stride = self.net_config.pooling_size ** (bidx - 1)
          rel_pos_id = self.construct_rel_pos_seq(
              q_pos=pooled_pos_id, q_stride=q_stride,
              k_pos=pos_id, k_stride=k_stride)

          # gather relative positional encoding
          rel_pos_id = rel_pos_id[:, None] + zero_offset
          rel_pos_id = rel_pos_id.expand(rel_pos_id.size(0), d_model)
          pos_enc_2 = torch.gather(pos_enc, 0, rel_pos_id)
        else:
          pos_enc_2 = None

        #### Second type: Attn(pooled-q, pooled-kv)
        # construct rel_pos_id
        pos_id = pooled_pos_id
        stride = self.net_config.pooling_size ** bidx
        rel_pos_id = self.construct_rel_pos_seq(
            q_pos=pos_id, q_stride=stride,
            k_pos=pos_id, k_stride=stride)

        # gather relative positional encoding
        rel_pos_id = rel_pos_id[:, None] + zero_offset
        rel_pos_id = rel_pos_id.expand(rel_pos_id.size(0), d_model)
        pos_enc_1 = torch.gather(pos_enc, 0, rel_pos_id)

        pos_enc_list.append([pos_enc_1, pos_enc_2])
      return pos_enc_list

Here, the pos_id used in the first type has been changed (pooled) after the 1 block. I'm a little confused about that.

Is anyone can explain that , thanks.

opened by hscspring 0

Pretraining Issues

Hey, I am trying to train Funnel Transformer with the following hparams, the cpu usage for my TPUv3-8 has not gone above 4% in the 90 hours the code has been running and it seems to be very slow, took approx 90 hours for 9000 steps.

Do you guys think there is something wrong here or is this time expected?

{
    "block_size": "6_6_6",
    "d_embed": 1024,
    "d_head": 64,
    "d_inner": 4096,
    "d_model": 1024,
    "decoder_size": "2",
    "dropact": 0.0,
    "dropatt": 0.1,
    "dropout": 0.1,
    "ff_activation": "gelu",
    "init": "truncated_normal",
    "init_range": 0.1,
    "init_std": 0.02,
    "n_head": 16,
    "pool_q_only": true,
    "pooling_size": 2,
    "pooling_type": "mean",
    "rel_attn_type": "factorized",
    "separate_cls": true,
    "vocab_size": 32000
}

opened by nemani 7