Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Overview

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

This repository contains the code in both PyTorch and TensorFlow for our paper

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov (*: equal contribution)

Preprint 2018

TensorFlow

  • The source code is in the tf/ folder, supporting (1) single-node multi-gpu training, and (2) multi-host TPU training.
  • Besides the source code, we also provide pretrained "TensorFlow" models with state-of-the-art (SoTA) performances reported in the paper.
  • Please refer to tf/README.md for details.

PyTorch

  • The source code is in the pytorch/ folder, supporting single-node multi-gpu training via the module nn.DataParallel.
  • Please refer to pytorch/README.md for details.

Results

Transformer-XL achieves new state-of-the-art results on multiple language modeling benchmarks. Transformer-XL is also the first to break through the 1.0 barrier on char-level language modeling. Below is a summary.

Method enwiki8 text8 One Billion Word WT-103 PTB (w/o finetuning)
Previous Best 1.06 1.13 23.7 20.5 55.5
Transformer-XL 0.99 1.08 21.8 18.3 54.5

Acknowledgement

A large portion of the getdata.sh script comes from the awd-lstm repo. Happy Language Modeling :)

Comments
  • Unable to replicate experiment results

    Unable to replicate experiment results

    I am not sure where it is wrong but I have been training enwik8 with your TensorFlow code and default parameters on 4 GPUs for 4 days and the loss never drops below 4.2 meanwhile the learning rate already drops to 0.000001. Is there any special tricks to replicate the experiment?

    Thanks.

    P.S. I am using Python 3 and TensorFlow 1.11.0. I have not tried on the other 3 datasets yet. I also tried transformer-xl on a private dataset (where a single-layer word-level LSTM can achieve around 60%+ accuracy), and its loss also never drops below 4.2 and accuracy never goes higher than 15%.

    opened by felixhao28 10
  • Expected Results for PyTorch run_lm1b_base.sh

    Expected Results for PyTorch run_lm1b_base.sh

    I wanted to know the expected test perplexity for the lm1b base model. It would be especially great if you could upload the training log file if possible. I wanted to include the results in my ICML paper.

    opened by rdspring1 9
  • OOM issue when training 1 billion corpus

    OOM issue when training 1 billion corpus

    I am trying to train with 1 billion corpus on Tesla P40. Following are the values being used.

    N_LAYER = 12 D_MODEL = 512, D_EMBED = 512, D_INNER = 2048, D_HEAD = 64

    I also tried with a BSZ of 128, it still gives OOM error.

    opened by deep-speech 7
  • Quick question on comparison against BERT

    Quick question on comparison against BERT

    Thanks for the codes! I am sure my question will be asked over and over and over again in near future. And I also read your paper which is all about comparison against vanilla transformer.

    But still, in terms of performance, have you compared your great model against BERT? I mean it may not be a 100% fair comparison. But at the end of the day... which one (BERT or Tranformer-XL) is better on typical NLP tasks? Thanks.

    opened by hohoCode 6
  • Train a new corpus !

    Train a new corpus !

    What changes we need to perform inside the script to train in a new corpus ?

    I have checked the script and there is a lot of if condition depend on each corpus.

    opened by agemagician 5
  • can not reproduce sota wikitext103 results

    can not reproduce sota wikitext103 results

    i use the pretrained-xl weights and same vocab to build transformer-xl large(we use tensorflow2.0) to eval the test set. But in my experiments, I find the {tgt_len=128, mem_len=1600, clamp_len=1000} just can reach test ppl around 35, and {tgt_len=384, mem_len=384, clamp_len=1000} can reach test ppl around 24, and {tgt_len=2048, mem_len=2048, clamp_len=1000} can reach test ppl around 20, but all of these settings can not reach the paper result 18.3, why?

    `#!usr/bin/env python

    -- coding:utf-8 --

    import tensorflow as tf from tensorflow import keras import numpy as np import pickle from DataService import DataObjForWT_PTB as DataObj import os

    os.environ["CUDA_VISIBLE_DEVICES"] = "2"

    vocab_size_dic = { "wikitext-103": 267736, "enwiki8": 0, "text8": 0 }

    class Vanilla_XL(keras.Model): def init(self, dataset_name: str, segment_size: int, dropout_attn, dropout_norm, n_layers, n_heads, d_embed, d_model, ffn_mul, cutoffs): super(Vanilla_XL, self).init() self.vocab_size = vocab_size_dic[dataset_name] self.segment_size = segment_size self.dropout_attn = dropout_attn self.dropout_norm = dropout_norm self.d_model = d_model self.d_embed = d_embed self.ffn_mul = ffn_mul self.cutoffs = cutoffs self.n_layers = n_layers self.n_heads = n_heads

        # embedding
        self.token_embedding = AdaptiveEmbedding(
            cutoffs=self.cutoffs, d_embed=self.d_embed, embed_drop_rate=self.dropout_norm,
            input_dim=self.vocab_size, out_dim=d_model, div_value=4
        )
    
        self.all_encoder_layers = []
        for layer in range(self.n_layers):
            self.all_encoder_layers.append(
                SingleTransformerBlock(
                    d_model=d_model, ffn_size=self.ffn_mul * d_model, n_heads=self.n_heads,
                    dropout_attn=self.dropout_attn, dropout_norm=self.dropout_norm,
                    cur_layer=layer
                )
            )
    
        self.softmax_out_layer = AdaptiveSoftmax(cutoffs=self.cutoffs, d_embed=self.d_embed,
                                                 adaptive_embedding_obj=self.token_embedding, div_value=4)
    
    def call(self, inputs, training=None, **kwargs):
        cache = kwargs["cache"] 
        padding_mask = kwargs["padding_mask"]
        segment_embedding = self.token_embedding(inputs=inputs, is_training=training)
        new_cache = [segment_embedding[:, tf.newaxis, :, :]]
    
        cur_layer_out = segment_embedding
        for layer in range(self.n_layers):
            cur_layer_out = self.all_encoder_layers[layer](
                inputs=cur_layer_out, cache=cache[:, layer, :, :],
                is_training=training, padding_mask=padding_mask
            )
            if layer != self.n_layers - 1:
                new_cache.append(cur_layer_out[:, tf.newaxis, :, :])
        final_out = cur_layer_out
        g_t = kwargs["ground_truth"]
        no_pad_indices = tf.where(tf.not_equal(g_t, PAD))
        final_out = tf.gather_nd(final_out, no_pad_indices)
        g_t = tf.gather_nd(g_t, no_pad_indices)
        log_prob = self.softmax_out_layer(inputs=final_out, ground_truth=g_t)
        return log_prob, tf.concat(new_cache, axis=1)
    

    class AdaptiveEmbedding(keras.layers.Layer): def init(self, cutoffs, embed_drop_rate, input_dim, out_dim, d_embed, div_value=4): super(AdaptiveEmbedding, self).init() assert isinstance(cutoffs, list) self.cutoffs = cutoffs self.input_dim = input_dim self.out_dim = out_dim self.d_embed = d_embed self.div_value = div_value

        self.cluster_embedding_list = []
        self.projection_list = []
        self.dropout_layer = keras.layers.Dropout(rate=embed_drop_rate)
    
        for i in range(len(self.cutoffs) - 1):
            in_dims = self.cutoffs[i + 1] - self.cutoffs[i]
            o_dims = self.d_embed // (self.div_value ** i)
            self.cluster_embedding_list.append(
                keras.layers.Embedding(
                    input_dim=in_dims, output_dim=o_dims,
                    weights=[tf.convert_to_tensor(
                        pre_train_weights["transformer/adaptive_embed/cutoff_%d/lookup_table:0" % i],
                        dtype=tf.float32)]
                ))
            self.projection_list.append(
                tf.Variable(
                    initial_value=tf.convert_to_tensor(
                        pre_train_weights["transformer/adaptive_embed/cutoff_%d/proj_W:0" % i]
                    ), dtype=tf.float32
                )
            )
    
    def call(self, inputs, **kwargs):
        for i in range(len(self.cutoffs) - 1):
            start = self.cutoffs[i]
            end = self.cutoffs[i + 1] 
            actual = tf.math.logical_and(inputs >= start, inputs < end)
            mask = tf.expand_dims(tf.cast(actual, dtype=tf.float32), axis=2)
            new_input = inputs - start
            new_input = tf.where(actual, new_input, tf.zeros_like(new_input, dtype=tf.int32))
            embed = self.cluster_embedding_list[i](inputs=new_input)
            linear_proj = tf.matmul(embed, self.projection_list[i], transpose_b=False)
            x.append(tf.multiply(linear_proj, mask))
        out = tf.zeros_like(x[0], dtype=tf.float32)
        for j in range(len(x)):
            out += x[j]
        out *= self.out_dim ** 0.5
        return self.dropout_layer(out, training=kwargs["is_training"])
    

    class AdaptiveSoftmax(keras.layers.Layer): def init(self, cutoffs, d_embed, adaptive_embedding_obj, div_value=4): super(AdaptiveSoftmax, self).init() self.cutoffs = cutoffs self.d_embed = d_embed self.div_value = div_value assert isinstance(adaptive_embedding_obj, AdaptiveEmbedding) self.adaptive_embedding_obj = adaptive_embedding_obj self.tail_clusters_embedding = keras.layers.Embedding( input_dim=len(self.cutoffs) - 2, output_dim=self.d_embed, weights=[tf.convert_to_tensor(pre_train_weights["transformer/adaptive_softmax/cutoff_0/cluster_W:0"])] ) self.clusters_bias = tf.Variable( initial_value=tf.convert_to_tensor(pre_train_weights["transformer/adaptive_softmax/cutoff_0/cluster_b:0"]), dtype=tf.float32 )

        self.head_projection = tf.Variable(
            initial_value=tf.convert_to_tensor(
                pre_train_weights["transformer/adaptive_softmax/cutoff_0/proj:0"]
            ), dtype=tf.float32
        )
    
        self.bias_list = []
        for i in range(len(self.cutoffs) - 1):
            self.bias_list.append(
                tf.convert_to_tensor(pre_train_weights["transformer/adaptive_softmax/cutoff_%d/b:0" % i])
            )
        self.projection_list = self.adaptive_embedding_obj.projection_list
    
    def call(self, inputs, **kwargs):
        x = []
        g_t = kwargs["ground_truth"]
        head_all_vocab_embedding = self.adaptive_embedding_obj.cluster_embedding_list[0](
            inputs=tf.convert_to_tensor([i for i in range(self.cutoffs[1] - self.cutoffs[0])], dtype=tf.int32)
        )  # (c0, dim)
    
        all_tail_cluster_embedding = self.tail_clusters_embedding(
            inputs=tf.convert_to_tensor([i for i in range(len(self.cutoffs) - 2)], dtype=tf.int32)
        )  # (3, dim)
        head_embedding = tf.concat([head_all_vocab_embedding, all_tail_cluster_embedding], axis=0)
        head_proj_out = tf.matmul(inputs, self.head_projection, transpose_b=True)
        head_logits = tf.matmul(head_proj_out, head_embedding, transpose_b=True)
        head_logits += tf.concat([self.bias_list[0], self.clusters_bias], axis=0)
        head_softmax = tf.nn.softmax(head_logits, axis=-1)
    
        for i in range(len(self.cutoffs) - 1):
            start = self.cutoffs[i]
            end = self.cutoffs[i + 1]
            cur_cluster_indices = tf.where(tf.math.logical_and(g_t >= start, g_t < end))
            seq_len = tf.shape(cur_cluster_indices)[0]
            cur_g_t = tf.gather_nd(g_t, cur_cluster_indices)
            cur_g_t = cur_g_t - start
            cur_g_t = tf.expand_dims(cur_g_t, axis=1)
            first_dim = tf.expand_dims(tf.range(seq_len, dtype=tf.int32), axis=1)
            r_s = tf.concat([first_dim, cur_g_t], axis=1)
            if i == 0: 
                cur_softmax = tf.gather_nd(head_softmax, cur_cluster_indices)
                cur_out_prob = tf.gather_nd(cur_softmax, r_s)
                cur_out_prob = tf.where(cur_out_prob >= 1e-9, cur_out_prob,
                                        tf.ones_like(cur_out_prob, dtype=tf.float32) * 1e-9)
                cur_log_prob = -tf.math.log(cur_out_prob)
            else:
                pre_softmax = tf.gather_nd(head_softmax, cur_cluster_indices)[..., self.cutoffs[1] + i - 2]
                pre_softmax = tf.where(pre_softmax > 1e-9, pre_softmax,
                                       tf.ones_like(pre_softmax, dtype=tf.float32) * 1e-9)
                pre_log_prob = -tf.math.log(pre_softmax)
    
                cur_inputs = tf.gather_nd(inputs, cur_cluster_indices)
    
                all_cur_cluster_embedding = self.adaptive_embedding_obj.cluster_embedding_list[i](
                    tf.convert_to_tensor([i for i in range(end - start)], dtype=tf.int32)
                )
                cur_inputs = tf.matmul(cur_inputs, self.projection_list[i], transpose_b=True)
                cur_logits = tf.matmul(cur_inputs, all_cur_cluster_embedding, transpose_b=True)
                cur_logits += self.bias_list[i]
    
                cur_softmax = tf.nn.softmax(cur_logits, axis=-1)
                cur_out_prob = tf.gather_nd(cur_softmax, r_s)
                cur_out_prob = tf.where(cur_out_prob >= 1e-9, cur_out_prob,
                                        tf.ones_like(cur_out_prob, dtype=tf.float32) * 1e-9)
                cur_log_prob = -tf.math.log(cur_out_prob)
    
                cur_log_prob += pre_log_prob
            x.append(cur_log_prob)
        return tf.concat(x, axis=0)
    

    class SingleTransformerBlock(keras.layers.Layer): def init(self, d_model, ffn_size, n_heads, dropout_attn, dropout_norm, cur_layer): super(SingleTransformerBlock, self).init() self.n_heads = n_heads self.cur_layer = cur_layer self.d_model = d_model

        self.w_query = keras.layers.Dense(
            units=d_model, use_bias=False,
            kernel_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/rel_attn/qkv/kernel:0" % cur_layer][:, 0:d_model]
            )
        )
        self.w_key = keras.layers.Dense(
            units=d_model, use_bias=False,
            kernel_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/rel_attn/qkv/kernel:0" % cur_layer][:, d_model:2 * d_model]
            )
        )
        self.w_value = keras.layers.Dense(
            units=d_model, use_bias=False,
            kernel_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/rel_attn/qkv/kernel:0" % cur_layer][:, 2 * d_model:]
            )
        )
        self.w_rel_pos = keras.layers.Dense(
            units=d_model, use_bias=False,
            kernel_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/rel_attn/r/kernel:0" % cur_layer]
            )
        )
    
        self.w_attn = keras.layers.Dense(
            units=d_model, use_bias=False,
            kernel_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/rel_attn/o/kernel:0" % cur_layer]
            )
        )
    
        self.w_ffn_up = keras.layers.Dense(
            units=ffn_size, activation="relu",
            kernel_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/ff/layer_1/kernel:0" % cur_layer]
            ),
            bias_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/ff/layer_1/bias:0" % cur_layer]
            )
        )
        self.w_ffn_down = keras.layers.Dense(
            units=d_model,
            kernel_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/ff/layer_2/kernel:0" % cur_layer]
            ),
            bias_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/ff/layer_2/bias:0" % cur_layer]
            )
        )
    
        x_ut = tf.convert_to_tensor(pre_train_weights["transformer/r_w_bias:0"][cur_layer],
                                    dtype=tf.float32)  # (head, dim // head)
        x_vt = tf.convert_to_tensor(pre_train_weights["transformer/r_r_bias:0"][cur_layer],
                                    dtype=tf.float32)  # (head, dim // head)
    
        self.ut = tf.Variable(initial_value=tf.reshape(
            x_ut, shape=(d_model,)
        ), dtype=tf.float32, trainable=True)
        self.vt = tf.Variable(initial_value=tf.reshape(
            x_vt, shape=(d_model,)
        ), dtype=tf.float32, trainable=True)
    
        self.attn_drop = keras.layers.Dropout(rate=dropout_attn)
        self.ffn_drop = keras.layers.Dropout(rate=dropout_norm)
    
        self.attn_ln = keras.layers.LayerNormalization(
            gamma_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/rel_attn/LayerNorm/gamma:0" % cur_layer]
            ), beta_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/rel_attn/LayerNorm/beta:0" % cur_layer]
            )
        )
        self.ffn_ln = keras.layers.LayerNormalization(
            gamma_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/ff/LayerNorm/gamma:0" % cur_layer]
            ), beta_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/ff/LayerNorm/beta:0" % cur_layer]
            )
        )
    
    def call(self, inputs, **kwargs):
        cache = kwargs["cache"]
        fusion_inputs = tf.concat([cache, inputs], axis=1)
    
        query = tf.concat(tf.split(self.w_query(inputs) + self.ut, axis=2, num_or_size_splits=self.n_heads), axis=0)
        q2 = tf.concat(tf.split(self.w_query(inputs) + self.vt, axis=2, num_or_size_splits=self.n_heads), axis=0)
        key = tf.concat(tf.split(self.w_key(fusion_inputs), axis=2, num_or_size_splits=self.n_heads), axis=0)
        value = tf.concat(tf.split(self.w_value(fusion_inputs), axis=2, num_or_size_splits=self.n_heads), axis=0)
    
        pos_enc = G.create_pre_relative_encoding(seq_length=fusion_inputs.shape[1], dim=fusion_inputs.shape[2])
        pos_enc = tf.tile(self.w_rel_pos(pos_enc)[tf.newaxis, ...], multiples=[fusion_inputs.shape[0], 1, 1])
        pos_enc = tf.concat(tf.split(pos_enc, axis=2, num_or_size_splits=self.n_heads), axis=0)
    
        attn_out = self.rel_scaled_dot_product_attention(query=query, key=key, value=value, pos_enc=pos_enc,
                                                         padding_mask=kwargs["padding_mask"], q2=q2,
                                                         look_ahead_mask=G.create_look_ahead_mask(
                                                             q_len=inputs.shape[1], k_len=fusion_inputs.shape[1]))
        attn_out = tf.concat(tf.split(attn_out, axis=0, num_or_size_splits=self.n_heads), axis=2)
    
        attn_out = self.w_attn(attn_out)
        attn_out = self.attn_drop(attn_out, training=kwargs["is_training"])
        res_out_1 = attn_out + inputs
        ln_out_1 = self.attn_ln(res_out_1)
    
        ffn_up = self.w_ffn_up(ln_out_1)
        ffn_down = self.w_ffn_down(ffn_up)
    
        ffn_out = self.ffn_drop(ffn_down, training=kwargs["is_training"])
        res_out_2 = ln_out_1 + ffn_out
        ln_out_2 = self.ffn_ln(res_out_2)
        return ln_out_2
    
    @staticmethod
    def rel_scaled_dot_product_attention(query, q2, key, value, pos_enc, padding_mask, look_ahead_mask):
        matmul_qk = tf.matmul(query, key, transpose_b=True)
        matmul_qp = tf.matmul(q2, pos_enc, transpose_b=True)
    
        pad_zero_1 = tf.zeros(shape=(query.shape[0], key.shape[1] - query.shape[1], key.shape[1]),
                              dtype=tf.float32)
        pad_zero_2 = tf.zeros(shape=(query.shape[0], key.shape[1], 1), dtype=tf.float32)
        matmul_qp = tf.concat([pad_zero_2, tf.concat([pad_zero_1, matmul_qp], axis=1)], axis=2)
    
        matmul_qp = tf.reshape(matmul_qp, shape=(matmul_qp.shape[0], matmul_qp.shape[2], matmul_qp.shape[1]))[:, 1:, :]
    
        matmul_qp = matmul_qp[:, -query.shape[1]:, :]
    
        matmul_out = matmul_qk + matmul_qp
        dk = tf.cast(tf.shape(value)[-1], tf.float32)
        scaled_attention_logits = matmul_out / tf.math.sqrt(dk)
    
        pad_one = tf.ones(shape=(padding_mask.shape[0], key.shape[1] - query.shape[1]), dtype=tf.float32)
        padding_mask = tf.concat([pad_one, padding_mask], axis=1) 
        padding_mask = tf.tile(padding_mask[:, tf.newaxis, :],
                               multiples=[query.shape[0] // padding_mask.shape[0], query.shape[1], 1])
        look_ahead_mask = tf.tile(look_ahead_mask[tf.newaxis, :], multiples=[query.shape[0], 1, 1])
        mask = tf.multiply(padding_mask, look_ahead_mask)
        scaled_attention_logits += (1 - mask) * -1e9
        attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
    
        output = tf.matmul(attention_weights, value)
    
        return output
    

    class GeneralFunction: @staticmethod def create_look_ahead_mask(q_len: int, k_len: int, same_length=True): mask = tf.linalg.band_part(tf.ones(shape=(k_len, k_len), dtype=tf.float32), -1, 0)[-q_len:, ...] if same_length: x = mask[:, 0: q_len] y = mask[:, q_len:] x = tf.linalg.band_part(x, 0, -1) mask = tf.concat([x, y], axis=1) return mask

    @staticmethod
    def create_pre_relative_encoding(seq_length: int, dim: int):
        pos = np.arange(start=seq_length - 1, step=-1, stop=-1, dtype=np.float32)[..., np.newaxis]
        pos = np.minimum(pos, 1000)
        all_i = np.arange(dim, dtype=np.float32)[np.newaxis, ...]
        angle_rates = 1 / np.power(10000, (2 * (all_i // 2)) / np.float32(dim))
        angle_rads = pos * angle_rates
        x = np.sin(angle_rads[:, 0::2])
        y = np.cos(angle_rads[:, 1::2])
        pos_enc = tf.convert_to_tensor(tf.concat([x, y], axis=-1), dtype=tf.float32)
        return pos_enc
    

    class Main: def init(self, **kwargs): self.kwargs = kwargs self.data_obj = DataObj(dataset_name=kwargs["dataset_name"], segment_size=kwargs["segment_size"], pad_id=PAD, batch_size=batch_size) self.cache = self.get_init_cache() self.model = Vanilla_XL( dataset_name=kwargs["dataset_name"], n_heads=kwargs["n_heads"], n_layers=kwargs["n_layers"], dropout_norm=kwargs["dropout_norm"], dropout_attn=kwargs["dropout_attn"], d_embed=kwargs["d_embed"], ffn_mul=kwargs["ffn_mul"], segment_size=kwargs["segment_size"], cutoffs=kwargs["cutoffs"], d_model=kwargs["d_model"] )

    def train(self):
        ppl, count = self.eval(is_valid=True)
        print("valid_ppl: %.3f, all_tokens:%d" % (ppl, count))
        ppl, count = self.eval(is_valid=False)
        print("test_ppl: %.3f, all_tokens:%d" % (ppl, count))
    
    def eval(self, is_valid):
        sum_loss, sum_count = 0.0, 0
        dic = self.data_obj.get_next_valid_test_segment(is_valid=is_valid)
        self.cache = self.get_init_cache()
        while dic is not None:
            loss, count, new_cache = self.eval_step(inputs=dic["input_ids"], ground_truth=dic["ground_truth"],
                                                    padding_mask=dic["input_mask"])
            self.cache = tf.concat(
                [self.cache[:, :, self.kwargs["segment_size"]:, :], new_cache], axis=2
            )
            sum_loss += loss
            sum_count += count
            dic = self.data_obj.get_next_valid_test_segment(is_valid=is_valid)
        ppl = tf.exp(sum_loss / sum_count)
        return ppl, sum_count
    
    @tf.function
    def eval_step(self, inputs, ground_truth, padding_mask):
        log_prob, new_seg_cache = self.model(inputs=inputs, training=False, padding_mask=padding_mask,
                                             cache=self.cache, ground_truth=ground_truth)
        total_loss = tf.reduce_sum(log_prob)
        count = tf.cast(tf.shape(log_prob)[0], dtype=tf.float32)
        return total_loss, count, new_seg_cache
    
    def get_init_cache(self):
        return tf.zeros(
            shape=(batch_size, self.kwargs["n_layers"], self.kwargs["mem_len"], self.kwargs["d_model"]),
            dtype=tf.float32)
    

    if name == "main": with open("InitWeights/WT103/weights.p", "rb") as f: pre_train_weights = pickle.load(f) dataset = "wikitext-103" PAD = 0 batch_size = 1 G = GeneralFunction() _cutoffs = [ 1, 20001, 40001, 200001, vocab_size_dic[dataset] ] a_epoch_segment = { "384": 268820 // batch_size, "512": 201615 // batch_size, "256": 403230 // batch_size } E = Main(dataset_name=dataset, segment_size=128, mem_len=1600, n_heads=16, d_model=1024, n_layers=18, d_embed=1024, batch_size=batch_size, dropout_attn=0.2, dropout_norm=0.2, ffn_mul=4, cutoffs=_cutoffs, method="AC001") E.train() `

    opened by menghuanlater 4
  • Different training steps in tf and pytorch

    Different training steps in tf and pytorch

    Hi, I notice that the training steps for the base_wt103 in PyTorch codes is 200K, while this number is 400K in the TF scripts. However, for the large wt103, both of them are 4M.

    I am confused about the training steps as I am training the large PyTorch model with 16*32GB v100. The speed is too slow to finish the 4000000 steps(1300ms per step,2 months, is it right? ).

    By the way, will the tf codes be faster than PyTorch in this project?

    Thanks for your help!

    opened by richardbaihe 3
  • TPU settings

    TPU settings

    Hi,

    I would like to train a model on TPU, but I'm not able to find the correct settings for a v2-8 TPU.

    What parameters are needed for NUM_HOST and NUM_CORE? I tried different values, but I always get num_replicas should be (8), got (XXX). error messages.

    What TPU model did you use for the 1 Billion word benchmark?

    Can I create the tfrecords locally (on a non-TPU) in the train_data step?

    Thanks :)

    opened by stefan-it 3
  • PyTorch multiGPU training

    PyTorch multiGPU training

    Thank you for releasing such an awesome and easy to use code!

    Could you, please, elaborate a little bit on PyTorch implementation multi GPU setup? More concretely, what does the parameter "gpu0_bsz" mean and what parameters should I change to scale this code to setups with the number of GPUs more (or less) than 4?

    From the description it seems that "gpu0_bsz" is the batch size for GPU0, but it is not clear to me why it should differ from batch sizes on other GPUs.

    opened by AlexGrinch 3
  • Bounty: PTB Transformer-xl

    Bounty: PTB Transformer-xl

    https://twitter.com/srush_nlp/status/1245825437240102913?s=19

    Open-Science NLP Bounty: ($100 + $100 to charity)

    Task: A notebook demonstrating experiments of this widely cited LM baseline on PTB.

    It seems many people on twitter have not been able to replicate anything near the PTB numbers reported in this paper. I would love for someone to prove me wrong and am happy to pay for it.

    opened by srush 2
  • Is transformer-xl like a seq2seq model or a word-embedding model?

    Is transformer-xl like a seq2seq model or a word-embedding model?

    Hi: I am reading the code of the model, but I can not find which part is encoder and which is decoder. So i want to know that is transformer-xl like a word-embedding layer as BERT instead of a seq2seq model? Thanks a lot.

    opened by huangnengCSU 2
  • [W C:\w\b\windows\pytorch\aten\src\ATen\native\cuda\Indexing.cu:963] Warning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (function masked_fill__cuda)

    [W C:\w\b\windows\pytorch\aten\src\ATen\native\cuda\Indexing.cu:963] Warning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (function masked_fill__cuda)

    I meet this error and really can not solve it: [W C:\w\b\windows\pytorch\aten\src\ATen\native\cuda\Indexing.cu:963] Warning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (function masked_fill__cuda) Could you somehow fix it? Thanks!!

    opened by Arsmart1 0
  • docs: demo, experiments and live inference API on Tiyaro

    docs: demo, experiments and live inference API on Tiyaro

    Hello Maintainer of Github repo kimiyoung/transformer-xl (@lopuhin @ijkilchenko @kimiyoung)!

    Thank you for your work on kimiyoung/transformer-xl. This GitHub project is interesting, and we think that it would be a great addition to make this work instantly discoverable & available as an API for all your users, to quickly try and use it in their applications.

    On Tiyaro, every model in kimiyoung/transformer-xl will get its own: Dedicated model card (see https://console.tiyaro.ai/explore/transfo-xl-wt103 Model demo (see https://console.tiyaro.ai/explore/transfo-xl-wt103/demo) Unique Inference API (https://api.tiyaro.ai/explore/huggingface/1//transfo-xl-wt103) Sample code snippets and swagger spec for the API

    Users will also be able to compare your model with other models of similar types on various parameters using Tiyaro Experiments (https://blog.tiyaro.ai/evaluate-openmmlabs-mmocr-models-using-tiyaro-experiments)

    —- I am from Tiyaro.ai (https://tiyaro.ai/). We are working on enabling developers to instantly evaluate, use and customize the world’s best AI. We are constantly working on adding new features to Tiyaro EasyTrain, EasyServe & Experiments, to make the best use of your ML model, and making AI more accessible for anyone.

    Sincerely, I-Jong Lin

    opened by ijonglin 2
  • enwiki8 18 layer model .sh file

    enwiki8 18 layer model .sh file

    I am trying to find the hyperparameters .sh file for the mid-size enwiki8 model with 88M parameters for torch. Searched the web but unfortunately found nothing.

    opened by vasily789 0
  • feat: replace einsum with matmul for efficiency

    feat: replace einsum with matmul for efficiency

    Hi @kimiyoung,

    I recently started to use transformer-xl on my personal project (mostly based on this repo), and I found the speed can be improved about 3 to 4 times by replacing all the torch.einsum with torch.matmul (equals to @). Feel free to review it if you have time.

    opened by yoyololicon 0
  • CUBLAS_STATUS_EXECUTION_FAILED and Blas GEMM launch failed

    CUBLAS_STATUS_EXECUTION_FAILED and Blas GEMM launch failed

    I have followed the required tensorflow 1.12 and python 2.7, but the following errors still raised. I wonder if you could help me. By the way, it is suggested by the internet that the CUBLAS_STATUS_EXECUTION_FAILED raises when tensorflow version does not match the cuda version. Could you please tell me the gpu type and cuda version you used to train? Looking forward to your reply. `2021-09-08 23:23:08.258752: E tensorflow/stream_executor/cuda/cuda_blas.cc:652] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED Traceback (most recent call last): File "train_gpu.py", line 475, in tf.app.run() File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "train_gpu.py", line 471, in main evaluate(n_token, cutoffs, "/gpu:0") File "train_gpu.py", line 446, in evaluate fetched = sess.run(fetches, feed_dict=feed_dict) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(31424, 1024), b.shape=(1024, 3072), m=31424, n=3072, k=1024 [[node transformer/layer_0/rel_attn/qkv/Tensordot/MatMul (defined at /home/caoyq/transformer-xl-master/tf/model.py:54) = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transformer/layer_0/rel_attn/qkv/Tensordot/Reshape, transformer/layer_0/rel_attn/qkv/kernel/read)]]

    Caused by op u'transformer/layer_0/rel_attn/qkv/Tensordot/MatMul', defined at: File "train_gpu.py", line 475, in tf.app.run() File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "train_gpu.py", line 471, in main evaluate(n_token, cutoffs, "/gpu:0") File "train_gpu.py", line 400, in evaluate mems=mems_i) File "train_gpu.py", line 218, in single_core_graph is_training=is_training) File "train_gpu.py", line 186, in model_fn proj_same_dim=FLAGS.proj_same_dim)

    return self.__call__(inputs, *args, **kwargs)
    

    File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 374, in call outputs = super(Layer, self).call(inputs, *args, **kwargs) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 757, in call outputs = self.call(inputs, *args, **kwargs) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/keras/layers/core.py", line 963, in call outputs = standard_ops.tensordot(inputs, self.kernel, [[rank - 1], [0]]) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 2985, in tensordot ab_matmul = matmul(a_reshape, b_reshape) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 2057, in matmul a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 4560, in mat_mul name=name) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(*args, **kwargs) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op op_def=op_def) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in init self._traceback = tf_stack.extract_stack()

    InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(31424, 1024), b.shape=(1024, 3072), m=31424, n=3072, k=1024 [[node transformer/layer_0/rel_attn/qkv/Tensordot/MatMul (defined at /home/caoyq/transformer-xl-master/tf/model.py:54) = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transformer/layer_0/rel_attn/qkv/Tensordot/Reshape, transformer/layer_0/rel_attn/qkv/kernel/read)]] `

    opened by CaoYiqingT 0
Owner
Zhilin Yang
Zhilin Yang
code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

AttentiveNAS: Improving Neural Architecture Search via Attentive Sampling This repository contains PyTorch evaluation code, training code and pretrain

Facebook Research 94 Oct 26, 2022
A Structured Self-attentive Sentence Embedding

Structured Self-attentive sentence embeddings Implementation for the paper A Structured Self-Attentive Sentence Embedding, which was published in ICLR

Kaushal Shetty 488 Nov 28, 2022
Conditional probing: measuring usable information beyond a baseline

Conditional probing: measuring usable information beyond a baseline

John Hewitt 20 Dec 15, 2022
Beyond Paragraphs: NLP for Long Sequences

Beyond Paragraphs: NLP for Long Sequences

AI2 338 Dec 2, 2022
Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

beyond masking Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers The code is coming Figure 1: Pipeline of token-based pre-

Yunjie Tian 23 Sep 27, 2022
Ongoing research training transformer language models at scale, including: BERT & GPT-2

What is this fork of Megatron-LM and Megatron-DeepSpeed This is a detached fork of https://github.com/microsoft/Megatron-DeepSpeed, which in itself is

BigScience Workshop 316 Jan 3, 2023
Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

NVIDIA Corporation 3.5k Dec 30, 2022
Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

NLP-Summarizer Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5 This project aimed to provide in

Samuel Sharkey 1 Feb 7, 2022
PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Cross-Covariance Image Transformer (XCiT) PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer L

Facebook Research 605 Jan 2, 2023
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

Robert Bogan Kang 3 May 25, 2022
A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Ethan 66 Dec 26, 2022
:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

R²SQL The PyTorch implementation of paper Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing. (AAAI 2021) Requirement

huybery 60 Dec 31, 2022
Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Sentance Parser Executing the Program Make sure Python 3.6+ is installed. Install requirements $ pip install requirements.txt Run the program:

Vaibhaw 12 Sep 28, 2022
Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization

Line as a Visual Sentence with LineTR This repository contains the inference code, pretrained model, and demo scripts of the following paper. It suppo

SungHo Yoon 158 Dec 27, 2022
Abhijith Neil Abraham 2 Nov 5, 2021
Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec

Wake Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec Abstract استخراج خودکار کلمات کلیدی متون کوتاه فارسی با استفاده از word2vec ب

Omid Hajipoor 1 Dec 17, 2021
NLP project that works with news (NER, context generation, news trend analytics)

СоАвтор СоАвтор – платформа и открытый набор инструментов для редакций и журналистов-фрилансеров, который призван сделать процесс создания контента ма

null 38 Jan 4, 2023