Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context Code in both PyTorch and TensorFlow

Overview

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

This repository contains the code in both PyTorch and TensorFlow for our paper

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov (*: equal contribution)

Preprint 2018

TensorFlow

  • The source code is in the tf/ folder, supporting (1) single-node multi-gpu training, and (2) multi-host TPU training.
  • Besides the source code, we also provide pretrained "TensorFlow" models with state-of-the-art (SoTA) performances reported in the paper.
  • Please refer to tf/README.md for details.

PyTorch

  • The source code is in the pytorch/ folder, supporting single-node multi-gpu training via the module nn.DataParallel.
  • Please refer to pytorch/README.md for details.

Results

Transformer-XL achieves new state-of-the-art results on multiple language modeling benchmarks. Transformer-XL is also the first to break through the 1.0 barrier on char-level language modeling. Below is a summary.

Method enwiki8 text8 One Billion Word WT-103 PTB (w/o finetuning)
Previous Best 1.06 1.13 23.7 20.5 55.5
Transformer-XL 0.99 1.08 21.8 18.3 54.5

Acknowledgement

A large portion of the getdata.sh script comes from the awd-lstm repo. Happy Language Modeling :)

Comments
  • Unable to replicate experiment results

    Unable to replicate experiment results

    I am not sure where it is wrong but I have been training enwik8 with your TensorFlow code and default parameters on 4 GPUs for 4 days and the loss never drops below 4.2 meanwhile the learning rate already drops to 0.000001. Is there any special tricks to replicate the experiment?

    Thanks.

    P.S. I am using Python 3 and TensorFlow 1.11.0. I have not tried on the other 3 datasets yet. I also tried transformer-xl on a private dataset (where a single-layer word-level LSTM can achieve around 60%+ accuracy), and its loss also never drops below 4.2 and accuracy never goes higher than 15%.

    opened by felixhao28 10
  • Expected Results for PyTorch run_lm1b_base.sh

    Expected Results for PyTorch run_lm1b_base.sh

    I wanted to know the expected test perplexity for the lm1b base model. It would be especially great if you could upload the training log file if possible. I wanted to include the results in my ICML paper.

    opened by rdspring1 9
  • OOM issue when training 1 billion corpus

    OOM issue when training 1 billion corpus

    I am trying to train with 1 billion corpus on Tesla P40. Following are the values being used.

    N_LAYER = 12 D_MODEL = 512, D_EMBED = 512, D_INNER = 2048, D_HEAD = 64

    I also tried with a BSZ of 128, it still gives OOM error.

    opened by deep-speech 7
  • Quick question on comparison against BERT

    Quick question on comparison against BERT

    Thanks for the codes! I am sure my question will be asked over and over and over again in near future. And I also read your paper which is all about comparison against vanilla transformer.

    But still, in terms of performance, have you compared your great model against BERT? I mean it may not be a 100% fair comparison. But at the end of the day... which one (BERT or Tranformer-XL) is better on typical NLP tasks? Thanks.

    opened by hohoCode 6
  • Train a new corpus !

    Train a new corpus !

    What changes we need to perform inside the script to train in a new corpus ?

    I have checked the script and there is a lot of if condition depend on each corpus.

    opened by agemagician 5
  • can not reproduce sota wikitext103 results

    can not reproduce sota wikitext103 results

    i use the pretrained-xl weights and same vocab to build transformer-xl large(we use tensorflow2.0) to eval the test set. But in my experiments, I find the {tgt_len=128, mem_len=1600, clamp_len=1000} just can reach test ppl around 35, and {tgt_len=384, mem_len=384, clamp_len=1000} can reach test ppl around 24, and {tgt_len=2048, mem_len=2048, clamp_len=1000} can reach test ppl around 20, but all of these settings can not reach the paper result 18.3, why?

    `#!usr/bin/env python

    -- coding:utf-8 --

    import tensorflow as tf from tensorflow import keras import numpy as np import pickle from DataService import DataObjForWT_PTB as DataObj import os

    os.environ["CUDA_VISIBLE_DEVICES"] = "2"

    vocab_size_dic = { "wikitext-103": 267736, "enwiki8": 0, "text8": 0 }

    class Vanilla_XL(keras.Model): def init(self, dataset_name: str, segment_size: int, dropout_attn, dropout_norm, n_layers, n_heads, d_embed, d_model, ffn_mul, cutoffs): super(Vanilla_XL, self).init() self.vocab_size = vocab_size_dic[dataset_name] self.segment_size = segment_size self.dropout_attn = dropout_attn self.dropout_norm = dropout_norm self.d_model = d_model self.d_embed = d_embed self.ffn_mul = ffn_mul self.cutoffs = cutoffs self.n_layers = n_layers self.n_heads = n_heads

        # embedding
        self.token_embedding = AdaptiveEmbedding(
            cutoffs=self.cutoffs, d_embed=self.d_embed, embed_drop_rate=self.dropout_norm,
            input_dim=self.vocab_size, out_dim=d_model, div_value=4
        )
    
        self.all_encoder_layers = []
        for layer in range(self.n_layers):
            self.all_encoder_layers.append(
                SingleTransformerBlock(
                    d_model=d_model, ffn_size=self.ffn_mul * d_model, n_heads=self.n_heads,
                    dropout_attn=self.dropout_attn, dropout_norm=self.dropout_norm,
                    cur_layer=layer
                )
            )
    
        self.softmax_out_layer = AdaptiveSoftmax(cutoffs=self.cutoffs, d_embed=self.d_embed,
                                                 adaptive_embedding_obj=self.token_embedding, div_value=4)
    
    def call(self, inputs, training=None, **kwargs):
        cache = kwargs["cache"] 
        padding_mask = kwargs["padding_mask"]
        segment_embedding = self.token_embedding(inputs=inputs, is_training=training)
        new_cache = [segment_embedding[:, tf.newaxis, :, :]]
    
        cur_layer_out = segment_embedding
        for layer in range(self.n_layers):
            cur_layer_out = self.all_encoder_layers[layer](
                inputs=cur_layer_out, cache=cache[:, layer, :, :],
                is_training=training, padding_mask=padding_mask
            )
            if layer != self.n_layers - 1:
                new_cache.append(cur_layer_out[:, tf.newaxis, :, :])
        final_out = cur_layer_out
        g_t = kwargs["ground_truth"]
        no_pad_indices = tf.where(tf.not_equal(g_t, PAD))
        final_out = tf.gather_nd(final_out, no_pad_indices)
        g_t = tf.gather_nd(g_t, no_pad_indices)
        log_prob = self.softmax_out_layer(inputs=final_out, ground_truth=g_t)
        return log_prob, tf.concat(new_cache, axis=1)
    

    class AdaptiveEmbedding(keras.layers.Layer): def init(self, cutoffs, embed_drop_rate, input_dim, out_dim, d_embed, div_value=4): super(AdaptiveEmbedding, self).init() assert isinstance(cutoffs, list) self.cutoffs = cutoffs self.input_dim = input_dim self.out_dim = out_dim self.d_embed = d_embed self.div_value = div_value

        self.cluster_embedding_list = []
        self.projection_list = []
        self.dropout_layer = keras.layers.Dropout(rate=embed_drop_rate)
    
        for i in range(len(self.cutoffs) - 1):
            in_dims = self.cutoffs[i + 1] - self.cutoffs[i]
            o_dims = self.d_embed // (self.div_value ** i)
            self.cluster_embedding_list.append(
                keras.layers.Embedding(
                    input_dim=in_dims, output_dim=o_dims,
                    weights=[tf.convert_to_tensor(
                        pre_train_weights["transformer/adaptive_embed/cutoff_%d/lookup_table:0" % i],
                        dtype=tf.float32)]
                ))
            self.projection_list.append(
                tf.Variable(
                    initial_value=tf.convert_to_tensor(
                        pre_train_weights["transformer/adaptive_embed/cutoff_%d/proj_W:0" % i]
                    ), dtype=tf.float32
                )
            )
    
    def call(self, inputs, **kwargs):
        for i in range(len(self.cutoffs) - 1):
            start = self.cutoffs[i]
            end = self.cutoffs[i + 1] 
            actual = tf.math.logical_and(inputs >= start, inputs < end)
            mask = tf.expand_dims(tf.cast(actual, dtype=tf.float32), axis=2)
            new_input = inputs - start
            new_input = tf.where(actual, new_input, tf.zeros_like(new_input, dtype=tf.int32))
            embed = self.cluster_embedding_list[i](inputs=new_input)
            linear_proj = tf.matmul(embed, self.projection_list[i], transpose_b=False)
            x.append(tf.multiply(linear_proj, mask))
        out = tf.zeros_like(x[0], dtype=tf.float32)
        for j in range(len(x)):
            out += x[j]
        out *= self.out_dim ** 0.5
        return self.dropout_layer(out, training=kwargs["is_training"])
    

    class AdaptiveSoftmax(keras.layers.Layer): def init(self, cutoffs, d_embed, adaptive_embedding_obj, div_value=4): super(AdaptiveSoftmax, self).init() self.cutoffs = cutoffs self.d_embed = d_embed self.div_value = div_value assert isinstance(adaptive_embedding_obj, AdaptiveEmbedding) self.adaptive_embedding_obj = adaptive_embedding_obj self.tail_clusters_embedding = keras.layers.Embedding( input_dim=len(self.cutoffs) - 2, output_dim=self.d_embed, weights=[tf.convert_to_tensor(pre_train_weights["transformer/adaptive_softmax/cutoff_0/cluster_W:0"])] ) self.clusters_bias = tf.Variable( initial_value=tf.convert_to_tensor(pre_train_weights["transformer/adaptive_softmax/cutoff_0/cluster_b:0"]), dtype=tf.float32 )

        self.head_projection = tf.Variable(
            initial_value=tf.convert_to_tensor(
                pre_train_weights["transformer/adaptive_softmax/cutoff_0/proj:0"]
            ), dtype=tf.float32
        )
    
        self.bias_list = []
        for i in range(len(self.cutoffs) - 1):
            self.bias_list.append(
                tf.convert_to_tensor(pre_train_weights["transformer/adaptive_softmax/cutoff_%d/b:0" % i])
            )
        self.projection_list = self.adaptive_embedding_obj.projection_list
    
    def call(self, inputs, **kwargs):
        x = []
        g_t = kwargs["ground_truth"]
        head_all_vocab_embedding = self.adaptive_embedding_obj.cluster_embedding_list[0](
            inputs=tf.convert_to_tensor([i for i in range(self.cutoffs[1] - self.cutoffs[0])], dtype=tf.int32)
        )  # (c0, dim)
    
        all_tail_cluster_embedding = self.tail_clusters_embedding(
            inputs=tf.convert_to_tensor([i for i in range(len(self.cutoffs) - 2)], dtype=tf.int32)
        )  # (3, dim)
        head_embedding = tf.concat([head_all_vocab_embedding, all_tail_cluster_embedding], axis=0)
        head_proj_out = tf.matmul(inputs, self.head_projection, transpose_b=True)
        head_logits = tf.matmul(head_proj_out, head_embedding, transpose_b=True)
        head_logits += tf.concat([self.bias_list[0], self.clusters_bias], axis=0)
        head_softmax = tf.nn.softmax(head_logits, axis=-1)
    
        for i in range(len(self.cutoffs) - 1):
            start = self.cutoffs[i]
            end = self.cutoffs[i + 1]
            cur_cluster_indices = tf.where(tf.math.logical_and(g_t >= start, g_t < end))
            seq_len = tf.shape(cur_cluster_indices)[0]
            cur_g_t = tf.gather_nd(g_t, cur_cluster_indices)
            cur_g_t = cur_g_t - start
            cur_g_t = tf.expand_dims(cur_g_t, axis=1)
            first_dim = tf.expand_dims(tf.range(seq_len, dtype=tf.int32), axis=1)
            r_s = tf.concat([first_dim, cur_g_t], axis=1)
            if i == 0: 
                cur_softmax = tf.gather_nd(head_softmax, cur_cluster_indices)
                cur_out_prob = tf.gather_nd(cur_softmax, r_s)
                cur_out_prob = tf.where(cur_out_prob >= 1e-9, cur_out_prob,
                                        tf.ones_like(cur_out_prob, dtype=tf.float32) * 1e-9)
                cur_log_prob = -tf.math.log(cur_out_prob)
            else:
                pre_softmax = tf.gather_nd(head_softmax, cur_cluster_indices)[..., self.cutoffs[1] + i - 2]
                pre_softmax = tf.where(pre_softmax > 1e-9, pre_softmax,
                                       tf.ones_like(pre_softmax, dtype=tf.float32) * 1e-9)
                pre_log_prob = -tf.math.log(pre_softmax)
    
                cur_inputs = tf.gather_nd(inputs, cur_cluster_indices)
    
                all_cur_cluster_embedding = self.adaptive_embedding_obj.cluster_embedding_list[i](
                    tf.convert_to_tensor([i for i in range(end - start)], dtype=tf.int32)
                )
                cur_inputs = tf.matmul(cur_inputs, self.projection_list[i], transpose_b=True)
                cur_logits = tf.matmul(cur_inputs, all_cur_cluster_embedding, transpose_b=True)
                cur_logits += self.bias_list[i]
    
                cur_softmax = tf.nn.softmax(cur_logits, axis=-1)
                cur_out_prob = tf.gather_nd(cur_softmax, r_s)
                cur_out_prob = tf.where(cur_out_prob >= 1e-9, cur_out_prob,
                                        tf.ones_like(cur_out_prob, dtype=tf.float32) * 1e-9)
                cur_log_prob = -tf.math.log(cur_out_prob)
    
                cur_log_prob += pre_log_prob
            x.append(cur_log_prob)
        return tf.concat(x, axis=0)
    

    class SingleTransformerBlock(keras.layers.Layer): def init(self, d_model, ffn_size, n_heads, dropout_attn, dropout_norm, cur_layer): super(SingleTransformerBlock, self).init() self.n_heads = n_heads self.cur_layer = cur_layer self.d_model = d_model

        self.w_query = keras.layers.Dense(
            units=d_model, use_bias=False,
            kernel_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/rel_attn/qkv/kernel:0" % cur_layer][:, 0:d_model]
            )
        )
        self.w_key = keras.layers.Dense(
            units=d_model, use_bias=False,
            kernel_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/rel_attn/qkv/kernel:0" % cur_layer][:, d_model:2 * d_model]
            )
        )
        self.w_value = keras.layers.Dense(
            units=d_model, use_bias=False,
            kernel_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/rel_attn/qkv/kernel:0" % cur_layer][:, 2 * d_model:]
            )
        )
        self.w_rel_pos = keras.layers.Dense(
            units=d_model, use_bias=False,
            kernel_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/rel_attn/r/kernel:0" % cur_layer]
            )
        )
    
        self.w_attn = keras.layers.Dense(
            units=d_model, use_bias=False,
            kernel_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/rel_attn/o/kernel:0" % cur_layer]
            )
        )
    
        self.w_ffn_up = keras.layers.Dense(
            units=ffn_size, activation="relu",
            kernel_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/ff/layer_1/kernel:0" % cur_layer]
            ),
            bias_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/ff/layer_1/bias:0" % cur_layer]
            )
        )
        self.w_ffn_down = keras.layers.Dense(
            units=d_model,
            kernel_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/ff/layer_2/kernel:0" % cur_layer]
            ),
            bias_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/ff/layer_2/bias:0" % cur_layer]
            )
        )
    
        x_ut = tf.convert_to_tensor(pre_train_weights["transformer/r_w_bias:0"][cur_layer],
                                    dtype=tf.float32)  # (head, dim // head)
        x_vt = tf.convert_to_tensor(pre_train_weights["transformer/r_r_bias:0"][cur_layer],
                                    dtype=tf.float32)  # (head, dim // head)
    
        self.ut = tf.Variable(initial_value=tf.reshape(
            x_ut, shape=(d_model,)
        ), dtype=tf.float32, trainable=True)
        self.vt = tf.Variable(initial_value=tf.reshape(
            x_vt, shape=(d_model,)
        ), dtype=tf.float32, trainable=True)
    
        self.attn_drop = keras.layers.Dropout(rate=dropout_attn)
        self.ffn_drop = keras.layers.Dropout(rate=dropout_norm)
    
        self.attn_ln = keras.layers.LayerNormalization(
            gamma_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/rel_attn/LayerNorm/gamma:0" % cur_layer]
            ), beta_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/rel_attn/LayerNorm/beta:0" % cur_layer]
            )
        )
        self.ffn_ln = keras.layers.LayerNormalization(
            gamma_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/ff/LayerNorm/gamma:0" % cur_layer]
            ), beta_initializer=tf.constant_initializer(
                pre_train_weights["transformer/layer_%d/ff/LayerNorm/beta:0" % cur_layer]
            )
        )
    
    def call(self, inputs, **kwargs):
        cache = kwargs["cache"]
        fusion_inputs = tf.concat([cache, inputs], axis=1)
    
        query = tf.concat(tf.split(self.w_query(inputs) + self.ut, axis=2, num_or_size_splits=self.n_heads), axis=0)
        q2 = tf.concat(tf.split(self.w_query(inputs) + self.vt, axis=2, num_or_size_splits=self.n_heads), axis=0)
        key = tf.concat(tf.split(self.w_key(fusion_inputs), axis=2, num_or_size_splits=self.n_heads), axis=0)
        value = tf.concat(tf.split(self.w_value(fusion_inputs), axis=2, num_or_size_splits=self.n_heads), axis=0)
    
        pos_enc = G.create_pre_relative_encoding(seq_length=fusion_inputs.shape[1], dim=fusion_inputs.shape[2])
        pos_enc = tf.tile(self.w_rel_pos(pos_enc)[tf.newaxis, ...], multiples=[fusion_inputs.shape[0], 1, 1])
        pos_enc = tf.concat(tf.split(pos_enc, axis=2, num_or_size_splits=self.n_heads), axis=0)
    
        attn_out = self.rel_scaled_dot_product_attention(query=query, key=key, value=value, pos_enc=pos_enc,
                                                         padding_mask=kwargs["padding_mask"], q2=q2,
                                                         look_ahead_mask=G.create_look_ahead_mask(
                                                             q_len=inputs.shape[1], k_len=fusion_inputs.shape[1]))
        attn_out = tf.concat(tf.split(attn_out, axis=0, num_or_size_splits=self.n_heads), axis=2)
    
        attn_out = self.w_attn(attn_out)
        attn_out = self.attn_drop(attn_out, training=kwargs["is_training"])
        res_out_1 = attn_out + inputs
        ln_out_1 = self.attn_ln(res_out_1)
    
        ffn_up = self.w_ffn_up(ln_out_1)
        ffn_down = self.w_ffn_down(ffn_up)
    
        ffn_out = self.ffn_drop(ffn_down, training=kwargs["is_training"])
        res_out_2 = ln_out_1 + ffn_out
        ln_out_2 = self.ffn_ln(res_out_2)
        return ln_out_2
    
    @staticmethod
    def rel_scaled_dot_product_attention(query, q2, key, value, pos_enc, padding_mask, look_ahead_mask):
        matmul_qk = tf.matmul(query, key, transpose_b=True)
        matmul_qp = tf.matmul(q2, pos_enc, transpose_b=True)
    
        pad_zero_1 = tf.zeros(shape=(query.shape[0], key.shape[1] - query.shape[1], key.shape[1]),
                              dtype=tf.float32)
        pad_zero_2 = tf.zeros(shape=(query.shape[0], key.shape[1], 1), dtype=tf.float32)
        matmul_qp = tf.concat([pad_zero_2, tf.concat([pad_zero_1, matmul_qp], axis=1)], axis=2)
    
        matmul_qp = tf.reshape(matmul_qp, shape=(matmul_qp.shape[0], matmul_qp.shape[2], matmul_qp.shape[1]))[:, 1:, :]
    
        matmul_qp = matmul_qp[:, -query.shape[1]:, :]
    
        matmul_out = matmul_qk + matmul_qp
        dk = tf.cast(tf.shape(value)[-1], tf.float32)
        scaled_attention_logits = matmul_out / tf.math.sqrt(dk)
    
        pad_one = tf.ones(shape=(padding_mask.shape[0], key.shape[1] - query.shape[1]), dtype=tf.float32)
        padding_mask = tf.concat([pad_one, padding_mask], axis=1) 
        padding_mask = tf.tile(padding_mask[:, tf.newaxis, :],
                               multiples=[query.shape[0] // padding_mask.shape[0], query.shape[1], 1])
        look_ahead_mask = tf.tile(look_ahead_mask[tf.newaxis, :], multiples=[query.shape[0], 1, 1])
        mask = tf.multiply(padding_mask, look_ahead_mask)
        scaled_attention_logits += (1 - mask) * -1e9
        attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
    
        output = tf.matmul(attention_weights, value)
    
        return output
    

    class GeneralFunction: @staticmethod def create_look_ahead_mask(q_len: int, k_len: int, same_length=True): mask = tf.linalg.band_part(tf.ones(shape=(k_len, k_len), dtype=tf.float32), -1, 0)[-q_len:, ...] if same_length: x = mask[:, 0: q_len] y = mask[:, q_len:] x = tf.linalg.band_part(x, 0, -1) mask = tf.concat([x, y], axis=1) return mask

    @staticmethod
    def create_pre_relative_encoding(seq_length: int, dim: int):
        pos = np.arange(start=seq_length - 1, step=-1, stop=-1, dtype=np.float32)[..., np.newaxis]
        pos = np.minimum(pos, 1000)
        all_i = np.arange(dim, dtype=np.float32)[np.newaxis, ...]
        angle_rates = 1 / np.power(10000, (2 * (all_i // 2)) / np.float32(dim))
        angle_rads = pos * angle_rates
        x = np.sin(angle_rads[:, 0::2])
        y = np.cos(angle_rads[:, 1::2])
        pos_enc = tf.convert_to_tensor(tf.concat([x, y], axis=-1), dtype=tf.float32)
        return pos_enc
    

    class Main: def init(self, **kwargs): self.kwargs = kwargs self.data_obj = DataObj(dataset_name=kwargs["dataset_name"], segment_size=kwargs["segment_size"], pad_id=PAD, batch_size=batch_size) self.cache = self.get_init_cache() self.model = Vanilla_XL( dataset_name=kwargs["dataset_name"], n_heads=kwargs["n_heads"], n_layers=kwargs["n_layers"], dropout_norm=kwargs["dropout_norm"], dropout_attn=kwargs["dropout_attn"], d_embed=kwargs["d_embed"], ffn_mul=kwargs["ffn_mul"], segment_size=kwargs["segment_size"], cutoffs=kwargs["cutoffs"], d_model=kwargs["d_model"] )

    def train(self):
        ppl, count = self.eval(is_valid=True)
        print("valid_ppl: %.3f, all_tokens:%d" % (ppl, count))
        ppl, count = self.eval(is_valid=False)
        print("test_ppl: %.3f, all_tokens:%d" % (ppl, count))
    
    def eval(self, is_valid):
        sum_loss, sum_count = 0.0, 0
        dic = self.data_obj.get_next_valid_test_segment(is_valid=is_valid)
        self.cache = self.get_init_cache()
        while dic is not None:
            loss, count, new_cache = self.eval_step(inputs=dic["input_ids"], ground_truth=dic["ground_truth"],
                                                    padding_mask=dic["input_mask"])
            self.cache = tf.concat(
                [self.cache[:, :, self.kwargs["segment_size"]:, :], new_cache], axis=2
            )
            sum_loss += loss
            sum_count += count
            dic = self.data_obj.get_next_valid_test_segment(is_valid=is_valid)
        ppl = tf.exp(sum_loss / sum_count)
        return ppl, sum_count
    
    @tf.function
    def eval_step(self, inputs, ground_truth, padding_mask):
        log_prob, new_seg_cache = self.model(inputs=inputs, training=False, padding_mask=padding_mask,
                                             cache=self.cache, ground_truth=ground_truth)
        total_loss = tf.reduce_sum(log_prob)
        count = tf.cast(tf.shape(log_prob)[0], dtype=tf.float32)
        return total_loss, count, new_seg_cache
    
    def get_init_cache(self):
        return tf.zeros(
            shape=(batch_size, self.kwargs["n_layers"], self.kwargs["mem_len"], self.kwargs["d_model"]),
            dtype=tf.float32)
    

    if name == "main": with open("InitWeights/WT103/weights.p", "rb") as f: pre_train_weights = pickle.load(f) dataset = "wikitext-103" PAD = 0 batch_size = 1 G = GeneralFunction() _cutoffs = [ 1, 20001, 40001, 200001, vocab_size_dic[dataset] ] a_epoch_segment = { "384": 268820 // batch_size, "512": 201615 // batch_size, "256": 403230 // batch_size } E = Main(dataset_name=dataset, segment_size=128, mem_len=1600, n_heads=16, d_model=1024, n_layers=18, d_embed=1024, batch_size=batch_size, dropout_attn=0.2, dropout_norm=0.2, ffn_mul=4, cutoffs=_cutoffs, method="AC001") E.train() `

    opened by menghuanlater 4
  • Different training steps in tf and pytorch

    Different training steps in tf and pytorch

    Hi, I notice that the training steps for the base_wt103 in PyTorch codes is 200K, while this number is 400K in the TF scripts. However, for the large wt103, both of them are 4M.

    I am confused about the training steps as I am training the large PyTorch model with 16*32GB v100. The speed is too slow to finish the 4000000 steps(1300ms per step,2 months, is it right? ).

    By the way, will the tf codes be faster than PyTorch in this project?

    Thanks for your help!

    opened by richardbaihe 3
  • TPU settings

    TPU settings

    Hi,

    I would like to train a model on TPU, but I'm not able to find the correct settings for a v2-8 TPU.

    What parameters are needed for NUM_HOST and NUM_CORE? I tried different values, but I always get num_replicas should be (8), got (XXX). error messages.

    What TPU model did you use for the 1 Billion word benchmark?

    Can I create the tfrecords locally (on a non-TPU) in the train_data step?

    Thanks :)

    opened by stefan-it 3
  • PyTorch multiGPU training

    PyTorch multiGPU training

    Thank you for releasing such an awesome and easy to use code!

    Could you, please, elaborate a little bit on PyTorch implementation multi GPU setup? More concretely, what does the parameter "gpu0_bsz" mean and what parameters should I change to scale this code to setups with the number of GPUs more (or less) than 4?

    From the description it seems that "gpu0_bsz" is the batch size for GPU0, but it is not clear to me why it should differ from batch sizes on other GPUs.

    opened by AlexGrinch 3
  • Bounty: PTB Transformer-xl

    Bounty: PTB Transformer-xl

    https://twitter.com/srush_nlp/status/1245825437240102913?s=19

    Open-Science NLP Bounty: ($100 + $100 to charity)

    Task: A notebook demonstrating experiments of this widely cited LM baseline on PTB.

    It seems many people on twitter have not been able to replicate anything near the PTB numbers reported in this paper. I would love for someone to prove me wrong and am happy to pay for it.

    opened by srush 2
  • Is transformer-xl like a seq2seq model or a word-embedding model?

    Is transformer-xl like a seq2seq model or a word-embedding model?

    Hi: I am reading the code of the model, but I can not find which part is encoder and which is decoder. So i want to know that is transformer-xl like a word-embedding layer as BERT instead of a seq2seq model? Thanks a lot.

    opened by huangnengCSU 2
  • [W C:\w\b\windows\pytorch\aten\src\ATen\native\cuda\Indexing.cu:963] Warning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (function masked_fill__cuda)

    [W C:\w\b\windows\pytorch\aten\src\ATen\native\cuda\Indexing.cu:963] Warning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (function masked_fill__cuda)

    I meet this error and really can not solve it: [W C:\w\b\windows\pytorch\aten\src\ATen\native\cuda\Indexing.cu:963] Warning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (function masked_fill__cuda) Could you somehow fix it? Thanks!!

    opened by Arsmart1 0
  • docs: demo, experiments and live inference API on Tiyaro

    docs: demo, experiments and live inference API on Tiyaro

    Hello Maintainer of Github repo kimiyoung/transformer-xl (@lopuhin @ijkilchenko @kimiyoung)!

    Thank you for your work on kimiyoung/transformer-xl. This GitHub project is interesting, and we think that it would be a great addition to make this work instantly discoverable & available as an API for all your users, to quickly try and use it in their applications.

    On Tiyaro, every model in kimiyoung/transformer-xl will get its own: Dedicated model card (see https://console.tiyaro.ai/explore/transfo-xl-wt103 Model demo (see https://console.tiyaro.ai/explore/transfo-xl-wt103/demo) Unique Inference API (https://api.tiyaro.ai/explore/huggingface/1//transfo-xl-wt103) Sample code snippets and swagger spec for the API

    Users will also be able to compare your model with other models of similar types on various parameters using Tiyaro Experiments (https://blog.tiyaro.ai/evaluate-openmmlabs-mmocr-models-using-tiyaro-experiments)

    —- I am from Tiyaro.ai (https://tiyaro.ai/). We are working on enabling developers to instantly evaluate, use and customize the world’s best AI. We are constantly working on adding new features to Tiyaro EasyTrain, EasyServe & Experiments, to make the best use of your ML model, and making AI more accessible for anyone.

    Sincerely, I-Jong Lin

    opened by ijonglin 2
  • enwiki8 18 layer model .sh file

    enwiki8 18 layer model .sh file

    I am trying to find the hyperparameters .sh file for the mid-size enwiki8 model with 88M parameters for torch. Searched the web but unfortunately found nothing.

    opened by vasily789 0
  • feat: replace einsum with matmul for efficiency

    feat: replace einsum with matmul for efficiency

    Hi @kimiyoung,

    I recently started to use transformer-xl on my personal project (mostly based on this repo), and I found the speed can be improved about 3 to 4 times by replacing all the torch.einsum with torch.matmul (equals to @). Feel free to review it if you have time.

    opened by yoyololicon 0
  • CUBLAS_STATUS_EXECUTION_FAILED and Blas GEMM launch failed

    CUBLAS_STATUS_EXECUTION_FAILED and Blas GEMM launch failed

    I have followed the required tensorflow 1.12 and python 2.7, but the following errors still raised. I wonder if you could help me. By the way, it is suggested by the internet that the CUBLAS_STATUS_EXECUTION_FAILED raises when tensorflow version does not match the cuda version. Could you please tell me the gpu type and cuda version you used to train? Looking forward to your reply. `2021-09-08 23:23:08.258752: E tensorflow/stream_executor/cuda/cuda_blas.cc:652] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED Traceback (most recent call last): File "train_gpu.py", line 475, in tf.app.run() File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "train_gpu.py", line 471, in main evaluate(n_token, cutoffs, "/gpu:0") File "train_gpu.py", line 446, in evaluate fetched = sess.run(fetches, feed_dict=feed_dict) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(31424, 1024), b.shape=(1024, 3072), m=31424, n=3072, k=1024 [[node transformer/layer_0/rel_attn/qkv/Tensordot/MatMul (defined at /home/caoyq/transformer-xl-master/tf/model.py:54) = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transformer/layer_0/rel_attn/qkv/Tensordot/Reshape, transformer/layer_0/rel_attn/qkv/kernel/read)]]

    Caused by op u'transformer/layer_0/rel_attn/qkv/Tensordot/MatMul', defined at: File "train_gpu.py", line 475, in tf.app.run() File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "train_gpu.py", line 471, in main evaluate(n_token, cutoffs, "/gpu:0") File "train_gpu.py", line 400, in evaluate mems=mems_i) File "train_gpu.py", line 218, in single_core_graph is_training=is_training) File "train_gpu.py", line 186, in model_fn proj_same_dim=FLAGS.proj_same_dim)

    return self.__call__(inputs, *args, **kwargs)
    

    File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 374, in call outputs = super(Layer, self).call(inputs, *args, **kwargs) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 757, in call outputs = self.call(inputs, *args, **kwargs) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/keras/layers/core.py", line 963, in call outputs = standard_ops.tensordot(inputs, self.kernel, [[rank - 1], [0]]) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 2985, in tensordot ab_matmul = matmul(a_reshape, b_reshape) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 2057, in matmul a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 4560, in mat_mul name=name) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(*args, **kwargs) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op op_def=op_def) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in init self._traceback = tf_stack.extract_stack()

    InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(31424, 1024), b.shape=(1024, 3072), m=31424, n=3072, k=1024 [[node transformer/layer_0/rel_attn/qkv/Tensordot/MatMul (defined at /home/caoyq/transformer-xl-master/tf/model.py:54) = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transformer/layer_0/rel_attn/qkv/Tensordot/Reshape, transformer/layer_0/rel_attn/qkv/kernel/read)]] `

    opened by CaoYiqingT 0
Owner
Zhilin Yang
Zhilin Yang
Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

null 730 Jan 9, 2023
Alex Pashevich 62 Dec 24, 2022
Dynamic Attentive Graph Learning for Image Restoration, ICCV2021 [PyTorch Code]

Dynamic Attentive Graph Learning for Image Restoration This repository is for GATIR introduced in the following paper: Chong Mou, Jian Zhang, Zhuoyuan

Jian Zhang 84 Dec 9, 2022
PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.

ALiBi PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. Quickstart Clone this reposit

Jake Tae 4 Jul 27, 2022
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting This is the origin Pytorch implementation of Informer in the followin

Haoyi 3.1k Dec 29, 2022
A PyTorch implementation of the paper Mixup: Beyond Empirical Risk Minimization in PyTorch

Mixup: Beyond Empirical Risk Minimization in PyTorch This is an unofficial PyTorch implementation of mixup: Beyond Empirical Risk Minimization. The co

Harry Yang 121 Dec 17, 2022
A transformer which can randomly augment VOC format dataset (both image and bbox) online.

VocAug It is difficult to find a script which can augment VOC-format dataset, especially the bbox. Or find a script needs complex requirements so it i

Coder.AN 1 Mar 5, 2022
Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding (AAAI 2020) - PyTorch Implementation

Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding PyTorch implementation for the Scalable Attentive Sentence-Pair Modeling vi

Microsoft 25 Dec 2, 2022
The first public PyTorch implementation of Attentive Recurrent Comparators

arc-pytorch PyTorch implementation of Attentive Recurrent Comparators by Shyam et al. A blog explaining Attentive Recurrent Comparators Visualizing At

Sanyam Agarwal 150 Oct 14, 2022
TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)

SLM: Structural Language Models of Code This is an official implementation of the model described in: "Structural Language Models of Code" [PDF] To ap

null 73 Nov 6, 2022
Code for the paper "How Attentive are Graph Attention Networks?"

How Attentive are Graph Attention Networks? This repository is the official implementation of How Attentive are Graph Attention Networks?. The PyTorch

null 175 Dec 29, 2022
code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

Facebook Research 94 Oct 26, 2022
The official implementation of Variable-Length Piano Infilling (VLI).

Variable-Length-Piano-Infilling The official implementation of Variable-Length Piano Infilling (VLI). (paper: Variable-Length Music Score Infilling vi

null 29 Sep 1, 2022
A Python script that creates subtitles of a given length from text paragraphs that can be easily imported into any Video Editing software such as FinalCut Pro for further adjustments.

Text to Subtitles - Python This python file creates subtitles of a given length from text paragraphs that can be easily imported into any Video Editin

Dmytro North 9 Dec 24, 2022
Llvlir - Low Level Variable Length Intermediate Representation

Low Level Variable Length Intermediate Representation Low Level Variable Length

Michael Clark 2 Jan 24, 2022
Implementation of the paper "Language-agnostic representation learning of source code from structure and context".

Code Transformer This is an official PyTorch implementation of the CodeTransformer model proposed in: D. Zügner, T. Kirschstein, M. Catasta, J. Leskov

Daniel Zügner 131 Dec 13, 2022
Official Pytorch implementation of "Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video", CVPR 2021

TCMR: Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video Qualtitative result Paper teaser video Introduction This r

Hongsuk Choi 215 Jan 6, 2023
We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.

Multi-Modal Self-Supervision using GDT and StiCa This is an official pytorch implementation of papers: Multi-modal Self-Supervision from Generalized D

Facebook Research 42 Dec 9, 2022