Sequence modeling benchmarks and temporal convolutional networks

Related tags

Text Data & NLP TCN
Overview

Sequence Modeling Benchmarks and Temporal Convolutional Networks (TCN)

This repository contains the experiments done in the work An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling by Shaojie Bai, J. Zico Kolter and Vladlen Koltun.

We specifically target a comprehensive set of tasks that have been repeatedly used to compare the effectiveness of different recurrent networks, and evaluate a simple, generic but powerful (purely) convolutional network on the recurrent nets' home turf.

Experiments are done in PyTorch. If you find this repository helpful, please cite our work:

@article{BaiTCN2018,
	author    = {Shaojie Bai and J. Zico Kolter and Vladlen Koltun},
	title     = {An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling},
	journal   = {arXiv:1803.01271},
	year      = {2018},
}

Domains and Datasets

Update: The code should be directly runnable with PyTorch v1.0.0 or above (PyTorch v>1.3.0 strongly recommended). The older versions of PyTorch are no longer supported.

This repository contains the benchmarks to the following tasks, with details explained in each sub-directory:

  • The Adding Problem with various T (we evaluated on T=200, 400, 600)
  • Copying Memory Task with various T (we evaluated on T=500, 1000, 2000)
  • Sequential MNIST digit classification
  • Permuted Sequential MNIST (based on Seq. MNIST, but more challenging)
  • JSB Chorales polyphonic music
  • Nottingham polyphonic music
  • PennTreebank [SMALL] word-level language modeling (LM)
  • Wikitext-103 [LARGE] word-level LM
  • LAMBADA [LARGE] word-level LM and textual understanding
  • PennTreebank [MEDIUM] char-level LM
  • text8 [LARGE] char-level LM

While some of the large datasets are not included in this repo, we use the observations package to download them, which can be easily installed using pip.

Usage

Each task is contained in its own directory, with the following structure:

[TASK_NAME] /
    data/
    [TASK_NAME]_test.py
    models.py
    utils.py

To run TCN model on the task, one only need to run [TASK_NAME]_test.py (e.g. add_test.py). To tune the hyperparameters, one can specify via argument options, which can been seen via the -h flag.

Comments
  • About different sequence input

    About different sequence input

    I have a totally different sequence, the smallest length is about 100 words, the max is about 5000. I attempt to padding zero to the same length of 5000, but the classified result is terrible. but if I just input different size, that's means keep the original and make the batch_size just 1, that's works well. I don't know why this happens.

    opened by LemoJa 10
  • Can the TCN module be made faster?

    Can the TCN module be made faster?

    I'm using your TCN module for a language modeling task. My code follows the structure of your char_cnn code. It works but the performance is very bad compared to an LSTM network. Each epoch with the TCN network takes about 10 times longer. Do you know if the performance can be improved? Here is the forward method from the TCN class:

        def forward(self, x):
            emb = self.drop(self.encoder(x))
            y = self.tcn(emb.transpose(1, 2))
            o = self.decoder(y.transpose(1, 2))
            return o.contiguous()
    

    Perhaps it is the transpose calls that is making the code slow?

    opened by bjourne 8
  • When I use 28 seqence length for LSTM and TCN, LSTM is much faster than TCN.

    When I use 28 seqence length for LSTM and TCN, LSTM is much faster than TCN.

    It seems to me that LSTM is faster when the sequence length is short (say 28). When the sequence length is long (say 784), LSTM will be much slower than TCN.

    It seems to me for TCN, the computation time is independent of the sequence length.

    Am I correct?

    opened by KinWaiCheuk 7
  • RNN/LSTM Baselines?

    RNN/LSTM Baselines?

    This is a great set of experiments! I'm wondering if the code for the RNN/LSTM baselines reported in the paper are available somewhere. At present, I only see code for the TCN model.

    Thanks!

    opened by millerjohnp 7
  • why temporal pad at all?

    why temporal pad at all?

    Nice work! I'm researching time series regression using machine learning so I'm looking at LSTM, TCN and Transformers based models and getting good results with your model.

    One general question? I'm not sure I understand the reason why we pad each layer of a TCN at all. I understand that it ensures each layer produces a sequence of the same length so there's a benefit in that your predictions are aligned with your inputs. But it's very similar to initialising an AR(p) model with a vector of zeros when you predict forward - the initial predictions will all be "wrong" until the effect of the initial state has decayed out. LSTM's also have this issue - most applications seem to set the initial state per batch to zero which results in transients errors at the start of the batch (some authors train a separate model to estimate the initial state which I've had good success with). I would assume this would impact training as well and it seems to make sense to mask out the start of the output sequence when calculating the loss or the model may try and adapt to "fix" the impact of the wrong ic.

    Certainly when I train a regression-based TCN I can observe transient errors at the start of the prediction - i.e. the diagram below underpredicts for the first 96 samples (that's 1 day of 15minute electricity consumption) then overpredicts for the first week before settling down. Interested in your thoughts.

    Also, one general observation - the prediction from TCN seem noisier than LSTM, I thought the long AR window might filter out more noise than it has. Plus it's quite sensitive to learning rate - low learning rate produces a very noisey output sequence.

    image

    opened by david-waterworth 6
  • Recommendation for image to text

    Recommendation for image to text

    My goal is to train a model that can output sequences of text from image inputs. Using the IAM handwriting dataset for example, we would pass the model an image

    image

    and expect it to return "broadcast and television report on his". Historically, the common (i.e. recurrent) way to accomplish this would be an encoder (CNN) + decoder (LSTM) architecture like OpenNMT's implementation. I am interested in replacing the decoder with a TCN, but am unsure how to approach the image data. The CNN encoder will create a batch of N features maps with reduced spatial dimensions (H', W')

    image

    The issue is a TCN expects 3D tensors (N, L, C) whereas each "timestep" of the image is 2D (N, H, W, C). Following the p-MNIST example in the paper, we could flatten the image into a 1D sequence with length H' x W'. Then the TCN would effectively snake through the pseudo-timesteps like below

    image

    However, if we want one prediction per timestep it makes much more sense to define a left-to-right sequence instead of a snaking one since that's the direction the text is depicted in the image. Did you experiment at all with image to text models, and if so, how did you chose to represent the images?

    I also wonder about the loss function for training a TCN decoder. Assuming you divide the image width into more timesteps than your maximum expected sequence length, it seems like connectionist temporal classification (CTC) would be a good choice. Then you do not have to worry about alignment between the target sequence and model's prediction. For instance, "bbb--ee-cau--sssss----e" would be collapsed to "because" by combining neighboring duplicates and removing blanks. Do you agree or is there a different loss function you would suggest?

    opened by addisonklinke 5
  • Clarification on figure 3(a)

    Clarification on figure 3(a)

    Hello and thanks for the paper and the helpful codebase!

    I just wanted to clarify how the convergence plots in the paper were generated, particularly fig 3(a). The Y axis is labelled test accuracy, however the X values seems to be more frequent than every epoch. Could you confirm what data is being evaluated on here and if smoothing is taking place? Thanks

    opened by alanjeffares 4
  • How to reproduce results from the paper?

    How to reproduce results from the paper?

    Is it just the testing result in the last epoch using default parameters? I have tried to run add_test.py and below is the result i get for the 10 epochs.

    Test set: Average loss: 0.168699 Test set: Average loss: 0.001142 Test set: Average loss: 0.000922 Test set: Average loss: 0.000345 Test set: Average loss: 0.000143 Test set: Average loss: 0.000188 Test set: Average loss: 0.000121 Test set: Average loss: 0.000028 Test set: Average loss: 0.000244 Test set: Average loss: 0.000042

    Which one should I use for benchmarking? In the paper, the result of TCN was 5.8e-5 but it seems like we can use 2.8e-5 or 4.2e-5 here.

    opened by johnsyin 4
  • Cutting off effective history when evaluating char_cnn model.

    Cutting off effective history when evaluating char_cnn model.

    I don't get why on test time (or when evaluating the model on a validation set), we don't compute the loss on all the sequence and not only on a part of the sequence that ensures sufficient history. The model is not evaluated on the whole dataset but only on a sub-part, are the results reliable ? or even comparable to other models (LSTM, ect ) that doesn't use this method ?

    opened by mok33 4
  • Is TCN suitable for time series regression?

    Is TCN suitable for time series regression?

    Hello,

    Thank you for your great paper and sharing!

    I'm wondering how to use TCN to solve time series regression problems. In my time series scenario, data for each moment contains multiple variables and each variable is a real number. For example, data for time step 0 is something like "vector_0 = <0.1, 0.2, 0.3, ...>", and I want to use historical k vectors to predict the next vector data.

    I have developed a LSTM model for this question. The input shape of LSTM model is "batch_size, time_steps(k), input_size(length of each vector)", and the prediction result is the last value of LSTM. Then I could calculate the MSE loss and do backward. How can I use TCN to solve this problem?

    Best Regards

    opened by liuzf13 4
  • Causal Transposed Convolution

    Causal Transposed Convolution

    Hi,

    Thanks for this great paper ... I am trying to use this architecture in auto-encoder setting such that the encoder part is a stack of strided-dilated-causal conv layers and now thinking about the decoder part.

    In terms of up-sampling using transposed convolutions, does it follow the same intuition in order to have causal up-sampling (i.e. to exclude the reconstructions of future part) ? Or shall we generate sample-by-sample without transposed conv layers ?

    With many thanks in advance Best Regards

    opened by ahmed-fau 4
  • Zero padding - possibly incorrect behavior?

    Zero padding - possibly incorrect behavior?

    Thank you for sharing the code and paper, it has been very helpful. I think I may have found a subtle issue with the padding scheme and would appreciate another opinion.

    Conceptually, we'd like every sequence input before the first to be zero. But I noticed that the implementation pads every Conv1d input with zeros, not just the first one. In my opinion, this is incorrect behavior for each layer beyond the first.

    Here is a diagram of the issue.

    Screen Shot 2022-12-31 at 19 14 15

    The triangles represent padded inputs. The bottom row (sequence input) is padded with 0, which is correct. However, the first layer's outputs are also padded with 0 (red triangles) before feeding to the next layer. I think we should instead pad with a constant vector, the result of convolving an all-zero receptive field. (Resulting in conv1's bias term.)

    Similarly, the next layer up should be padded with a constant vector, whose value is the result of convolving a receptive field with a constant value (the padding of the previous layer).

    Impact: A network with receptive field $r$ will produce incorrect results prior to the $r$-th input. "Incorrect" in this case means at least inconsistent with its behavior in the steady state, far from the beginning of the input. This might be especially important with long receptive fields, where sequences are similar in length to the receptive field, because a substantial portion of the training examples will be using these wrong padding values.

    Here's a simple test case that demonstrates that prepending a sequence of zeros to the input changes the output.

    def test_tcn():
        torch.manual_seed(42)
        def init_weights(m):
            if isinstance(m, nn.Conv1d):
                if hasattr(m, 'weight_g'):
                    # weight_norm was applied to this layer
                    torch.nn.init.uniform_(m.weight_g)
                    torch.nn.init.uniform_(m.weight_v)
                    # XXX: not sure if this is correct way to initialize
                else:
                    torch.nn.init.uniform_(m.weight)
                torch.nn.init.uniform_(m.bias)
    
        with torch.no_grad():
            net = tcn.TemporalConvNet(num_inputs=1, num_channels=[2, 1], kernel_size=2, dropout=0)
            net.apply(init_weights)
            print("Receptive field", net.receptive_field_size)
    
            for i in range(8):
                print(f"Padding with {i} zeros:",
                      net(torch.Tensor([[ [0] * i + [1] ]])))
    
            print("Zero input response:", net(torch.Tensor([[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]])))
    
    Receptive field 7
    Padding with 0 zeros: tensor([[[2.1018]]])
    Padding with 1 zeros: tensor([[[1.3458, 2.2364]]])
    Padding with 2 zeros: tensor([[[1.3458, 1.4805, 2.4149]]])
    Padding with 3 zeros: tensor([[[1.3458, 1.4805, 1.6590, 2.4309]]])
    Padding with 4 zeros: tensor([[[1.3458, 1.4805, 1.6590, 1.6749, 2.4466]]])
    Padding with 5 zeros: tensor([[[1.3458, 1.4805, 1.6590, 1.6749, 1.6907, 2.4550]]])
    Padding with 6 zeros: tensor([[[1.3458, 1.4805, 1.6590, 1.6749, 1.6907, 1.6991, 2.4550]]])
    Padding with 7 zeros: tensor([[[1.3458, 1.4805, 1.6590, 1.6749, 1.6907, 1.6991, 1.6991, 2.4550]]])
    
    Zero input response: tensor([[[1.3458, 1.4805, 1.6590, 1.6749, 1.6907, 1.6991, 1.6991, 1.6991,
              1.6991, 1.6991, 1.6991, 1.6991]]])
    

    Clearly this TCN implementation is still able to achieve great results, so I am not yet sure of the practical impact. I'll experiment with changing it for my application.

    opened by iceboundflame 0
  • Correlate .mat files with songs in Nottingham dataset

    Correlate .mat files with songs in Nottingham dataset

    I have all the .abc files. I have made sure the shape of all the X variables combined is the same as the number of songs. But after I load the .mat file for Nottingham how do I say that the first element in the numpy array corresponds to the song entitled "...."? I want to match data with songs listed in the .abc files.

    opened by demongolem-biz2 0
  • why?

    why?

    I change the output "return torch.mean(self.network(x),dim=2)" for multi-feature time series and the training time reduces significantly..so why--lol

    opened by RaganrokV 0
  • What is the accuracy supposed to be for the MNIST problem?

    What is the accuracy supposed to be for the MNIST problem?

    After each epoch (at least the first 6 so far), I get 982/10000 (10%) accuracy which can't be right. What should the accuracy be as was originally designed?

    opened by demongolem-biz 0
  • Code Question about: input the final conv-layer output to the linear layer

    Code Question about: input the final conv-layer output to the linear layer

    Great code guys! Can I ask a question at this code? https://github.com/locuslab/TCN/blob/master/TCN/adding_problem/model.py#L17

    Usually when I implement CNN-series model, the calculation of the last layer CNN dimension was always a problem.

    In your code, Line17, it looks the final linear layer just catches a part of the conv-layer output. Is this understanding correct? Does it ignore many other parameters that in the conv-layer output?

    What shocks me is, when I test such implementation on other traditional CNN models, they also works. (I mean just use like self.linear(y1[:, :, -1])) Does this mean the task is simple for the designed CNN because we just dropped a lot of neurons in it?

    Will be highly appreciated if someone could advice.

    opened by ShengzheXu 0
Owner
CMU Locus Lab
Zico Kolter's Research Group
CMU Locus Lab
[ICLR'19] Trellis Networks for Sequence Modeling

TrellisNet for Sequence Modeling This repository contains the experiments done in paper Trellis Networks for Sequence Modeling by Shaojie Bai, J. Zico

CMU Locus Lab 460 Oct 13, 2022
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

null 20.5k Jan 8, 2023
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sockeye This package contains the Sockeye project, an open-source sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet

Amazon Web Services - Labs 1.1k Dec 27, 2022
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

null 11.3k Feb 18, 2021
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sockeye This package contains the Sockeye project, an open-source sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet

Amazon Web Services - Labs 986 Feb 17, 2021
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Amazon Web Services - Labs 1000 Apr 19, 2021
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

null 13.2k Jul 7, 2021
Sequence-to-Sequence Framework in PyTorch

nmtpytorch allows training of various end-to-end neural architectures including but not limited to neural machine translation, image captioning and au

LIUM 395 Nov 21, 2022
A highly sophisticated sequence-to-sequence model for code generation

CoderX A proof-of-concept AI system by Graham Neubig (June 30, 2021). About CoderX CoderX is a retrieval-based code generation AI system reminiscent o

Graham Neubig 39 Aug 3, 2021
MASS: Masked Sequence to Sequence Pre-training for Language Generation

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Microsoft 1.1k Dec 17, 2022
Sequence-to-Sequence learning using PyTorch

Seq2Seq in PyTorch This is a complete suite for training sequence-to-sequence models in PyTorch. It consists of several models and code to both train

Elad Hoffer 514 Nov 17, 2022
Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Yoon Kim 43 Dec 23, 2022
Sequence Modeling with Structured State Spaces

Structured State Spaces for Sequence Modeling This repository provides implementations and experiments for the following papers. S4 Efficiently Modeli

HazyResearch 902 Jan 6, 2023
Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

MTFAA-Net Unofficial PyTorch implementation of Baidu's MTFAA-Net: "Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speec

Shimin Zhang 87 Dec 19, 2022
Concept Modeling: Topic Modeling on Images and Text

Concept is a technique that leverages CLIP and BERTopic-based techniques to perform Concept Modeling on images.

Maarten Grootendorst 120 Dec 27, 2022
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 3.2k Dec 31, 2022
[WWW 2021 GLB] New Benchmarks for Learning on Non-Homophilous Graphs

New Benchmarks for Learning on Non-Homophilous Graphs Here are the codes and datasets accompanying the paper: New Benchmarks for Learning on Non-Homop

null 94 Dec 21, 2022
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.8k Dec 30, 2022