PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

Overview

PyTorch Large-Scale Language Model

A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset

Latest Results

  • 39.98 Perplexity after 5 training epochs using LSTM Language Model with Adam Optimizer
  • Trained in ~26 hours using 1 Nvidia V100 GPU (~5.1 hours per epoch) with 2048 batch size (~10.7 GB GPU memory)

Previous Results

  • 46.47 Perplexity after 5 training epochs on a 1-layer, 2048-unit, 256-projection LSTM Language Model [3]
  • Trained for 3 days using 1 Nvidia P100 GPU (~12.5 hours per epoch)
  • Implemented Sampled Softmax and Log-Uniform Sampler functions

GPU Hardware Requirement

Type LM Memory Size GPU
w/o tied weights ~9 GB Nvidia 1080 TI, Nvidia Titan X
w/ tied weights [6] ~7 GB Nvidia 1070 or higher
  • There is an option to tie the word embedding and softmax weight matrices together to save GPU memory.

Hyper-Parameters [3]

Parameter Value
# Epochs 5
Training Batch Size 128
Evaluation Batch Size 1
BPTT 20
Embedding Size 256
Hidden Size 2048
Projection Size 256
Tied Embedding + Softmax False
# Layers 1
Optimizer AdaGrad
Learning Rate 0.10
Gradient Clipping 1.00
Dropout 0.01
Weight-Decay (L2 Penalty) 1e-6

Setup - Torch Data Format

  1. Download Google Billion Word Dataset for Torch - Link
  2. Run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file
  3. Install Cython framework and build Log_Uniform Sampler
  4. Convert Torch data tensors to PyTorch tensor format (Requires Pytorch v0.4.1)

I leverage the GBW data preprocessed for the Torch framework. (See Torch GBW) Each data tensor contains all the words in data partition. The "train_data.sid" file marks the start and end positions for each independent sentence. The preprocessing step and "train_data.sid" file speeds up loading the massive training data.

  • Data Tensors - (test_data, valid_data, train_data, train_small, train_tiny) - (#words x 2) matrix - (sentence id, word id)
  • Sentence ID Tensor - (#sentences x 2) matrix - (start position, sentence length)

Setup - Original Data Format

  1. Download 1-Billion Word Dataset - Link

The Torch Data Format loads the entire dataset at once, so it requires at least 32 GB of memory. The original format partitions the dataset into smaller chunks, but it runs slower.

References

  1. Exploring the Limits of Language Modeling Github
  2. Factorization Tricks for LSTM networks Github
  3. Efficient softmax approximation for GPUs Github
  4. Candidate Sampling
  5. Torch GBW
  6. Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling
Comments
  • state of the art performance?

    state of the art performance?

    Nice work! I have a question regarding the result: In the paper "Exploring the limits of language modeling", it reports test ppl of 54.1 using LSTM-512-512. Does it mean two 2 layers are used in the paper, while your result is obtained from 4 layers ? If so, what makes the difference?

    opened by eric-haibin-lin 8
  • RuntimeError: inconsistent tensor size

    RuntimeError: inconsistent tensor size

    I have problem: load word frequency mapping - complete loaded tensor torch.Size([798949912]) loaded tensor torch.Size([798949912, 3]) #sentences 798949912 load train data - complete #sentences 6073 load test data - complete Traceback (most recent call last): File "main.py", line 195, in train() File "main.py", line 157, in train for batch, item in enumerate(train_loader): File "/home/xxxx/PyTorch_LM/lm/fast_gbw.py", line 89, in batch_generator tracker_list[idx] = self.add(seq_length, source, target, idx, tracker) File "/home/xxxx/lm/PyTorch_LM/lm/fast_gbw.py", line 124, in add source[curr:batch_end, batch_idx] = self.corpus[seq_start:seq_end] RuntimeError: inconsistent tensor size, expected tensor [19] and src [798949911] to have the same number of elements, but got 19 and 798949911 elements respectively at /pytorch/torch/lib/TH/generic/THTensorCopy.c:86

    opened by maydaygmail 7
  • ImportError: cannot import name 'LogUniformSampler'

    ImportError: cannot import name 'LogUniformSampler'

    After running: 'python3 setup.py build_ext --inplace', I still got ImportError: cannot import name 'LogUniformSampler'. It seems that log_uniform module is not built correctly.

    Any suggestion?

    Thanks!

    opened by songyuzhou324 4
  • Resume Training?

    Resume Training?

    Hi, I am wondering whether it is possible to resume training using the saved checkpoint? Based on the code I think I just need to re-define the scheduler by myself. Is there anything that you think I missed?

    Thank you so much for your code btw.

    opened by WilliamLwj 2
  • Pretrained Model?

    Pretrained Model?

    Nice work! It's so tragic that when I type "pytorch language models", this is not the first repo that shows up!

    Do you plan to release the pre-trained model?

    (I see it takes roughly 3 days...so probably it's ok)

    opened by windweller 2
  • sample_ids being ignored?

    sample_ids being ignored?

    Hi! thanks for your code. I've been reading through it to understand the approach and I've noticed that the output of sampled is actually always a zero long-tensor:

    https://github.com/rdspring1/PyTorch_GBW_LM/blob/master/lm/model.py#L68-L69

    Is this the way is supposed to work? I was understanding that the sampled softmax obtains the speed up by computing the loss on only a sample of the entire vocabulary. But the way it's setup the loss would always be computed with respect to the same target (0).

    Or is there something else I might be missing?

    greetings!

    opened by emanjavacas 2
  • dead link (Google Billion Word Dataset for Torch)

    dead link (Google Billion Word Dataset for Torch)

    Hi, I'd like to use your language model for my research. I can't train it because the link to the Google Billion Word Dataset for Torch is down. Is there a mirror somewhere?

    opened by jxmorris12 1
  • how to build Log_Uniform Sampler?

    how to build Log_Uniform Sampler?

    On my macbook, I run 'python setup.py install' or 'python setup.py build_ext --inplace' in log_uniform folder and got error:

    ➜  log_uniform git:(master) ✗ ~/miniconda3/bin/python setup.py install
    running install
    running build
    running build_ext
    building 'log_uniform' extension
    creating build
    creating build/temp.macosx-10.7-x86_64-3.7
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/gaoxianglu/miniconda3/include -arch x86_64 -I/Users/gaoxianglu/miniconda3/include -arch x86_64 -I/Users/gaoxianglu/miniconda3/lib/python3.7/site-packages/numpy/core/include -I/Users/gaoxianglu/miniconda3/include/python3.7m -c log_uniform.cpp -o build/temp.macosx-10.7-x86_64-3.7/log_uniform.o -std=c++11
    warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead
          [-Wstdlibcxx-not-found]
    log_uniform.cpp:635:10: fatal error: 'ios' file not found
    #include "ios"
             ^~~~~
    1 warning and 1 error generated.
    error: command 'gcc' failed with exit status 1
    

    I installed xcode command line, but the error still exists

    opened by universewill 1
  • TypeError: iteration over a 0-d tensor

    TypeError: iteration over a 0-d tensor

    File "main_dev.py", line 99, in repackage_hidden return [repackage_hidden(state) for state in h] File "/Users/admin/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 381, in iter raise TypeError('iteration over a 0-d tensor') TypeError: iteration over a 0-d tensor

    Have you met this kind of question before?

    opened by Machine-Tom 1
  • Preprocess problem

    Preprocess problem

    It seems torch.load() cannot load train_data.th7? I cannot figure out how to "run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file."

    opened by jiangtianli91 1
  • build Log_Uniform Sampler

    build Log_Uniform Sampler

    Hi

    I have Cython installed, but I'm not sure how to do the step "build Log_Uniform Sampler". Could you be more detailed in what commands should I run?

    I tried to do python setup.py install but I got the following error:

    running install
    running build
    running build_ext
    building 'log_uniform' extension
    x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.5m -I/home/goncalo/.virtualenvs/nmtpy/include/python3.5m -c log_uniform.cpp -o build/temp.linux-x86_64-3.5/log_uniform.o -std=c++11
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    log_uniform.cpp:608:31: fatal error: numpy/arrayobject.h: No such file or directory
    compilation terminated.
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
    

    So I'm not sure if I'm doing the right thing.

    opened by goncalomcorreia 1
  • missing train_data.pt

    missing train_data.pt

    It seems that process_gbw.py is looking for train_data.pt but couldn't find it. Are there any instructions on how to create this file (or does it belong to the dataset downloaded)?

    Thanks!

    opened by flint-stone 0
Owner
Ryan Spring
A PhD student researching Deep Learning, Locality-Sensitive Hashing, and other large-scale machine learning algorithms.
Ryan Spring
📝 Wrapper library for text generation / language models at char and word level with RNN in TensorFlow

tensorlm Generate Shakespeare poems with 4 lines of code. Installation tensorlm is written in / for Python 3.4+ and TensorFlow 1.1+ pip3 install tenso

Kilian Batzner 63 May 22, 2021
In this project we investigate the performance of the SetCon model on realistic video footage. Therefore, we implemented the model in PyTorch and tested the model on two example videos.

Contrastive Learning of Object Representations Supervisor: Prof. Dr. Gemma Roig Institutions: Goethe University CVAI - Computational Vision & Artifici

Dirk Neuhäuser 6 Dec 8, 2022
MWPToolkit is a PyTorch-based toolkit for Math Word Problem (MWP) solving.

MWPToolkit is a PyTorch-based toolkit for Math Word Problem (MWP) solving. It is a comprehensive framework for research purpose that integrates popular MWP benchmark datasets and typical deep learning-based MWP algorithms.

null 119 Jan 4, 2023
LIAO Shuiying 6 Dec 1, 2022
PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence) and pre-trained model on ImageNet dataset

Reference-Based-Sketch-Image-Colorization-ImageNet This is a PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization usin

Yuzhi ZHAO 11 Jul 28, 2022
PyTorch implementation of a Real-ESRGAN model trained on custom dataset

Real-ESRGAN PyTorch implementation of a Real-ESRGAN model trained on custom dataset. This model shows better results on faces compared to the original

Sber AI 160 Jan 4, 2023
LSTM model trained on a small dataset of 3000 names written in PyTorch

LSTM model trained on a small dataset of 3000 names. Model generates names from model by selecting one out of top 3 letters suggested by model at a time until an EOS (End Of Sentence) character is not encountered.

Sahil Lamba 1 Dec 20, 2021
Step by Step on how to create an vision recognition model using LOBE.ai, export the model and run the model in an Azure Function

Step by Step on how to create an vision recognition model using LOBE.ai, export the model and run the model in an Azure Function

El Bruno 3 Mar 30, 2022
Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

Portrait Photo Retouching with PPR10K Paper | Supplementary Material PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask an

null 184 Dec 11, 2022
This is the dataset and code release of the OpenRooms Dataset.

This is the dataset and code release of the OpenRooms Dataset.

Visual Intelligence Lab of UCSD 95 Jan 8, 2023
A large dataset of 100k Google Satellite and matching Map images, resembling pix2pix's Google Maps dataset.

Larger Google Sat2Map dataset This dataset extends the aerial ⟷ Maps dataset used in pix2pix (Isola et al., CVPR17). The provide script download_sat2m

null 34 Dec 28, 2022
Dataset used in "PlantDoc: A Dataset for Visual Plant Disease Detection" accepted in CODS-COMAD 2020

PlantDoc: A Dataset for Visual Plant Disease Detection This repository contains the Cropped-PlantDoc dataset used for benchmarking classification mode

Pratik Kayal 109 Dec 29, 2022
EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

MADE (Multi-Adapter Dataset Experts) This repository contains the implementation of MADE (Multi-adapter dataset experts), which is described in the pa

Princeton Natural Language Processing 68 Jul 18, 2022
EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

MADE (Multi-Adapter Dataset Experts) This repository contains the implementation of MADE (Multi-adapter dataset experts), which is described in the pa

Princeton Natural Language Processing 39 Oct 5, 2021
LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation (NeurIPS2021 Benchmark and Dataset Track)

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation by Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zh

Kingdrone 174 Dec 22, 2022
The Habitat-Matterport 3D Research Dataset - the largest-ever dataset of 3D indoor spaces.

Habitat-Matterport 3D Dataset (HM3D) The Habitat-Matterport 3D Research Dataset is the largest-ever dataset of 3D indoor spaces. It consists of 1,000

Meta Research 62 Dec 27, 2022
Official repository for Jia, Raghunathan, Göksel, and Liang, "Certified Robustness to Adversarial Word Substitutions" (EMNLP 2019)

Certified Robustness to Adversarial Word Substitutions This is the official GitHub repository for the following paper: Certified Robustness to Adversa

Robin Jia 38 Oct 16, 2022