PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

Ryan Spring

Last update: Nov 4, 2022

Related tags

Deep Learning nlp machine-learning deep-learning gpu torch pytorch lstm language-model torch-gbw

Overview

PyTorch Large-Scale Language Model

A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset

Latest Results

39.98 Perplexity after 5 training epochs using LSTM Language Model with Adam Optimizer
Trained in ~26 hours using 1 Nvidia V100 GPU (~5.1 hours per epoch) with 2048 batch size (~10.7 GB GPU memory)

Previous Results

46.47 Perplexity after 5 training epochs on a 1-layer, 2048-unit, 256-projection LSTM Language Model [3]
Trained for 3 days using 1 Nvidia P100 GPU (~12.5 hours per epoch)
Implemented Sampled Softmax and Log-Uniform Sampler functions

GPU Hardware Requirement

Type	LM Memory Size	GPU
w/o tied weights	~9 GB	Nvidia 1080 TI, Nvidia Titan X
w/ tied weights [6]	~7 GB	Nvidia 1070 or higher

There is an option to tie the word embedding and softmax weight matrices together to save GPU memory.

Hyper-Parameters [3]

Parameter	Value
# Epochs	5
Training Batch Size	128
Evaluation Batch Size	1
BPTT	20
Embedding Size	256
Hidden Size	2048
Projection Size	256
Tied Embedding + Softmax	False
# Layers	1
Optimizer	AdaGrad
Learning Rate	0.10
Gradient Clipping	1.00
Dropout	0.01
Weight-Decay (L2 Penalty)	1e-6

Setup - Torch Data Format

Download Google Billion Word Dataset for Torch - Link
Run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file
Install Cython framework and build Log_Uniform Sampler
Convert Torch data tensors to PyTorch tensor format (Requires Pytorch v0.4.1)

I leverage the GBW data preprocessed for the Torch framework. (See Torch GBW) Each data tensor contains all the words in data partition. The "train_data.sid" file marks the start and end positions for each independent sentence. The preprocessing step and "train_data.sid" file speeds up loading the massive training data.

Data Tensors - (test_data, valid_data, train_data, train_small, train_tiny) - (#words x 2) matrix - (sentence id, word id)
Sentence ID Tensor - (#sentences x 2) matrix - (start position, sentence length)

Setup - Original Data Format

Download 1-Billion Word Dataset - Link

The Torch Data Format loads the entire dataset at once, so it requires at least 32 GB of memory. The original format partitions the dataset into smaller chunks, but it runs slower.

References

Comments

state of the art performance?

Nice work! I have a question regarding the result: In the paper "Exploring the limits of language modeling", it reports test ppl of 54.1 using LSTM-512-512. Does it mean two 2 layers are used in the paper, while your result is obtained from 4 layers ? If so, what makes the difference?

opened by eric-haibin-lin 8
RuntimeError: inconsistent tensor size

I have problem: load word frequency mapping - complete loaded tensor torch.Size([798949912]) loaded tensor torch.Size([798949912, 3]) #sentences 798949912 load train data - complete #sentences 6073 load test data - complete Traceback (most recent call last): File "main.py", line 195, in train() File "main.py", line 157, in train for batch, item in enumerate(train_loader): File "/home/xxxx/PyTorch_LM/lm/fast_gbw.py", line 89, in batch_generator tracker_list[idx] = self.add(seq_length, source, target, idx, tracker) File "/home/xxxx/lm/PyTorch_LM/lm/fast_gbw.py", line 124, in add source[curr:batch_end, batch_idx] = self.corpus[seq_start:seq_end] RuntimeError: inconsistent tensor size, expected tensor [19] and src [798949911] to have the same number of elements, but got 19 and 798949911 elements respectively at /pytorch/torch/lib/TH/generic/THTensorCopy.c:86

opened by maydaygmail 7
ImportError: cannot import name 'LogUniformSampler'

After running: 'python3 setup.py build_ext --inplace', I still got ImportError: cannot import name 'LogUniformSampler'. It seems that log_uniform module is not built correctly.

Any suggestion?

Thanks!

opened by songyuzhou324 4
Resume Training?

Hi, I am wondering whether it is possible to resume training using the saved checkpoint? Based on the code I think I just need to re-define the scheduler by myself. Is there anything that you think I missed？

Thank you so much for your code btw.

opened by WilliamLwj 2
Pretrained Model?

Nice work! It's so tragic that when I type "pytorch language models", this is not the first repo that shows up!

Do you plan to release the pre-trained model?

(I see it takes roughly 3 days...so probably it's ok)

opened by windweller 2
sample_ids being ignored?

Hi! thanks for your code. I've been reading through it to understand the approach and I've noticed that the output of sampled is actually always a zero long-tensor:

https://github.com/rdspring1/PyTorch_GBW_LM/blob/master/lm/model.py#L68-L69

Is this the way is supposed to work? I was understanding that the sampled softmax obtains the speed up by computing the loss on only a sample of the entire vocabulary. But the way it's setup the loss would always be computed with respect to the same target (0).

Or is there something else I might be missing?

greetings!

opened by emanjavacas 2
dead link (Google Billion Word Dataset for Torch)

Hi, I'd like to use your language model for my research. I can't train it because the link to the Google Billion Word Dataset for Torch is down. Is there a mirror somewhere?

opened by jxmorris12 1

how to build Log_Uniform Sampler?

On my macbook, I run 'python setup.py install' or 'python setup.py build_ext --inplace' in log_uniform folder and got error:

➜  log_uniform git:(master) ✗ ~/miniconda3/bin/python setup.py install
running install
running build
running build_ext
building 'log_uniform' extension
creating build
creating build/temp.macosx-10.7-x86_64-3.7
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/gaoxianglu/miniconda3/include -arch x86_64 -I/Users/gaoxianglu/miniconda3/include -arch x86_64 -I/Users/gaoxianglu/miniconda3/lib/python3.7/site-packages/numpy/core/include -I/Users/gaoxianglu/miniconda3/include/python3.7m -c log_uniform.cpp -o build/temp.macosx-10.7-x86_64-3.7/log_uniform.o -std=c++11
warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead
      [-Wstdlibcxx-not-found]
log_uniform.cpp:635:10: fatal error: 'ios' file not found
#include "ios"
         ^~~~~
1 warning and 1 error generated.
error: command 'gcc' failed with exit status 1

I installed xcode command line, but the error still exists

opened by universewill 1

TypeError: iteration over a 0-d tensor

File "main_dev.py", line 99, in repackage_hidden return [repackage_hidden(state) for state in h] File "/Users/admin/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 381, in iter raise TypeError('iteration over a 0-d tensor') TypeError: iteration over a 0-d tensor

Have you met this kind of question before?

opened by Machine-Tom 1
Preprocess problem

It seems torch.load() cannot load train_data.th7? I cannot figure out how to "run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file."

opened by jiangtianli91 1

build Log_Uniform Sampler

I have Cython installed, but I'm not sure how to do the step "build Log_Uniform Sampler". Could you be more detailed in what commands should I run?

I tried to do python setup.py install but I got the following error:

running install
running build
running build_ext
building 'log_uniform' extension
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.5m -I/home/goncalo/.virtualenvs/nmtpy/include/python3.5m -c log_uniform.cpp -o build/temp.linux-x86_64-3.5/log_uniform.o -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
log_uniform.cpp:608:31: fatal error: numpy/arrayobject.h: No such file or directory
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

So I'm not sure if I'm doing the right thing.

opened by goncalomcorreia 1

missing train_data.pt

It seems that process_gbw.py is looking for train_data.pt but couldn't find it. Are there any instructions on how to create this file (or does it belong to the dataset downloaded)?

Thanks!

opened by flint-stone 0

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

Related tags

Overview

PyTorch Large-Scale Language Model

Latest Results

Previous Results

GPU Hardware Requirement

Hyper-Parameters [3]

Setup - Torch Data Format

Setup - Original Data Format

References

Comments

Owner

Ryan Spring

📝 Wrapper library for text generation / language models at char and word level with RNN in TensorFlow

In this project we investigate the performance of the SetCon model on realistic video footage. Therefore, we implemented the model in PyTorch and tested the model on two example videos.

MWPToolkit is a PyTorch-based toolkit for Math Word Problem (MWP) solving.

This repo uses a combination of logits and feature distillation method to teach the PSPNet model of ResNet18 backbone with the PSPNet model of ResNet50 backbone. All the models are trained and tested on the PASCAL-VOC2012 dataset.

PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence) and pre-trained model on ImageNet dataset

PyTorch implementation of a Real-ESRGAN model trained on custom dataset

LSTM model trained on a small dataset of 3000 names written in PyTorch

Step by Step on how to create an vision recognition model using LOBE.ai, export the model and run the model in an Azure Function

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

This is the dataset and code release of the OpenRooms Dataset.

A large dataset of 100k Google Satellite and matching Map images, resembling pix2pix's Google Maps dataset.

Dataset used in "PlantDoc: A Dataset for Visual Plant Disease Detection" accepted in CODS-COMAD 2020

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation (NeurIPS2021 Benchmark and Dataset Track)

The Habitat-Matterport 3D Research Dataset - the largest-ever dataset of 3D indoor spaces.

Official repository for Jia, Raghunathan, Göksel, and Liang, "Certified Robustness to Adversarial Word Substitutions" (EMNLP 2019)