PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

Overview

PyTorch Large-Scale Language Model

A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset

Latest Results

  • 39.98 Perplexity after 5 training epochs using LSTM Language Model with Adam Optimizer
  • Trained in ~26 hours using 1 Nvidia V100 GPU (~5.1 hours per epoch) with 2048 batch size (~10.7 GB GPU memory)

Previous Results

  • 46.47 Perplexity after 5 training epochs on a 1-layer, 2048-unit, 256-projection LSTM Language Model [3]
  • Trained for 3 days using 1 Nvidia P100 GPU (~12.5 hours per epoch)
  • Implemented Sampled Softmax and Log-Uniform Sampler functions

GPU Hardware Requirement

Type LM Memory Size GPU
w/o tied weights ~9 GB Nvidia 1080 TI, Nvidia Titan X
w/ tied weights [6] ~7 GB Nvidia 1070 or higher
  • There is an option to tie the word embedding and softmax weight matrices together to save GPU memory.

Hyper-Parameters [3]

Parameter Value
# Epochs 5
Training Batch Size 128
Evaluation Batch Size 1
BPTT 20
Embedding Size 256
Hidden Size 2048
Projection Size 256
Tied Embedding + Softmax False
# Layers 1
Optimizer AdaGrad
Learning Rate 0.10
Gradient Clipping 1.00
Dropout 0.01
Weight-Decay (L2 Penalty) 1e-6

Setup - Torch Data Format

  1. Download Google Billion Word Dataset for Torch - Link
  2. Run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file
  3. Install Cython framework and build Log_Uniform Sampler
  4. Convert Torch data tensors to PyTorch tensor format (Requires Pytorch v0.4.1)

I leverage the GBW data preprocessed for the Torch framework. (See Torch GBW) Each data tensor contains all the words in data partition. The "train_data.sid" file marks the start and end positions for each independent sentence. The preprocessing step and "train_data.sid" file speeds up loading the massive training data.

  • Data Tensors - (test_data, valid_data, train_data, train_small, train_tiny) - (#words x 2) matrix - (sentence id, word id)
  • Sentence ID Tensor - (#sentences x 2) matrix - (start position, sentence length)

Setup - Original Data Format

  1. Download 1-Billion Word Dataset - Link

The Torch Data Format loads the entire dataset at once, so it requires at least 32 GB of memory. The original format partitions the dataset into smaller chunks, but it runs slower.

References

  1. Exploring the Limits of Language Modeling Github
  2. Factorization Tricks for LSTM networks Github
  3. Efficient softmax approximation for GPUs Github
  4. Candidate Sampling
  5. Torch GBW
  6. Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling
Comments
  • state of the art performance?

    state of the art performance?

    Nice work! I have a question regarding the result: In the paper "Exploring the limits of language modeling", it reports test ppl of 54.1 using LSTM-512-512. Does it mean two 2 layers are used in the paper, while your result is obtained from 4 layers ? If so, what makes the difference?

    opened by eric-haibin-lin 8
  • RuntimeError: inconsistent tensor size

    RuntimeError: inconsistent tensor size

    I have problem: load word frequency mapping - complete loaded tensor torch.Size([798949912]) loaded tensor torch.Size([798949912, 3]) #sentences 798949912 load train data - complete #sentences 6073 load test data - complete Traceback (most recent call last): File "main.py", line 195, in train() File "main.py", line 157, in train for batch, item in enumerate(train_loader): File "/home/xxxx/PyTorch_LM/lm/fast_gbw.py", line 89, in batch_generator tracker_list[idx] = self.add(seq_length, source, target, idx, tracker) File "/home/xxxx/lm/PyTorch_LM/lm/fast_gbw.py", line 124, in add source[curr:batch_end, batch_idx] = self.corpus[seq_start:seq_end] RuntimeError: inconsistent tensor size, expected tensor [19] and src [798949911] to have the same number of elements, but got 19 and 798949911 elements respectively at /pytorch/torch/lib/TH/generic/THTensorCopy.c:86

    opened by maydaygmail 7
  • ImportError: cannot import name 'LogUniformSampler'

    ImportError: cannot import name 'LogUniformSampler'

    After running: 'python3 setup.py build_ext --inplace', I still got ImportError: cannot import name 'LogUniformSampler'. It seems that log_uniform module is not built correctly.

    Any suggestion?

    Thanks!

    opened by songyuzhou324 4
  • Resume Training?

    Resume Training?

    Hi, I am wondering whether it is possible to resume training using the saved checkpoint? Based on the code I think I just need to re-define the scheduler by myself. Is there anything that you think I missed?

    Thank you so much for your code btw.

    opened by WilliamLwj 2
  • Pretrained Model?

    Pretrained Model?

    Nice work! It's so tragic that when I type "pytorch language models", this is not the first repo that shows up!

    Do you plan to release the pre-trained model?

    (I see it takes roughly 3 days...so probably it's ok)

    opened by windweller 2
  • sample_ids being ignored?

    sample_ids being ignored?

    Hi! thanks for your code. I've been reading through it to understand the approach and I've noticed that the output of sampled is actually always a zero long-tensor:

    https://github.com/rdspring1/PyTorch_GBW_LM/blob/master/lm/model.py#L68-L69

    Is this the way is supposed to work? I was understanding that the sampled softmax obtains the speed up by computing the loss on only a sample of the entire vocabulary. But the way it's setup the loss would always be computed with respect to the same target (0).

    Or is there something else I might be missing?

    greetings!

    opened by emanjavacas 2
  • dead link (Google Billion Word Dataset for Torch)

    dead link (Google Billion Word Dataset for Torch)

    Hi, I'd like to use your language model for my research. I can't train it because the link to the Google Billion Word Dataset for Torch is down. Is there a mirror somewhere?

    opened by jxmorris12 1
  • how to build Log_Uniform Sampler?

    how to build Log_Uniform Sampler?

    On my macbook, I run 'python setup.py install' or 'python setup.py build_ext --inplace' in log_uniform folder and got error:

    ➜  log_uniform git:(master) ✗ ~/miniconda3/bin/python setup.py install
    running install
    running build
    running build_ext
    building 'log_uniform' extension
    creating build
    creating build/temp.macosx-10.7-x86_64-3.7
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/gaoxianglu/miniconda3/include -arch x86_64 -I/Users/gaoxianglu/miniconda3/include -arch x86_64 -I/Users/gaoxianglu/miniconda3/lib/python3.7/site-packages/numpy/core/include -I/Users/gaoxianglu/miniconda3/include/python3.7m -c log_uniform.cpp -o build/temp.macosx-10.7-x86_64-3.7/log_uniform.o -std=c++11
    warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead
          [-Wstdlibcxx-not-found]
    log_uniform.cpp:635:10: fatal error: 'ios' file not found
    #include "ios"
             ^~~~~
    1 warning and 1 error generated.
    error: command 'gcc' failed with exit status 1
    

    I installed xcode command line, but the error still exists

    opened by universewill 1
  • TypeError: iteration over a 0-d tensor

    TypeError: iteration over a 0-d tensor

    File "main_dev.py", line 99, in repackage_hidden return [repackage_hidden(state) for state in h] File "/Users/admin/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 381, in iter raise TypeError('iteration over a 0-d tensor') TypeError: iteration over a 0-d tensor

    Have you met this kind of question before?

    opened by Machine-Tom 1
  • Preprocess problem

    Preprocess problem

    It seems torch.load() cannot load train_data.th7? I cannot figure out how to "run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file."

    opened by jiangtianli91 1
  • build Log_Uniform Sampler

    build Log_Uniform Sampler

    Hi

    I have Cython installed, but I'm not sure how to do the step "build Log_Uniform Sampler". Could you be more detailed in what commands should I run?

    I tried to do python setup.py install but I got the following error:

    running install
    running build
    running build_ext
    building 'log_uniform' extension
    x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.5m -I/home/goncalo/.virtualenvs/nmtpy/include/python3.5m -c log_uniform.cpp -o build/temp.linux-x86_64-3.5/log_uniform.o -std=c++11
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    log_uniform.cpp:608:31: fatal error: numpy/arrayobject.h: No such file or directory
    compilation terminated.
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
    

    So I'm not sure if I'm doing the right thing.

    opened by goncalomcorreia 1
  • missing train_data.pt

    missing train_data.pt

    It seems that process_gbw.py is looking for train_data.pt but couldn't find it. Are there any instructions on how to create this file (or does it belong to the dataset downloaded)?

    Thanks!

    opened by flint-stone 0
Owner
Ryan Spring
A PhD student researching Deep Learning, Locality-Sensitive Hashing, and other large-scale machine learning algorithms.
Ryan Spring
Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 Billion Parameters) on a single 16 GB VRAM V100 Google Cloud instance with Huggingfa

null 289 Jan 6, 2023
ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in a matter of minutes. Based on our experiments with a wide range of benchmarks, ProteinBERT usually achieves state-of-the-art performance. ProteinBERT is built on TenforFlow/Keras.

null 241 Jan 4, 2023
A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

Transformer Embedder A Word Level Transformer layer based on PyTorch and ?? Transformers. How to use Install the library from PyPI: pip install transf

Riccardo Orlando 27 Nov 20, 2022
Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

Wav2Vec2CTC With KenLM Using KenLM ARPA language model with beam search to decode audio files and show the most probable transcription. Assuming you'v

farisalasmary 65 Sep 21, 2022
The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning over Financial Data"

Zhiyu Chen 114 Dec 29, 2022
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Google Research Datasets 740 Dec 24, 2022
A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Delta Reading Comprehension Dataset 台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。 本資料集期望成為適用於遷移學習之標準中文閱讀理解資料集。 本資料集從2,108篇

null 272 Dec 15, 2022
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

pkuseg:一个多领域中文分词工具包 (English Version) pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用,支持细分领域分词,有效提升了分词准确度。 目录 主要亮点 编译和安装 各类分词工具包的性能对比 使用方式 论文引用 作者 常见问题及解答 主要

LancoPKU 6k Dec 29, 2022
🦆 Contextually-keyed word vectors

sense2vec: Contextually-keyed word vectors sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detaile

Explosion 1.5k Dec 25, 2022
🦆 Contextually-keyed word vectors

sense2vec: Contextually-keyed word vectors sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detaile

Explosion 1.2k Feb 17, 2021
A library for Multilingual Unsupervised or Supervised word Embeddings

MUSE: Multilingual Unsupervised and Supervised Embeddings MUSE is a Python library for multilingual word embeddings, whose goal is to provide the comm

Facebook Research 3k Jan 6, 2023
Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

THUNLP-MT 46 Dec 15, 2022
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Word-Level Coreference Resolution This is a repository with the code to reproduce the experiments described in the paper of the same name, which was a

null 79 Dec 27, 2022
Accurately generate all possible forms of an English word e.g "election" --> "elect", "electoral", "electorate" etc.

Accurately generate all possible forms of an English word Word forms can accurately generate all possible forms of an English word. It can conjugate v

Dibya Chakravorty 570 Dec 31, 2022
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage >>> from transformers import RemBertToken

Koichi Yasuoka 3 Dec 22, 2021
This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

Text Summarizer This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text. Team Members This mini-project was

null 1 Nov 16, 2021