Ongoing research training transformer language models at scale, including: BERT & GPT-2

NVIDIA Corporation

Last update: Dec 30, 2022

Related tags

Text Data & NLP Megatron-LM

Overview

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. This repository is for ongoing research on training large transformer language models at scale. We developed efficient, model-parallel (tensor and pipeline), and multi-node pre-training oftransformer based models such as GPT, BERT, and T5 using mixed precision.

Below are some of the projects where we have directly used Megatron:

Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. All models use a vocabulary size of 51,200 and a sequence length of 2048. We vary hidden size, number of attention heads, and number of layers to arrive at a specifc model size. As the model size increases, we also modestly increase the batch size. We leverage NVIDIA's Selene supercomputer to perform scaling studies and use up to 3072 A100 GPUs for the largest model. The table below shows the model configurations along with the achieved FLOPs (both per GPU and aggregate over all GPUs). Note that the FLOPs are measured for end-to-end training, i.e., includes all operations including data loading, optimization, and even logging.

All the cases from 1 billion to 1 trillion parameters achieve more than 43% half precision utilization, which is high for an end-to-end application. We observe that initially the utilization remains constant but as hidden size increases for larger models, utilization starts increasing and reaches 52% for the largest model. We also note that achieved aggregate petaFLOPs across all GPUs increases almost linearly with number of GPUs, demonstrating good weak scaling.

Contents
Setup
- Downloading Checkpoints
Usage
Training
Evaluation and Tasks
Datasets
- Collecting Wikipedia Training Data
- Collecting GPT Webtext Data

Setup

We have tested Megatron with NGC's PyTorch container version 20.12, which uses python 3.8, pytorch 1.8, cuda 11.1, and nccl 2.8.3.

To use this repository, please install the latest supported versions of PyTorch with GPU support (python 3.8, pytorch 1.8, cuda 11.1, and nccl 2.8.3 and above) and NVIDIA APEX. We strongly recommend using one of NGC's recent PyTorch containers (the latest compatible version at time of publication can be pulled with docker pull nvcr.io/nvidia/pytorch:20.12-py3). Data preprocessing requires NLTK, though this is not required for training, evaluation, or downstream tasks.

Downloading Checkpoints

We have provided pretrained BERT-345M and GPT-345M checkpoints for use to evaluate or finetuning downstream tasks. To access these checkpoints, first sign up for and setup the NVIDIA GPU Cloud (NGC) Registry CLI. Further documentation for downloading models can be found in the NGC documentation.

Alternatively, you can directly download the checkpoints using:

BERT-345M-uncased: wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_uncased/zip -O megatron_bert_345m_v0.1_uncased.zip
BERT-345M-cased: wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_cased/zip -O megatron_bert_345m_v0.1_cased.zip
GPT-345M: wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip

The models require vocabulary files to run. The BERT WordPiece vocab file can be extracted from Google's pretrained BERT models: uncased, cased. The GPT vocab file and merge table can be downloaded directly.

Usage

After installation, there are several possible workflows. The most comprehensive is:

Data preprocessing
Pretraining
Finetuning (Optional for zero-shot tasks)
Downstream task evaluation or text generation

However, steps 1 and 2 can be replaced by using one of the pretrained models mentioned above.

We've provided several scripts for pretraining both BERT and GPT in examples directory, as well as scripts for both zero-shot and fine-tuned downstream tasks including MNLI, RACE, WikiText103, and LAMBADA evaluation. There is also a script for GPT interactive text generation.

Training

Data Preprocessing

The training data requires preprocessing. First, place your training data in a loose json format, with one json containing a text sample per line. For example:

{"src": "www.nvidia.com", "text": "The quick brown fox", "type": "Eng", "id": "0", "title": "First Part"}
{"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"}

The name of the text field of the json can be changed by using the --json-key flag in preprocess_data.py The other metadata are optional and are not used in training.

The loose json is then processed into a binary format for training. To convert the json into mmap, cached index file, or the lazy loader format use preprocess_data.py. Set the --dataset-impl flag to mmap, cached, or lazy, respectively (default is mmap). An example script to prepare data for BERT training is:

python tools/preprocess_data.py \
       --input my-corpus.json \
       --output-prefix my-bert \
       --vocab bert-vocab.txt \
       --dataset-impl mmap \
       --tokenizer-type BertWordPieceLowerCase \
       --split-sentences

The output will be two files named, in this case, my-bert_text_sentence.bin and my-bert_text_sentence.idx. The --data-path specified in later BERT training is the full path and new filename, but without the file extension.

For T5 use the same preprocessing as BERT, perhaps renaming it to:

       --output-prefix my-t5 \

Some minor modifications are required for GPT data preprocessing, namely, the addition of a merge table, an end-of-document token, removal of sentence splitting, and a change to the tokenizer type:

python tools/preprocess_data.py \
       --input my-corpus.json \
       --output-prefix my-gpt2 \
       --vocab gpt2-vocab.json \
       --dataset-impl mmap \
       --tokenizer-type GPT2BPETokenizer \
       --merge-file gpt2-merges.txt \
       --append-eod

Here the output files are named my-gpt2_text_document.bin and my-gpt2_text_document.idx. As before, in GPT training, use the longer name without the extension as --data-path.

Further command line arguments are described in the source file preprocess_data.py.

BERT Pretraining

The examples/pretrain_bert.sh script runs single GPU 345M parameter BERT pretraining. Debugging is the primary use for single GPU training, as the code base and command line arguments are optimized for highly distributed training. Most of the arguments are fairly self-explanatory. By default, the learning rate decays linearly over the training iterations starting at --lr to a minimum set by --min-lr over --lr-decay-iters iterations. The fraction of training iterations used for warmup is set by --lr-warmup-fraction. While this is single GPU training, the batch size specified by --micro-batch-size is a single forward-backward path batch-size and the code will perform gradient accumulation steps until it reaches global-batch-size which is the batch size per iteration. The data is partitioned into a 949:50:1 ratio for training/validation/test sets (default is 969:30:1). This partitioning happens on the fly, but is consistent across runs with the same random seed (1234 by default, or specified manually with --seed). We use train-iters as the training iterations requested. Alternatively, one can provide --train-samples which is total number of samples to train on. If this option is present, then instead of providing --lr-decay-iters, one will need to provide --lr-decay-samples.

The logging, checkpoint-saving, and evaluation intervals are specified. Checkpointing the activations facilitates the training of larger models and/or batches. Note that the --data-path now includes the additional _text_sentence suffix added in preprocessing, but does not include the file extensions.

CHECKPOINT_PATH=checkpoints/bert_345m
VOCAB_FILE=bert-vocab.txt
DATA_PATH=my-bert_text_sentence

BERT_ARGS="--num-layers 24 \
           --hidden-size 1024 \
           --num-attention-heads 16 \
           --seq-length 512 \
           --max-position-embeddings 512 \
           --lr 0.0001 \
           --lr-decay-iters 990000 \
           --train-iters 2000000 \
           --min-lr 0.00001 \
           --lr-warmup-fraction 0.01 \
	   --micro-batch-size 4 \
           --global-batch-size 8 \
           --vocab-file $VOCAB_FILE \
           --split 949,50,1 \
           --fp16"

OUTPUT_ARGS="--log-interval 10 \
             --save-interval 500 \
             --eval-interval 100 \
             --eval-iters 10 \
             --checkpoint-activations"

python pretrain_bert.py \
       $BERT_ARGS \
       $OUTPUT_ARGS \
       --save $CHECKPOINT_PATH \
       --load $CHECKPOINT_PATH \
       --data-path $DATA_PATH

Further command line arguments are described in the source file arguments.py.

GPT Pretraining

The examples/pretrain_gpt.sh script runs single GPU 345M parameter GPT pretraining. As mentioned above, single GPU training is primarily intended for debugging purposes, as the code is optimized for distributed training.

It follows largely the same format as the previous BERT script with a few notable differences: the tokenization scheme used is BPE (which requires a merge table and a json vocabulary file) instead of WordPiece, the model architecture allows for longer sequences (note that the max position embedding must be greater than or equal to the maximum sequence length), and the --lr-decay-style has been set to cosine decay. Note that the --data-path now includes the additional _text_document suffix added in preprocessing, but does not include the file extensions.

CHECKPOINT_PATH=checkpoints/gpt2_345m
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
DATA_PATH=my-gpt2_text_document

GPT_ARGS="--num-layers 24 \
          --hidden-size 1024 \
          --num-attention-heads 16 \
          --seq-length 1024 \
          --max-position-embeddings 1024 \
          --micro-batch-size 4 \
          --global-batch-size 8 \
          --lr 0.00015 \
          --train-iters 500000 \
          --lr-decay-iters 320000 \
          --lr-decay-style cosine \
          --vocab-file $VOCAB_FILE \
          --merge-file $MERGE_FILE \
          --lr-warmup-fraction .01 \
          --fp16"

OUTPUT_ARGS=BERT pretraining above>

python pretrain_gpt.py \
       $GPT_ARGS \
       $OUTPUT_ARGS \
       --save $CHECKPOINT_PATH \
       --load $CHECKPOINT_PATH \
       --data-path $DATA_PATH \

Further command line arguments are described in the source file arguments.py.

T5 Pretraining

Very similar to BERT and GPT, the examples/pretrain_t5.sh script runs single GPU "base" (~220M parameter) T5 pretraining. The primary difference from BERT and GPT is the addition of the following arguments to accomodate the T5 architecture:

--kv-channels sets the inner dimension of the "key" and "value" matrices of all attention mechanisms in the model. For BERT and GPT this defaults to the hidden size divided by the number of attention heads, but can be configured for T5.
--ffn-hidden-size sets the hidden size in the feed-forward networks within a transformer layer. For BERT and GPT this defaults to 4 times the transformer hidden size, but can be configured for T5.
--encoder-seq-length and --decoder-seq-length set the sequence length for the encoder and decoder separately.

All of the other arguments remain as they were for BERT and GPT pretraining.

CHECKPOINT_PATH=checkpoints/t5_base
VOCAB_FILE=t5-vocab.txt
DATA_PATH=my-t5_text_sentence

T5_ARGS="--num-layers 24 \
         --hidden-size 1024 \
         --num-attention-heads 16 \
         --kv-channels 64 \
         --ffn-hidden-size 3072 \
         --encoder-seq-length 512 \
         --decoder-seq-length 128 \
         --max-position-embeddings 512 \
         --lr 0.0001 \
         --lr-decay-iters 990000 \
         --train-iters 2000000 \
         --min-lr 0.00001 \
         --lr-warmup-fraction 0.01 \
         --micro-batch-size 16 \
         --global-batch-size 2048 \
         --vocab-file $VOCAB_FILE \
         --vocab-extra-ids 100 \
         --split 949,50,1 \
         --fp16"

OUTPUT_ARGS=BERT pretraining above>

python pretrain_t5.py \
       $T5_ARGS \
       $OUTPUT_ARGS \
       --save $CHECKPOINT_PATH \
       --load $CHECKPOINT_PATH \
       --data-path $DATA_PATH

Distributed Pretraining

The examples/pretrain_{bert,gpt,t5}_distributed.sh scripts use the PyTorch distributed launcher for distributed training. As such, multi-node training can be achieved by properly setting environment variables and using init_method='env://' in the launcher. See the official PyTorch documentation for further description of these environment variables. By default, multi-node training uses the nccl distributed backend. A simple set of additional arguments and the use of the PyTorch distributed module with the Python flag -m torch.distributed.launch, detailed below, are the only additional requirements to adopt distributed training.

We use two types of parallelism: data and model parallelism. We facilitate two distributed data parallel implementations: a simple one of our own that performs gradient all-reduce at the end of back propagation step, and Torch's distributed data parallel wrapper that overlaps gradient reduction with back propagation computation. To switch between these two options use --DDP-impl local or --DDP-impl torch, respectively. As expected, Torch distributed data parallelism is more efficient at larger model sizes. For example, for the 8.3 billion parameters model running on 512 GPUs, the scaling increases from 60% to 76% when Torch's distributed data parallel is used. However, the overlapping method requires more memory and for some configurations (e.g., 2.5 billion parameters using 2-way model parallel and 1.2 billion parameters with no model parallel) can make the overall training slower as a result. We empirically found that using a smaller model in those cases improves the training time.

Second, we developed a simple and efficient two-dimensional model-parallel approach. To use tensor model parallelism (splitting execution of a single transformer module over multiple GPUs), add the --tensor-model-parallel-size flag to specify the number of GPUs among which to split the model, along with the arguments passed to the distributed launcher as mentioned above. To use pipeline model parallelism (sharding the transformer modules into stages with an equal number of transformer modules on each stage, and then pipelining execution by breaking the batch into smaller microbatches), use the --pipeline-model-parallel-size flag to specify the number of stages to split the model into (e.g., splitting a model with 24 transformer layers across 4 stages would mean each stage gets 6 transformer layers each).

We have examples of how to use these two different forms of model parallelism the example scripts ending in distributed_with_mp.sh, note that pipeline parallelism is not currently supported in the T5 model:

Other than these minor changes, the distributed training is identical to the training on a single GPU.

Distributed training:

WORLD_SIZE=8
TENSOR_MP_SIZE=2
PIPELINE_MP_SIZE=2

DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
                  --nnodes 1 \
                  --node_rank 0 \
                  --master_addr localhost \
                  --master_port 6000"

CHECKPOINT_PATH=
VOCAB_FILE=
DATA_PATH=
MODEL_ARGS=
OUTPUT_ARGS=

python -m torch.distributed.launch $DISTRIBUTED_ARGS ./pretrain_.py \
                $MODEL_ARGS \
                $OUTPUT_ARGS \
                --save $CHECKPOINT_PATH \
                --load $CHECKPOINT_PATH \
                --data-path $DATA_PATH \
                --tensor-model-parallel-size $TENSOR_MP_SIZE \
                --pipeline-model-parallel-size $PIPELINE_MP_SIZE \
                --DDP-impl torch

The interleaved pipelining schedule (more details in Section 2.2.2 of our paper) can be enabled using the --num-layers-per-virtual-pipeline-stage argument, which controls the number of transformer layers in a virtual stage (by default with the non-interleaved schedule, each GPU will execute a single virtual stage with NUM_LAYERS / PIPELINE_MP_SIZE transformer layers). The total number of layers in the transformer model should be divisible by this argument value. Additionally, the number of microbatches in the pipeline (computed as GLOBAL_BATCH_SIZE / (DATA_PARALLEL_SIZE * MICRO_BATCH_SIZE)) should be divisible by the PIPELINE_MP_SIZE when using this schedule (this condition is checked in an assertion in the code). The interleaved schedule is not supported for pipelines with 2 stages (PIPELINE_MP_SIZE=2).

GPT-3 Example

In examples/pretrain_gpt3_175B.sh we have provided an example of how to configure Megatron to run GPT-3 with 175 billion parameters on 1024 GPUs. The script is designed for slurm with pyxis plugin but can be easily adopted to any other scheduler. It uses 8-way and 16-way tensor and pipeline parallelism, respectively. With options global-batch-size 1536 and rampup-batch-size 16 16 5859375, the training will start with global batch size 16 and linearly increase the global batch size to 1536 over 5,859,375 samples with incrmeental steps 16. The training dataset can be either a single set or a multiple datasets combined with a set of weights.

With full global batch size of 1536 on 1024 A100 GPUs, each iteration takes around 32 seconds resulting in 138 teraFLOPs per GPU which is 44% of the theoretical peak FLOPs.

Evaluation and Tasks

We provide several command line arguments, detailed in the scripts listed below, to handle various zero-shot and fine-tuned downstream tasks. However, you can also finetune your model from a pretrained checkpoint on other corpora as desired. To do so, simply add the --finetune flag and adjust the input files and training parameters within the original training script. The iteration count will be reset to zero, and the optimizer and internal state will be reinitialized. If the fine-tuning is interrupted for any reason, be sure to remove the --finetune flag before continuing, otherwise the training will start again from the beginning.

Because evaluation requires substantially less memory than training, it may be advantageous to merge a model trained in parallel for use on a single GPU in downstream tasks. The following script accomplishes this. Currently only tensor model parallelism is supported on input and pipeline model parallelsim on the output. This example reads in a model with 2-way tensor model parallelism and writes out a model with 2-way pipeline model parallelism.

TENSOR_MODEL_PARALLEL_SIZE=2
TARGET_PIPELINE_MODEL_PARALLEL_SIZE=2

VOCAB_FILE=bert-vocab.txt
CHECKPOINT_PATH=checkpoints/bert_345m

WORLD_SIZE=$TENSOR_MODEL_PARALLEL_SIZE python tools/merge_mp_partitions.py \
        --model-type BERT \
        --tensor-model-parallel-size $TENSOR_MODEL_PARALLEL_SIZE \
        --pipeline-model-parallel-size 1 \
        --target-pipeline-model-parallel-size $TARGET_PIPELINE_MODEL_PARALLEL_SIZE \
        --tokenizer-type BertWordPieceLowerCase \
        --vocab-file $VOCAB_FILE \
        --num-layers 24 \
        --hidden-size 1024 \
        --num-attention-heads 16 \
        --seq-length 512 \
        --max-position-embeddings 512 \
        --load $CHECKPOINT_PATH
        --save $CHECKPOINT_PATH/merged

Several downstream tasks are described for both GPT and BERT models below. They can be run in distributed and model parallel modes with the same changes used in the training scripts.

GPT Text Generation

bash examples/generate_text.sh

We generate text samples using largely the GPT pretraining script. Few changes need to make, such as we need to provide the path to the pretrained checkpoint, the length of the output samples, whether to generate texts unconditionally (--num-samples to denote how many samples to generate) or conditional (need to pass --sample-input-file where each line of the file will be used as the conditional texts). There are few optional parameters to play, e.g. top-k, top-p, or greedy (set top-k and top-p to 0) sampling..

CHECKPOINT_PATH=checkpoints/gpt2_345m
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
GPT_ARGS=GPT pretraining above>

MAX_OUTPUT_SEQUENCE_LENGTH=1024
TEMPERATURE=1.0
TOP_P=0.9
NUMBER_OF_SAMPLES=2
OUTPUT_FILE=samples.json

python tools/generate_samples_gpt.py \
       $GPT_ARGS \
       --load $CHECKPOINT_PATH \
       --out-seq-length $MAX_OUTPUT_SEQUENCE_LENGTH \
       --temperature $TEMPERATURE \
       --genfile $OUTPUT_FILE \
       --num-samples $NUMBER_OF_SAMPLES \
       --top_p $TOP_P \
       --recompute

GPT Evaluation

We include example scripts for GPT evaluation on WikiText perplexity evaluation and LAMBADA Cloze accuracy.

WikiText Perplexity Evaluation

For even comparison with prior works, we evaluate perplexity on the word-level WikiText-103 test dataset, and appropriately compute perplexity given the change in tokens when using our subword tokenizer.

We use the following command to run WikiText-103 evaluation on a 345M parameter model.

TASK="WIKITEXT103"

VALID_DATA=.txt
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
CHECKPOINT_PATH=checkpoints/gpt2_345m

COMMON_TASK_ARGS="--num-layers 24 \
                  --hidden-size 1024 \
                  --num-attention-heads 16 \
                  --seq-length 1024 \
                  --max-position-embeddings 1024 \
                  --fp16 \
                  --vocab-file $VOCAB_FILE"

python tasks/main.py \
       --task $TASK \
       $COMMON_TASK_ARGS \
       --valid-data $VALID_DATA \
       --tokenizer-type GPT2BPETokenizer \
       --merge-file $MERGE_FILE \
       --load $CHECKPOINT_PATH \
       --micro-batch-size 8 \
       --checkpoint-activations \
       --log-interval 10 \
       --no-load-optim \
       --no-load-rng

LAMBADA Cloze Accuracy

To compute LAMBADA cloze accuracy (the accuracy of predicting the last token given the preceeding tokens) we utilize a detokenized, processed version of the LAMBADA dataset.

We use the following command to run LAMBADA evaluation on a 345M parameter model. Note that the --strict-lambada flag should be used to require whole word matching. Make that lambada is part of the file path.

TASK="LAMBADA"

VALID_DATA=.json
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
CHECKPOINT_PATH=checkpoints/gpt2_345m
COMMON_TASK_ARGS=WikiText Perplexity Evaluation above>

python tasks/main.py \
       --task $TASK \
       $COMMON_TASK_ARGS \
       --valid-data $VALID_DATA \
       --tokenizer-type GPT2BPETokenizer \
       --strict-lambada \
       --merge-file $MERGE_FILE \
       --load $CHECKPOINT_PATH \
       --micro-batch-size 8 \
       --checkpoint-activations \
       --log-interval 10 \
       --no-load-optim \
       --no-load-rng

Further command line arguments are described in the source file main.py

BERT Task Evaluation

RACE Evaluation

The following script finetunes the BERT model for evaluation on the RACE dataset. The TRAIN_DATA and VALID_DATA directory contain the RACE dataset as separate .txt files. Note that for RACE, the batch size is the number of RACE query's to evaluate. Since each RACE query has four samples, the effective batch size passed through the model will be four times the batch size specified on the command line.

TRAIN_DATA="data/RACE/train/middle"
VALID_DATA="data/RACE/dev/middle \
            data/RACE/dev/high"
VOCAB_FILE=bert-vocab.txt
PRETRAINED_CHECKPOINT=checkpoints/bert_345m
CHECKPOINT_PATH=checkpoints/bert_345m_race
COMMON_TASK_ARGS="--num-layers 24 \
                  --hidden-size 1024 \
                  --num-attention-heads 16 \
                  --seq-length 512 \
                  --max-position-embeddings 512 \
                  --fp16 \
                  --vocab-file $VOCAB_FILE"

COMMON_TASK_ARGS_EXT="--train-data $TRAIN_DATA \
                      --valid-data $VALID_DATA \
                      --pretrained-checkpoint $PRETRAINED_CHECKPOINT \
                      --checkpoint-activations \
                      --save-interval 10000 \
                      --save $CHECKPOINT_PATH \
                      --log-interval 100 \
                      --eval-interval 1000 \
                      --eval-iters 10 \
                      --weight-decay 1.0e-1"

python tasks/main.py \
       --task RACE \
       $COMMON_TASK_ARGS \
       $COMMON_TASK_ARGS_EXT \
       --tokenizer-type BertWordPieceLowerCase \
       --epochs 3 \
       --micro-batch-size 4 \
       --lr 1.0e-5 \
       --lr-warmup-fraction 0.06

MNLI Evaluation

The following script finetunes the BERT model for evaluation with the MultiNLI sentence pair corpus. Because the matching tasks are quite similar, the script can be quickly tweaked to work with the Quora Question Pairs (QQP) dataset as well.

TRAIN_DATA="data/glue_data/MNLI/train.tsv"
VALID_DATA="data/glue_data/MNLI/dev_matched.tsv \
            data/glue_data/MNLI/dev_mismatched.tsv"
PRETRAINED_CHECKPOINT=checkpoints/bert_345m
VOCAB_FILE=bert-vocab.txt
CHECKPOINT_PATH=checkpoints/bert_345m_mnli
COMMON_TASK_ARGS=RACE Evaluation above>
COMMON_TASK_ARGS_EXT=RACE Evaluation above>

python tasks/main.py \
       --task MNLI \
       $COMMON_TASK_ARGS \
       $COMMON_TASK_ARGS_EXT \
       --tokenizer-type BertWordPieceLowerCase \
       --epochs 5 \
       --micro-batch-size 8 \
       --lr 5.0e-5 \
       --lr-warmup-fraction 0.065

Datasets

We do not host any datasets for GPT or BERT training, however, we detail their collection so that our results may be reproduced.

Collecting Wikipedia Training Data

We recommend following the Wikipedia data extraction process specified by Google research: "the recommended pre-processing is to download the latest dump, extract the text with WikiExtractor.py, and then apply any necessary cleanup to convert it into plain text."

We recommend using the --json argument when using WikiExtractor, which will dump the Wikipedia data into loose json format (one json per line), making it more manageable on the file system and also readily consumable by our codebase. We recommend further preprocessing this json dataset by nltk punctuation standardization. For BERT training, use the --split-sentences flag to preprocess_data.py as described above to include sentence breaks in the produced index. If you'd like to use Wikipedia data for GPT training you should still clean it with nltk/spacy/ftfy, but do not use the --split-sentences flag.

Collecting GPT Webtext Data

We utilize the publicly available OpenWebText library from jcpeterson and eukaryote31's work to download urls. We then filtered, cleaned, and deduplicated all downloaded content according to the procedure described in our openwebtext directory. For reddit URLs corresponding to content up to October 2018 we arrived at approximately 37GB of content.

Comments

Error in fused softmax kernel result

Problem ?

스크린샷 2021-08-12 오전 11 28 52

The result of the fused softmax layer is different from the result of the original torch softmax layer.

How to reproduce ?

import math

import torch
from torch.nn import Softmax
from transformers import BertTokenizer
from transformers.models.bert.modeling_bert import BertModel
from fused import FusedScaleMaskSoftmax
from fused import AttnMaskType

def load_fused_kernels():
    try:
        import fused_mix_prec_layer_norm_cuda
        import scaled_masked_softmax_cuda
        import scaled_upper_triang_masked_softmax_cuda
        import torch

        print("[Success] load_fused_kernels")
    except ImportError as e:
        print("[Fail] load_fused_kernels")
        raise e


def attention_mask_func(attention_scores, attention_mask):
    attention_scores.masked_fill_(attention_mask, -10000.0)
    return attention_scores


def test_softmax():
    bert = BertModel.from_pretrained("bert-base-cased").cuda().half()
    tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

    # len_query=24, batch_per_block=8 (in my setting)
    tokens = tokenizer(
        [
            "Hello. How are you? I am fine thank you and you? yes Good. hi hello hello hello hello"
        ]
        * 4,
        return_tensors="pt",
    )

    embedding_output = bert.embeddings(
        input_ids=tokens["input_ids"].cuda(),
        position_ids=None,
        token_type_ids=tokens["token_type_ids"].cuda(),
        inputs_embeds=None,
        past_key_values_length=0,
    )

    # (bsz, 1, 1, seq_len), all values are 0.
    mask = bert.get_extended_attention_mask(
        attention_mask=tokens["attention_mask"].cuda(),
        input_shape=tokens["input_ids"].shape,
        device=bert.device,
    )
    # (bsz, 1, seq_len, seq_len)
    mask = mask.repeat(1, 1, mask.size()[-1], 1)

    attention = bert.encoder.layer[0].attention.self
    query_proj = attention.query
    key_proj = attention.key
    value_proj = attention.value

    key_layer = attention.transpose_for_scores(key_proj(embedding_output))
    value_layer = attention.transpose_for_scores(value_proj(embedding_output))
    query_layer = attention.transpose_for_scores(query_proj(embedding_output))

    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
    attention_scores /= math.sqrt(key_layer.size()[-1])

    fused_softmax = FusedScaleMaskSoftmax(
        mask_func=attention_mask_func,
        attn_mask_type=AttnMaskType.padding,
        input_in_fp16=True,
        input_in_bf16=False,
        scale=None,
        softmax_in_fp32=False,
        scaled_masked_softmax_fusion=True,
    )

    fused_softmax_output = fused_softmax(
        attention_scores,
        (mask != 0),
    )

    torch_softmax = FusedScaleMaskSoftmax(
        mask_func=attention_mask_func,
        attn_mask_type=AttnMaskType.padding,
        input_in_fp16=True,
        input_in_bf16=False,
        scale=None,
        softmax_in_fp32=False,
        scaled_masked_softmax_fusion=False,
    )

    torch_softmax_output = torch_softmax(
        attention_scores,
        (mask != 0),
    )

    print("fused (turn on fusion):", fused_softmax_output[0][0][0])
    print("\n")
    print("fused (turn off fusion):", torch_softmax_output[0][0][0])

    torch_softmax = torch.nn.Softmax(dim=-1)
    torch_softmax_output = torch_softmax(attention_scores)

    print("\n")
    print("torch softmax", torch_softmax_output[0][0][0])


if __name__ == "__main__":
    load_fused_kernels()
    test_softmax()

opened by hyunwoongko 22

There is a difference in the calculation of num_warmup_microbatches

In interleaved-1F1B：

https://github.com/NVIDIA/Megatron-LM/blob/b31e1296354e979722627a6c4dedafe19b51fa97/megatron/schedules.py#L222-L223

but in 1F1B:

https://github.com/NVIDIA/Megatron-LM/blob/b31e1296354e979722627a6c4dedafe19b51fa97/megatron/schedules.py#L531-L533

what is the purpose of this diff?

opened by unlimblue 11
Compatibility with pytorch-transformers for fine-tuning

Hi,

Thanks for the great package! I wanted to check about the compatibility of the trained GPT-2 model/tokenizer with the pytorch-transformers package. Is it possible that, with a few changes, the trained model can be imported using that package, in order to perform additional fine-tuning there with different heads for example? I understand that there are some config files expected by that package, so I'm assuming these can be added. But I'm interested in knowing about the compatibility of the model/tokenizer mainly.

Thanks!

opened by harkous 6

perplexity too big for gpt2 wikitext evaluation

When running the wikitext evaluation of gpt2

python evaluate_gpt2.py 
    --valid-data wikitext-103-v1/wiki.test.tokens 
    --load-openai 
    --hidden-size 768 
    --vocab-size 50257 
    --tokenizer-type GPT2BPETokenizer 
    --max-position-embeddings 1024

the resulting perplexity is 2.9290E+02 -- why is the value so extremely high?

Here is the console output with logging level DEBUG:

Evaluate GPT2 model
WARNING: No training data specified
using world size: 1 and model-parallel size: 1 
 > using dynamic loss scaling
> initializing model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): s3.amazonaws.com:443
DEBUG:urllib3.connectionpool:https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/gpt2-vocab.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): s3.amazonaws.com:443
DEBUG:urllib3.connectionpool:https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/gpt2-merges.txt HTTP/1.1" 200 0
INFO:data_utils.tokenization_gpt2:loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json from cache at /braintree/home/msch/.pytorch_pretrained_bert/f2808208f9bec2320371a9f5f891c184ae0b674ef866b79c58177067d15732dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
INFO:data_utils.tokenization_gpt2:loading merges file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt from cache at /braintree/home/msch/.pytorch_pretrained_bert/d629f792e430b3c76a1291bb2766b0a047e36fae0588f9dbc1ae51decdff691b.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
wikitext
Original Tokens: 270330, Detokenized tokens: 245566
> padded vocab (size: 50257) with 0 dummy tokens (new size: 50257)
global rank: 0 | vocab size: 50257 | eod token: 50256 | num_examples: 8448 | num_original_tokens: 245566 | num_tokenized_tokens: 270330
building GPT2 model ...
 > number of parameters: 209494272
loading openai weights
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): s3.amazonaws.com:443
DEBUG:urllib3.connectionpool:https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/gpt2-pytorch_model.bin HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): s3.amazonaws.com:443
DEBUG:urllib3.connectionpool:https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/gpt2-config.json HTTP/1.1" 200 0
INFO:pytorch_pretrained_bert.modeling_gpt2:loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin from cache at gpt2_weights/4295d67f022061768f4adc386234dbdb781c814c39662dd1662221c309962c55.778cf36f5c4e5d94c8cd9cefcf2a580c8643570eb327f0d4a1f007fab2acbdf1
INFO:pytorch_pretrained_bert.modeling_gpt2:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json from cache at gpt2_weights/4be02c5697d91738003fb1685c9872f284166aa32e061576bbe6aaeb95649fcf.085d5f6a8e7812ea05ff0e6ed0645ab2e75d80387ad55c1ad9806ee70d272f80
INFO:pytorch_pretrained_bert.modeling_gpt2:Model config {
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12,
  "n_positions": 1024,
  "vocab_size": 50257
}

global rank: 0 | max iters: 2112
global rank: 0 | iteration: 0
global rank: 0 | iteration: 100
...
global rank: 0 | iteration: 1900
global rank: 0 | iteration: 2000
global rank: 0 | iteration: 2100
----------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
 validation results on wiki | avg loss: 5.6798E+00 | ppl: 2.9290E+02 | adjusted ppl: 5.1937E+02 | token ratio: 1.1008449901248143 |
------------------------------------------------------------------------------------------------------------------------------------

opened by mschrimpf 5

[Question]Megatron Performance with NGC PyTorch

Hi I'm not sure if this is the right repo to ask this question please help redirect me.

I'm training Megatron-LM with NGC container, but I need a custom change on PyTorch. https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch

I found out that NGC stock PyTorch in the container is constantly faster than other alternatives, in forward/backward compute time. If I make custom changes and compile from source, or install from conda/pip it's always slower. Any ideas why or how can I match the performance? I'm already using the same PyTorch commit from NGC release notes https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_22-03.html#rel_22-03

I'd like to make some custom change and match the perf of NGC PyTorch.

Any insights will be helpful, thanks!

# NGC PyTorch
iteration      600/     800 | consumed samples:         4800 | elapsed time per iteration (ms): 437.3 | learning rate: 2.733E-05 | global batch size:     8 | lm loss: 6.766993E+00 | loss scale: 65536.0 | grad norm: 1.671 | number of skipped iterations:   0 | nu
mber of nan iterations:   0 |
7: time (ms) | forward-compute: 117.89 | backward-compute: 278.99 | backward-params-all-reduce: 1.87 | backward-layernorm-all-reduce: 0.01 | backward-embedding-all-reduce: 0.02 | backward-reduce-model-grads: 1.94 | backward-gather-model-params: 0.01 | optimizer-cop

# Conda/Pip installed PyTorch and Compiled from source PyTorch
iteration      600/     800 | consumed samples:         4800 | elapsed time per iteration (ms): 451.7 | learning rate: 2.733E-05 | global batch size:     8 | lm loss: 6.767890E+00 | loss scale: 65536.0 | grad norm: 1.687 | number of skipped iterations:   0 | number of nan iterations:   0 |
7: time (ms) | forward-compute: 120.25 | backward-compute: 290.83 | backward-params-all-reduce: 1.87 | backward-layernorm-all-reduce: 0.01 | backward-embedding-all-reduce: 0.02 | backward-reduce-model-grads: 1.94 | backward-gather-model-params: 0.01 | optimizer-copy-to-main-grad: 4.85 | optimizer-unscale-and-check-inf: 4.91 | optimizer-clip-main-grad: 7.58 | optimizer-count-zeros: 0.01 | optimizer-inner-step: 14.71 | optimizer-copy-main-to-model-params: 5.20 | optimizer: 37.35 | batch-generator: 1.44

opened by roywei 4

AttributeError: 'Namespace' object has no attribute 'model_parallel_size'

When i am running the preprocess.py file its showing error Namespace' object has no attribute 'model_parallel_size'

!python Megatron-LM/tools/preprocess_data.py
--input 'manifest_file.json'
--output-prefix 'my_t5'
--vocab 'vocab/vocab.txt'
--dataset-impl mmap
--tokenizer-type BertWordPieceLowerCase
--workers 1
--split-sentences

Error: Opening manifest_file.json

building BertWordPieceLowerCase tokenizer ... Traceback (most recent call last): File "Megatron-LM/tools/preprocess_data.py", line 203, in main() File "Megatron-LM/tools/preprocess_data.py", line 155, in main tokenizer = build_tokenizer(args) File "/opt/conda/lib/python3.6/site-packages/megatron/tokenizer/tokenizer.py", line 48, in build_tokenizer args) File "/opt/conda/lib/python3.6/site-packages/megatron/tokenizer/tokenizer.py", line 59, in _vocab_size_with_padding args.model_parallel_size AttributeError: 'Namespace' object has no attribute 'model_parallel_size'

opened by abdul756 4

checkpoint wget download doesn't work

FYI, the instructions at https://github.com/NVIDIA/Megatron-LM#downloading-checkpoints lead to 0-sized files. e.g.,

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
--2021-05-05 11:42:01--  https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip
Resolving api.ngc.nvidia.com (api.ngc.nvidia.com)... 13.57.84.77, 13.52.19.24
Connecting to api.ngc.nvidia.com (api.ngc.nvidia.com)|13.57.84.77|:443... connected.
HTTP request sent, awaiting response... 200 
Length: unspecified
Saving to: ‘megatron_lm_345m_v0.0.zip’

megatron_lm_345m_v0.0.zip                    [ <=>                                                                            ]       0  --.-KB/s    in 0s      

2021-05-05 11:42:01 (0.00 B/s) - ‘megatron_lm_345m_v0.0.zip’ saved [0]

I was able to download the files manually via https://ngc.nvidia.com/catalog/models/nvidia:megatron_bert_345m/version

opened by stas00 4

merge_file_ in MMapIndexedDatasetBuilder does not work because of _doc_idx

task

try to merge two .bin files into one

code

data_path_prefix = ["test1", "test2"]

class A:
    def __init__(self):
        self.tokenizer_type = 'BertWordPieceCase'
        self.rank = 0
        self.vocab_file = '/blue/yonghui.wu/alexgre/data/vocabs/bert/vocab.txt'
        self.merge_file = None
        self.make_vocab_size_divisible_by = 128
        self.tensor_model_parallel_size = 1

args = A()
tokenizer = build_tokenizer(args)

builders = indexed_dataset.make_builder(output_bin_files,  impl='mmap', vocab_size=30592)

for each in data_path_prefix:
    builders.merge_file_(each)

builders.finalize(output_idx_files)

issue

after merging, when ran training using the merged data, raise an error:

> building train, validation, and test datasets ...
 > datasets target sizes (minimum size):
    train:      1024000000
    validation: 107520
    test:       5120
> building train, validation, and test datasets for BERT ...
 > building dataset index ...
    reading sizes...
    reading pointers...
    reading document index...
    creating numpy buffer of mmap...
    creating memory view of numpy buffer...
Traceback (most recent call last):
  File "../Megatron-LM/pretrain_bert.py", line 154, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/blue/yonghui.wu/alexgre/Megatron-LM/megatron/training.py", line 115, in pretrain
    = build_train_valid_test_data_iterators(
  File "/blue/yonghui.wu/alexgre/Megatron-LM/megatron/training.py", line 995, in build_train_valid_test_data_iterators
    train_ds, valid_ds, test_ds = build_train_valid_test_datasets_provider(
  File "../Megatron-LM/pretrain_bert.py", line 137, in train_valid_test_datasets_provider
    train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
  File "/blue/yonghui.wu/alexgre/Megatron-LM/megatron/data/dataset_utils.py", line 398, in build_train_valid_test_datasets
    return _build_train_valid_test_datasets(data_prefix[0],
  File "/blue/yonghui.wu/alexgre/Megatron-LM/megatron/data/dataset_utils.py", line 453, in _build_train_valid_test_datasets
    indexed_dataset = get_indexed_dataset_(data_prefix,
  File "/blue/yonghui.wu/alexgre/Megatron-LM/megatron/data/dataset_utils.py", line 549, in get_indexed_dataset_
    assert indexed_dataset.sizes.shape[0] == indexed_dataset.doc_idx[-1]
AssertionError

solution

I modify the merge_file_ function of MMapIndexedDatasetBuilder class in indexed_dataset.py and now it workes

class MMapIndexedDatasetBuilder(object):
    def __init__(self, out_file, dtype=np.int64):
        self._data_file = open(out_file, 'wb')
        self._dtype = dtype
        self._sizes = []
        self._doc_idx = [0]
        self._merge_idx = 0

    def add_item(self, tensor):
        np_array = np.array(tensor.numpy(), dtype=self._dtype)
        self._data_file.write(np_array.tobytes(order='C'))
        self._sizes.append(np_array.size)

    def end_document(self):
        self._doc_idx.append(len(self._sizes))

    def merge_file_(self, another_file):
        # Concatenate index
        index = MMapIndexedDataset.Index(index_file_path(another_file))
        assert index.dtype == self._dtype

        for s in index.sizes:
            self._sizes.append(s)
        
        if self._merge_idx == 0:
            self._doc_idx = []
            self._doc_idx.extend(index.doc_idx)
        else:
            start_pt = self._doc_idx[-1]
            for each in index.doc_idx:
                new_doc_idx = start_pt + each
                self._doc_idx.append(new_doc_idx)

        # Concatenate data
        with open(data_file_path(another_file), 'rb') as f:
            shutil.copyfileobj(f, self._data_file)

        self._merge_idx += 1

    def finalize(self, index_file):
        self._data_file.close()

        self._sizes = np.array(self._sizes)
        self._doc_idx = np.array(self._doc_idx)
        
        print(self._sizes.shape)
        print(self._doc_idx.shape)
        print(self._sizes.shape[0], self._doc_idx[-1])

        with MMapIndexedDataset.Index.writer(index_file, self._dtype) as index:
            index.write(self._sizes, self._doc_idx)

follow up

do you want me to create a pull request on this?

opened by bugface 4

Unintended error caused by compiling fused_kernels

https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/arguments.py#L186-L198 https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/fused_kernels/init.py#L46-L72

When I tried to train GPT-3 on multi-node using torch.distributed.launch, sometimes the training process was stuck while compiling the fused_kernels. This bug can be occurred by timing issue when multiple processes compile concurrently. The simplest way is to remove ./fused_kernels/build and run script again, but I thought it is not solving the fundamental problem.

In my case, I resolved this issue can be solved by using torch.distributed.barrier, letting the process compile the fused_kernels only on master rank (rank == 0). If the authors think resolving this issue is necessary for the codes, then I will leave PR :)

opened by wade3han 4
Improve and fix bugs about fused softmax layer
Fix bugs about ELEMENTS_PER_LDG_STG (reported in https://github.com/NVIDIA/Megatron-LM/issues/132)

Add test codes for all fused cuda kernel using huggingface transformers

Add constraint about 0 <= length_key <= 2048 (originally it was in the header file as TORCH_INTERNAL_ASSERT)

Add constraint about batch_per_block (originally it was in the header file as TORCH_INTERNAL_ASSERT)

Refactor python fused sacle mask softmax layer codes
opened by hyunwoongko 3

Can't find scaled_masked_softmax.cpp

I want to run bash examples/generate_text.sh, but an error occurs:

Traceback (most recent call last):
  File "tools/generate_samples_gpt.py", line 116, in <module>
    main()
  File "tools/generate_samples_gpt.py", line 94, in main
    initialize_megatron(extra_args_provider=add_text_generate_args,
  File "/opt/conda/envs/as/lib/python3.8/site-packages/megatron_lm-1.1.5-py3.8.egg/megatron/initialize.py", line 48, in initialize_megatron
    set_global_variables(extra_args_provider=extra_args_provider,
  File "/opt/conda/envs/as/lib/python3.8/site-packages/megatron_lm-1.1.5-py3.8.egg/megatron/global_vars.py", line 82, in set_global_variables
    args = _parse_args(extra_args_provider=extra_args_provider,
  File "/opt/conda/envs/as/lib/python3.8/site-packages/megatron_lm-1.1.5-py3.8.egg/megatron/global_vars.py", line 97, in _parse_args
    _GLOBAL_ARGS = parse_args(extra_args_provider=extra_args_provider,
  File "/opt/conda/envs/as/lib/python3.8/site-packages/megatron_lm-1.1.5-py3.8.egg/megatron/arguments.py", line 190, in parse_args
    fused_kernels.load_scaled_masked_softmax_fusion_kernel()
  File "/opt/conda/envs/as/lib/python3.8/site-packages/megatron_lm-1.1.5-py3.8.egg/megatron/fused_kernels/__init__.py", line 88, in load_scaled_masked_softmax_fusion_kernel
    scaled_upper_triang_masked_softmax_cuda = cpp_extension.load(
  File "/opt/conda/envs/as/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1079, in load
    return _jit_compile(
  File "/opt/conda/envs/as/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1262, in _jit_compile
    version = JIT_EXTENSION_VERSIONER.bump_version_if_changed(
  File "/opt/conda/envs/as/lib/python3.8/site-packages/torch/utils/_cpp_extension_versioner.py", line 45, in bump_version_if_changed
    hash_value = hash_source_files(hash_value, source_files)
  File "/opt/conda/envs/as/lib/python3.8/site-packages/torch/utils/_cpp_extension_versioner.py", line 15, in hash_source_files
    with open(filename) as file:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/conda/envs/as/lib/python3.8/site-packages/megatron_lm-1.1.5-py3.8.egg/megatron/fused_kernels/scaled_masked_softmax.cpp'

It seems like that the program can't find scaled_masked_softmax.cpp. I have the directory /opt/conda/envs/as/lib/python3.8/site-packages/megatron_lm-1.1.5-py3.8.egg/megatron/fused_kernels/, but there are only __init__.py __pycache__ build.

I am not sure that I've setup Megatron well. I have run python setup install in the cloned repo folder. When checking with conda list, it shows megatron-lm 1.1.5 pypi_0 pypi.

opened by Co1lin 3

Module 'megatron.core.parallel_state' has no attribute 'parallel_state'
Hi there,

I am trying to merge a GPT2-6.7B trained with

TENSOR_MP_SIZE=8 PIPELINE_MP_SIZE=1

using the tools/checkpoint_util.py script. However, I am getting the following error message:

AttributeError: module 'megatron.core.parallel_state' has no attribute 'parallel_state'

This can be fixed by going through tools/checkpoint_loader_megatron.py and substituting mpu.parallel_state with mpu but then I get another error:

File "Megatron-LM/megatron/core/parallel_state.py", line 227, in get_tensor_model_parallel_group assert _TENSOR_MODEL_PARALLEL_GROUP is not None, \ AssertionError: intra_layer_model parallel group is not initialized

Is the first fix causing this?
opened by chrisby 0
Are there more features to be released?

It has been a long time since last release. Are there more features to be released? such as, supporting more high perfamance ops， adapting to Pytorch 2.0 and so on.

opened by GxjGit 0
Add UL2 data sampling and pretraining

This adds pretraining using UL2 for both encoder-decoder, non-causal decoder-only, and causal decoder-only models. I have not yet run large-scale tests to see if it yields the desired training improvements, but I wanted to give others the option to take a look at the code already.

I'm also not super sure about the non-causal GPT model, but I can disable (or even remove) that part if desired.

opened by janEbert 2
T5 model run on a single gpu

sh examples/pretrain_t5.sh

setting number of micro-batches to constant 1

building BertWordPieceLowerCase tokenizer ... padded vocab (size: 21230) with 18 dummy tokens (new size: 21248) initializing torch distributed ... Traceback (most recent call last): File "pretrain_t5.py", line 181, in forward_step, args_defaults={'tokenizer_type': 'BertWordPieceLowerCase'}) File "/workspace/Megatron-LM-3.0/megatron/training.py", line 103, in pretrain args_defaults=args_defaults) File "/workspace/Megatron-LM-3.0/megatron/initialize.py", line 81, in initialize_megatron finish_mpu_init() File "/workspace/Megatron-LM-3.0/megatron/initialize.py", line 62, in finish_mpu_init _initialize_distributed() File "/workspace/Megatron-LM-3.0/megatron/initialize.py", line 182, in _initialize_distributed timeout=timedelta(minutes=10)) File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 229, in _env_rendezvous_handler master_addr = _get_env_or_raise("MASTER_ADDR") File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 206, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable MASTER_ADDR expected, but not set

opened by HueCheng1021 0
Adds black and isort for formatting

We were wondering if Megatron-LM would be interested in using formatting in the codebase or adopt a formatting style. We realize that it may have been an intentional change to not adopt any formatting, but in our own usage of Megatron-LM we've been trying to adopt formatting which would make it harder to sync with upstream and vice versa.
Wanted to create a PR with formatting which we thought was the closes to what was already used in the repo to show some of the diffs that it'd make in the codebase

Config in a new pyproject.toml

opened by Averylamp 0

Releases(v3.0.2)

v3.0.2(May 25, 2022)

Includes sequence parallelism and selective activation recomputation.
Source code(tar.gz)
Source code(zip)
v2.5(Aug 11, 2021)

Source code(tar.gz)
Source code(zip)
v0.1(Mar 27, 2019)

Initial commit with BERT mixed precision training.
Source code(tar.gz)
Source code(zip)