Ongoing research training transformer language models at scale, including: BERT & GPT-2

Overview

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. This repository is for ongoing research on training large transformer language models at scale. We developed efficient, model-parallel (tensor and pipeline), and multi-node pre-training oftransformer based models such as GPT, BERT, and T5 using mixed precision.

Below are some of the projects where we have directly used Megatron:

Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. All models use a vocabulary size of 51,200 and a sequence length of 2048. We vary hidden size, number of attention heads, and number of layers to arrive at a specifc model size. As the model size increases, we also modestly increase the batch size. We leverage NVIDIA's Selene supercomputer to perform scaling studies and use up to 3072 A100 GPUs for the largest model. The table below shows the model configurations along with the achieved FLOPs (both per GPU and aggregate over all GPUs). Note that the FLOPs are measured for end-to-end training, i.e., includes all operations including data loading, optimization, and even logging.

Cases

All the cases from 1 billion to 1 trillion parameters achieve more than 43% half precision utilization, which is high for an end-to-end application. We observe that initially the utilization remains constant but as hidden size increases for larger models, utilization starts increasing and reaches 52% for the largest model. We also note that achieved aggregate petaFLOPs across all GPUs increases almost linearly with number of GPUs, demonstrating good weak scaling.

Contents

Setup

We have tested Megatron with NGC's PyTorch container version 20.12, which uses python 3.8, pytorch 1.8, cuda 11.1, and nccl 2.8.3.

To use this repository, please install the latest supported versions of PyTorch with GPU support (python 3.8, pytorch 1.8, cuda 11.1, and nccl 2.8.3 and above) and NVIDIA APEX. We strongly recommend using one of NGC's recent PyTorch containers (the latest compatible version at time of publication can be pulled with docker pull nvcr.io/nvidia/pytorch:20.12-py3). Data preprocessing requires NLTK, though this is not required for training, evaluation, or downstream tasks.

Downloading Checkpoints

We have provided pretrained BERT-345M and GPT-345M checkpoints for use to evaluate or finetuning downstream tasks. To access these checkpoints, first sign up for and setup the NVIDIA GPU Cloud (NGC) Registry CLI. Further documentation for downloading models can be found in the NGC documentation.

Alternatively, you can directly download the checkpoints using:

BERT-345M-uncased: wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_uncased/zip -O megatron_bert_345m_v0.1_uncased.zip
BERT-345M-cased: wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_cased/zip -O megatron_bert_345m_v0.1_cased.zip
GPT-345M: wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip

The models require vocabulary files to run. The BERT WordPiece vocab file can be extracted from Google's pretrained BERT models: uncased, cased. The GPT vocab file and merge table can be downloaded directly.

Usage

After installation, there are several possible workflows. The most comprehensive is:

  1. Data preprocessing
  2. Pretraining
  3. Finetuning (Optional for zero-shot tasks)
  4. Downstream task evaluation or text generation

However, steps 1 and 2 can be replaced by using one of the pretrained models mentioned above.

We've provided several scripts for pretraining both BERT and GPT in examples directory, as well as scripts for both zero-shot and fine-tuned downstream tasks including MNLI, RACE, WikiText103, and LAMBADA evaluation. There is also a script for GPT interactive text generation.

Training

Data Preprocessing

The training data requires preprocessing. First, place your training data in a loose json format, with one json containing a text sample per line. For example:

{"src": "www.nvidia.com", "text": "The quick brown fox", "type": "Eng", "id": "0", "title": "First Part"}
{"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"}

The name of the text field of the json can be changed by using the --json-key flag in preprocess_data.py The other metadata are optional and are not used in training.

The loose json is then processed into a binary format for training. To convert the json into mmap, cached index file, or the lazy loader format use preprocess_data.py. Set the --dataset-impl flag to mmap, cached, or lazy, respectively (default is mmap). An example script to prepare data for BERT training is:

python tools/preprocess_data.py \
       --input my-corpus.json \
       --output-prefix my-bert \
       --vocab bert-vocab.txt \
       --dataset-impl mmap \
       --tokenizer-type BertWordPieceLowerCase \
       --split-sentences

The output will be two files named, in this case, my-bert_text_sentence.bin and my-bert_text_sentence.idx. The --data-path specified in later BERT training is the full path and new filename, but without the file extension.

For T5 use the same preprocessing as BERT, perhaps renaming it to:

       --output-prefix my-t5 \

Some minor modifications are required for GPT data preprocessing, namely, the addition of a merge table, an end-of-document token, removal of sentence splitting, and a change to the tokenizer type:

python tools/preprocess_data.py \
       --input my-corpus.json \
       --output-prefix my-gpt2 \
       --vocab gpt2-vocab.json \
       --dataset-impl mmap \
       --tokenizer-type GPT2BPETokenizer \
       --merge-file gpt2-merges.txt \
       --append-eod

Here the output files are named my-gpt2_text_document.bin and my-gpt2_text_document.idx. As before, in GPT training, use the longer name without the extension as --data-path.

Further command line arguments are described in the source file preprocess_data.py.

BERT Pretraining

The examples/pretrain_bert.sh script runs single GPU 345M parameter BERT pretraining. Debugging is the primary use for single GPU training, as the code base and command line arguments are optimized for highly distributed training. Most of the arguments are fairly self-explanatory. By default, the learning rate decays linearly over the training iterations starting at --lr to a minimum set by --min-lr over --lr-decay-iters iterations. The fraction of training iterations used for warmup is set by --lr-warmup-fraction. While this is single GPU training, the batch size specified by --micro-batch-size is a single forward-backward path batch-size and the code will perform gradient accumulation steps until it reaches global-batch-size which is the batch size per iteration. The data is partitioned into a 949:50:1 ratio for training/validation/test sets (default is 969:30:1). This partitioning happens on the fly, but is consistent across runs with the same random seed (1234 by default, or specified manually with --seed). We use train-iters as the training iterations requested. Alternatively, one can provide --train-samples which is total number of samples to train on. If this option is present, then instead of providing --lr-decay-iters, one will need to provide --lr-decay-samples.

The logging, checkpoint-saving, and evaluation intervals are specified. Checkpointing the activations facilitates the training of larger models and/or batches. Note that the --data-path now includes the additional _text_sentence suffix added in preprocessing, but does not include the file extensions.

CHECKPOINT_PATH=checkpoints/bert_345m
VOCAB_FILE=bert-vocab.txt
DATA_PATH=my-bert_text_sentence

BERT_ARGS="--num-layers 24 \
           --hidden-size 1024 \
           --num-attention-heads 16 \
           --seq-length 512 \
           --max-position-embeddings 512 \
           --lr 0.0001 \
           --lr-decay-iters 990000 \
           --train-iters 2000000 \
           --min-lr 0.00001 \
           --lr-warmup-fraction 0.01 \
	   --micro-batch-size 4 \
           --global-batch-size 8 \
           --vocab-file $VOCAB_FILE \
           --split 949,50,1 \
           --fp16"

OUTPUT_ARGS="--log-interval 10 \
             --save-interval 500 \
             --eval-interval 100 \
             --eval-iters 10 \
             --checkpoint-activations"

python pretrain_bert.py \
       $BERT_ARGS \
       $OUTPUT_ARGS \
       --save $CHECKPOINT_PATH \
       --load $CHECKPOINT_PATH \
       --data-path $DATA_PATH

Further command line arguments are described in the source file arguments.py.

GPT Pretraining

The examples/pretrain_gpt.sh script runs single GPU 345M parameter GPT pretraining. As mentioned above, single GPU training is primarily intended for debugging purposes, as the code is optimized for distributed training.

It follows largely the same format as the previous BERT script with a few notable differences: the tokenization scheme used is BPE (which requires a merge table and a json vocabulary file) instead of WordPiece, the model architecture allows for longer sequences (note that the max position embedding must be greater than or equal to the maximum sequence length), and the --lr-decay-style has been set to cosine decay. Note that the --data-path now includes the additional _text_document suffix added in preprocessing, but does not include the file extensions.

CHECKPOINT_PATH=checkpoints/gpt2_345m
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
DATA_PATH=my-gpt2_text_document

GPT_ARGS="--num-layers 24 \
          --hidden-size 1024 \
          --num-attention-heads 16 \
          --seq-length 1024 \
          --max-position-embeddings 1024 \
          --micro-batch-size 4 \
          --global-batch-size 8 \
          --lr 0.00015 \
          --train-iters 500000 \
          --lr-decay-iters 320000 \
          --lr-decay-style cosine \
          --vocab-file $VOCAB_FILE \
          --merge-file $MERGE_FILE \
          --lr-warmup-fraction .01 \
          --fp16"

OUTPUT_ARGS=BERT pretraining above>

python pretrain_gpt.py \
       $GPT_ARGS \
       $OUTPUT_ARGS \
       --save $CHECKPOINT_PATH \
       --load $CHECKPOINT_PATH \
       --data-path $DATA_PATH \

Further command line arguments are described in the source file arguments.py.

T5 Pretraining

Very similar to BERT and GPT, the examples/pretrain_t5.sh script runs single GPU "base" (~220M parameter) T5 pretraining. The primary difference from BERT and GPT is the addition of the following arguments to accomodate the T5 architecture:

  • --kv-channels sets the inner dimension of the "key" and "value" matrices of all attention mechanisms in the model. For BERT and GPT this defaults to the hidden size divided by the number of attention heads, but can be configured for T5.

  • --ffn-hidden-size sets the hidden size in the feed-forward networks within a transformer layer. For BERT and GPT this defaults to 4 times the transformer hidden size, but can be configured for T5.

  • --encoder-seq-length and --decoder-seq-length set the sequence length for the encoder and decoder separately.

All of the other arguments remain as they were for BERT and GPT pretraining.

CHECKPOINT_PATH=checkpoints/t5_base
VOCAB_FILE=t5-vocab.txt
DATA_PATH=my-t5_text_sentence

T5_ARGS="--num-layers 24 \
         --hidden-size 1024 \
         --num-attention-heads 16 \
         --kv-channels 64 \
         --ffn-hidden-size 3072 \
         --encoder-seq-length 512 \
         --decoder-seq-length 128 \
         --max-position-embeddings 512 \
         --lr 0.0001 \
         --lr-decay-iters 990000 \
         --train-iters 2000000 \
         --min-lr 0.00001 \
         --lr-warmup-fraction 0.01 \
         --micro-batch-size 16 \
         --global-batch-size 2048 \
         --vocab-file $VOCAB_FILE \
         --vocab-extra-ids 100 \
         --split 949,50,1 \
         --fp16"

OUTPUT_ARGS=BERT pretraining above>

python pretrain_t5.py \
       $T5_ARGS \
       $OUTPUT_ARGS \
       --save $CHECKPOINT_PATH \
       --load $CHECKPOINT_PATH \
       --data-path $DATA_PATH

Distributed Pretraining

The examples/pretrain_{bert,gpt,t5}_distributed.sh scripts use the PyTorch distributed launcher for distributed training. As such, multi-node training can be achieved by properly setting environment variables and using init_method='env://' in the launcher. See the official PyTorch documentation for further description of these environment variables. By default, multi-node training uses the nccl distributed backend. A simple set of additional arguments and the use of the PyTorch distributed module with the Python flag -m torch.distributed.launch, detailed below, are the only additional requirements to adopt distributed training.

We use two types of parallelism: data and model parallelism. We facilitate two distributed data parallel implementations: a simple one of our own that performs gradient all-reduce at the end of back propagation step, and Torch's distributed data parallel wrapper that overlaps gradient reduction with back propagation computation. To switch between these two options use --DDP-impl local or --DDP-impl torch, respectively. As expected, Torch distributed data parallelism is more efficient at larger model sizes. For example, for the 8.3 billion parameters model running on 512 GPUs, the scaling increases from 60% to 76% when Torch's distributed data parallel is used. However, the overlapping method requires more memory and for some configurations (e.g., 2.5 billion parameters using 2-way model parallel and 1.2 billion parameters with no model parallel) can make the overall training slower as a result. We empirically found that using a smaller model in those cases improves the training time.

Second, we developed a simple and efficient two-dimensional model-parallel approach. To use tensor model parallelism (splitting execution of a single transformer module over multiple GPUs), add the --tensor-model-parallel-size flag to specify the number of GPUs among which to split the model, along with the arguments passed to the distributed launcher as mentioned above. To use pipeline model parallelism (sharding the transformer modules into stages with an equal number of transformer modules on each stage, and then pipelining execution by breaking the batch into smaller microbatches), use the --pipeline-model-parallel-size flag to specify the number of stages to split the model into (e.g., splitting a model with 24 transformer layers across 4 stages would mean each stage gets 6 transformer layers each).

We have examples of how to use these two different forms of model parallelism the example scripts ending in distributed_with_mp.sh, note that pipeline parallelism is not currently supported in the T5 model:

Other than these minor changes, the distributed training is identical to the training on a single GPU.

Distributed training:

WORLD_SIZE=8
TENSOR_MP_SIZE=2
PIPELINE_MP_SIZE=2

DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
                  --nnodes 1 \
                  --node_rank 0 \
                  --master_addr localhost \
                  --master_port 6000"

CHECKPOINT_PATH=
VOCAB_FILE=
DATA_PATH=
MODEL_ARGS=
OUTPUT_ARGS=

python -m torch.distributed.launch $DISTRIBUTED_ARGS ./pretrain_.py \
                $MODEL_ARGS \
                $OUTPUT_ARGS \
                --save $CHECKPOINT_PATH \
                --load $CHECKPOINT_PATH \
                --data-path $DATA_PATH \
                --tensor-model-parallel-size $TENSOR_MP_SIZE \
                --pipeline-model-parallel-size $PIPELINE_MP_SIZE \
                --DDP-impl torch

The interleaved pipelining schedule (more details in Section 2.2.2 of our paper) can be enabled using the --num-layers-per-virtual-pipeline-stage argument, which controls the number of transformer layers in a virtual stage (by default with the non-interleaved schedule, each GPU will execute a single virtual stage with NUM_LAYERS / PIPELINE_MP_SIZE transformer layers). The total number of layers in the transformer model should be divisible by this argument value. Additionally, the number of microbatches in the pipeline (computed as GLOBAL_BATCH_SIZE / (DATA_PARALLEL_SIZE * MICRO_BATCH_SIZE)) should be divisible by the PIPELINE_MP_SIZE when using this schedule (this condition is checked in an assertion in the code). The interleaved schedule is not supported for pipelines with 2 stages (PIPELINE_MP_SIZE=2).

GPT-3 Example

In examples/pretrain_gpt3_175B.sh we have provided an example of how to configure Megatron to run GPT-3 with 175 billion parameters on 1024 GPUs. The script is designed for slurm with pyxis plugin but can be easily adopted to any other scheduler. It uses 8-way and 16-way tensor and pipeline parallelism, respectively. With options global-batch-size 1536 and rampup-batch-size 16 16 5859375, the training will start with global batch size 16 and linearly increase the global batch size to 1536 over 5,859,375 samples with incrmeental steps 16. The training dataset can be either a single set or a multiple datasets combined with a set of weights.

With full global batch size of 1536 on 1024 A100 GPUs, each iteration takes around 32 seconds resulting in 138 teraFLOPs per GPU which is 44% of the theoretical peak FLOPs.

Evaluation and Tasks

We provide several command line arguments, detailed in the scripts listed below, to handle various zero-shot and fine-tuned downstream tasks. However, you can also finetune your model from a pretrained checkpoint on other corpora as desired. To do so, simply add the --finetune flag and adjust the input files and training parameters within the original training script. The iteration count will be reset to zero, and the optimizer and internal state will be reinitialized. If the fine-tuning is interrupted for any reason, be sure to remove the --finetune flag before continuing, otherwise the training will start again from the beginning.

Because evaluation requires substantially less memory than training, it may be advantageous to merge a model trained in parallel for use on a single GPU in downstream tasks. The following script accomplishes this. Currently only tensor model parallelism is supported on input and pipeline model parallelsim on the output. This example reads in a model with 2-way tensor model parallelism and writes out a model with 2-way pipeline model parallelism.

TENSOR_MODEL_PARALLEL_SIZE=2
TARGET_PIPELINE_MODEL_PARALLEL_SIZE=2

VOCAB_FILE=bert-vocab.txt
CHECKPOINT_PATH=checkpoints/bert_345m

WORLD_SIZE=$TENSOR_MODEL_PARALLEL_SIZE python tools/merge_mp_partitions.py \
        --model-type BERT \
        --tensor-model-parallel-size $TENSOR_MODEL_PARALLEL_SIZE \
        --pipeline-model-parallel-size 1 \
        --target-pipeline-model-parallel-size $TARGET_PIPELINE_MODEL_PARALLEL_SIZE \
        --tokenizer-type BertWordPieceLowerCase \
        --vocab-file $VOCAB_FILE \
        --num-layers 24 \
        --hidden-size 1024 \
        --num-attention-heads 16 \
        --seq-length 512 \
        --max-position-embeddings 512 \
        --load $CHECKPOINT_PATH
        --save $CHECKPOINT_PATH/merged

Several downstream tasks are described for both GPT and BERT models below. They can be run in distributed and model parallel modes with the same changes used in the training scripts.

GPT Text Generation

bash examples/generate_text.sh

We generate text samples using largely the GPT pretraining script. Few changes need to make, such as we need to provide the path to the pretrained checkpoint, the length of the output samples, whether to generate texts unconditionally (--num-samples to denote how many samples to generate) or conditional (need to pass --sample-input-file where each line of the file will be used as the conditional texts). There are few optional parameters to play, e.g. top-k, top-p, or greedy (set top-k and top-p to 0) sampling..

CHECKPOINT_PATH=checkpoints/gpt2_345m
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
GPT_ARGS=GPT pretraining above>

MAX_OUTPUT_SEQUENCE_LENGTH=1024
TEMPERATURE=1.0
TOP_P=0.9
NUMBER_OF_SAMPLES=2
OUTPUT_FILE=samples.json

python tools/generate_samples_gpt.py \
       $GPT_ARGS \
       --load $CHECKPOINT_PATH \
       --out-seq-length $MAX_OUTPUT_SEQUENCE_LENGTH \
       --temperature $TEMPERATURE \
       --genfile $OUTPUT_FILE \
       --num-samples $NUMBER_OF_SAMPLES \
       --top_p $TOP_P \
       --recompute

GPT Evaluation

We include example scripts for GPT evaluation on WikiText perplexity evaluation and LAMBADA Cloze accuracy.

WikiText Perplexity Evaluation

For even comparison with prior works, we evaluate perplexity on the word-level WikiText-103 test dataset, and appropriately compute perplexity given the change in tokens when using our subword tokenizer.

We use the following command to run WikiText-103 evaluation on a 345M parameter model.

TASK="WIKITEXT103"

VALID_DATA=.txt
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
CHECKPOINT_PATH=checkpoints/gpt2_345m

COMMON_TASK_ARGS="--num-layers 24 \
                  --hidden-size 1024 \
                  --num-attention-heads 16 \
                  --seq-length 1024 \
                  --max-position-embeddings 1024 \
                  --fp16 \
                  --vocab-file $VOCAB_FILE"

python tasks/main.py \
       --task $TASK \
       $COMMON_TASK_ARGS \
       --valid-data $VALID_DATA \
       --tokenizer-type GPT2BPETokenizer \
       --merge-file $MERGE_FILE \
       --load $CHECKPOINT_PATH \
       --micro-batch-size 8 \
       --checkpoint-activations \
       --log-interval 10 \
       --no-load-optim \
       --no-load-rng

LAMBADA Cloze Accuracy

To compute LAMBADA cloze accuracy (the accuracy of predicting the last token given the preceeding tokens) we utilize a detokenized, processed version of the LAMBADA dataset.

We use the following command to run LAMBADA evaluation on a 345M parameter model. Note that the --strict-lambada flag should be used to require whole word matching. Make that lambada is part of the file path.

TASK="LAMBADA"

VALID_DATA=.json
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
CHECKPOINT_PATH=checkpoints/gpt2_345m
COMMON_TASK_ARGS=WikiText Perplexity Evaluation above>

python tasks/main.py \
       --task $TASK \
       $COMMON_TASK_ARGS \
       --valid-data $VALID_DATA \
       --tokenizer-type GPT2BPETokenizer \
       --strict-lambada \
       --merge-file $MERGE_FILE \
       --load $CHECKPOINT_PATH \
       --micro-batch-size 8 \
       --checkpoint-activations \
       --log-interval 10 \
       --no-load-optim \
       --no-load-rng

Further command line arguments are described in the source file main.py

BERT Task Evaluation

RACE Evaluation

The following script finetunes the BERT model for evaluation on the RACE dataset. The TRAIN_DATA and VALID_DATA directory contain the RACE dataset as separate .txt files. Note that for RACE, the batch size is the number of RACE query's to evaluate. Since each RACE query has four samples, the effective batch size passed through the model will be four times the batch size specified on the command line.

TRAIN_DATA="data/RACE/train/middle"
VALID_DATA="data/RACE/dev/middle \
            data/RACE/dev/high"
VOCAB_FILE=bert-vocab.txt
PRETRAINED_CHECKPOINT=checkpoints/bert_345m
CHECKPOINT_PATH=checkpoints/bert_345m_race
COMMON_TASK_ARGS="--num-layers 24 \
                  --hidden-size 1024 \
                  --num-attention-heads 16 \
                  --seq-length 512 \
                  --max-position-embeddings 512 \
                  --fp16 \
                  --vocab-file $VOCAB_FILE"

COMMON_TASK_ARGS_EXT="--train-data $TRAIN_DATA \
                      --valid-data $VALID_DATA \
                      --pretrained-checkpoint $PRETRAINED_CHECKPOINT \
                      --checkpoint-activations \
                      --save-interval 10000 \
                      --save $CHECKPOINT_PATH \
                      --log-interval 100 \
                      --eval-interval 1000 \
                      --eval-iters 10 \
                      --weight-decay 1.0e-1"

python tasks/main.py \
       --task RACE \
       $COMMON_TASK_ARGS \
       $COMMON_TASK_ARGS_EXT \
       --tokenizer-type BertWordPieceLowerCase \
       --epochs 3 \
       --micro-batch-size 4 \
       --lr 1.0e-5 \
       --lr-warmup-fraction 0.06

MNLI Evaluation

The following script finetunes the BERT model for evaluation with the MultiNLI sentence pair corpus. Because the matching tasks are quite similar, the script can be quickly tweaked to work with the Quora Question Pairs (QQP) dataset as well.

TRAIN_DATA="data/glue_data/MNLI/train.tsv"
VALID_DATA="data/glue_data/MNLI/dev_matched.tsv \
            data/glue_data/MNLI/dev_mismatched.tsv"
PRETRAINED_CHECKPOINT=checkpoints/bert_345m
VOCAB_FILE=bert-vocab.txt
CHECKPOINT_PATH=checkpoints/bert_345m_mnli
COMMON_TASK_ARGS=RACE Evaluation above>
COMMON_TASK_ARGS_EXT=RACE Evaluation above>

python tasks/main.py \
       --task MNLI \
       $COMMON_TASK_ARGS \
       $COMMON_TASK_ARGS_EXT \
       --tokenizer-type BertWordPieceLowerCase \
       --epochs 5 \
       --micro-batch-size 8 \
       --lr 5.0e-5 \
       --lr-warmup-fraction 0.065

Datasets

We do not host any datasets for GPT or BERT training, however, we detail their collection so that our results may be reproduced.

Collecting Wikipedia Training Data

We recommend following the Wikipedia data extraction process specified by Google research: "the recommended pre-processing is to download the latest dump, extract the text with WikiExtractor.py, and then apply any necessary cleanup to convert it into plain text."

We recommend using the --json argument when using WikiExtractor, which will dump the Wikipedia data into loose json format (one json per line), making it more manageable on the file system and also readily consumable by our codebase. We recommend further preprocessing this json dataset by nltk punctuation standardization. For BERT training, use the --split-sentences flag to preprocess_data.py as described above to include sentence breaks in the produced index. If you'd like to use Wikipedia data for GPT training you should still clean it with nltk/spacy/ftfy, but do not use the --split-sentences flag.

Collecting GPT Webtext Data

We utilize the publicly available OpenWebText library from jcpeterson and eukaryote31's work to download urls. We then filtered, cleaned, and deduplicated all downloaded content according to the procedure described in our openwebtext directory. For reddit URLs corresponding to content up to October 2018 we arrived at approximately 37GB of content.

Issues
  • Error in fused softmax kernel result

    Error in fused softmax kernel result

    Problem ?

    스크린샷 2021-08-12 오전 11 28 52

    The result of the fused softmax layer is different from the result of the original torch softmax layer.

    How to reproduce ?

    import math
    
    import torch
    from torch.nn import Softmax
    from transformers import BertTokenizer
    from transformers.models.bert.modeling_bert import BertModel
    from fused import FusedScaleMaskSoftmax
    from fused import AttnMaskType
    
    def load_fused_kernels():
        try:
            import fused_mix_prec_layer_norm_cuda
            import scaled_masked_softmax_cuda
            import scaled_upper_triang_masked_softmax_cuda
            import torch
    
            print("[Success] load_fused_kernels")
        except ImportError as e:
            print("[Fail] load_fused_kernels")
            raise e
    
    
    def attention_mask_func(attention_scores, attention_mask):
        attention_scores.masked_fill_(attention_mask, -10000.0)
        return attention_scores
    
    
    def test_softmax():
        bert = BertModel.from_pretrained("bert-base-cased").cuda().half()
        tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
    
        # len_query=24, batch_per_block=8 (in my setting)
        tokens = tokenizer(
            [
                "Hello. How are you? I am fine thank you and you? yes Good. hi hello hello hello hello"
            ]
            * 4,
            return_tensors="pt",
        )
    
        embedding_output = bert.embeddings(
            input_ids=tokens["input_ids"].cuda(),
            position_ids=None,
            token_type_ids=tokens["token_type_ids"].cuda(),
            inputs_embeds=None,
            past_key_values_length=0,
        )
    
        # (bsz, 1, 1, seq_len), all values are 0.
        mask = bert.get_extended_attention_mask(
            attention_mask=tokens["attention_mask"].cuda(),
            input_shape=tokens["input_ids"].shape,
            device=bert.device,
        )
        # (bsz, 1, seq_len, seq_len)
        mask = mask.repeat(1, 1, mask.size()[-1], 1)
    
        attention = bert.encoder.layer[0].attention.self
        query_proj = attention.query
        key_proj = attention.key
        value_proj = attention.value
    
        key_layer = attention.transpose_for_scores(key_proj(embedding_output))
        value_layer = attention.transpose_for_scores(value_proj(embedding_output))
        query_layer = attention.transpose_for_scores(query_proj(embedding_output))
    
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        attention_scores /= math.sqrt(key_layer.size()[-1])
    
        fused_softmax = FusedScaleMaskSoftmax(
            mask_func=attention_mask_func,
            attn_mask_type=AttnMaskType.padding,
            input_in_fp16=True,
            input_in_bf16=False,
            scale=None,
            softmax_in_fp32=False,
            scaled_masked_softmax_fusion=True,
        )
    
        fused_softmax_output = fused_softmax(
            attention_scores,
            (mask != 0),
        )
    
        torch_softmax = FusedScaleMaskSoftmax(
            mask_func=attention_mask_func,
            attn_mask_type=AttnMaskType.padding,
            input_in_fp16=True,
            input_in_bf16=False,
            scale=None,
            softmax_in_fp32=False,
            scaled_masked_softmax_fusion=False,
        )
    
        torch_softmax_output = torch_softmax(
            attention_scores,
            (mask != 0),
        )
    
        print("fused (turn on fusion):", fused_softmax_output[0][0][0])
        print("\n")
        print("fused (turn off fusion):", torch_softmax_output[0][0][0])
    
        torch_softmax = torch.nn.Softmax(dim=-1)
        torch_softmax_output = torch_softmax(attention_scores)
    
        print("\n")
        print("torch softmax", torch_softmax_output[0][0][0])
    
    
    if __name__ == "__main__":
        load_fused_kernels()
        test_softmax()
    
    opened by hyunwoongko 22
  • Compatibility with pytorch-transformers for fine-tuning

    Compatibility with pytorch-transformers for fine-tuning

    Hi,

    Thanks for the great package! I wanted to check about the compatibility of the trained GPT-2 model/tokenizer with the pytorch-transformers package. Is it possible that, with a few changes, the trained model can be imported using that package, in order to perform additional fine-tuning there with different heads for example? I understand that there are some config files expected by that package, so I'm assuming these can be added. But I'm interested in knowing about the compatibility of the model/tokenizer mainly.

    Thanks!

    opened by harkous 6
  • perplexity too big for gpt2 wikitext evaluation

    perplexity too big for gpt2 wikitext evaluation

    When running the wikitext evaluation of gpt2

    python evaluate_gpt2.py 
        --valid-data wikitext-103-v1/wiki.test.tokens 
        --load-openai 
        --hidden-size 768 
        --vocab-size 50257 
        --tokenizer-type GPT2BPETokenizer 
        --max-position-embeddings 1024
    

    the resulting perplexity is 2.9290E+02 -- why is the value so extremely high?

    Here is the console output with logging level DEBUG:

    Evaluate GPT2 model
    WARNING: No training data specified
    using world size: 1 and model-parallel size: 1 
     > using dynamic loss scaling
    > initializing model parallel with size 1
    > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
    DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): s3.amazonaws.com:443
    DEBUG:urllib3.connectionpool:https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/gpt2-vocab.json HTTP/1.1" 200 0
    DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): s3.amazonaws.com:443
    DEBUG:urllib3.connectionpool:https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/gpt2-merges.txt HTTP/1.1" 200 0
    INFO:data_utils.tokenization_gpt2:loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json from cache at /braintree/home/msch/.pytorch_pretrained_bert/f2808208f9bec2320371a9f5f891c184ae0b674ef866b79c58177067d15732dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
    INFO:data_utils.tokenization_gpt2:loading merges file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt from cache at /braintree/home/msch/.pytorch_pretrained_bert/d629f792e430b3c76a1291bb2766b0a047e36fae0588f9dbc1ae51decdff691b.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
    wikitext
    Original Tokens: 270330, Detokenized tokens: 245566
    > padded vocab (size: 50257) with 0 dummy tokens (new size: 50257)
    global rank: 0 | vocab size: 50257 | eod token: 50256 | num_examples: 8448 | num_original_tokens: 245566 | num_tokenized_tokens: 270330
    building GPT2 model ...
     > number of parameters: 209494272
    loading openai weights
    DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): s3.amazonaws.com:443
    DEBUG:urllib3.connectionpool:https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/gpt2-pytorch_model.bin HTTP/1.1" 200 0
    DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): s3.amazonaws.com:443
    DEBUG:urllib3.connectionpool:https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/gpt2-config.json HTTP/1.1" 200 0
    INFO:pytorch_pretrained_bert.modeling_gpt2:loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin from cache at gpt2_weights/4295d67f022061768f4adc386234dbdb781c814c39662dd1662221c309962c55.778cf36f5c4e5d94c8cd9cefcf2a580c8643570eb327f0d4a1f007fab2acbdf1
    INFO:pytorch_pretrained_bert.modeling_gpt2:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json from cache at gpt2_weights/4be02c5697d91738003fb1685c9872f284166aa32e061576bbe6aaeb95649fcf.085d5f6a8e7812ea05ff0e6ed0645ab2e75d80387ad55c1ad9806ee70d272f80
    INFO:pytorch_pretrained_bert.modeling_gpt2:Model config {
      "initializer_range": 0.02,
      "layer_norm_epsilon": 1e-05,
      "n_ctx": 1024,
      "n_embd": 768,
      "n_head": 12,
      "n_layer": 12,
      "n_positions": 1024,
      "vocab_size": 50257
    }
    
    global rank: 0 | max iters: 2112
    global rank: 0 | iteration: 0
    global rank: 0 | iteration: 100
    ...
    global rank: 0 | iteration: 1900
    global rank: 0 | iteration: 2000
    global rank: 0 | iteration: 2100
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------------------------------------------------------
     validation results on wiki | avg loss: 5.6798E+00 | ppl: 2.9290E+02 | adjusted ppl: 5.1937E+02 | token ratio: 1.1008449901248143 |
    ------------------------------------------------------------------------------------------------------------------------------------
    
    opened by mschrimpf 5
  • Unintended error caused by compiling fused_kernels

    Unintended error caused by compiling fused_kernels

    https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/arguments.py#L186-L198 https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/fused_kernels/init.py#L46-L72

    When I tried to train GPT-3 on multi-node using torch.distributed.launch, sometimes the training process was stuck while compiling the fused_kernels. This bug can be occurred by timing issue when multiple processes compile concurrently. The simplest way is to remove ./fused_kernels/build and run script again, but I thought it is not solving the fundamental problem.

    In my case, I resolved this issue can be solved by using torch.distributed.barrier, letting the process compile the fused_kernels only on master rank (rank == 0). If the authors think resolving this issue is necessary for the codes, then I will leave PR :)

    opened by wade3han 4
  • AttributeError: 'Namespace' object has no attribute 'model_parallel_size'

    AttributeError: 'Namespace' object has no attribute 'model_parallel_size'

    When i am running the preprocess.py file its showing error Namespace' object has no attribute 'model_parallel_size'

    !python Megatron-LM/tools/preprocess_data.py
    --input 'manifest_file.json'
    --output-prefix 'my_t5'
    --vocab 'vocab/vocab.txt'
    --dataset-impl mmap
    --tokenizer-type BertWordPieceLowerCase
    --workers 1
    --split-sentences

    Error: Opening manifest_file.json

    building BertWordPieceLowerCase tokenizer ... Traceback (most recent call last): File "Megatron-LM/tools/preprocess_data.py", line 203, in main() File "Megatron-LM/tools/preprocess_data.py", line 155, in main tokenizer = build_tokenizer(args) File "/opt/conda/lib/python3.6/site-packages/megatron/tokenizer/tokenizer.py", line 48, in build_tokenizer args) File "/opt/conda/lib/python3.6/site-packages/megatron/tokenizer/tokenizer.py", line 59, in _vocab_size_with_padding args.model_parallel_size AttributeError: 'Namespace' object has no attribute 'model_parallel_size'

    opened by abdul756 4
  • checkpoint wget download doesn't work

    checkpoint wget download doesn't work

    FYI, the instructions at https://github.com/NVIDIA/Megatron-LM#downloading-checkpoints lead to 0-sized files. e.g.,

    wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
    --2021-05-05 11:42:01--  https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip
    Resolving api.ngc.nvidia.com (api.ngc.nvidia.com)... 13.57.84.77, 13.52.19.24
    Connecting to api.ngc.nvidia.com (api.ngc.nvidia.com)|13.57.84.77|:443... connected.
    HTTP request sent, awaiting response... 200 
    Length: unspecified
    Saving to: ‘megatron_lm_345m_v0.0.zip’
    
    megatron_lm_345m_v0.0.zip                    [ <=>                                                                            ]       0  --.-KB/s    in 0s      
    
    2021-05-05 11:42:01 (0.00 B/s) - ‘megatron_lm_345m_v0.0.zip’ saved [0]
    

    I was able to download the files manually via https://ngc.nvidia.com/catalog/models/nvidia:megatron_bert_345m/version

    opened by stas00 4
  • Reproduce 71.9 TFlops throughput

    Reproduce 71.9 TFlops throughput

    Hi,

    I wonder what batch size per GPU was used for the benchmark below, if I want to reproduce it? scaling-mp

    opened by eric-haibin-lin 3
  • Can we get some samples?

    Can we get some samples?

    Hi!

    Out of interest in GPT-2 and the Megatron LM, can we get a idea what the code outputs? I.e some output samples of what the tool actually does, instead of having to run it just to see what it can do.

    opened by bladedsupernova 3
  • Can't find scaled_masked_softmax.cpp

    Can't find scaled_masked_softmax.cpp

    I want to run bash examples/generate_text.sh, but an error occurs:

    Traceback (most recent call last):
      File "tools/generate_samples_gpt.py", line 116, in <module>
        main()
      File "tools/generate_samples_gpt.py", line 94, in main
        initialize_megatron(extra_args_provider=add_text_generate_args,
      File "/opt/conda/envs/as/lib/python3.8/site-packages/megatron_lm-1.1.5-py3.8.egg/megatron/initialize.py", line 48, in initialize_megatron
        set_global_variables(extra_args_provider=extra_args_provider,
      File "/opt/conda/envs/as/lib/python3.8/site-packages/megatron_lm-1.1.5-py3.8.egg/megatron/global_vars.py", line 82, in set_global_variables
        args = _parse_args(extra_args_provider=extra_args_provider,
      File "/opt/conda/envs/as/lib/python3.8/site-packages/megatron_lm-1.1.5-py3.8.egg/megatron/global_vars.py", line 97, in _parse_args
        _GLOBAL_ARGS = parse_args(extra_args_provider=extra_args_provider,
      File "/opt/conda/envs/as/lib/python3.8/site-packages/megatron_lm-1.1.5-py3.8.egg/megatron/arguments.py", line 190, in parse_args
        fused_kernels.load_scaled_masked_softmax_fusion_kernel()
      File "/opt/conda/envs/as/lib/python3.8/site-packages/megatron_lm-1.1.5-py3.8.egg/megatron/fused_kernels/__init__.py", line 88, in load_scaled_masked_softmax_fusion_kernel
        scaled_upper_triang_masked_softmax_cuda = cpp_extension.load(
      File "/opt/conda/envs/as/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1079, in load
        return _jit_compile(
      File "/opt/conda/envs/as/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1262, in _jit_compile
        version = JIT_EXTENSION_VERSIONER.bump_version_if_changed(
      File "/opt/conda/envs/as/lib/python3.8/site-packages/torch/utils/_cpp_extension_versioner.py", line 45, in bump_version_if_changed
        hash_value = hash_source_files(hash_value, source_files)
      File "/opt/conda/envs/as/lib/python3.8/site-packages/torch/utils/_cpp_extension_versioner.py", line 15, in hash_source_files
        with open(filename) as file:
    FileNotFoundError: [Errno 2] No such file or directory: '/opt/conda/envs/as/lib/python3.8/site-packages/megatron_lm-1.1.5-py3.8.egg/megatron/fused_kernels/scaled_masked_softmax.cpp'
    

    It seems like that the program can't find scaled_masked_softmax.cpp. I have the directory /opt/conda/envs/as/lib/python3.8/site-packages/megatron_lm-1.1.5-py3.8.egg/megatron/fused_kernels/, but there are only __init__.py __pycache__ build.

    I am not sure that I've setup Megatron well. I have run python setup install in the cloned repo folder. When checking with conda list, it shows megatron-lm 1.1.5 pypi_0 pypi.

    opened by Co1lin 3
  • Improve and fix bugs about fused softmax layer

    Improve and fix bugs about fused softmax layer

    1. Fix bugs about ELEMENTS_PER_LDG_STG (reported in https://github.com/NVIDIA/Megatron-LM/issues/132)
    2. Add test codes for all fused cuda kernel using huggingface transformers
    3. Add constraint about 0 <= length_key <= 2048 (originally it was in the header file as TORCH_INTERNAL_ASSERT)
    4. Add constraint about batch_per_block (originally it was in the header file as TORCH_INTERNAL_ASSERT)
    5. Refactor python fused sacle mask softmax layer codes
    opened by hyunwoongko 3
  • There is a difference in the calculation of num_warmup_microbatches

    There is a difference in the calculation of num_warmup_microbatches

    In interleaved-1F1B:

    https://github.com/NVIDIA/Megatron-LM/blob/b31e1296354e979722627a6c4dedafe19b51fa97/megatron/schedules.py#L222-L223

    but in 1F1B:

    https://github.com/NVIDIA/Megatron-LM/blob/b31e1296354e979722627a6c4dedafe19b51fa97/megatron/schedules.py#L531-L533

    what is the purpose of this diff?

    opened by unlimblue 4
  • Update README.md

    Update README.md

    Remove duplicated bulletpoint

    opened by kvtoraman 0
  • torch.cuda.synchronize() might be unnecessary in p2p_communication.py

    torch.cuda.synchronize() might be unnecessary in p2p_communication.py

    Hi. I notice that there is an explicit cuda device synchronization to avoid race condition in p2p_communication.py.

            if len(ops) > 0:
                reqs = torch.distributed.batch_isend_irecv(ops)
                for req in reqs:
                    req.wait()
        # To protect against race condition when using batch_isend_irecv().
        torch.cuda.synchronize()
    

    However, I think the synchronization here is not needed. This is because req.wait() will block the default stream (i.e., the compute stream) until the communication operations on the NCCL stream finishes. Refer to this upstream issue https://github.com/pytorch/pytorch/issues/68112 and the related code for details.

    opened by vycezhong 1
  • <Signals.SIGSEGV: 11> occured in multiple nodes pretraing

    occured in multiple nodes pretraing

    Hi, Dear! I'm trying to pretraing bert in two nodes! I used the docker image pulled from nvcr.io/nvidia/pytorch:20.12-py3 as my evirnoment, I have managed to run pretraining in a node with single T4 card. but when I trying to train the same model with two nodes both with single T4 card, I always getting the following Error:

    image

    but if I setting nnodes as one all things goes well? can anyone give me some guide? thanks very much

    opened by xiongjun19 0
  • LM for long sequence (e.g. - BigBird) support into Megatron-LM

    LM for long sequence (e.g. - BigBird) support into Megatron-LM

    I am exploring possible ways to add BigBird support to current Megatron-LM. I would like to work on it. Let me know if this sound good or not.

    opened by tanmoyio 0
  • Why is it 3us?

    Why is it 3us?

    https://github.com/NVIDIA/Megatron-LM/blob/b31e1296354e979722627a6c4dedafe19b51fa97/megatron/mpu/layers.py#L226

    Mentioned in the comment above:

    Delay the start of weight gradient computation shortly (3us) to have all-reduce scheduled first and have GPU resources allocated

    , but I am confused about some details:

    1. Why should we wait for all-reduce scheduled and GPU resources allocated?
    2. Why is it 3us but nothing else?
    3. What is the purpose of +1 after torch.empty?
    opened by unlimblue 0
  • AttributeError: 'IndexedDataset' object has no attribute 'get_doc_idx’

    AttributeError: 'IndexedDataset' object has no attribute 'get_doc_idx’

    I am preparing the data for GPT training according with the Megatron-LM project's(tag 1.1) README.md at path "Megatron-LM/tools/openwebtext/tools", I run the preprocess_dataset.py, the code in it: python3 tools/preprocess_data.py --input data/corpus/test.json --output-prefix my-bert --vocab data/tokenizer/bert/bert-large-uncased-vocab.txt --tokenizer-type BertWordPieceLowerCase --split-sentences --dataset-impl cached

    i use “cached” type of dataset-impl but when i run

    python3 pretrain_bert.py
    --num-layers 12
    --hidden-size 1024
    --num-attention-heads 16
    --batch-size 4
    --seq-length 512
    --max-position-embeddings 512
    --train-iters 2000000
    --save $CHECKPOINT_PATH
    --load $CHECKPOINT_PATH
    --data-path $DATA_PATH
    --vocab-file data/tokenizer/bert/bert-large-uncased-vocab.txt
    --data-impl cached
    --split 949,50,1
    --distributed-backend nccl
    --lr 0.0001
    --min-lr 0.00001
    --lr-decay-style linear
    --lr-decay-iters 990000
    --weight-decay 1e-2
    --clip-grad 1.0
    --warmup .01
    --log-interval 100
    --save-interval 10000
    --eval-interval 1000
    --eval-iters 10
    --fp16

    I encountered the below error: AttributeError: 'IndexedDataset' object has no attribute 'get_doc_idx' I want to know how can I fix the error, someone can give me some tips, thx!

    opened by whulxl 0
  • ImportError: cannot import name 'Tokenizer' from 'tokenizer'

    ImportError: cannot import name 'Tokenizer' from 'tokenizer'

    I am preparing the data for GPT training according with the Megatron-LM project's README.md at path "Megatron-LM/tools/openwebtext/tools", I run the cleanup_dataset.py, the code in it: import sys

    from tokenizer import Tokenizer

    MIN_DOCUMENT_LENGHT = 128 I encountered the below error: ImportError: cannot import name 'Tokenizer' from 'tokenizer' I want to know where can I find the tokenizer lib, someone can give me some tips, thx!

    opened by Neleon 0
  • How to do punctuation  standardzation?

    How to do punctuation standardzation?

    I'm wondering how to do the standardzation of the extracted corpus from wikipedia ?

    opened by xiongjun19 0
Owner
NVIDIA Corporation
NVIDIA Corporation
Transformer related optimization, including BERT, GPT

This repository provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA.

NVIDIA Corporation 599 Nov 28, 2021
GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

Nathan Cooper 693 Nov 24, 2021
API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend

gpt-j-api ?? An API to interact with the GPT-J language model. You can use and test the model in two different ways: Streamlit web app at http://api.v

Víctor Gallego 173 Nov 24, 2021
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

zxx飞翔的鱼 497 Nov 30, 2021
VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

Salesforce 35 Nov 6, 2021
Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

Stella Douka 4 Nov 8, 2021
A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Ethan 18 Nov 26, 2021
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 1.1k Dec 2, 2021
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 903 Feb 17, 2021
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-t

Facebook Research 4.7k Dec 2, 2021
An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

GPT-NeoX An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hun

EleutherAI 1.3k Dec 3, 2021
Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景 安装教程 快速上手 (一)预训练模型 (二)机器翻译 (三)文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练 背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台,支持多种预训练方式,以及序列生成和自然语言理解任务。 安装教程 git clone git

Tencent Minority-Mandarin Translation Team 15 Sep 14, 2021
Seonghwan Kim 14 Nov 29, 2021
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Weihao Yu 13 Jun 24, 2021
Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Ubiquitous Knowledge Processing Lab 26 Nov 15, 2021
Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Boolean Prompting for Neural Text Generators Neural text generators like the GPT models promise a general-purpose means of manipulating texts. These m

Jeffrey M. Binder 17 Nov 10, 2021
Tools for curating biomedical training data for large-scale language modeling

Tools for curating biomedical training data for large-scale language modeling

BigScience Workshop 7 Dec 3, 2021