Guide to using pre-trained large language models of source code

Overview

Large Models of Source Code

I occasionally train and publicly release large neural language models on programs, including PolyCoder. Here, I describe how to use these.

  1. Setup
  2. Models (incl. PolyCoder)
  3. Datasets
  4. Evaluation
  5. How to cite

Getting Started

All current models were trained using the GPT NeoX toolkit. First, download a pretrained checkpoint as described below and then use this either with a Docker image or through our fork of this toolkit from source to generate code or replicate our evaluation.

Retrieving Checkpoints

Checkpoint files for training PolyCoder are hosted on this public Zenodo repository. See this section for details on currently available models. Model checkpoints range up to 6GB, which is also the amount of GPU memory they require to run (running on CPU is neither tested nor recommended). Download and untar a checkpoint file (in this case for a 2.7B parameter model trained for 150K steps) to a directory called checkpoints/, using:

mkdir checkpoints
cd checkpoints
wget https://zenodo.org/record/6363556/files/2-7B-150K.tar
tar -xvf 2-7B-150K.tar

From Source

We maintain a public fork of the NeoX repository here, which includes the (minor) changes we made to the codebase to allow for tabs & newlines in the tokenization, and also includes instructions for running the perplexity and HumanEval tasks. Note that this repository uses a forked version of the LM Evaluation Harness with the code benchmark from our work.

Building this repository should match the process for GPT-NeoX almost exactly. You may also use the Docker image mentioned next, but mounting a checkout of the latest version of this fork over the /gpt-neox directory inside the container. Once set up generate.py entrypoint (described below) for free-form code generation, or use one of the commands here to calculate perplexity and HumanEval results as in the paper.

Via Docker

A base Docker image containing a slightly modified version of the gpt-neox repository is available via DockerHub:

docker pull vhellendoorn/code-lms-neox:base

This image can be used together with a checkpoint file hosted on this public Zenodo repository. The base Docker image size is 5.4GB. Once a checkpoint has been retrieved, start the container with the following commands (substituting another GPU device index if needed):

nvidia-docker run --rm -it -e NVIDIA_VISIBLE_DEVICES=0 --shm-size=1g --ulimit memlock=-1 --mount type=bind,src=$PWD/checkpoints,dst=/gpt-neox/checkpoints vhellendoorn/code-lms-neox:base

Code Generation

The following command can be used to generate code from a prompt:

sudo ./deepy.py generate.py configs/text_generation.yml checkpoints/configs/local_setup.yml checkpoints/configs/2-7B.yml

Note: if not using the 2.7B parameter model, replace the final config file with the appropriate model size (e.g., small = 160M parameters, medium = 405M).

Once the checkpoint has been loaded, you can feed it an example such as def return1():\n """Returns 1."""\n (note the whitespace tokens) and watch it predict return 1 (and then probably a bunch of other returnX methods, depending on the sample).

The modifications to gpt-neox mentioned above center around the need to allow tabs and newlines in the prompt input. For the interactive mode, these can be added using their escaped versions (\t, \n); when using file-based input, the project will read the entire file instead of treating each line as a prompt. By default, the command below will create an interactive prompt and return relatively short outputs (256 tokens) with a sampling temperature of 0.5; this behavior can be changed in /gpt-neox/checkpoints/configs/text_generation.yml.

A lower temperature (e.g., 0.2) will produce more consistent and plausible (to the model) predictions; a higher temperature such as the default may be useful for generating and evaluating many candidates (see our paper for recommendations). For the latter setting, consider switching to the input-file mode and providing an entire snippet (without escaping whitespace) in the corresponding file

Multi-lingual Models

Several models have been trained on a large corpus of code spanning 12 programming languages. This includes a 2.7B parameter model (nick-named PolyCoder, trained for 100K and 150K steps), a 405M parameter model (100K & 150K steps) and a 160M parameter model (150K steps).

Available Models

All models are available at a public Zenodo repository, in the form of .tar files with fairly self-explanatory names (e.g., 2-7B-100K => a 2.7B parameter model trained for 100K steps). Currently available models include:

  • GPT2 - 2.7B: A 32 layer, 2,560 dimensional Transformer model, trained with a batch size of 128 sequences (256K tokens). Models available both at 100K and at 150K steps steps.
    • Note that GPT-Neox' default config for this model was modified to reduce the number of training steps (and learning rate decay steps accordingly) to 160K, down from 320K, to better match the available training resources. Hence, this model may not have reached its peak performance.
  • GPT2 - 0.4B: A 24 layer, 1,024 dimensional Transformer model based on the medium config, trained with 256K tokens per batch.
  • GPT2 - 160M: A 12 layer, 768 dimensional Transformer model based on the small config, trained with 256K tokens per batch.

Training Process

Training was done on 4 to 8 NVIDIA RTX 8000 GPUs, largely following the standard config values, except also enabling "scaled-upper-triang-masked-softmax-fusion" and "bias-gelu-fusion" for performance and slightly changing the batch size (see model details), data split (changed to 98.9%, 0.1%, 1%), initial loss scale (2^16), and print/eval intervals.

The below image shows the loss curve of the various models' training process in terms of validation loss. image

Caveats

The trained models come with a few minor known limitations:

  • This model was not trained to solve programming problems and may not perform well on a benchmark such as HumanEval. Models like Codex (powering Copilot) are pretrained on natural language, which may boost their ability to interpret NL prompts; this model only learned language from comments in code.
  • The model appears to start generating a random new file once it reaches the (predicted) end of the current one. It is possible that the end-of-document token was not properly added to the training data.
  • Whitespace is very important to the model, since no preprocessing was done on the input files. For instance, the following snippet will yield poor predictions, because in Java we would never expect an instance-method at the top-level, as is indicated by the single level of (\t) indentation of the two lines within this method:
public int getTotalWeight(List<Integer> weights) {\n\t// Sum weights in parallel.\n\treturn 

Adjusting the indentation makes it predict more reasonable continuations:

public int getTotalWeight(List<Integer> weights) {\n\t\t// Sum weights in parallel.\n\t\treturn 

The Codex model discusses controlling for this to increase usability; this may be worth doing in a future version of the model.

Datasets

249GB Multi-Lingual Corpus

This is the corpus used to train PolyCoder.

The datasets were cloned overnight on October 9-10, 2021. To mine a similar training set, see Data.

The list of file paths can be downloaded from: https://zenodo.org/record/6363556/files/index.zip. Each row in the file is the file path along with its SHA-256 hash, to ease deduplication. That is, the hashes allow checking if files from any future test set were already contained in the training set.

The data collection and filtering process is described in detail in the paper and below. The final, filtered dataset statistics are:

Language Repositories Size(GB) Files
C 10,749 55G 3,037,112
C# 9,511 21G 2,514,494
C++ 13,726 52G 4,289,506
Go 12,371 15G 1,416,789
Java 15,044 41G 5,120,129
JavaScript 25,144 22G 1,774,174
PHP 9,960 13G 1,714,058
Python 25,446 16G 1,550,208
Ruby 5,826 4.1G 674,343
Rust 4,991 3.5G 304,842
Scala 1,497 1.8G 245,100
TypeScript 12,830 9.2G 1,441,926

Data Collection & Filtering

I cloned the most popular repositories for 12 popular programming languages with at least 50 stars (stopping at ~25K per langauge) from GitHub in October 2021. For each project, each file belonging to the majority-language of that project was extracted, yielding the training set below (after cleaning). This initial, unfiltered dataset spanned 631GB and 38.9M files.

Next, similar to Codex and CodeParrot, very large (>1MB) and very short (<100 tokens) files were filtered out, reducing the dataset to 424GB. Files were then deduplicated based on a hash of their content, which reduced the number of files by another 30% or so, leaving 249GB of data and 24.1M files. No tokenization filters were applied; the model processes entire files including all comments. A code-specific vocabulary was constructed on a random 5% subset of the files above.

Evaluation

Please find detailed instructions for replicating our perplexity and HumanEval results on our public fork of the NeoX repository. This in turn leverages our extension of the LM Evaluation Harness.

Evaluating Codex

To download the test sets that we used in the paper (12 programming languages), use:

wget https://zenodo.org/record/6363556/files/unseen_test_sets.tar.gz
tar -xvzf unseen_test_sets.tar.gz

To get perplexity results on these samples using Codex' API, use:

export OPENAI_API_KEY=<YOUR OPEN AI API KEY>
python3 -u Evaluation/eval_codex_all.py --dirs Code-sampled100

Where <YOUR OPEN AI API KEY> is a private string that can be obtained by signing up for OpenAI's beta.

As of March 2022, getting an API Key is free for 3 months, and afterwards a credit card needs to be entered. However, even after entering a credit card, using our evaluation script does not lead to any costs.

Results - HumanEval

These are PolyCoder's results on the HumanEval benchmark:

Model Pass@1 Pass@10 Pass@100
PolyCoder (160M) 2.13% 3.35% 4.88%
PolyCoder (400M) 2.96% 5.29% 11.59%
PolyCoder (2.7B) 5.59% 9.87% 17.68%
CodeParrot (110M) 3.80% 6.57% 12.78%
CodeParrot (1.5B) 3.58% 8.03% 14.96%
GPT-Neo (125M) 0.75% 1.88% 2.97%
GPT-Neo (1.3B) 4.79% 7.47% 16.30%
GPT-Neo (2.7B) 6.41% 11.27% 21.37%
GPT-J (6B) 11.62% 15.74% 27.74%
Codex (300M) 13.17% 20.37% 36.27%
Codex (2.5B) 21.36% 35.42% 59.50%
Codex (12B) 28.81% 46.81% 72.31%

Results - Multilingual Language Modeling

These are the perplexity results of PolyCoder on the multilingual test sets:

Language Perplexity
C 2.3464
C# 2.5832
C++ 2.9189
Go 2.567
Java 2.9194
JavaScript 3.0611
PHP 3.6954
Python 3.1767
Ruby 3.9742
Rust 3.2449
Scala 3.8735
TypeScript 3.6143

A comparison with the other models is available in Figure 6 in the paper: image

Citation

A Systematic Evaluation of Large Language Models of Code

@article{xu2022systematic,
  title={A Systematic Evaluation of Large Language Models of Code},
  author={Xu, Frank F and Alon, Uri and Neubig, Graham and Hellendoorn, Vincent J},
  journal={arXiv preprint arXiv:2202.13169},
  year={2022}
}
Comments
  • Convert GPT-NeoX to HuggingFace

    Convert GPT-NeoX to HuggingFace

    This PR includes a script named convert_neox_pt_to_huggingface_neox.py, which is used to convert a PolyCoder checkpoint trained by GPT-NeoX into HuggingFace format. A transformers.GPTNeoXConfig file that matches the checkpoint is also provided.

    I have checked with a 0.4B model, which is trained following the medium.yml config. The greedy-decoding outputs by GPT-NeoX's inference script and HF's generate() are identical.

    NOTE: The HuggingFace model type I used is "gpt_neox", but the model architecture needs to be adjusted. Please refer to this PR.

    opened by NinedayWang 9
  • Docker Image code generation fails

    Docker Image code generation fails

    Hello!

    While trying to run polycoder in a dockerized setup, we bumped into the error: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)

    Could you help us to get over this problem?

    This is odd, because I guess the docker solution should have been error free.

    We're trying to run the 160M version on a Ubuntu 22.04 machine with a GTX 1060 6GB by using the provided command to start the container.

    Here is the full log: https://pastebin.com/JzayrXUr

    opened by Longman-Stan 9
  • Code generation fails if more than 1 token in input

    Code generation fails if more than 1 token in input

    Hi, I encounter quite the same problem as in #23 but in a slightly different GPU context. As him/her, I can generate for a single token input but it fails as soon as I add a character to the input.

    I had to use singularity instead of docker as I had no root access on this cluster. For this to work, I had to copy the /gpt-neox out of the image and bind-mount it in order to have it in read-write mode.

    I start it like that:

    srun --account=xxx --partition=xxx --gres=gpu:1 --time=0-2:00:00 --mem=70G --pty  bash
    [xxx@nodexx xxx]$ singularity shell --nv -B $PWD/gpt-neox:/gpt-neox -B $PWD/checkpoints-0-4B:/gpt-neox/checkpoints --writable-tmpfs code-lms-neox_base.sif 
    ./deepy.py generate.py configs/text_generation.yml checkpoints/configs/local_setup.yml checkpoints/configs/medium.yml
    

    The full output including the correct generation for for and the failure for fora is here: https://paste.libre-service.eu/?f9efce75b3836b4b#BNZq675ihp2LnSqRywMa2Y9EtwzCnu4tZqUwt9o57Q9L

    The GPU is a Quadro P5000. Here is the output of nvidia-smi:

    nvidia-smi
    Wed Jun 29 15:14:34 2022
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Quadro P5000        On   | 00000000:1B:00.0 Off |                  Off |
    | 52%   80C    P0   163W / 180W |   3133MiB / 16278MiB |    100%      Default |
    |                               |                      |                  N/A |
    

    The GPU memory is never above 2GB (on 16 available).

    I have no more ideas to try to solve this problem.

    (Note that I first tried to install from source but it failed when installing the requirements)

    opened by kleag 5
  • Hugging Face's Hub

    Hugging Face's Hub

    Hello! First I would like to congratulate you on the fantastic work done with PolyCoder and the paper.

    Second, I wanted to know if there are plans to release PolyCoder in Hugging Face any time soon? I saw in #13 that in March something about that possibility was commented, but couldn't find it in the hub.

    There is still complicated to run the model and adapt it to new scenarios, and making it available by Hugging Face API would facilitate everything.

    Thanks!

    opened by Otavio-Parraga 3
  • How to evaluate the perplexity of PolyCoder/CodeParrot?

    How to evaluate the perplexity of PolyCoder/CodeParrot?

    Hello,

    Thank you for the awesome PolyCoder project, I really think it helps a lot for evaluating all the Code PTMs. However, I only found a script to evaluate the perplexity of codex, it's hard to reproduce the perplexity benchmark results when compared among codex, Polycoder and other PTMs. Is it possible that you guys update a polycoder/codeparrot perplexity evalution script?

    Thanks again, Sen

    opened by nforest 3
  • PlaidML support for (MacOS)

    PlaidML support for (MacOS)

    I am currently on a Mac Pro with a AMD Radeon Pro 580X but the code generation doesn't complete due to this error.

    NeoXArgs.from_ymls() ['configs/text_generation.yml', 'configs/local_setup.yml', 'configs/2-7B.yml']
    INFO:root:NeoXArgs.calculate_derived() Total number of GPUs determined to be: 0
    Traceback (most recent call last):
      File "/Users/xoxrumblelorexox/Desktop/Workbash/polyai/checkpoints/checkpoints/./gpt-neox-main/deepy.py", line 28, in <module>
        neox_args = NeoXArgs.consume_deepy_args()
      File "/Users/xoxrumblelorexox/Desktop/Workbash/polyai/checkpoints/checkpoints/gpt-neox-main/megatron/neox_arguments/arguments.py", line 321, in consume_deepy_args
        neox_args = cls.from_ymls(
      File "/Users/xoxrumblelorexox/Desktop/Workbash/polyai/checkpoints/checkpoints/gpt-neox-main/megatron/neox_arguments/arguments.py", line 201, in from_ymls
        return cls(**config)
      File "<string>", line 186, in __init__
      File "/Users/xoxrumblelorexox/Desktop/Workbash/polyai/checkpoints/checkpoints/gpt-neox-main/megatron/neox_arguments/arguments.py", line 106, in __post_init__
        self.calculate_derived()
      File "/Users/xoxrumblelorexox/Desktop/Workbash/polyai/checkpoints/checkpoints/gpt-neox-main/megatron/neox_arguments/arguments.py", line 675, in calculate_derived
        self.check_batch_parameters(
      File "/Users/xoxrumblelorexox/Desktop/Workbash/polyai/checkpoints/checkpoints/gpt-neox-main/megatron/neox_arguments/arguments.py", line 589, in check_batch_parameters
        assert (
    AssertionError: Train batch size: 0 has to be greater than 0
    

    After a while i realised that tensorflow doesn't recognise non-Nividia GPUs. There should be a disclaimer about this on the page. The only way around i found was to use PlaidML, but i am still getting the same error (probably because PlaidML only supports Keras)

    Sorry, if already known.

    opened by XoxRumbleLorexoX 2
  • Dataset index.txt file contains some corrupted entries

    Dataset index.txt file contains some corrupted entries

    Hi,

    Thanks for open sourcing the dataset and PolyCoder.

    I'm looking at the dataset: https://zenodo.org/record/6363556/files/index.zip , From the README, it seems that each line in the index.txt should be in the form of {language}__{organization}__{project}__{full__file__path}\tSHA, however after parsing there are few lines that seems to be malformed.

    I've attached the malformed entries below

    Line 2818009 : Upside Down Numbers
    Line 2818010 : Upside Down Numbers
    Line 2818011 : Upside Down Numbers
    Line 2818012 : Upside Down Numbers
    Line 2818013 : Upside Down Numbers__main.cpp    0070bf300d9f1bf6ec6533142fbbaa4de8ff65374da8d29e6a85cba5d0ad38df
    Line 2818158 : Phone Number Combinations__main.cpp      39ae7219b6377846a3792efcf6db5a9cc1b949652a3cc76edbe5c368f37b90a1
    Line 11100410 : .java   3f49f41560cd7a5ea7c2d31120d98dfd2f56da204b6286e099af69d629ba3041```
    
    opened by brutalsavage 2
  • cuda error

    cuda error

    hi,i run 2-7B model ,this is my command :sudo ./deepy.py generate.py configs/text_generation.yml checkpoints/configs/local_setup.yml checkpoints/configs/2-7B.yml,and I get this error message. [2022-03-28 14:07:59,567] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=0 file=checkpoints/global_step150000/layer_00-model_00-model_states.pt

    would help me solve this problem!! thanks.

    opened by liuyongqiangjava 2
  • 运行问题

    运行问题

    您好 可以帮我看下报错吗 No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' NeoXArgs.from_ymls() ['configs/text_generation.yml', 'checkpoints/configs/local_setup.yml', 'checkpoints/configs/2-7B.yml'] INFO:root:NeoXArgs.calculate_derived() Total number of GPUs determined to be: 0 Traceback (most recent call last): File "./deepy.py", line 29, in <module> neox_args = NeoXArgs.consume_deepy_args() File "/gpt-neox/megatron/neox_arguments/arguments.py", line 304, in consume_deepy_args neox_args = cls.from_ymls(paths_to_yml_files=conf_files, overwrite_values=overwrite_values) File "/gpt-neox/megatron/neox_arguments/arguments.py", line 199, in from_ymls return cls(**config) File "<string>", line 184, in __init__ File "/gpt-neox/megatron/neox_arguments/arguments.py", line 106, in __post_init__ self.calculate_derived() File "/gpt-neox/megatron/neox_arguments/arguments.py", line 656, in calculate_derived self.check_batch_parameters( File "/gpt-neox/megatron/neox_arguments/arguments.py", line 570, in check_batch_parameters assert ( AssertionError: Train batch size: 0 has to be greater than 0 谢谢~

    opened by ipsaxiaobo 2
  • Code Completion Support

    Code Completion Support

    Hi, thanks for your nice trained model~ I would like to add it to a code completion(not code generation) backend, but I cannot find an API to do this, such as using gpt cache to decode one step and go through the whole code files to calculate top-1 accuracy.

    opened by HoratioJSY 2
  • Plan on Releasing Generated Sample C Code

    Plan on Releasing Generated Sample C Code

    Hi there, nice job on this work! I'm wondering are you planning on releasing some generated sample C code? Just be curious about what they look like. For judging the functional correctness of the generated Python code, I know you used HumanEval to evaluate that, did you conduct a similar functionality check on C code also? Thanks much!

    opened by Lightninghkm 2
  • Context Prompt - RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling ```cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)```

    Context Prompt - RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling ```cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)```

    I am trying to setup Polycoder inferencing on my machine with 2xP100 GPUs, and use the docker command as available in README:

    nvidia-docker run --rm -it -e NVIDIA_VISIBLE_DEVICES=0,1 --shm-size=1g --ulimit memlock=-1 --mount type=bind,src=$PWD/Downloads/checkpoints/checkpoints-2-7B,dst=/gpt-neox/checkpoints vhellendoorn/code-lms-neox:base
    

    And then within the container:

     sudo ./deepy.py generate.py configs/text_generation.yml checkpoints/configs/local_setup.yml checkpoints/configs/2-7B.yml
    

    The following is the output (stdout+stderr):

    NeoXArgs.from_ymls() ['configs/text_generation.yml', 'checkpoints/configs/local_setup.yml', 'checkpoints/configs/2-7B.yml']
    INFO:root:NeoXArgs.calculate_derived() Total number of GPUs determined to be: 2
    -------------------- arguments --------------------
      attention_config ................ ['global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global']updated
      attention_dropout ............... 0...........................updated
      batch_size ...................... 8...........................updated
      bias_gelu_fusion ................ True........................updated
      checkpoint_activations .......... True........................updated
      clip_grad ....................... 1.0.........................updated
      config_files .................... {'text_generation.yml': '# Parameters used for text generation\n# Make sure `load` is specified somewhere else\n{\n  # Text gen type: `input-file`, `unconditional` or `interactive`\n  "text-gen-type": "interactive",\n \n  # Params for all\n  "maximum_tokens": 256,\n  "temperature": 0.5,\n  "top_p": 0.0,\n  "top_k": 0,\n  "recompute": false,\n  \n  # `unconditional`: samples\n  "num-samples": 10,\n\n  # input/output file\n  "sample-input-file": "sample_input.txt",\n  "sample-output-file": "sample_output.txt",\n}', 'local_setup.yml': '# Suggested data paths when using GPT-NeoX locally\n{\n  "data-path": "data/code/code_text_document",\n  \n  # or for weighted datasets: \n  # "train-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n  # "test-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n  # "valid-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n  # "train-data-weights": [1., 2.],\n  # "test-data-weights": [2., 1.],\n  # "valid-data-weights": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n  # WARNING: setting this to True will override any user provided weights\n  # "weight_by_num_documents": false,\n  # "weighted_sampler_alpha": 0.3,\n\n  "vocab-file": "data/code-vocab.json",\n  "merge-file": "data/code-merges.txt",\n\n  "save": "checkpoints",\n  "load": "checkpoints",\n  "checkpoint_validation_with_forward_pass": False,\n  \n  "tensorboard-dir": "tensorboard",\n  "log-dir": "logs",\n  "use_wandb": True,\n  "wandb_host": "https://api.wandb.ai",\n  "wandb_project": "neox"\n}', '2-7B.yml': '# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   "pipe-parallel-size": 1,\n   "model-parallel-size": 1,\n\n   # model settings\n   "num-layers": 32,\n   "hidden-size": 2560,\n   "num-attention-heads": 32,\n   "seq-length": 2048,\n   "max-position-embeddings": 2048,\n   "norm": "layernorm",\n   "pos-emb": "rotary",\n   "no-weight-tying": true,\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   "scaled-upper-triang-masked-softmax-fusion": true,\n   "bias-gelu-fusion": true,\n\n   # optimizer settings\n   "zero_allow_untested_optimizer": true,\n   "optimizer": {\n     "type": "adam",\n     "params": {\n       "lr": 0.00016,\n       "betas": [0.9, 0.999],\n       "eps": 1.0e-8,\n     }\n   },\n   "zero_optimization": {\n    "stage": 1,\n    "allgather_partitions": True,\n    "allgather_bucket_size": 500000000,\n    "overlap_comm": True,\n    "reduce_scatter": True,\n    "reduce_bucket_size": 500000000,\n    "contiguous_gradients": True,\n    "cpu_offload": False\n  },\n\n   # batch / data settings\n   "train_micro_batch_size_per_gpu": 8,\n   "gradient_accumulation_steps": 4,\n   "data-impl": "mmap",\n   "split": "989,10,1",\n\n   # activation checkpointing\n   "checkpoint-activations": true,\n   "checkpoint-num-layers": 1,\n   "partition-activations": true,\n   "synchronize-each-layer": true,\n\n   # regularization\n   "gradient_clipping": 1.0,\n   "weight-decay": 0,\n   "hidden-dropout": 0,\n   "attention-dropout": 0,\n\n   # precision settings\n   "fp16": { \n     "fp16": true,\n     "enabled": true,\n     "loss_scale": 0,\n     "initial_scale_power": 16,\n     "loss_scale_window": 1000,\n     "hysteresis": 2,\n     "min_loss_scale": 1\n   },\n\n   # misc. training settings\n   "train-iters": 160000,\n   "lr-decay-iters": 160000,\n   "distributed-backend": "nccl",\n   "lr-decay-style": "cosine",\n   "warmup": 0.01,\n   "save-interval": 1000,\n   "eval-interval": 1000,\n   "eval-iters": 10,\n\n   # logging\n   "log-interval": 100,\n   "steps_per_print": 10,\n   "keep-last-n-checkpoints": 1,\n   "wall_clock_breakdown": true,\n}\n'}updated
      data_impl ....................... mmap........................updated
      data_path ....................... data/code/code_text_documentupdated
      dynamic_loss_scale .............. True........................updated
      eval_iters ...................... 10..........................updated
      fp16 ............................ {'fp16': True, 'enabled': True, 'loss_scale': 0, 'initial_scale_power': 16, 'loss_scale_window': 1000, 'hysteresis': 2, 'min_loss_scale': 1}updated
      gas ............................. 4...........................updated
      global_num_gpus ................. 2...........................updated
      gradient_accumulation_steps ..... 4...........................updated
      gradient_clipping ............... 1.0.........................updated
      hidden_dropout .................. 0...........................updated
      hidden_size ..................... 2560........................updated
      is_pipe_parallel ................ True........................updated
      keep_last_n_checkpoints ......... 1...........................updated
      load ............................ checkpoints.................updated
      log_dir ......................... logs........................updated
      log_interval .................... 100.........................updated
      lr .............................. 0.00016.....................updated
      lr_decay_iters .................. 160000......................updated
      lr_decay_style .................. cosine......................updated
      max_position_embeddings ......... 2048........................updated
      maximum_tokens .................. 256.........................updated
      merge_file ...................... data/code-merges.txt........updated
      no_weight_tying ................. True........................updated
      num_attention_heads ............. 32..........................updated
      num_layers ...................... 32..........................updated
      num_samples ..................... 10..........................updated
      optimizer ....................... {'type': 'adam', 'params': {'lr': 0.00016, 'betas': [0.9, 0.999], 'eps': 1e-08}}updated
      partition_activations ........... True........................updated
      pipe_parallel_size .............. 1...........................updated
      pos_emb ......................... rotary......................updated
      precision ....................... fp16........................updated
      sample_input_file ............... sample_input.txt............updated
      sample_output_file .............. sample_output.txt...........updated
      save ............................ checkpoints.................updated
      save_interval ................... 1000........................updated
      scaled_upper_triang_masked_softmax_fusion  True...............updated
      seq_length ...................... 2048........................updated
      sparsity_config ................. {}..........................updated
      split ........................... 989,10,1....................updated
      synchronize_each_layer .......... True........................updated
      temperature ..................... 0.5.........................updated
      tensorboard_dir ................. tensorboard.................updated
      text_gen_type ................... interactive.................updated
      train_batch_size ................ 64..........................updated
      train_iters ..................... 160000......................updated
      train_micro_batch_size_per_gpu .. 8...........................updated
      use_wandb ....................... True........................updated
      user_script ..................... generate.py.................updated
      vocab_file ...................... data/code-vocab.json........updated
      wall_clock_breakdown ............ True........................updated
      wandb_group ..................... jtRPtjruy7PQkWHayfg7cH_6sweym4supdated
      weight_decay .................... 0...........................updated
      zero_allgather_bucket_size ...... 500000000...................updated
      zero_allow_untested_optimizer ... True........................updated
      zero_contiguous_gradients ....... True........................updated
      zero_optimization ............... {'stage': 1, 'allgather_partitions': True, 'allgather_bucket_size': 500000000, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 500000000, 'contiguous_gradients': True, 'cpu_offload': False}updated
      zero_reduce_bucket_size ......... 500000000...................updated
      zero_reduce_scatter ............. True........................updated
      zero_stage ...................... 1...........................updated
      activation ...................... gelu........................default
      adlr_autoresume ................. False.......................default
      adlr_autoresume_interval ........ 1000........................default
      amp ............................. None........................default
      apply_query_key_layer_scaling ... False.......................default
      attention_softmax_in_fp32 ....... False.......................default
      bias_dropout_fusion ............. False.......................default
      char_level_ppl .................. False.......................default
      checkpoint_in_cpu ............... False.......................default
      checkpoint_num_layers ........... 1...........................default
      checkpoint_validation_with_forward_pass  False................default
      contiguous_checkpointing ........ False.......................default
      deepscale ....................... False.......................default
      deepscale_config ................ None........................default
      deepspeed ....................... True........................default
      deepspeed_activation_checkpointing  True......................default
      deepspeed_mpi ................... False.......................default
      detect_nvlink_pairs ............. False.......................default
      distributed_backend ............. nccl........................default
      do_test ......................... None........................default
      do_train ........................ None........................default
      do_valid ........................ None........................default
      dump_state ...................... False.......................default
      eod_mask_loss ................... False.......................default
      eval_interval ................... 1000........................default
      eval_results_prefix ............. ............................default
      eval_tasks ...................... None........................default
      exclude ......................... None........................default
      exit_interval ................... None........................default
      finetune ........................ False.......................default
      flops_profiler .................. None........................default
      fp16_lm_cross_entropy ........... False.......................default
      fp32_allreduce .................. False.......................default
      git_hash ........................ 98683ae.....................default
      gmlp_attn_dim ................... 64..........................default
      gpt_j_residual .................. False.......................default
      gradient_noise_scale_cpu_offload  False.......................default
      gradient_noise_scale_n_batches .. 5...........................default
      gradient_predivide_factor ....... 1.0.........................default
      hostfile ........................ None........................default
      hysteresis ...................... 2...........................default
      include ......................... None........................default
      init_method ..................... normal......................default
      init_method_std ................. 0.02........................default
      iteration ....................... None........................default
      launcher ........................ pdsh........................default
      layernorm_epsilon ............... 1e-05.......................default
      lazy_mpu_init ................... False.......................default
      local_rank ...................... None........................default
      log_grad_norm ................... False.......................default
      log_gradient_noise_scale ........ False.......................default
      log_optimizer_states ............ False.......................default
      log_param_norm .................. False.......................default
      loss_scale ...................... None........................default
      loss_scale_window ............... 1000.0......................default
      make_vocab_size_divisible_by .... 128.........................default
      master_addr ..................... None........................default
      master_port ..................... 29500.......................default
      min_lr .......................... 0.0.........................default
      min_scale ....................... 1.0.........................default
      mmap_warmup ..................... False.......................default
      model_parallel_size ............. 1...........................default
      no_load_optim ................... False.......................default
      no_load_rng ..................... False.......................default
      no_save_optim ................... False.......................default
      no_save_rng ..................... False.......................default
      norm ............................ layernorm...................default
      num_gpus ........................ None........................default
      num_nodes ....................... -1..........................default
      num_unique_layers ............... None........................default
      num_workers ..................... 2...........................default
      onnx_safe ....................... False.......................default
      optimizer_type .................. adam........................default
      output_layer_init_method ........ scaled_normal...............default
      output_layer_parallelism ........ row.........................default
      override_lr_scheduler ........... False.......................default
      padded_vocab_size ............... None........................default
      param_sharing_style ............. grouped.....................default
      pipe_partition_method ........... type:transformer|mlp........default
      prescale_gradients .............. False.......................default
      profile_backward ................ False.......................default
      rank ............................ None........................default
      recompute ....................... False.......................default
      rms_norm_epsilon ................ 1e-08.......................default
      rotary_emb_base ................. 10000.......................default
      rotary_pct ...................... 1.0.........................default
      rpe_max_distance ................ 128.........................default
      rpe_num_buckets ................. 32..........................default
      scaled_masked_softmax_fusion .... False.......................default
      scalenorm_epsilon ............... 1e-08.......................default
      scheduler ....................... None........................default
      seed ............................ 1234........................default
      short_seq_prob .................. 0.1.........................default
      soft_prompt_tuning .............. None........................default
      sparse_gradients ................ False.......................default
      steps_per_print ................. 10..........................default
      test_data_paths ................. None........................default
      test_data_weights ............... None........................default
      tokenizer_type .................. GPT2BPETokenizer............default
      top_k ........................... 0...........................default
      top_p ........................... 0.0.........................default
      train_data_paths ................ None........................default
      train_data_weights .............. None........................default
      use_bnb_optimizer ............... False.......................default
      use_checkpoint_lr_scheduler ..... False.......................default
      use_cpu_initialization .......... False.......................default
      valid_data_paths ................ None........................default
      valid_data_weights .............. None........................default
      wandb_host ...................... https://api.wandb.ai........default
      wandb_project ................... neox........................default
      wandb_team ...................... None........................default
      warmup .......................... 0.01........................default
      weight_by_num_documents ......... False.......................default
      weighted_sampler_alpha .......... 0.3.........................default
      world_size ...................... None........................default
    ---------------- end of arguments ----------------
    [2022-07-21 05:12:58,859] [WARNING] [runner.py:126:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
    [2022-07-21 05:12:58,860] [INFO] [runner.py:366:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 generate.py --deepspeed_config {"train_batch_size": 64, "train_micro_batch_size_per_gpu": 8, "gradient_accumulation_steps": 4, "optimizer": {"type": "adam", "params": {"lr": 0.00016, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "initial_scale_power": 16, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 1, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": false}, "wall_clock_breakdown": true, "zero_allow_untested_optimizer": true} --megatron_config {"train_batch_size": 64, "train_micro_batch_size_per_gpu": 8, "gradient_accumulation_steps": 4, "optimizer": {"type": "adam", "params": {"lr": 0.00016, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "initial_scale_power": 16, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 1, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": false}, "wall_clock_breakdown": true, "zero_allow_untested_optimizer": true, "precision": "fp16", "num_layers": 32, "hidden_size": 2560, "num_attention_heads": 32, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "scaled_upper_triang_masked_softmax_fusion": true, "bias_gelu_fusion": true, "lr_decay_style": "cosine", "lr_decay_iters": 160000, "zero_stage": 1, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.00016, "data_path": "data/code/code_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"text_generation.yml": "# Parameters used for text generation\n# Make sure `load` is specified somewhere else\n{\n  # Text gen type: `input-file`, `unconditional` or `interactive`\n  \"text-gen-type\": \"interactive\",\n \n  # Params for all\n  \"maximum_tokens\": 256,\n  \"temperature\": 0.5,\n  \"top_p\": 0.0,\n  \"top_k\": 0,\n  \"recompute\": false,\n  \n  # `unconditional`: samples\n  \"num-samples\": 10,\n\n  # input/output file\n  \"sample-input-file\": \"sample_input.txt\",\n  \"sample-output-file\": \"sample_output.txt\",\n}", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n  \"data-path\": \"data/code/code_text_document\",\n  \n  # or for weighted datasets: \n  # \"train-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n  # \"test-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n  # \"valid-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n  # \"train-data-weights\": [1., 2.],\n  # \"test-data-weights\": [2., 1.],\n  # \"valid-data-weights\": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n  # WARNING: setting this to True will override any user provided weights\n  # \"weight_by_num_documents\": false,\n  # \"weighted_sampler_alpha\": 0.3,\n\n  \"vocab-file\": \"data/code-vocab.json\",\n  \"merge-file\": \"data/code-merges.txt\",\n\n  \"save\": \"checkpoints\",\n  \"load\": \"checkpoints\",\n  \"checkpoint_validation_with_forward_pass\": False,\n  \n  \"tensorboard-dir\": \"tensorboard\",\n  \"log-dir\": \"logs\",\n  \"use_wandb\": True,\n  \"wandb_host\": \"https://api.wandb.ai\",\n  \"wandb_project\": \"neox\"\n}", "2-7B.yml": "# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   \"pipe-parallel-size\": 1,\n   \"model-parallel-size\": 1,\n\n   # model settings\n   \"num-layers\": 32,\n   \"hidden-size\": 2560,\n   \"num-attention-heads\": 32,\n   \"seq-length\": 2048,\n   \"max-position-embeddings\": 2048,\n   \"norm\": \"layernorm\",\n   \"pos-emb\": \"rotary\",\n   \"no-weight-tying\": true,\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   \"scaled-upper-triang-masked-softmax-fusion\": true,\n   \"bias-gelu-fusion\": true,\n\n   # optimizer settings\n   \"zero_allow_untested_optimizer\": true,\n   \"optimizer\": {\n     \"type\": \"adam\",\n     \"params\": {\n       \"lr\": 0.00016,\n       \"betas\": [0.9, 0.999],\n       \"eps\": 1.0e-8,\n     }\n   },\n   \"zero_optimization\": {\n    \"stage\": 1,\n    \"allgather_partitions\": True,\n    \"allgather_bucket_size\": 500000000,\n    \"overlap_comm\": True,\n    \"reduce_scatter\": True,\n    \"reduce_bucket_size\": 500000000,\n    \"contiguous_gradients\": True,\n    \"cpu_offload\": False\n  },\n\n   # batch / data settings\n   \"train_micro_batch_size_per_gpu\": 8,\n   \"gradient_accumulation_steps\": 4,\n   \"data-impl\": \"mmap\",\n   \"split\": \"989,10,1\",\n\n   # activation checkpointing\n   \"checkpoint-activations\": true,\n   \"checkpoint-num-layers\": 1,\n   \"partition-activations\": true,\n   \"synchronize-each-layer\": true,\n\n   # regularization\n   \"gradient_clipping\": 1.0,\n   \"weight-decay\": 0,\n   \"hidden-dropout\": 0,\n   \"attention-dropout\": 0,\n\n   # precision settings\n   \"fp16\": { \n     \"fp16\": true,\n     \"enabled\": true,\n     \"loss_scale\": 0,\n     \"initial_scale_power\": 16,\n     \"loss_scale_window\": 1000,\n     \"hysteresis\": 2,\n     \"min_loss_scale\": 1\n   },\n\n   # misc. training settings\n   \"train-iters\": 160000,\n   \"lr-decay-iters\": 160000,\n   \"distributed-backend\": \"nccl\",\n   \"lr-decay-style\": \"cosine\",\n   \"warmup\": 0.01,\n   \"save-interval\": 1000,\n   \"eval-interval\": 1000,\n   \"eval-iters\": 10,\n\n   # logging\n   \"log-interval\": 100,\n   \"steps_per_print\": 10,\n   \"keep-last-n-checkpoints\": 1,\n   \"wall_clock_breakdown\": true,\n}\n"}, "load": "checkpoints", "save_interval": 1000, "batch_size": 8, "train_iters": 160000, "eval_iters": 10, "keep_last_n_checkpoints": 1, "split": "989,10,1", "vocab_file": "data/code-vocab.json", "merge_file": "data/code-merges.txt", "attention_dropout": 0, "hidden_dropout": 0, "weight_decay": 0, "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "gas": 4, "clip_grad": 1.0, "dynamic_loss_scale": true, "pipe_parallel_size": 1, "is_pipe_parallel": true, "use_wandb": true, "wandb_group": "jtRPtjruy7PQkWHayfg7cH_6sweym4s", "log_dir": "logs", "tensorboard_dir": "tensorboard", "log_interval": 100, "text_gen_type": "interactive", "temperature": 0.5, "maximum_tokens": 256, "sample_input_file": "sample_input.txt", "sample_output_file": "sample_output.txt", "num_samples": 10, "user_script": "generate.py", "global_num_gpus": 2}
    [2022-07-21 05:12:59,743] [INFO] [launch.py:82:main] WORLD INFO DICT: {'localhost': [0, 1]}
    [2022-07-21 05:12:59,743] [INFO] [launch.py:88:main] nnodes=1, num_local_procs=2, node_rank=0
    [2022-07-21 05:12:59,743] [INFO] [launch.py:103:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
    [2022-07-21 05:12:59,743] [INFO] [launch.py:104:main] dist_world_size=2
    [2022-07-21 05:12:59,743] [INFO] [launch.py:112:main] Setting CUDA_VISIBLE_DEVICES=0,1
    NeoXArgs.configure_distributed_args() using world size: 2 and model-parallel size: 1 
    > building GPT2BPETokenizer tokenizer ...
     > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
    > initializing torch distributed ...
    [2022-07-21 05:13:02,390] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
    [2022-07-21 05:13:02,482] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
    > initializing model parallel with size 1
    MPU DP: [0, 1]
    MPU PP: [0]
    MPU PP: [1]
    MPU MP: [0]
    MPU MP: [1]
    > setting random seeds to 1234 ...
    [2022-07-21 05:13:02,518] [INFO] [checkpointing.py:223:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
    make: Entering directory '/gpt-neox/megatron/data'
    make: Nothing to be done for 'default'.
    make: Leaving directory '/gpt-neox/megatron/data'
    building GPT2 model ...
    SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
    Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=1, model=0): 1}
    [2022-07-21 05:13:02,651] [INFO] [module.py:363:_partition_layers] Partitioning pipeline stages with method type:transformer|mlp
    stage=0 layers=37
         0: EmbeddingPipe
         1: _pre_transformer_block
         2: ParallelTransformerLayerPipe
         3: ParallelTransformerLayerPipe
         4: ParallelTransformerLayerPipe
         5: ParallelTransformerLayerPipe
         6: ParallelTransformerLayerPipe
         7: ParallelTransformerLayerPipe
         8: ParallelTransformerLayerPipe
         9: ParallelTransformerLayerPipe
        10: ParallelTransformerLayerPipe
        11: ParallelTransformerLayerPipe
        12: ParallelTransformerLayerPipe
        13: ParallelTransformerLayerPipe
        14: ParallelTransformerLayerPipe
        15: ParallelTransformerLayerPipe
        16: ParallelTransformerLayerPipe
        17: ParallelTransformerLayerPipe
        18: ParallelTransformerLayerPipe
        19: ParallelTransformerLayerPipe
        20: ParallelTransformerLayerPipe
        21: ParallelTransformerLayerPipe
        22: ParallelTransformerLayerPipe
        23: ParallelTransformerLayerPipe
        24: ParallelTransformerLayerPipe
        25: ParallelTransformerLayerPipe
        26: ParallelTransformerLayerPipe
        27: ParallelTransformerLayerPipe
        28: ParallelTransformerLayerPipe
        29: ParallelTransformerLayerPipe
        30: ParallelTransformerLayerPipe
        31: ParallelTransformerLayerPipe
        32: ParallelTransformerLayerPipe
        33: ParallelTransformerLayerPipe
        34: _post_transformer_block
        35: NormPipe
        36: ParallelLinearPipe
    DeepSpeed is enabled.
    [2022-07-21 05:13:05,069] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.15+eb7f5cf, git-hash=eb7f5cf, git-branch=main
    [2022-07-21 05:13:05,070] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
    [2022-07-21 05:13:05,102] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
    [2022-07-21 05:13:05,172] [INFO] [config.py:759:print] DeepSpeedEngine configuration:
    [2022-07-21 05:13:05,173] [INFO] [config.py:763:print]   activation_checkpointing_config  {
        "partition_activations": false, 
        "contiguous_memory_optimization": false, 
        "cpu_checkpointing": false, 
        "number_checkpoints": null, 
        "synchronize_checkpoint_boundary": false, 
        "profile": false
    }
    [2022-07-21 05:13:05,173] [INFO] [config.py:763:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
    [2022-07-21 05:13:05,173] [INFO] [config.py:763:print]   allreduce_always_fp32 ........ False
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   amp_enabled .................. False
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   amp_params ................... False
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   checkpoint_tag_validation_enabled  True
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   checkpoint_tag_validation_fail  False
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   disable_allgather ............ False
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   dump_state ................... False
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   elasticity_enabled ........... False
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   flops_profiler_config ........ {
        "enabled": false, 
        "profile_step": 1, 
        "module_depth": -1, 
        "top_modules": 3, 
        "detailed": true
    }
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   fp16_enabled ................. True
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   fp16_type .................... fp16
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   global_rank .................. 0
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   gradient_accumulation_steps .. 4
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   gradient_clipping ............ 1.0
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   gradient_predivide_factor .... 1.0
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   initial_dynamic_scale ........ 65536
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   loss_scale ................... 0
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   memory_breakdown ............. False
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   optimizer_legacy_fusion ...... False
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   optimizer_name ............... adam
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   optimizer_params ............. {'lr': 0.00016, 'betas': [0.9, 0.999], 'eps': 1e-08}
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   pld_enabled .................. False
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   pld_params ................... False
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   precision .................... torch.float16
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   prescale_gradients ........... False
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   scheduler_name ............... None
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   scheduler_params ............. None
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   sparse_attention ............. None
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   sparse_gradients_enabled ..... False
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   steps_per_print .............. 10
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   tensorboard_enabled .......... False
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   tensorboard_job_name ......... DeepSpeedJobName
    [2022-07-21 05:13:05,174] [INFO] [config.py:763:print]   tensorboard_output_path ...... 
    [2022-07-21 05:13:05,175] [INFO] [config.py:763:print]   train_batch_size ............. 64
    [2022-07-21 05:13:05,175] [INFO] [config.py:763:print]   train_micro_batch_size_per_gpu  8
    [2022-07-21 05:13:05,175] [INFO] [config.py:763:print]   wall_clock_breakdown ......... True
    [2022-07-21 05:13:05,175] [INFO] [config.py:763:print]   world_size ................... 2
    [2022-07-21 05:13:05,175] [INFO] [config.py:763:print]   zero_allow_untested_optimizer  True
    [2022-07-21 05:13:05,175] [INFO] [config.py:763:print]   zero_config .................. {
        "stage": 0, 
        "contiguous_gradients": false, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 5.000000e+08, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 5.000000e+08, 
        "overlap_comm": false, 
        "load_from_fp32_weights": true, 
        "elastic_checkpoint": true, 
        "offload_param": null, 
        "offload_optimizer": null, 
        "sub_group_size": 1.000000e+12, 
        "prefetch_bucket_size": 5.000000e+07, 
        "param_persistence_threshold": 1.000000e+05, 
        "max_live_parameters": 1.000000e+09, 
        "max_reuse_distance": 1.000000e+09, 
        "gather_fp16_weights_on_model_save": false
    }
    [2022-07-21 05:13:05,175] [INFO] [config.py:763:print]   zero_enabled ................. False
    [2022-07-21 05:13:05,175] [INFO] [config.py:763:print]   zero_optimization_stage ...... 0
    [2022-07-21 05:13:05,175] [INFO] [config.py:765:print]   json = {
        "train_batch_size": 64, 
        "train_micro_batch_size_per_gpu": 8, 
        "gradient_accumulation_steps": 4, 
        "optimizer": {
            "type": "adam", 
            "params": {
                "lr": 0.00016, 
                "betas": [0.9, 0.999], 
                "eps": 1e-08
            }
        }, 
        "fp16": {
            "fp16": true, 
            "enabled": true, 
            "loss_scale": 0, 
            "initial_scale_power": 16, 
            "loss_scale_window": 1000, 
            "hysteresis": 2, 
            "min_loss_scale": 1
        }, 
        "gradient_clipping": 1.0, 
        "zero_optimization": {
            "stage": 0, 
            "allgather_partitions": true, 
            "reduce_scatter": true, 
            "allgather_bucket_size": 5.000000e+08, 
            "overlap_comm": false, 
            "reduce_bucket_size": 5.000000e+08, 
            "contiguous_gradients": false, 
            "cpu_offload": false
        }, 
        "wall_clock_breakdown": true, 
        "zero_allow_untested_optimizer": true
    }
    Using /root/.cache/torch_extensions as PyTorch extensions root...
    Using /root/.cache/torch_extensions as PyTorch extensions root...
    Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
    Building extension module utils...
    Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
    ninja: no work to do.
    Loading extension module utils...
    Time to load utils op: 0.37787580490112305 seconds
    [2022-07-21 05:13:05,558] [INFO] [engine.py:84:__init__] CONFIG: micro_batches=4 micro_batch_size=8
    Loading extension module utils...
    Time to load utils op: 0.40610766410827637 seconds
    [2022-07-21 05:13:05,679] [INFO] [engine.py:141:__init__] RANK=0 STAGE=0 LAYERS=37 [0, 37) STAGE_PARAMS=2775208960 (2775.209M) TOTAL_PARAMS=2775208960 (2775.209M) UNIQUE_PARAMS=2775208960 (2775.209M)
     > number of parameters on model parallel rank 0: 2775208960
     > total params: 2,775,208,960
    [2022-07-21 05:13:05,702] [INFO] [engine.py:1551:_load_checkpoint] rank: 0 loading checkpoint: checkpoints/global_step150000/mp_rank_00_model_states.pt
    [2022-07-21 05:13:05,702] [INFO] [engine.py:1551:_load_checkpoint] rank: 1 loading checkpoint: checkpoints/global_step150000/mp_rank_00_model_states.pt
    [2022-07-21 05:13:05,901] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=0 file=checkpoints/global_step150000/layer_00-model_00-model_states.pt
    [2022-07-21 05:13:06,022] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=2 file=checkpoints/global_step150000/layer_02-model_00-model_states.pt
    [2022-07-21 05:13:06,138] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=3 file=checkpoints/global_step150000/layer_03-model_00-model_states.pt
    [2022-07-21 05:13:06,254] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=4 file=checkpoints/global_step150000/layer_04-model_00-model_states.pt
    [2022-07-21 05:13:06,370] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=5 file=checkpoints/global_step150000/layer_05-model_00-model_states.pt
    [2022-07-21 05:13:06,481] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=6 file=checkpoints/global_step150000/layer_06-model_00-model_states.pt
    [2022-07-21 05:13:06,592] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=7 file=checkpoints/global_step150000/layer_07-model_00-model_states.pt
    [2022-07-21 05:13:06,730] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=8 file=checkpoints/global_step150000/layer_08-model_00-model_states.pt
    [2022-07-21 05:13:06,854] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=9 file=checkpoints/global_step150000/layer_09-model_00-model_states.pt
    [2022-07-21 05:13:06,968] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=10 file=checkpoints/global_step150000/layer_10-model_00-model_states.pt
    [2022-07-21 05:13:07,083] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=11 file=checkpoints/global_step150000/layer_11-model_00-model_states.pt
    [2022-07-21 05:13:07,199] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=12 file=checkpoints/global_step150000/layer_12-model_00-model_states.pt
    [2022-07-21 05:13:07,313] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=13 file=checkpoints/global_step150000/layer_13-model_00-model_states.pt
    [2022-07-21 05:13:07,433] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=14 file=checkpoints/global_step150000/layer_14-model_00-model_states.pt
    [2022-07-21 05:13:07,550] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=15 file=checkpoints/global_step150000/layer_15-model_00-model_states.pt
    [2022-07-21 05:13:07,667] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=16 file=checkpoints/global_step150000/layer_16-model_00-model_states.pt
    [2022-07-21 05:13:07,782] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=17 file=checkpoints/global_step150000/layer_17-model_00-model_states.pt
    [2022-07-21 05:13:07,899] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=18 file=checkpoints/global_step150000/layer_18-model_00-model_states.pt
    [2022-07-21 05:13:08,007] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=19 file=checkpoints/global_step150000/layer_19-model_00-model_states.pt
    [2022-07-21 05:13:08,142] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=20 file=checkpoints/global_step150000/layer_20-model_00-model_states.pt
    [2022-07-21 05:13:08,251] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=21 file=checkpoints/global_step150000/layer_21-model_00-model_states.pt
    [2022-07-21 05:13:08,358] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=22 file=checkpoints/global_step150000/layer_22-model_00-model_states.pt
    [2022-07-21 05:13:08,466] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=23 file=checkpoints/global_step150000/layer_23-model_00-model_states.pt
    [2022-07-21 05:13:08,574] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=24 file=checkpoints/global_step150000/layer_24-model_00-model_states.pt
    [2022-07-21 05:13:08,681] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=25 file=checkpoints/global_step150000/layer_25-model_00-model_states.pt
    [2022-07-21 05:13:08,786] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=26 file=checkpoints/global_step150000/layer_26-model_00-model_states.pt
    [2022-07-21 05:13:08,894] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=27 file=checkpoints/global_step150000/layer_27-model_00-model_states.pt
    [2022-07-21 05:13:09,003] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=28 file=checkpoints/global_step150000/layer_28-model_00-model_states.pt
    [2022-07-21 05:13:09,114] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=29 file=checkpoints/global_step150000/layer_29-model_00-model_states.pt
    [2022-07-21 05:13:09,222] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=30 file=checkpoints/global_step150000/layer_30-model_00-model_states.pt
    [2022-07-21 05:13:09,332] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=31 file=checkpoints/global_step150000/layer_31-model_00-model_states.pt
    [2022-07-21 05:13:09,438] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=32 file=checkpoints/global_step150000/layer_32-model_00-model_states.pt
    [2022-07-21 05:13:09,544] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=33 file=checkpoints/global_step150000/layer_33-model_00-model_states.pt
    [2022-07-21 05:13:09,544] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=35 file=checkpoints/global_step150000/layer_35-model_00-model_states.pt
    [2022-07-21 05:13:09,752] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=36 file=checkpoints/global_step150000/layer_36-model_00-model_states.pt
     > validated currently set args with arguments in the checkpoint ...
      successfully loaded checkpoints/global_step150000/mp_rank_00_model_states.pt
    Loading checkpoint and starting from iteration 150000
    Finished loading model
    Context prompt >>> def return1():\n """Returns 1."""\n
    
    Traceback (most recent call last):
      File "generate.py", line 74, in <module>
        main()
      File "generate.py", line 59, in main
        generate_samples_interactive(
      File "/gpt-neox/megatron/text_generation_utils.py", line 751, in generate_samples_interactive
        for (
      File "/gpt-neox/megatron/text_generation_utils.py", line 317, in stream_tokens
        logits, layer_past = forward_model(neox_args, model, model_inputs)
      File "/gpt-neox/megatron/text_generation_utils.py", line 137, in forward_model
        return model.module(model_inputs)
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 335, in forward
        x = func(forward_input)
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 328, in exec_func
        inputs = layer(inputs)
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/gpt-neox/megatron/model/transformer.py", line 686, in forward
        outputs = super().forward(hidden_states, attention_mask, layer_past=past)
      File "/gpt-neox/megatron/model/transformer.py", line 639, in forward
        attention_output, attention_bias = self.attention(
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/gpt-neox/megatron/model/transformer.py", line 516, in forward
        output, bias = self.dense(context_layer)
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/gpt-neox/megatron/mpu/layers.py", line 446, in forward
        output_parallel = F.linear(input_parallel, self.weight)
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 1753, in linear
        return torch._C._nn.linear(input, weight, bias)
    RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
    Traceback (most recent call last):
      File "generate.py", line 74, in <module>
        main()
      File "generate.py", line 59, in main
        generate_samples_interactive(
      File "/gpt-neox/megatron/text_generation_utils.py", line 751, in generate_samples_interactive
        for (
      File "/gpt-neox/megatron/text_generation_utils.py", line 317, in stream_tokens
        logits, layer_past = forward_model(neox_args, model, model_inputs)
      File "/gpt-neox/megatron/text_generation_utils.py", line 137, in forward_model
        return model.module(model_inputs)
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 335, in forward
        x = func(forward_input)
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 328, in exec_func
        inputs = layer(inputs)
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/gpt-neox/megatron/model/transformer.py", line 686, in forward
        outputs = super().forward(hidden_states, attention_mask, layer_past=past)
      File "/gpt-neox/megatron/model/transformer.py", line 639, in forward
        attention_output, attention_bias = self.attention(
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/gpt-neox/megatron/model/transformer.py", line 516, in forward
        output, bias = self.dense(context_layer)
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/gpt-neox/megatron/mpu/layers.py", line 446, in forward
        output_parallel = F.linear(input_parallel, self.weight)
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 1753, in linear
        return torch._C._nn.linear(input, weight, bias)
    RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
    Killing subprocess 1054
    Killing subprocess 1055
    Traceback (most recent call last):
      File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 179, in <module>
        main()
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 169, in main
        sigkill_handler(signal.SIGTERM, None)  # not coming back
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 147, in sigkill_handler
        raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
    subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'generate.py', '--local_rank=1', '--deepspeed_config', '{"train_batch_size": 64, "train_micro_batch_size_per_gpu": 8, "gradient_accumulation_steps": 4, "optimizer": {"type": "adam", "params": {"lr": 0.00016, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "initial_scale_power": 16, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 1, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": false}, "wall_clock_breakdown": true, "zero_allow_untested_optimizer": true}', '--megatron_config', '{"train_batch_size": 64, "train_micro_batch_size_per_gpu": 8, "gradient_accumulation_steps": 4, "optimizer": {"type": "adam", "params": {"lr": 0.00016, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "initial_scale_power": 16, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 1, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": false}, "wall_clock_breakdown": true, "zero_allow_untested_optimizer": true, "precision": "fp16", "num_layers": 32, "hidden_size": 2560, "num_attention_heads": 32, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "scaled_upper_triang_masked_softmax_fusion": true, "bias_gelu_fusion": true, "lr_decay_style": "cosine", "lr_decay_iters": 160000, "zero_stage": 1, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.00016, "data_path": "data/code/code_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"text_generation.yml": "# Parameters used for text generation\\n# Make sure `load` is specified somewhere else\\n{\\n  # Text gen type: `input-file`, `unconditional` or `interactive`\\n  \\"text-gen-type\\": \\"interactive\\",\\n \\n  # Params for all\\n  \\"maximum_tokens\\": 256,\\n  \\"temperature\\": 0.5,\\n  \\"top_p\\": 0.0,\\n  \\"top_k\\": 0,\\n  \\"recompute\\": false,\\n  \\n  # `unconditional`: samples\\n  \\"num-samples\\": 10,\\n\\n  # input/output file\\n  \\"sample-input-file\\": \\"sample_input.txt\\",\\n  \\"sample-output-file\\": \\"sample_output.txt\\",\\n}", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\\n{\\n  \\"data-path\\": \\"data/code/code_text_document\\",\\n  \\n  # or for weighted datasets: \\n  # \\"train-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\\n  # \\"test-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\\n  # \\"valid-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\\n  # \\"train-data-weights\\": [1., 2.],\\n  # \\"test-data-weights\\": [2., 1.],\\n  # \\"valid-data-weights\\": [0.5, 0.4],\\n\\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \\n  # WARNING: setting this to True will override any user provided weights\\n  # \\"weight_by_num_documents\\": false,\\n  # \\"weighted_sampler_alpha\\": 0.3,\\n\\n  \\"vocab-file\\": \\"data/code-vocab.json\\",\\n  \\"merge-file\\": \\"data/code-merges.txt\\",\\n\\n  \\"save\\": \\"checkpoints\\",\\n  \\"load\\": \\"checkpoints\\",\\n  \\"checkpoint_validation_with_forward_pass\\": False,\\n  \\n  \\"tensorboard-dir\\": \\"tensorboard\\",\\n  \\"log-dir\\": \\"logs\\",\\n  \\"use_wandb\\": True,\\n  \\"wandb_host\\": \\"https://api.wandb.ai\\",\\n  \\"wandb_project\\": \\"neox\\"\\n}", "2-7B.yml": "# GPT-2 pretraining setup\\n{\\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\\n   # across the node boundaries )\\n   \\"pipe-parallel-size\\": 1,\\n   \\"model-parallel-size\\": 1,\\n\\n   # model settings\\n   \\"num-layers\\": 32,\\n   \\"hidden-size\\": 2560,\\n   \\"num-attention-heads\\": 32,\\n   \\"seq-length\\": 2048,\\n   \\"max-position-embeddings\\": 2048,\\n   \\"norm\\": \\"layernorm\\",\\n   \\"pos-emb\\": \\"rotary\\",\\n   \\"no-weight-tying\\": true,\\n\\n   # these should provide some speedup but takes a while to build, set to true if desired\\n   \\"scaled-upper-triang-masked-softmax-fusion\\": true,\\n   \\"bias-gelu-fusion\\": true,\\n\\n   # optimizer settings\\n   \\"zero_allow_untested_optimizer\\": true,\\n   \\"optimizer\\": {\\n     \\"type\\": \\"adam\\",\\n     \\"params\\": {\\n       \\"lr\\": 0.00016,\\n       \\"betas\\": [0.9, 0.999],\\n       \\"eps\\": 1.0e-8,\\n     }\\n   },\\n   \\"zero_optimization\\": {\\n    \\"stage\\": 1,\\n    \\"allgather_partitions\\": True,\\n    \\"allgather_bucket_size\\": 500000000,\\n    \\"overlap_comm\\": True,\\n    \\"reduce_scatter\\": True,\\n    \\"reduce_bucket_size\\": 500000000,\\n    \\"contiguous_gradients\\": True,\\n    \\"cpu_offload\\": False\\n  },\\n\\n   # batch / data settings\\n   \\"train_micro_batch_size_per_gpu\\": 8,\\n   \\"gradient_accumulation_steps\\": 4,\\n   \\"data-impl\\": \\"mmap\\",\\n   \\"split\\": \\"989,10,1\\",\\n\\n   # activation checkpointing\\n   \\"checkpoint-activations\\": true,\\n   \\"checkpoint-num-layers\\": 1,\\n   \\"partition-activations\\": true,\\n   \\"synchronize-each-layer\\": true,\\n\\n   # regularization\\n   \\"gradient_clipping\\": 1.0,\\n   \\"weight-decay\\": 0,\\n   \\"hidden-dropout\\": 0,\\n   \\"attention-dropout\\": 0,\\n\\n   # precision settings\\n   \\"fp16\\": { \\n     \\"fp16\\": true,\\n     \\"enabled\\": true,\\n     \\"loss_scale\\": 0,\\n     \\"initial_scale_power\\": 16,\\n     \\"loss_scale_window\\": 1000,\\n     \\"hysteresis\\": 2,\\n     \\"min_loss_scale\\": 1\\n   },\\n\\n   # misc. training settings\\n   \\"train-iters\\": 160000,\\n   \\"lr-decay-iters\\": 160000,\\n   \\"distributed-backend\\": \\"nccl\\",\\n   \\"lr-decay-style\\": \\"cosine\\",\\n   \\"warmup\\": 0.01,\\n   \\"save-interval\\": 1000,\\n   \\"eval-interval\\": 1000,\\n   \\"eval-iters\\": 10,\\n\\n   # logging\\n   \\"log-interval\\": 100,\\n   \\"steps_per_print\\": 10,\\n   \\"keep-last-n-checkpoints\\": 1,\\n   \\"wall_clock_breakdown\\": true,\\n}\\n"}, "load": "checkpoints", "save_interval": 1000, "batch_size": 8, "train_iters": 160000, "eval_iters": 10, "keep_last_n_checkpoints": 1, "split": "989,10,1", "vocab_file": "data/code-vocab.json", "merge_file": "data/code-merges.txt", "attention_dropout": 0, "hidden_dropout": 0, "weight_decay": 0, "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "gas": 4, "clip_grad": 1.0, "dynamic_loss_scale": true, "pipe_parallel_size": 1, "is_pipe_parallel": true, "use_wandb": true, "wandb_group": "jtRPtjruy7PQkWHayfg7cH_6sweym4s", "log_dir": "logs", "tensorboard_dir": "tensorboard", "log_interval": 100, "text_gen_type": "interactive", "temperature": 0.5, "maximum_tokens": 256, "sample_input_file": "sample_input.txt", "sample_output_file": "sample_output.txt", "num_samples": 10, "user_script": "generate.py", "global_num_gpus": 2}']' returned non-zero exit status 1.
    
    
    opened by LakshyAAAgrawal 1
  • Context prompt - example not working

    Context prompt - example not working

    I have configured this setup in a 4GPU machine. I have done the setup using docker image. I have received the Context prompt. When I feed it an example, it is falling apart. Can you please help me to understand why it is failing?

    Context prompt >>> def return1():\n """Returns 1."""\n Traceback (most recent call last): File "generate.py", line 74, in <module> main() File "generate.py", line 59, in main generate_samples_interactive( File "/gpt-neox/megatron/text_generation_utils.py", line 779, in generate_samples_interactive generated_text = neox_args.tokenizer.detokenize(generated_tokens) File "/gpt-neox/megatron/tokenizer/tokenizer.py", line 162, in detokenize return self.tokenizer.decode(token_ids) File "/gpt-neox/megatron/tokenizer/gpt2_tokenization.py", line 279, in decode text = ''.join([self.decoder[token] for token in tokens]) File "/gpt-neox/megatron/tokenizer/gpt2_tokenization.py", line 279, in <listcomp> text = ''.join([self.decoder[token] for token in tokens]) KeyError: 50269 Traceback (most recent call last): File "generate.py", line 74, in <module> main() File "generate.py", line 59, in main generate_samples_interactive( File "/gpt-neox/megatron/text_generation_utils.py", line 779, in generate_samples_interactive generated_text = neox_args.tokenizer.detokenize(generated_tokens) File "/gpt-neox/megatron/tokenizer/tokenizer.py", line 162, in detokenize return self.tokenizer.decode(token_ids) File "/gpt-neox/megatron/tokenizer/gpt2_tokenization.py", line 279, in decode text = ''.join([self.decoder[token] for token in tokens]) File "/gpt-neox/megatron/tokenizer/gpt2_tokenization.py", line 279, in <listcomp> text = ''.join([self.decoder[token] for token in tokens]) KeyError: 50269 Traceback (most recent call last): File "generate.py", line 74, in <module> main() File "generate.py", line 59, in main generate_samples_interactive( File "/gpt-neox/megatron/text_generation_utils.py", line 779, in generate_samples_interactive generated_text = neox_args.tokenizer.detokenize(generated_tokens) File "/gpt-neox/megatron/tokenizer/tokenizer.py", line 162, in detokenize return self.tokenizer.decode(token_ids) File "/gpt-neox/megatron/tokenizer/gpt2_tokenization.py", line 279, in decode text = ''.join([self.decoder[token] for token in tokens]) File "/gpt-neox/megatron/tokenizer/gpt2_tokenization.py", line 279, in <listcomp> text = ''.join([self.decoder[token] for token in tokens]) KeyError: 50269Killing subprocess 118 Killing subprocess 119Killing subprocess 120 Killing subprocess 121Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 179, in <module> main() File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 169, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 147, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'generate.py', '--local_rank=3', '--deepspeed_config', '{"train_batch_size": 128, "train_micro_batch_size_per_gpu": 8, "gradient_accumulation_steps": 4, "optimizer": {"type": "adam", "params": {"lr": 0.00016, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "initial_scale_power": 16, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 1, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter":true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": false}, "wall_clock_breakdown": true, "zero_allow_untested_optimizer": true}', '--megatron_config', '{"train_batch_size": 128, "train_micro_batch_size_per_gpu": 8, "gradient_accumulation_steps": 4, "optimizer": {"type": "adam", "params": {"lr": 0.00016, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "initial_scale_power": 16, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 1, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": false}, "wall_clock_breakdown": true, "zero_allow_untested_optimizer": true, "precision": "fp16", "num_layers": 32, "hidden_size": 2560, "num_attention_heads": 32, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "scaled_upper_triang_masked_softmax_fusion": true, "bias_gelu_fusion": true, "lr_decay_style": "cosine", "lr_decay_iters": 160000, "zero_stage": 1, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.00016, "data_path": "data/code/code_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"text_generation.yml": "# Parameters used for text generation\\n# Make sureloadis specified somewhere else\\n{\\n # Text gen type:input-file,unconditionalorinteractive\\n \\"text-gen-type\\": \\"interactive\\",\\n \\n # Params for all\\n \\"maximum_tokens\\": 256,\\n \\"temperature\\": 0.5,\\n \\"top_p\\": 0.0,\\n \\"top_k\\": 0,\\n \\"recompute\\": false,\\n \\n #unconditional: samples\\n \\"num-samples\\": 10,\\n\\n # input/output file\\n \\"sample-input-file\\": \\"sample_input.txt\\",\\n \\"sample-output-file\\": \\"sample_output.txt\\",\\n}", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\\n{\\n \\"data-path\\": \\"data/code/code_text_document\\",\\n \\n # or for weighted datasets: \\n # \\"train-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\\n # \\"test-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\\n # \\"valid-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\\n # \\"train-data-weights\\": [1., 2.],\\n # \\"test-data-weights\\": [2., 1.],\\n # \\"valid-data-weights\\": [0.5, 0.4],\\n\\n # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \\n # WARNING: setting this to True will override any user provided weights\\n # \\"weight_by_num_documents\\": false,\\n # \\"weighted_sampler_alpha\\": 0.3,\\n\\n \\"vocab-file\\": \\"data/code-vocab.json\\",\\n \\"merge-file\\": \\"data/code-merges.txt\\",\\n\\n \\"save\\": \\"checkpoints\\",\\n \\"load\\": \\"checkpoints\\",\\n \\"checkpoint_validation_with_forward_pass\\": False,\\n \\n \\"tensorboard-dir\\": \\"tensorboard\\",\\n \\"log-dir\\": \\"logs\\",\\n \\"use_wandb\\": True,\\n \\"wandb_host\\": \\"https://api.wandb.ai\\",\\n \\"wandb_project\\": \\"neox\\"\\n}", "2-7B.yml": "# GPT-2 pretraining setup\\n{\\n # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\\n # across the node boundaries )\\n \\"pipe-parallel-size\\": 1,\\n \\"model-parallel-size\\": 1,\\n\\n # model settings\\n \\"num-layers\\": 32,\\n \\"hidden-size\\": 2560,\\n \\"num-attention-heads\\": 32,\\n \\"seq-length\\": 2048,\\n \\"max-position-embeddings\\": 2048,\\n \\"norm\\": \\"layernorm\\",\\n \\"pos-emb\\": \\"rotary\\",\\n \\"no-weight-tying\\": true,\\n\\n # these should provide some speedup but takes awhile to build, set to true if desired\\n \\"scaled-upper-triang-masked-softmax-fusion\\": true,\\n \\"bias-gelu-fusion\\": true,\\n\\n # optimizer settings\\n \\"zero_allow_untested_optimizer\\": true,\\n \\"optimizer\\": {\\n \\"type\\": \\"adam\\",\\n \\"params\\": {\\n \\"lr\\": 0.00016,\\n \\"betas\\": [0.9, 0.999],\\n \\"eps\\": 1.0e-8,\\n }\\n },\\n \\"zero_optimization\\": {\\n \\"stage\\": 1,\\n \\"allgather_partitions\\": True,\\n \\"allgather_bucket_size\\": 500000000,\\n \\"overlap_comm\\": True,\\n \\"reduce_scatter\\": True,\\n \\"reduce_bucket_size\\": 500000000,\\n \\"contiguous_gradients\\": True,\\n \\"cpu_offload\\": False\\n },\\n\\n # batch / data settings\\n \\"train_micro_batch_size_per_gpu\\": 8,\\n \\"gradient_accumulation_steps\\": 4,\\n \\"data-impl\\": \\"mmap\\",\\n \\"split\\": \\"989,10,1\\",\\n\\n # activation checkpointing\\n \\"checkpoint-activations\\": true,\\n \\"checkpoint-num-layers\\": 1,\\n \\"partition-activations\\": true,\\n \\"synchronize-each-layer\\": true,\\n\\n # regularization\\n \\"gradient_clipping\\": 1.0,\\n \\"weight-decay\\": 0,\\n \\"hidden-dropout\\": 0,\\n \\"attention-dropout\\": 0,\\n\\n # precision settings\\n \\"fp16\\": { \\n \\"fp16\\": true,\\n \\"enabled\\": true,\\n \\"loss_scale\\": 0,\\n \\"initial_scale_power\\": 16,\\n \\"loss_scale_window\\": 1000,\\n \\"hysteresis\\": 2,\\n \\"min_loss_scale\\": 1\\n },\\n\\n # misc. training settings\\n \\"train-iters\\": 160000,\\n \\"lr-decay-iters\\": 160000,\\n \\"distributed-backend\\": \\"nccl\\",\\n \\"lr-decay-style\\": \\"cosine\\",\\n \\"warmup\\": 0.01,\\n \\"save-interval\\": 1000,\\n \\"eval-interval\\": 1000,\\n \\"eval-iters\\": 10,\\n\\n # logging\\n \\"log-interval\\": 100,\\n \\"steps_per_print\\": 10,\\n \\"keep-last-n-checkpoints\\": 1,\\n \\"wall_clock_breakdown\\": true,\\n}\\n"}, "load": "checkpoints", "save_interval": 1000, "batch_size": 8, "train_iters": 160000, "eval_iters": 10, "keep_last_n_checkpoints": 1, "split": "989,10,1", "vocab_file": "data/code-vocab.json", "merge_file": "data/code-merges.txt", "attention_dropout": 0, "hidden_dropout": 0, "weight_decay": 0, "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "gas": 4, "clip_grad": 1.0, "dynamic_loss_scale": true, "pipe_parallel_size": 1, "is_pipe_parallel": true, "use_wandb": true, "wandb_group": "9j43ZWqmpkAaRAbvTjSFUt_2hatssyt", "log_dir": "logs", "tensorboard_dir": "tensorboard", "log_interval": 100, "text_gen_type": "interactive", "temperature": 0.5, "maximum_tokens": 256, "sample_input_file": "sample_input.txt", "sample_output_file": "sample_output.txt", "num_samples": 10, "user_script": "generate.py", "global_num_gpus": 4}']' returned non-zero exit status 1. mchorse@f4a108abb6e6:/gpt-neox$

    opened by pallabganai 2
  • CUDA out of memory error on training

    CUDA out of memory error on training

    I was trying to train Polycoder using the preconfigured dataset, from the checkpoint checkpoints-2-7B, I used the following command as per the instructions in the repo (only changing the configs as appropriate):

    sudo python ./deepy.py train.py -d configs 2-7B.yml local_setup.yml

    which gave the following error:

    RuntimeError: CUDA out of memory. Tried to allocate 1.86 GiB (GPU 0; 23.70 GiB total capacity; 20.49 GiB already allocated; 1.74 GiB free; 20.50 GiB reserved in total by PyTorch)

    Interestingly, the full 25 Gigs of our GPU is free, as per nvidia-smi.

    I tried updating the batch size, and the the only location I found to update batch size in the config files was train_micro_batch_size_per_gpu: 8, in 2-7B.yml.

    It was 8, I changed it to 4, and then also to 1, but in both cases got the same error.

    I am running all this in docker, as per the containerized setup instructions.

    Appreciate any help!

    opened by AftabHussain 3
Owner
Vincent Hellendoorn
AI4SE Researcher, Assistant Prof. at CMU
Vincent Hellendoorn
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Stefan Dumitrescu 9 Nov 7, 2022
Must-read papers on improving efficiency for pre-trained language models.

Must-read papers on improving efficiency for pre-trained language models.

Tobias Lee 89 Jan 3, 2023
The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

null 22 Dec 14, 2022
Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, which modifies the input text with a textual template and directly uses PLMs to conduct pre-trained tasks. This library provides a standard, flexible and extensible framework to deploy the prompt-learning pipeline. OpenPrompt supports loading PLMs directly from huggingface transformers. In the future, we will also support PLMs implemented by other libraries.

THUNLP 2.3k Jan 8, 2023
Chinese Pre-Trained Language Models (CPM-LM) Version-I

CPM-Generate 为了促进中文自然语言处理研究的发展,本项目提供了 CPM-LM (2.6B) 模型的文本生成代码,可用于文本生成的本地测试,并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告] 若您想使用CPM-1进行推理,我们建议使用高效推理工具BMI

Tsinghua AI 1.4k Jan 3, 2023
PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

Jacob Zhou 6 Apr 29, 2022
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

null 44 Dec 31, 2022
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 3.2k Dec 31, 2022
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 1, 2022
DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect.

null 117 Jan 7, 2023
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

Salesforce 564 Jan 8, 2023
Laboratory for Social Machines 84 Dec 20, 2022
Code for text augmentation method leveraging large-scale language models

HyperMix Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation. Getting Started Installing P

NAVER AI 47 Dec 20, 2022
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

InstaDeep Ltd 72 Dec 9, 2022
One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

Adobe, Inc. 148 Dec 26, 2022
BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

OpenBMB 377 Jan 2, 2023
Google and Stanford University released a new pre-trained model called ELECTRA

Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For further accelerating the research of the Chinese pre-trained model, the Joint Laboratory of HIT and iFLYTEK Research (HFL) has released the Chinese ELECTRA models based on the official code of ELECTRA. ELECTRA-small could reach similar or even higher scores on several NLP tasks with only 1/10 parameters compared to BERT and its variants.

Yiming Cui 1.2k Dec 30, 2022
Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | 简体中文 mT5: A Massively

null 2 Oct 17, 2021
ElasticBERT: A pre-trained model with multi-exit transformer architecture.

This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

fastNLP 48 Dec 14, 2022