An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

EleutherAI

Last update: Jan 8, 2023

Related tags

Overview

GPT-NeoX

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger. This repository is under development and may change rapidly without warning.

Requirements

Everything you need to get started running the code can be installed via pip:

$ pip install -r requirements.txt

Important: This codebase does not install Microsoft's DeepSpeed library. It installs DeeperSpeed, EleutherAI's variant on the original DeepSpeed. We have added some necessary functionality for our purposes and patched holes created by the fact that only parts of DeepSpeed were publicly released, but DeeperSpeed uses the same namespace as DeepSpeed and may break other code built upon DeepSpeed. If you use or suspect you might use Microsoft's DeepSpeed for another project, we strongly secommend you use anaconda to install this code in an isolated environment by creating a condo environment and running conda install --file requirements.txt. We welcome any suggestions for improvements to our DeepSpeeder library, but please open issues on its repo rather than this one.

EleutherAI members who wish to run models on our Kubernetes cluster will additionally need to install Kubernetes and obtain an authorization from Stella Biderman or Sid Black. Please reach out on discord in the #gpt-neo channel. You will also need to create a WandB account and share your username so that you can be added to the organization WandB account.

Running the code

The core anatomy of a call to the DeepSpeed engine is the following

$ deepspeed --hostfile=host_path train_script.py user_args\
	--deepspeed \
	--deepspeed_config deepspeed_config.json

where

host_path (optional) is the path to the host file containing the addresses of the machines you wish to train on.
train_script.py is the training script you wish to use. Our main training script is train_pipeline.py.
deepspeed_config.json is the json file containing DeepSpeed-specific hyperparameters.

In this repository, we provide a lightweight wrapper for the above function call for two main reasons. Firstly, we find the way the arguments are ordered and used somewhat counterintuitive, and secondly our wrapper automatically uploads logging data to WandB. Everything in this repository will work with both the native DeepSpeed command and with our deepy command. The core anatomy of a deepy call is

$ ./deepy --hostfile=host_path train_script.py deepspeed_config.json

Running the code locally

This code is set up to run automatically on as many GPUs as are avaliable. If you have multiple GPUs and only wish to make use of some of them, you can find information about how to specify which GPU(s) to use in training here.

The most common pitfall for local training is pipeline parallelism. Pipeline parallelism paritions the model into segments (called PipelineModules in our code) that can decrese latency by running partially asynchronously.

Running the code on a server

This code is set up to run automatically on as many GPUs as are avaliable. To run across multiple machines, you need to make use of a hostfile which lists the IP address of each machine you wish to run the code on followed by the number of GPUs to use. For example, 123.45.67.890 slots=8 instructs the code to run on all eight GPUs of the machine at 123.45.67.890. Each machine should be listed on a separate line with no end-of-line punctuation. It is officially recommended that you set up passwordless ssh, but we have had success entering the password at run-time. To have your hostfile used by GPT-NeoX automatically, store it at ~/jobs/hostfile. Otherwise, you can provide it as an argument as shown above.

EleutherAI members: Once you have been granted access to the EleutherAI servers and have confirmed that an unused cluster is currently running, simply ssh into the cluster. If you have been granted the ability to create an destroy Kubernetes clusters, run kubernetes/deploy_k8s.sh branch_name num_pods cluster_name to create a cluster.

~/scripts/

The directory ~/scripts/ stores various scripts for automatically starting runs with particular settings and configs that we have found useful. They can be run using sh scripts/script_name.sh but should not be relied upon. We do not guarentee forward compatibility of any scripts.

Datasets

Tokenizers

Using our data

Using your data

Advanced Options

Contribute

If you want to get involved, check out our repo projects. Anything that is listed as "todo" or has not been assigned to anyone is fair game, but please leave a comment so that we know you're working on it!

Resources

If you have trouble getting the model to run, consider consulting this guide to installing in a GCE virtual machine. You may also find the (very sparse) DeepSpeed docs helpful.

Comments

Running on a single GPU

tried merging the checkpoints as described for single GPU python tools/merge20b.py --input_dir ./20B_checkpoints --output_dir ./20B_checkpoints_merged

However Im getting this error when generating RuntimeError: Error(s) in loading state_dict for EmbeddingPipe: size mismatch for word_embeddings.weight: copying a param with shape torch.Size([50432, 6144]) from checkpoint, the shape in current model is torch.Size([50304, 6144]).

How can I adjust to make the current model match size 50432? or is it the other way around?
bug

opened by huey2531 22
Clean up Neox configuration
Clean up neox configuration so config files can be used instead of a mishmash of files, command line args and enviroment variables.

Aim:

All parameters can be set using passed json files

No parameters are repeated

Modify megatron's codebase as little as possible to make it easier to merge upstream megatron changes in the future.

Nice to haves:

JSON schema files

Single config documentation

Todo:

[x] Convert all examples and configs to new configuration

[ ] Create config documentation with all possible parameters

[x] Separate configs into: model and system

[x] Cast numbers to numbers in JSON (suggested by @StellaAthena)

[x] Calculate batch size from other parms (micro_batch_per_gpu*GAS*n_gpus)
opened by joshlk 19
Model and config code of an HF gpt-neox model; a conversion script.
The modeling and configuration files are largely based on HF's gpt-j model. (I found gpt-j's architecture more similar to gpt-neox than gpt-neo, especially it uses rotary embedding)

Modifications to the original gpt-j modeling:

Added post-attention layernorm as ln_2.

Changed q_proj, k_proj, v_proj linear layers to a single qkv_proj that corresponds to gpt-neox's attention.query_key_value linear layer. And set bias=True.

Combined gpt-neox's and HF gpt-j's rotatry embedding functions.

Set bias=False for lm_head.

Updated the computation in GPTNeoXBlock in correspondence to two residual computing ways in gpt-neox, which is controlled by a new config argument gpt_j_residual.

Modifications to the original gpt-j configuration:

Set the default value of activation_function to gelu.

Removed rotary_dim (so that its default value is None).

Added a gpt_j_residual argument (default value is False) in correspondence to two residual computing ways in gpt-neox.

A conversion script:

that reads config files and gpt-neox's output state dict files and outputs a pretrained HF pytorch GPTNeoX model.

Note that weights are not loaded for two kinds of model parameters transformer.h.*.attn.bias and transformer.h.*.attn.masked_bias because they should keep their default values.

Things that I have checked with a 1B model, which is trained basically following the default XL.yml config:

The above conversion script works correctly.

The greedy-decoding outputs by gpt-neox's inference script and HF's generate() are identical.

Intermediate outputs (e.g. hidden states) are almost identical when running from gpt-neox code and HF code. There are some small differences, which I think are caused by precision settings.

Things that are not included in this pull request:

Tensorflow-related code.

Conversion script that considers checkpoints trained with model parallel.

Other model variants that might use a different type of, for example, rotary embeddings.

Things that haven't come to my mind.

BTW, I haven't found a better place to put the new files so I simply created a directory huggingface under /tools.

A table that summarizes parameters (and their shapes) in 1) a GPT-NeoX checkpoint, 2) an HF GPTNeo model, 3) an HF GPTJ model, and 4) the HF GPTNeoX model in this pull request:
opened by ZHAOTING 16
align gpt-j layernorm to hf

Looking deeper into the gpt-j residual implementation I found a delta in the way layernorm(s) are applied. I don't see the point in applying two separate layer norm modules to the hidden_states (x)

Compare the HF implementation. https://github.com/huggingface/transformers/blob/a94105f95fb66ee4129077c03e4e8a224f6a07fd/src/transformers/models/gptj/modeling_gptj.py#L279

Is there a reason for having two layernorms? Am I completally off?

opened by sweinbach 15
13B Model Out of Memory with Single Node 8 A100 GPUs

Hi!

Thanks for contribution making this repo available :)

I tried to train the 13B model with micro batch size 1, model parallelism degree 8, but unable to get it to work. (always get OOM) The library advertises being able to scale up to 100B. What is required for this? I also tried deepspeed stage 3 with offload without using pipeline parallelism but that doesn't seem to work either. Please let me know what I'm missing. thanks!

opened by benathi 14
Add support for Flash attention

This PR adds Tri Dao's Flash Attention as an optional backend for the global attention operation, enabled by setting the attention_config to [[["flash"], ...]. I've tested the changes in my own environment and consistently see a 2x boost for 4K sequence lengths in models ranging from 100M - 3B parameters.

Maybe relevant: @tridao @lucidrains

opened by VHellendoorn 13
Pipeline parallelism and gradient checkpointing (edit: and ZeRO 2!) don’t work together

Pipeline parallelism and gradient checkpointing both work when you use them individually. However when you turn them both on you get a mysterious KeyError: 0 from somewhere deep in DeepSpeed.
bug

opened by StellaAthena 12
distributed training with multipy nodes.

Hi, I want train a 13B model with 4 nodes, each node have 8*A100 GPUs, but I don't know how to run the code on my cluster, can you show me an example? I just run it on a single node successful.
bug

opened by cdj0311 11
fix alibi inference shapes for cached layer_past
Restart the now reverted previous fix.

History:

Old PR was merged, after which some (small?) differences in model output became apparent in discussion with @sdtblck

Old PR was reverted

This PR is opened to discuss the issue

Tests and validations so far:

Inference was tested on a trained neox checkpoint (model size ~4B, 60k steps trained). Random sampling is deactivated (no top_k, top_p, temperature).

test NOT using recompute, i.e. with cache values used in the text generation (interactive)

python deepy.py generate.py -d configs ... text_generation

(copy pasted from terminal) Context prompt >>> Once upon a time Generated Text: , the only way to get a job at any of the many colleges and universities in the United States of America was to have a high school diploma

=> the model is working => generated text seems far from random (though not a fairytale :-) )

test using recompute, i.e. with NO cache values used in the text generation (interactive)

python deepy.py generate.py -d configs ... text_generation

(copy pasted from terminal) Context prompt >>> Once upon a time Generated Text: , the only way to get a job at any of the many colleges and universities in the United States of America was to have a high school diploma

=> The generated text is exactly the same as above

Tests and validations that resulted in discussions.

The following line has been added to text_generation_utils.py to print logits.

test NOT using recompute, logit output

(copy pasted from terminal) Context prompt >>> Once upon a time generated_tokens tensor(15, device='cuda:0') generated_token_logits tensor(42.1250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(37.4375, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1711, device='cuda:0') generated_token_logits tensor(33.6250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2333, device='cuda:0') generated_token_logits tensor(36.8750, device='cuda:0', dtype=torch.float16) generated_tokens tensor(327, device='cuda:0') generated_token_logits tensor(42.1875, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1635, device='cuda:0') generated_token_logits tensor(40.0312, device='cuda:0', dtype=torch.float16) generated_tokens tensor(247, device='cuda:0') generated_token_logits tensor(39.0938, device='cuda:0', dtype=torch.float16) generated_tokens tensor(6834, device='cuda:0') generated_token_logits tensor(36.5000, device='cuda:0', dtype=torch.float16) generated_tokens tensor(554, device='cuda:0') generated_token_logits tensor(40.6250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1287, device='cuda:0') generated_token_logits tensor(34.9688, device='cuda:0', dtype=torch.float16) generated_tokens tensor(338, device='cuda:0') generated_token_logits tensor(36.5938, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(39.9688, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2329, device='cuda:0') generated_token_logits tensor(34.2500, device='cuda:0', dtype=torch.float16) generated_tokens tensor(45857, device='cuda:0') generated_token_logits tensor(33.4062, device='cuda:0', dtype=torch.float16) generated_tokens tensor(348, device='cuda:0') generated_token_logits tensor(40.5312, device='cuda:0', dtype=torch.float16) generated_tokens tensor(40126, device='cuda:0') generated_token_logits tensor(38.3125, device='cuda:0', dtype=torch.float16) generated_tokens tensor(280, device='cuda:0') generated_token_logits tensor(41.7188, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(41.1875, device='cuda:0', dtype=torch.float16) generated_tokens tensor(5510, device='cuda:0') generated_token_logits tensor(36.5000, device='cuda:0', dtype=torch.float16) generated_tokens tensor(6405, device='cuda:0') generated_token_logits tensor(41.2812, device='cuda:0', dtype=torch.float16) generated_tokens tensor(338, device='cuda:0') generated_token_logits tensor(44.4062, device='cuda:0', dtype=torch.float16) generated_tokens tensor(7384, device='cuda:0') generated_token_logits tensor(43.4688, device='cuda:0', dtype=torch.float16) generated_tokens tensor(672, device='cuda:0') generated_token_logits tensor(45.3750, device='cuda:0', dtype=torch.float16) generated_tokens tensor(327, device='cuda:0') generated_token_logits tensor(46., device='cuda:0', dtype=torch.float16) generated_tokens tensor(779, device='cuda:0') generated_token_logits tensor(39.9062, device='cuda:0', dtype=torch.float16) generated_tokens tensor(247, device='cuda:0') generated_token_logits tensor(42.6875, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2556, device='cuda:0') generated_token_logits tensor(36.8438, device='cuda:0', dtype=torch.float16) generated_tokens tensor(3991, device='cuda:0') generated_token_logits tensor(42.8750, device='cuda:0', dtype=torch.float16) generated_tokens tensor(30480, device='cuda:0') generated_token_logits tensor(40.0625, device='cuda:0', dtype=torch.float16) generated_tokens tensor(17, device='cuda:0') generated_token_logits tensor(42.8125, device='cuda:0', dtype=torch.float16) Generated Text: , the only way to get a job at any of the many colleges and universities in the United States of America was to have a high school diploma

=> the generated text is the same as above

test using recompute, logit output (copy pasted from terminal) Context prompt >>> Once upon a time generated_tokens tensor(15, device='cuda:0') generated_token_logits tensor(42.1562, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(37.4375, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1711, device='cuda:0') generated_token_logits tensor(33.6250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2333, device='cuda:0') generated_token_logits tensor(36.9062, device='cuda:0', dtype=torch.float16) generated_tokens tensor(327, device='cuda:0') generated_token_logits tensor(42.1875, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1635, device='cuda:0') generated_token_logits tensor(40.0625, device='cuda:0', dtype=torch.float16) generated_tokens tensor(247, device='cuda:0') generated_token_logits tensor(39.0938, device='cuda:0', dtype=torch.float16) generated_tokens tensor(6834, device='cuda:0') generated_token_logits tensor(36.5312, device='cuda:0', dtype=torch.float16) generated_tokens tensor(554, device='cuda:0') generated_token_logits tensor(40.6562, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1287, device='cuda:0') generated_token_logits tensor(35.0312, device='cuda:0', dtype=torch.float16) generated_tokens tensor(338, device='cuda:0') generated_token_logits tensor(36.5625, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(40.1250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2329, device='cuda:0') generated_token_logits tensor(34.2500, device='cuda:0', dtype=torch.float16) generated_tokens tensor(45857, device='cuda:0') generated_token_logits tensor(33.3125, device='cuda:0', dtype=torch.float16) generated_tokens tensor(348, device='cuda:0') generated_token_logits tensor(40.6562, device='cuda:0', dtype=torch.float16) generated_tokens tensor(40126, device='cuda:0') generated_token_logits tensor(38.2812, device='cuda:0', dtype=torch.float16) generated_tokens tensor(280, device='cuda:0') generated_token_logits tensor(41.6875, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(41.1250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(5510, device='cuda:0') generated_token_logits tensor(36.4688, device='cuda:0', dtype=torch.float16) generated_tokens tensor(6405, device='cuda:0') generated_token_logits tensor(41.3750, device='cuda:0', dtype=torch.float16) generated_tokens tensor(338, device='cuda:0') generated_token_logits tensor(44.3438, device='cuda:0', dtype=torch.float16) generated_tokens tensor(7384, device='cuda:0') generated_token_logits tensor(43.2500, device='cuda:0', dtype=torch.float16) generated_tokens tensor(672, device='cuda:0') generated_token_logits tensor(45.2188, device='cuda:0', dtype=torch.float16) generated_tokens tensor(327, device='cuda:0') generated_token_logits tensor(46.0625, device='cuda:0', dtype=torch.float16) generated_tokens tensor(779, device='cuda:0') generated_token_logits tensor(39.8750, device='cuda:0', dtype=torch.float16) generated_tokens tensor(247, device='cuda:0') generated_token_logits tensor(42.6562, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2556, device='cuda:0') generated_token_logits tensor(36.8438, device='cuda:0', dtype=torch.float16) generated_tokens tensor(3991, device='cuda:0') generated_token_logits tensor(42.9375, device='cuda:0', dtype=torch.float16) generated_tokens tensor(30480, device='cuda:0') generated_token_logits tensor(39.9688, device='cuda:0', dtype=torch.float16) generated_tokens tensor(17, device='cuda:0') generated_token_logits tensor(42.9375, device='cuda:0', dtype=torch.float16) Generated Text: , the only way to get a job at any of the many colleges and universities in the United States of America was to have a high school diploma

=> Generated text still the same => Logits are different. This is the question at hand

Questions

Why are logits different?

Does the difference only occur for alibi? Is this a general issue if an issue at all?
opened by sweinbach 11
Running through Dockerfile broken
Describe the bug When using an image based on the provided Dockerfile and running the quick start steps (download enron data, run deep.py), execution crashes before training begins.

To Reproduce Steps to reproduce the behavior:

Build an image using the provided Dockerfile

Run said image, mounting 8 RTX800 GPUs

Fetch enron data using the prepare_dataset.py script

Run ./deepy.py pretrain_gpt2.py -d configs small.yml local_configs.yml

The code crashes with a non-descript NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8

Expected behavior Training starts or a specific error is provided.

Proposed solution The NCCL error is typically a stand-in for a real issue that is not relayed back through multiprocessing. As a first step, it would be nice to know if this setup works out-of-the-box for others; in that case, it might be my resources or CUDA version.

Environment (please complete the following information):

GPUs: 8 RTX8000 GPUs

Configs: Ubuntu 20.04, Cuda 11.2

Additional context Add any other context about the problem here.
bug
opened by VHellendoorn 11
Create experiment runners
We will want to run experiments with a variety of configs and options. To enable this, we need two things:

[ ] configs files that we can use to specify the settings for a particular run

[ ] an experiment runner for managing and automatically executing several runs

feature request good first issue
opened by StellaAthena 11
In interactive mode prompt length more than one word causes to crash

Describe the bug In interactive mode prompt length more than one word causes to crash. When I type just one word it generates text though.

text_generation.yml

` { "text-gen-type": "interactive", "maximum_tokens": 500, "temperature": 0.9, "top_p": 0, "top_k": 0, "recompute": false, "num-samples": 10, "sample-input-file": "prompt.txt", "sample-output-file": "sample_output.txt", }

`

`Context prompt >>> Hello from Traceback (most recent call last): Traceback (most recent call last): File "generate.py", line 89, in File "generate.py", line 89, in main() File "generate.py", line 72, in main main() File "generate.py", line 72, in main generate_samples_interactive( File "/gpt-neox/megatron/text_generation_utils.py", line 760, in generate_samples_interactive generate_samples_interactive( File "/gpt-neox/megatron/text_generation_utils.py", line 760, in generate_samples_interactive for ( File "/gpt-neox/megatron/text_generation_utils.py", line 316, in stream_tokens for ( File "/gpt-neox/megatron/text_generation_utils.py", line 316, in stream_tokens logits = forward_model(model, model_inputs, neox_args.is_pipe_parallel) File "/gpt-neox/megatron/text_generation_utils.py", line 156, in forward_model logits = forward_model(model, model_inputs, neox_args.is_pipe_parallel) File "/gpt-neox/megatron/text_generation_utils.py", line 156, in forward_model loss, logits = model.eval_batch(model_inputs, return_logits=True) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 394, in eval_batch loss, logits = model.eval_batch(model_inputs, return_logits=True) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 394, in eval_batch self._exec_schedule(sched) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 1308, in _exec_schedule self._exec_schedule(sched) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 1308, in _exec_schedule self._exec_instr(**cmd.kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 700, in _exec_forward_pass self._exec_instr(**cmd.kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 700, in _exec_forward_pass self.loss = self.loss_model(outputs, labels) File "/gpt-neox/megatron/model/gpt2_model.py", line 67, in cross_entropy losses = mpu.vocab_parallel_cross_entropy(output.float().contiguous(), labels) File "/gpt-neox/megatron/mpu/cross_entropy.py", line 117, in vocab_parallel_cross_entropy self.loss = self.loss_model(outputs, labels) File "/gpt-neox/megatron/model/gpt2_model.py", line 67, in cross_entropy return _VocabParallelCrossEntropy.apply(vocab_parallel_logits, target) File "/gpt-neox/megatron/mpu/cross_entropy.py", line 63, in forward losses = mpu.vocab_parallel_cross_entropy(output.float().contiguous(), labels) predicted_logits_1d = logits_2d[arange_1d, masked_target_1d] File "/gpt-neox/megatron/mpu/cross_entropy.py", line 117, in vocab_parallel_cross_entropy

IndexError: shape mismatch: indexing tensors could not be broadcast together with shapes [2], [3] return _VocabParallelCrossEntropy.apply(vocab_parallel_logits, target) File "/gpt-neox/megatron/mpu/cross_entropy.py", line 63, in forward predicted_logits_1d = logits_2d[arange_1d, masked_target_1d] IndexError: shape mismatch: indexing tensors could not be broadcast together with shapes [2], [3] Killing subprocess 7479 Killing subprocess 7480 Killing subprocess 7481 Killing subprocess 7482 Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 179, in main() File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 169, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 147, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'generate.py', '--local_rank=3', '--deepspeed_config', '{"train_batch_size": 128, "train_micro_batch_size_per_gpu": 4, "gradient_accumulation_steps": 32, "optimizer": {"type": "Adam", "params": {"lr": 9.7e-05, "betas": [0.9, 0.95], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 12, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1260000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1260000000, "contiguous_gradients": true}, "steps_per_print": 2}', '--megatron_config', '{"train_batch_size": 128, "train_micro_batch_size_per_gpu": 4, "gradient_accumulation_steps": 32, "optimizer": {"type": "Adam", "params": {"lr": 9.7e-05, "betas": [0.9, 0.95], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 12, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1260000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1260000000, "contiguous_gradients": true}, "steps_per_print": 2, "precision": "fp16", "num_layers": 44, "hidden_size": 6144, "num_attention_heads": 64, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "scaled_upper_triang_masked_softmax_fusion": true, "bias_gelu_fusion": true, "rotary_pct": 0.25, "init_method": "small_init", "output_layer_init_method": "wang_init", "gpt_j_residual": true, "gpt_j_tied": true, "output_layer_parallelism": "column", "lr_decay_style": "cosine", "lr_decay_iters": 150000, "min_lr": 9.7e-06, "optimizer_type": "Adam", "zero_stage": 1, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 1260000000, "zero_allgather_bucket_size": 1260000000, "lr": 9.7e-05, "tokenizer_type": "HFTokenizer", "data_path": "./data/pile_20B_tokenizer/pile_20B_tokenizer_text_document", "data_impl": "mmap", "save": "./20B_checkpoints", "config_files": {"20B.yml": "# DISCLAIMER: This is the configuration file for the GPT-NeoX-20B model as it was trained on 96x 40GB A100\n# GPUs. Depending on your system configuration, you may need to change some parameters in order to fit\n# the model in memory.\n\n{\n # Tokenizer / checkpoint settings - you will need to change these to the location you have them saved in\n \"vocab-file\": \"./20B_checkpoints/20B_tokenizer.json\",\n \"save\": \"./20B_checkpoints\",\n \"load\": \"./20B_checkpoints\",\n\n # If finetuning, edit the following to the location of your finetuning dataset:\n \"data-path\": \"./data/pile_20B_tokenizer/pile_20B_tokenizer_text_document\",\n\n # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n # across the node boundaries )\n \"pipe-parallel-size\": 2,\n \"model-parallel-size\": 2,\n\n # model settings\n \"num-layers\": 44,\n \"hidden-size\": 6144,\n \"num-attention-heads\": 64,\n \"seq-length\": 2048,\n \"max-position-embeddings\": 2048,\n \"norm\": \"layernorm\",\n \"pos-emb\": \"rotary\",\n \"rotary_pct\": 0.25,\n \"no-weight-tying\": true,\n \"gpt_j_residual\": true,\n \"gpt_j_tied\": true,\n \"output_layer_parallelism\": \"column\",\n \"scaled-upper-triang-masked-softmax-fusion\": true,\n \"bias-gelu-fusion\": true,\n\n # init methods\n \"init_method\": \"small_init\",\n \"output_layer_init_method\": \"wang_init\",\n\n # optimizer settings\n \"optimizer\": {\n \"type\": \"Adam\",\n \"params\": {\n \"lr\": 0.97e-4,\n \"betas\": [0.9, 0.95],\n \"eps\": 1.0e-8,\n }\n },\n\n \"min_lr\": 0.97e-5,\n\n # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training\n \"zero_optimization\": {\n \"stage\": 1,\n \"allgather_partitions\": True,\n \"allgather_bucket_size\": 1260000000,\n \"overlap_comm\": True,\n \"reduce_scatter\": True,\n \"reduce_bucket_size\": 1260000000,\n \"contiguous_gradients\": True,\n },\n\n # batch / data settings (assuming 96 GPUs)\n \"train_micro_batch_size_per_gpu\": 4,\n \"gradient_accumulation_steps\": 32,\n \"data-impl\": \"mmap\",\n \"split\": \"995,4,1\",\n\n # activation checkpointing\n \"checkpoint-activations\": true,\n \"checkpoint-num-layers\": 1,\n \"partition-activations\": false,\n \"synchronize-each-layer\": true,\n\n # regularization\n \"gradient_clipping\": 1.0,\n \"weight-decay\": 0.01,\n \"hidden-dropout\": 0,\n \"attention-dropout\": 0,\n\n # precision settings\n \"fp16\": {\n \"fp16\": true,\n \"enabled\": true,\n \"loss_scale\": 0,\n \"loss_scale_window\": 1000,\n \"initial_scale_power\": 12,\n \"hysteresis\": 2,\n \"min_loss_scale\": 1\n },\n\n # misc. training settings\n \"train-iters\": 150000,\n \"lr-decay-iters\": 150000,\n\n \"distributed-backend\": \"nccl\",\n \"lr-decay-style\": \"cosine\",\n \"warmup\": 0.01,\n \"checkpoint-factor\": 500,\n \"eval-interval\": 1000,\n \"eval-iters\": 10,\n\n # logging\n \"log-interval\": 2,\n \"steps_per_print\": 2,\n \"wall_clock_breakdown\": false,\n\n ### NEW DATA: ####\n \"tokenizer_type\": \"HFTokenizer\",\n \"tensorboard-dir\": \"./tensorboard\",\n \"log-dir\": \"./logs\",\n\n}\n", "text_generation_interactive.yml": "# Parameters used for text generation\n# Make sure load is specified somewhere else\n{\n # Text gen type: input-file, unconditional or interactive\n \"text-gen-type\": \"interactive\",\n\n # Params for all\n \"maximum_tokens\": 500,\n \"temperature\": 0.9,\n \"top_p\": 0,\n \"top_k\": 0,\n \"recompute\": false,\n\n # unconditional: samples\n \"num-samples\": 10,\n\n # input/output file\n \"sample-input-file\": \"prompt.txt\",\n \"sample-output-file\": \"sample_output.txt\",\n}\n"}, "load": "./20B_checkpoints", "checkpoint_factor": 500, "batch_size": 4, "train_iters": 150000, "eval_iters": 10, "split": "995,4,1", "vocab_file": "./20B_checkpoints/20B_tokenizer.json", "attention_dropout": 0, "hidden_dropout": 0, "checkpoint_activations": true, "synchronize_each_layer": true, "gas": 32, "clip_grad": 1.0, "dynamic_loss_scale": true, "pipe_parallel_size": 2, "model_parallel_size": 2, "is_pipe_parallel": true, "wandb_group": "72fp5jTbC3iYzFUHnE9Fh2_35wkyadj", "log_dir": "./logs", "tensorboard_dir": "./tensorboard", "log_interval": 2, "text_gen_type": "interactive", "temperature": 0.9, "maximum_tokens": 500, "sample_input_file": "prompt.txt", "sample_output_file": "sample_output.txt", "num_samples": 10, "user_script": "generate.py", "save_iters": [500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500, 6000, 6500, 7000, 7500, 8000, 8500, 9000, 9500, 10000, 10500, 11000, 11500, 12000, 12500, 13000, 13500, 14000, 14500, 15000, 15500, 16000, 16500, 17000, 17500, 18000, 18500, 19000, 19500, 20000, 20500, 21000, 21500, 22000, 22500, 23000, 23500, 24000, 24500, 25000, 25500, 26000, 26500, 27000, 27500, 28000, 28500, 29000, 29500, 30000, 30500, 31000, 31500, 32000, 32500, 33000, 33500, 34000, 34500, 35000, 35500, 36000, 36500, 37000, 37500, 38000, 38500, 39000, 39500, 40000, 40500, 41000, 41500, 42000, 42500, 43000, 43500, 44000, 44500, 45000, 45500, 46000, 46500, 47000, 47500, 48000, 48500, 49000, 49500, 50000, 50500, 51000, 51500, 52000, 52500, 53000, 53500, 54000, 54500, 55000, 55500, 56000, 56500, 57000, 57500, 58000, 58500, 59000, 59500, 60000, 60500, 61000, 61500, 62000, 62500, 63000, 63500, 64000, 64500, 65000, 65500, 66000, 66500, 67000, 67500, 68000, 68500, 69000, 69500, 70000, 70500, 71000, 71500, 72000, 72500, 73000, 73500, 74000, 74500, 75000, 75500, 76000, 76500, 77000, 77500, 78000, 78500, 79000, 79500, 80000, 80500, 81000, 81500, 82000, 82500, 83000, 83500, 84000, 84500, 85000, 85500, 86000, 86500, 87000, 87500, 88000, 88500, 89000, 89500, 90000, 90500, 91000, 91500, 92000, 92500, 93000, 93500, 94000, 94500, 95000, 95500, 96000, 96500, 97000, 97500, 98000, 98500, 99000, 99500, 100000, 100500, 101000, 101500, 102000, 102500, 103000, 103500, 104000, 104500, 105000, 105500, 106000, 106500, 107000, 107500, 108000, 108500, 109000, 109500, 110000, 110500, 111000, 111500, 112000, 112500, 113000, 113500, 114000, 114500, 115000, 115500, 116000, 116500, 117000, 117500, 118000, 118500, 119000, 119500, 120000, 120500, 121000, 121500, 122000, 122500, 123000, 123500, 124000, 124500, 125000, 125500, 126000, 126500, 127000, 127500, 128000, 128500, 129000, 129500, 130000, 130500, 131000, 131500, 132000, 132500, 133000, 133500, 134000, 134500, 135000, 135500, 136000, 136500, 137000, 137500, 138000, 138500, 139000, 139500, 140000, 140500, 141000, 141500, 142000, 142500, 143000, 143500, 144000, 144500, 145000, 145500, 146000, 146500, 147000, 147500, 148000, 148500, 149000, 149500], "global_num_gpus": 4}']' returned non-zero exit status 1. mchorse@473650d01`
bug

opened by ahmedavid 0
Upstream DeepSpeed -> HF checkpoint conversion script update

This PR fixes #750 . Upstream DeepSpeed saves checkpoints in a new layout which is incompatible with the old conversion script. This makes convert_to_hf.py work with upstream DeepSpeed, and leaves legacy_convert_to_hf.py for conversion from DeeperSpeed.

Draft for now because I need to test on a real model.

opened by haileyschoelkopf 0

Upstream DeepSpeed breaks HF conversion script

The tools/convert_to_hf.py script will need to be updated / a different version may need to be created for checkpoints saved with DeepSpeed. Checkpoints are no longer saved layer-by-layer, it seems, and now all weights are in several mp_rank_{MP_RANK}_model_states.pt files for each Model Parallel partition.

Upstream DeepSpeed checkpoint:

drwxr-xr-x 2 hailey eleuther     33280 Dec 18 14:29 configs
-rw-r--r-- 1 hailey eleuther 810771646 Dec 18 14:29 mp_rank_00_model_states.pt
-rw-r--r-- 1 hailey eleuther 608006863 Dec 18 14:29 zero_pp_rank_0_mp_rank_00_optim_states.pt
-rw-r--r-- 1 hailey eleuther 608008079 Dec 18 14:29 zero_pp_rank_1_mp_rank_00_optim_states.pt
-rw-r--r-- 1 hailey eleuther 608008079 Dec 18 14:29 zero_pp_rank_2_mp_rank_00_optim_states.pt
-rw-r--r-- 1 hailey eleuther 608008143 Dec 18 14:29 zero_pp_rank_3_mp_rank_00_optim_states.pt
-rw-r--r-- 1 hailey eleuther 608008143 Dec 18 14:29 zero_pp_rank_4_mp_rank_00_optim_states.pt
-rw-r--r-- 1 hailey eleuther 608008143 Dec 18 14:29 zero_pp_rank_5_mp_rank_00_optim_states.pt
-rw-r--r-- 1 hailey eleuther 608008079 Dec 18 14:29 zero_pp_rank_6_mp_rank_00_optim_states.pt
-rw-r--r-- 1 hailey eleuther 608006863 Dec 18 14:29 zero_pp_rank_7_mp_rank_00_optim_states.pt

DeeperSpeed checkpoint:

drwxrwxrwx 2 hailey eleuther     33280 Nov 18 04:55 configs
-rwxrwxrwx 1 hailey eleuther 206045931 Nov 18 04:55 layer_00-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_02-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_03-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_04-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_05-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_06-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_07-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_08-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_09-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_10-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_11-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_12-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_13-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_14-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_15-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_16-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_17-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_18-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_19-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_20-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_21-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_22-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_23-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_24-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_25-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther      9127 Nov 18 04:55 layer_27-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 206045931 Nov 18 04:55 layer_28-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther     16291 Nov 18 04:55 mp_rank_00_model_states.pt
-rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_0_mp_rank_00_optim_states.pt
-rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_10_mp_rank_00_optim_states.pt
-rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_11_mp_rank_00_optim_states.pt
-rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_12_mp_rank_00_optim_states.pt
-rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_13_mp_rank_00_optim_states.pt
-rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_14_mp_rank_00_optim_states.pt
-rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_15_mp_rank_00_optim_states.pt
...

Updating the script shouldn't be too hard at all though.

bug

opened by haileyschoelkopf 0

Issue deploying GPT-NeoX-20b on AWS Sagemaker with Jupyter Notebook

Describe the bug I get the following error upon trying to use predictor.predict(data) for AWS Sagemaker.

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "Could not load model /.sagemaker/mms/models/EleutherAI__gpt-neox-20b with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForCausalLM\u0027\u003e, \u003cclass \u0027transformers.models.gpt_neox.modeling_gpt_neox.GPTNeoXForCausalLM\u0027\u003e)."
}

To Reproduce Steps to reproduce the behavior:

Create Dockerfile
Add the following into Dockerfile

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04
RUN pip install --upgrade 'transformers==4.25.1'
RUN pip install --upgrade 'torch==1.13.0'

Build the Dockerfile via command docker build -t gpt-neox . in the directory Dockerfille is in
Create a file named dockerize.sh
Add the following content into the file

%%sh

# Specify an algorithm name
algorithm_name=gpt-neox

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1
if [ $? -ne 0 ]
then
aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Log into Docker
aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

Run command docker login (you need docker cli)
Execute the shell script file (you need aws cli)
Open Jupyter Notebook
Add the following to Jupyter Notebook

%pip install sagemaker
%pip install boto3

from sagemaker.huggingface import HuggingFaceModel
import boto3

client=boto3.client('sts')
account=client.get_caller_identity()['Account']

my_session=boto3.session.Session()
region=my_session.region_name

algorithm_name="gpt-neox"
tag="latest"
ecr_image='{}.dkr.ecr.{}.amazonaws.com/{}:{}'.format(account, region, algorithm_name, tag)

role = 'SageMaker'

hub = {
    'HF_MODEL_ID':'EleutherAI/gpt-neox-20b',
    'HF_TASK':'text-generation'
}

huggingface_model = HuggingFaceModel(
    image_uri=ecr_image,
    env=hub,
    role=role,
#     transformers_version="4.17", these are not needed anymore
#     pytorch_version="1.10",
#     py_version="py38",
)

predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.xlarge"
)

Run this final command in Jupyter Notebook once predictor is done.

predictor.predict({
    "inputs": "The weather is"
})

See error

Expected behavior I expect to see a generated query from the input using the 20 billion parameters pretrained EleutherAI model.

Proposed solution I suspect I could fix this issue if I all together ditched the huggingface Sagemaker library. Also, the model hasn't been updated for the last 8 months, so I'm not sure if it is due to that.

Environment (please complete the following information):

GPUs: none
Configs: unsure

Additional context I have tried the other versions of GPT-Neo's like 125M and 2.7B and those have worked perfectly. The reason that I need to extend the Docker container for AWS is to not get another error which is apparently an issue with the latest version of transformers (4.17?) on the default docker is not up to date enough.

bug

opened by BjornTheProgrammer 1

Model ckpts from `DeeperSpeed` cannot be loaded using `deepspeed_main`/upstream DeepSpeed

Describe the bug

Using DeeperSpeed-trained model checkpoints (git+https://github.com/EleutherAI/DeeperSpeed.git@eb7f5cff36678625d23db8a8fe78b4a93e5d2c75#egg=deepspeed ),

loading them raises an error when trying to use deepspeed_main with the upstream DeepSpeed.

To Reproduce Steps to reproduce the behavior:

Train a model from main branch using DeeperSpeed (or download model ckpt from s-eai-neox/pythia/1.3B/global_step71500)

Try to load this checkpoint using the deepspeed_main branch and upstream Deepspeed (for either training or evaluation), gives the following error:

Traceback (most recent call last):
  File "/fsx/hailey/deepspeed-main-neox/gpt-neox/evaluate.py", line 76, in <module>
    main()
  File "/fsx/hailey/deepspeed-main-neox/gpt-neox/evaluate.py", line 35, in main
    model, neox_args = setup_for_inference_or_eval(use_cache=False)
  File "/fsx/hailey/deepspeed-main-neox/gpt-neox/megatron/utils.py", line 440, in setup_for_inference_or_eval
    model, _, _ = setup_model_and_optimizer(
  File "/fsx/hailey/deepspeed-main-neox/gpt-neox/megatron/training.py", line 437, in setup_model_and_optimizer
    neox_args.iteration = load_checkpoint(
  File "/fsx/hailey/deepspeed-main-neox/gpt-neox/megatron/checkpointing.py", line 235, in load_checkpoint
    checkpoint_name, state_dict = model.load_checkpoint(
  File "/fsx/shiv/torchtest/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2647, in load_checkpoint
    load_path, client_states = self._load_checkpoint(load_dir,
  File "/fsx/shiv/torchtest/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2713, in _load_checkpoint
    self.load_module_state_dict(state_dict=checkpoint['module'],
  File "/fsx/shiv/torchtest/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2507, in load_module_state_dict
    self.module.load_state_dict(state_dict, # TODO
  File "/fsx/shiv/torchtest/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1620, in load_state_dict
    raise TypeError("Expected state_dict to be dict-like, got {}.".format(type(state_dict)))
TypeError: Expected state_dict to be dict-like, got <class 'NoneType'>.

This gives the above traceback and checkpoint loading fails.

Expected behavior The checkpoints should be loadable by either Deepspeed version, ideally.

Proposed solution This could be an issue with Deepspeed checkpoint formats changing over the course of 4 versions--not sure yet.

Additional context Relevant to merging #663 since we have checkpoints we want to use trained in DeeperSpeed.

cc @Quentin-Anthony @dashstander @StellaAthena

bug

opened by haileyschoelkopf 3

Releases(legacy_gptj_residual.1.0.0)

legacy_gptj_residual.1.0.0(May 16, 2022)

Source code(tar.gz)
Source code(zip)

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

Related tags

Overview

GPT-NeoX

Requirements

Running the code

Running the code locally

Running the code on a server

~/scripts/

Datasets

Tokenizers

Using our data

Using your data

Advanced Options

Contribute

Resources

Comments

History:

Tests and validations so far:

Tests and validations that resulted in discussions.

Questions

Releases(legacy_gptj_residual.1.0.0)

legacy_gptj_residual.1.0.0(May 16, 2022)

Owner

EleutherAI

Train 🤗transformers with DeepSpeed: ZeRO-2, ZeRO-3

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

Train GPT-3 model on V100(16GB Mem) Using improved Transformer.

Simple and efficient RevNet-Library with DeepSpeed support

The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend

Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

Honor's thesis project analyzing whether the GPT-2 model can more effectively generate free-verse or structured poetry.

🕹 An esoteric language designed so that the program looks like the transcript of a Pokémon battle

Ongoing research training transformer language models at scale, including: BERT & GPT-2