An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

Overview

GPT-NeoX

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger. This repository is under development and may change rapidly without warning.

Requirements

Everything you need to get started running the code can be installed via pip:

$ pip install -r requirements.txt

Important: This codebase does not install Microsoft's DeepSpeed library. It installs DeeperSpeed, EleutherAI's variant on the original DeepSpeed. We have added some necessary functionality for our purposes and patched holes created by the fact that only parts of DeepSpeed were publicly released, but DeeperSpeed uses the same namespace as DeepSpeed and may break other code built upon DeepSpeed. If you use or suspect you might use Microsoft's DeepSpeed for another project, we strongly secommend you use anaconda to install this code in an isolated environment by creating a condo environment and running conda install --file requirements.txt. We welcome any suggestions for improvements to our DeepSpeeder library, but please open issues on its repo rather than this one.

EleutherAI members who wish to run models on our Kubernetes cluster will additionally need to install Kubernetes and obtain an authorization from Stella Biderman or Sid Black. Please reach out on discord in the #gpt-neo channel. You will also need to create a WandB account and share your username so that you can be added to the organization WandB account.

Running the code

The core anatomy of a call to the DeepSpeed engine is the following

$ deepspeed --hostfile=host_path train_script.py user_args\
	--deepspeed \
	--deepspeed_config deepspeed_config.json

where

  • host_path (optional) is the path to the host file containing the addresses of the machines you wish to train on.
  • train_script.py is the training script you wish to use. Our main training script is train_pipeline.py.
  • deepspeed_config.json is the json file containing DeepSpeed-specific hyperparameters.

In this repository, we provide a lightweight wrapper for the above function call for two main reasons. Firstly, we find the way the arguments are ordered and used somewhat counterintuitive, and secondly our wrapper automatically uploads logging data to WandB. Everything in this repository will work with both the native DeepSpeed command and with our deepy command. The core anatomy of a deepy call is

$ ./deepy --hostfile=host_path train_script.py deepspeed_config.json

Running the code locally

This code is set up to run automatically on as many GPUs as are avaliable. If you have multiple GPUs and only wish to make use of some of them, you can find information about how to specify which GPU(s) to use in training here.

The most common pitfall for local training is pipeline parallelism. Pipeline parallelism paritions the model into segments (called PipelineModules in our code) that can decrese latency by running partially asynchronously.

Running the code on a server

This code is set up to run automatically on as many GPUs as are avaliable. To run across multiple machines, you need to make use of a hostfile which lists the IP address of each machine you wish to run the code on followed by the number of GPUs to use. For example, 123.45.67.890 slots=8 instructs the code to run on all eight GPUs of the machine at 123.45.67.890. Each machine should be listed on a separate line with no end-of-line punctuation. It is officially recommended that you set up passwordless ssh, but we have had success entering the password at run-time. To have your hostfile used by GPT-NeoX automatically, store it at ~/jobs/hostfile. Otherwise, you can provide it as an argument as shown above.

EleutherAI members: Once you have been granted access to the EleutherAI servers and have confirmed that an unused cluster is currently running, simply ssh into the cluster. If you have been granted the ability to create an destroy Kubernetes clusters, run kubernetes/deploy_k8s.sh branch_name num_pods cluster_name to create a cluster.

~/scripts/

The directory ~/scripts/ stores various scripts for automatically starting runs with particular settings and configs that we have found useful. They can be run using sh scripts/script_name.sh but should not be relied upon. We do not guarentee forward compatibility of any scripts.

Datasets

Tokenizers

Using our data

Using your data

Advanced Options

Contribute

If you want to get involved, check out our repo projects. Anything that is listed as "todo" or has not been assigned to anyone is fair game, but please leave a comment so that we know you're working on it!

Resources

If you have trouble getting the model to run, consider consulting this guide to installing in a GCE virtual machine. You may also find the (very sparse) DeepSpeed docs helpful.

Comments
  • Running on a single GPU

    Running on a single GPU

    tried merging the checkpoints as described for single GPU python tools/merge20b.py --input_dir ./20B_checkpoints --output_dir ./20B_checkpoints_merged

    However Im getting this error when generating RuntimeError: Error(s) in loading state_dict for EmbeddingPipe: size mismatch for word_embeddings.weight: copying a param with shape torch.Size([50432, 6144]) from checkpoint, the shape in current model is torch.Size([50304, 6144]).

    How can I adjust to make the current model match size 50432? or is it the other way around?

    bug 
    opened by huey2531 22
  • Clean up Neox configuration

    Clean up Neox configuration

    Clean up neox configuration so config files can be used instead of a mishmash of files, command line args and enviroment variables.

    Aim:

    • All parameters can be set using passed json files
    • No parameters are repeated
    • Modify megatron's codebase as little as possible to make it easier to merge upstream megatron changes in the future.

    Nice to haves:

    Todo:

    • [x] Convert all examples and configs to new configuration
    • [ ] Create config documentation with all possible parameters
    • [x] Separate configs into: model and system
    • [x] Cast numbers to numbers in JSON (suggested by @StellaAthena)
    • [x] Calculate batch size from other parms (micro_batch_per_gpu*GAS*n_gpus)
    opened by joshlk 19
  • Model and config code of an HF gpt-neox model; a conversion script.

    Model and config code of an HF gpt-neox model; a conversion script.

    The modeling and configuration files are largely based on HF's gpt-j model. (I found gpt-j's architecture more similar to gpt-neox than gpt-neo, especially it uses rotary embedding)

    Modifications to the original gpt-j modeling:

    • Added post-attention layernorm as ln_2.
    • Changed q_proj, k_proj, v_proj linear layers to a single qkv_proj that corresponds to gpt-neox's attention.query_key_value linear layer. And set bias=True.
    • Combined gpt-neox's and HF gpt-j's rotatry embedding functions.
    • Set bias=False for lm_head.
    • Updated the computation in GPTNeoXBlock in correspondence to two residual computing ways in gpt-neox, which is controlled by a new config argument gpt_j_residual.

    Modifications to the original gpt-j configuration:

    • Set the default value of activation_function to gelu.
    • Removed rotary_dim (so that its default value is None).
    • Added a gpt_j_residual argument (default value is False) in correspondence to two residual computing ways in gpt-neox.

    A conversion script:

    • that reads config files and gpt-neox's output state dict files and outputs a pretrained HF pytorch GPTNeoX model.
    • Note that weights are not loaded for two kinds of model parameters transformer.h.*.attn.bias and transformer.h.*.attn.masked_bias because they should keep their default values.

    Things that I have checked with a 1B model, which is trained basically following the default XL.yml config:

    • The above conversion script works correctly.
    • The greedy-decoding outputs by gpt-neox's inference script and HF's generate() are identical.
    • Intermediate outputs (e.g. hidden states) are almost identical when running from gpt-neox code and HF code. There are some small differences, which I think are caused by precision settings.

    Things that are not included in this pull request:

    • Tensorflow-related code.
    • Conversion script that considers checkpoints trained with model parallel.
    • Other model variants that might use a different type of, for example, rotary embeddings.
    • Things that haven't come to my mind.

    BTW, I haven't found a better place to put the new files so I simply created a directory huggingface under /tools.

    A table that summarizes parameters (and their shapes) in 1) a GPT-NeoX checkpoint, 2) an HF GPTNeo model, 3) an HF GPTJ model, and 4) the HF GPTNeoX model in this pull request: Screen Shot 2021-12-15 at 13 29 32

    opened by ZHAOTING 16
  • align gpt-j layernorm to hf

    align gpt-j layernorm to hf

    Looking deeper into the gpt-j residual implementation I found a delta in the way layernorm(s) are applied. I don't see the point in applying two separate layer norm modules to the hidden_states (x)

    Compare the HF implementation. https://github.com/huggingface/transformers/blob/a94105f95fb66ee4129077c03e4e8a224f6a07fd/src/transformers/models/gptj/modeling_gptj.py#L279

    Is there a reason for having two layernorms? Am I completally off?

    opened by sweinbach 15
  • 13B Model Out of Memory with Single Node 8 A100 GPUs

    13B Model Out of Memory with Single Node 8 A100 GPUs

    Hi!

    Thanks for contribution making this repo available :)

    I tried to train the 13B model with micro batch size 1, model parallelism degree 8, but unable to get it to work. (always get OOM) The library advertises being able to scale up to 100B. What is required for this? I also tried deepspeed stage 3 with offload without using pipeline parallelism but that doesn't seem to work either. Please let me know what I'm missing. thanks!

    opened by benathi 14
  • Add support for Flash attention

    Add support for Flash attention

    This PR adds Tri Dao's Flash Attention as an optional backend for the global attention operation, enabled by setting the attention_config to [[["flash"], ...]. I've tested the changes in my own environment and consistently see a 2x boost for 4K sequence lengths in models ranging from 100M - 3B parameters.

    Maybe relevant: @tridao @lucidrains

    opened by VHellendoorn 13
  • Pipeline parallelism and gradient checkpointing (edit: and ZeRO 2!) don’t work together

    Pipeline parallelism and gradient checkpointing (edit: and ZeRO 2!) don’t work together

    Pipeline parallelism and gradient checkpointing both work when you use them individually. However when you turn them both on you get a mysterious KeyError: 0 from somewhere deep in DeepSpeed.

    bug 
    opened by StellaAthena 12
  • distributed training with multipy nodes.

    distributed training with multipy nodes.

    Hi, I want train a 13B model with 4 nodes, each node have 8*A100 GPUs, but I don't know how to run the code on my cluster, can you show me an example? I just run it on a single node successful.

    bug 
    opened by cdj0311 11
  • fix alibi inference shapes for cached layer_past

    fix alibi inference shapes for cached layer_past

    Restart the now reverted previous fix.

    History:

    • Old PR was merged, after which some (small?) differences in model output became apparent in discussion with @sdtblck
    • Old PR was reverted
    • This PR is opened to discuss the issue

    Tests and validations so far:

    Inference was tested on a trained neox checkpoint (model size ~4B, 60k steps trained). Random sampling is deactivated (no top_k, top_p, temperature).

    1. test NOT using recompute, i.e. with cache values used in the text generation (interactive)
    python deepy.py generate.py -d configs ... text_generation
    

    (copy pasted from terminal) Context prompt >>> Once upon a time Generated Text: , the only way to get a job at any of the many colleges and universities in the United States of America was to have a high school diploma

    => the model is working => generated text seems far from random (though not a fairytale :-) )

    1. test using recompute, i.e. with NO cache values used in the text generation (interactive)
    python deepy.py generate.py -d configs ... text_generation
    

    (copy pasted from terminal) Context prompt >>> Once upon a time Generated Text: , the only way to get a job at any of the many colleges and universities in the United States of America was to have a high school diploma

    => The generated text is exactly the same as above

    Tests and validations that resulted in discussions.

    The following line has been added to text_generation_utils.py to print logits. image

    1. test NOT using recompute, logit output

    (copy pasted from terminal) Context prompt >>> Once upon a time generated_tokens tensor(15, device='cuda:0') generated_token_logits tensor(42.1250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(37.4375, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1711, device='cuda:0') generated_token_logits tensor(33.6250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2333, device='cuda:0') generated_token_logits tensor(36.8750, device='cuda:0', dtype=torch.float16) generated_tokens tensor(327, device='cuda:0') generated_token_logits tensor(42.1875, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1635, device='cuda:0') generated_token_logits tensor(40.0312, device='cuda:0', dtype=torch.float16) generated_tokens tensor(247, device='cuda:0') generated_token_logits tensor(39.0938, device='cuda:0', dtype=torch.float16) generated_tokens tensor(6834, device='cuda:0') generated_token_logits tensor(36.5000, device='cuda:0', dtype=torch.float16) generated_tokens tensor(554, device='cuda:0') generated_token_logits tensor(40.6250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1287, device='cuda:0') generated_token_logits tensor(34.9688, device='cuda:0', dtype=torch.float16) generated_tokens tensor(338, device='cuda:0') generated_token_logits tensor(36.5938, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(39.9688, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2329, device='cuda:0') generated_token_logits tensor(34.2500, device='cuda:0', dtype=torch.float16) generated_tokens tensor(45857, device='cuda:0') generated_token_logits tensor(33.4062, device='cuda:0', dtype=torch.float16) generated_tokens tensor(348, device='cuda:0') generated_token_logits tensor(40.5312, device='cuda:0', dtype=torch.float16) generated_tokens tensor(40126, device='cuda:0') generated_token_logits tensor(38.3125, device='cuda:0', dtype=torch.float16) generated_tokens tensor(280, device='cuda:0') generated_token_logits tensor(41.7188, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(41.1875, device='cuda:0', dtype=torch.float16) generated_tokens tensor(5510, device='cuda:0') generated_token_logits tensor(36.5000, device='cuda:0', dtype=torch.float16) generated_tokens tensor(6405, device='cuda:0') generated_token_logits tensor(41.2812, device='cuda:0', dtype=torch.float16) generated_tokens tensor(338, device='cuda:0') generated_token_logits tensor(44.4062, device='cuda:0', dtype=torch.float16) generated_tokens tensor(7384, device='cuda:0') generated_token_logits tensor(43.4688, device='cuda:0', dtype=torch.float16) generated_tokens tensor(672, device='cuda:0') generated_token_logits tensor(45.3750, device='cuda:0', dtype=torch.float16) generated_tokens tensor(327, device='cuda:0') generated_token_logits tensor(46., device='cuda:0', dtype=torch.float16) generated_tokens tensor(779, device='cuda:0') generated_token_logits tensor(39.9062, device='cuda:0', dtype=torch.float16) generated_tokens tensor(247, device='cuda:0') generated_token_logits tensor(42.6875, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2556, device='cuda:0') generated_token_logits tensor(36.8438, device='cuda:0', dtype=torch.float16) generated_tokens tensor(3991, device='cuda:0') generated_token_logits tensor(42.8750, device='cuda:0', dtype=torch.float16) generated_tokens tensor(30480, device='cuda:0') generated_token_logits tensor(40.0625, device='cuda:0', dtype=torch.float16) generated_tokens tensor(17, device='cuda:0') generated_token_logits tensor(42.8125, device='cuda:0', dtype=torch.float16) Generated Text: , the only way to get a job at any of the many colleges and universities in the United States of America was to have a high school diploma

    => the generated text is the same as above

    1. test using recompute, logit output (copy pasted from terminal) Context prompt >>> Once upon a time generated_tokens tensor(15, device='cuda:0') generated_token_logits tensor(42.1562, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(37.4375, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1711, device='cuda:0') generated_token_logits tensor(33.6250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2333, device='cuda:0') generated_token_logits tensor(36.9062, device='cuda:0', dtype=torch.float16) generated_tokens tensor(327, device='cuda:0') generated_token_logits tensor(42.1875, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1635, device='cuda:0') generated_token_logits tensor(40.0625, device='cuda:0', dtype=torch.float16) generated_tokens tensor(247, device='cuda:0') generated_token_logits tensor(39.0938, device='cuda:0', dtype=torch.float16) generated_tokens tensor(6834, device='cuda:0') generated_token_logits tensor(36.5312, device='cuda:0', dtype=torch.float16) generated_tokens tensor(554, device='cuda:0') generated_token_logits tensor(40.6562, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1287, device='cuda:0') generated_token_logits tensor(35.0312, device='cuda:0', dtype=torch.float16) generated_tokens tensor(338, device='cuda:0') generated_token_logits tensor(36.5625, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(40.1250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2329, device='cuda:0') generated_token_logits tensor(34.2500, device='cuda:0', dtype=torch.float16) generated_tokens tensor(45857, device='cuda:0') generated_token_logits tensor(33.3125, device='cuda:0', dtype=torch.float16) generated_tokens tensor(348, device='cuda:0') generated_token_logits tensor(40.6562, device='cuda:0', dtype=torch.float16) generated_tokens tensor(40126, device='cuda:0') generated_token_logits tensor(38.2812, device='cuda:0', dtype=torch.float16) generated_tokens tensor(280, device='cuda:0') generated_token_logits tensor(41.6875, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(41.1250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(5510, device='cuda:0') generated_token_logits tensor(36.4688, device='cuda:0', dtype=torch.float16) generated_tokens tensor(6405, device='cuda:0') generated_token_logits tensor(41.3750, device='cuda:0', dtype=torch.float16) generated_tokens tensor(338, device='cuda:0') generated_token_logits tensor(44.3438, device='cuda:0', dtype=torch.float16) generated_tokens tensor(7384, device='cuda:0') generated_token_logits tensor(43.2500, device='cuda:0', dtype=torch.float16) generated_tokens tensor(672, device='cuda:0') generated_token_logits tensor(45.2188, device='cuda:0', dtype=torch.float16) generated_tokens tensor(327, device='cuda:0') generated_token_logits tensor(46.0625, device='cuda:0', dtype=torch.float16) generated_tokens tensor(779, device='cuda:0') generated_token_logits tensor(39.8750, device='cuda:0', dtype=torch.float16) generated_tokens tensor(247, device='cuda:0') generated_token_logits tensor(42.6562, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2556, device='cuda:0') generated_token_logits tensor(36.8438, device='cuda:0', dtype=torch.float16) generated_tokens tensor(3991, device='cuda:0') generated_token_logits tensor(42.9375, device='cuda:0', dtype=torch.float16) generated_tokens tensor(30480, device='cuda:0') generated_token_logits tensor(39.9688, device='cuda:0', dtype=torch.float16) generated_tokens tensor(17, device='cuda:0') generated_token_logits tensor(42.9375, device='cuda:0', dtype=torch.float16) Generated Text: , the only way to get a job at any of the many colleges and universities in the United States of America was to have a high school diploma

    => Generated text still the same => Logits are different. This is the question at hand

    Questions

    • Why are logits different?
    • Does the difference only occur for alibi? Is this a general issue if an issue at all?
    opened by sweinbach 11
  • Running through Dockerfile broken

    Running through Dockerfile broken

    Describe the bug When using an image based on the provided Dockerfile and running the quick start steps (download enron data, run deep.py), execution crashes before training begins.

    To Reproduce Steps to reproduce the behavior:

    1. Build an image using the provided Dockerfile
    2. Run said image, mounting 8 RTX800 GPUs
    3. Fetch enron data using the prepare_dataset.py script
    4. Run ./deepy.py pretrain_gpt2.py -d configs small.yml local_configs.yml
    5. The code crashes with a non-descript NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8

    Expected behavior Training starts or a specific error is provided.

    Proposed solution The NCCL error is typically a stand-in for a real issue that is not relayed back through multiprocessing. As a first step, it would be nice to know if this setup works out-of-the-box for others; in that case, it might be my resources or CUDA version.

    Environment (please complete the following information):

    • GPUs: 8 RTX8000 GPUs
    • Configs: Ubuntu 20.04, Cuda 11.2

    Additional context Add any other context about the problem here.

    bug 
    opened by VHellendoorn 11
  • Create experiment runners

    Create experiment runners

    We will want to run experiments with a variety of configs and options. To enable this, we need two things:

    • [ ] configs files that we can use to specify the settings for a particular run
    • [ ] an experiment runner for managing and automatically executing several runs
    feature request good first issue 
    opened by StellaAthena 11
  • In interactive mode prompt length more than one word causes to crash

    In interactive mode prompt length more than one word causes to crash

    Describe the bug In interactive mode prompt length more than one word causes to crash. When I type just one word it generates text though.

    text_generation.yml

    ` { "text-gen-type": "interactive", "maximum_tokens": 500, "temperature": 0.9, "top_p": 0, "top_k": 0, "recompute": false, "num-samples": 10, "sample-input-file": "prompt.txt", "sample-output-file": "sample_output.txt", }

    `

    `Context prompt >>> Hello from Traceback (most recent call last): Traceback (most recent call last): File "generate.py", line 89, in File "generate.py", line 89, in main() File "generate.py", line 72, in main main() File "generate.py", line 72, in main generate_samples_interactive( File "/gpt-neox/megatron/text_generation_utils.py", line 760, in generate_samples_interactive generate_samples_interactive( File "/gpt-neox/megatron/text_generation_utils.py", line 760, in generate_samples_interactive for ( File "/gpt-neox/megatron/text_generation_utils.py", line 316, in stream_tokens for ( File "/gpt-neox/megatron/text_generation_utils.py", line 316, in stream_tokens logits = forward_model(model, model_inputs, neox_args.is_pipe_parallel) File "/gpt-neox/megatron/text_generation_utils.py", line 156, in forward_model logits = forward_model(model, model_inputs, neox_args.is_pipe_parallel) File "/gpt-neox/megatron/text_generation_utils.py", line 156, in forward_model loss, logits = model.eval_batch(model_inputs, return_logits=True) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 394, in eval_batch loss, logits = model.eval_batch(model_inputs, return_logits=True) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 394, in eval_batch self._exec_schedule(sched) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 1308, in _exec_schedule self._exec_schedule(sched) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 1308, in _exec_schedule self._exec_instr(**cmd.kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 700, in _exec_forward_pass self._exec_instr(**cmd.kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 700, in _exec_forward_pass self.loss = self.loss_model(outputs, labels) File "/gpt-neox/megatron/model/gpt2_model.py", line 67, in cross_entropy losses = mpu.vocab_parallel_cross_entropy(output.float().contiguous(), labels) File "/gpt-neox/megatron/mpu/cross_entropy.py", line 117, in vocab_parallel_cross_entropy self.loss = self.loss_model(outputs, labels) File "/gpt-neox/megatron/model/gpt2_model.py", line 67, in cross_entropy return _VocabParallelCrossEntropy.apply(vocab_parallel_logits, target) File "/gpt-neox/megatron/mpu/cross_entropy.py", line 63, in forward losses = mpu.vocab_parallel_cross_entropy(output.float().contiguous(), labels) predicted_logits_1d = logits_2d[arange_1d, masked_target_1d] File "/gpt-neox/megatron/mpu/cross_entropy.py", line 117, in vocab_parallel_cross_entropy

    IndexError: shape mismatch: indexing tensors could not be broadcast together with shapes [2], [3] return _VocabParallelCrossEntropy.apply(vocab_parallel_logits, target) File "/gpt-neox/megatron/mpu/cross_entropy.py", line 63, in forward predicted_logits_1d = logits_2d[arange_1d, masked_target_1d] IndexError: shape mismatch: indexing tensors could not be broadcast together with shapes [2], [3] Killing subprocess 7479 Killing subprocess 7480 Killing subprocess 7481 Killing subprocess 7482 Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 179, in main() File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 169, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 147, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'generate.py', '--local_rank=3', '--deepspeed_config', '{"train_batch_size": 128, "train_micro_batch_size_per_gpu": 4, "gradient_accumulation_steps": 32, "optimizer": {"type": "Adam", "params": {"lr": 9.7e-05, "betas": [0.9, 0.95], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 12, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1260000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1260000000, "contiguous_gradients": true}, "steps_per_print": 2}', '--megatron_config', '{"train_batch_size": 128, "train_micro_batch_size_per_gpu": 4, "gradient_accumulation_steps": 32, "optimizer": {"type": "Adam", "params": {"lr": 9.7e-05, "betas": [0.9, 0.95], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 12, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1260000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1260000000, "contiguous_gradients": true}, "steps_per_print": 2, "precision": "fp16", "num_layers": 44, "hidden_size": 6144, "num_attention_heads": 64, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "scaled_upper_triang_masked_softmax_fusion": true, "bias_gelu_fusion": true, "rotary_pct": 0.25, "init_method": "small_init", "output_layer_init_method": "wang_init", "gpt_j_residual": true, "gpt_j_tied": true, "output_layer_parallelism": "column", "lr_decay_style": "cosine", "lr_decay_iters": 150000, "min_lr": 9.7e-06, "optimizer_type": "Adam", "zero_stage": 1, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 1260000000, "zero_allgather_bucket_size": 1260000000, "lr": 9.7e-05, "tokenizer_type": "HFTokenizer", "data_path": "./data/pile_20B_tokenizer/pile_20B_tokenizer_text_document", "data_impl": "mmap", "save": "./20B_checkpoints", "config_files": {"20B.yml": "# DISCLAIMER: This is the configuration file for the GPT-NeoX-20B model as it was trained on 96x 40GB A100\n# GPUs. Depending on your system configuration, you may need to change some parameters in order to fit\n# the model in memory.\n\n{\n # Tokenizer / checkpoint settings - you will need to change these to the location you have them saved in\n \"vocab-file\": \"./20B_checkpoints/20B_tokenizer.json\",\n \"save\": \"./20B_checkpoints\",\n \"load\": \"./20B_checkpoints\",\n\n # If finetuning, edit the following to the location of your finetuning dataset:\n \"data-path\": \"./data/pile_20B_tokenizer/pile_20B_tokenizer_text_document\",\n\n # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n # across the node boundaries )\n \"pipe-parallel-size\": 2,\n \"model-parallel-size\": 2,\n\n # model settings\n \"num-layers\": 44,\n \"hidden-size\": 6144,\n \"num-attention-heads\": 64,\n \"seq-length\": 2048,\n \"max-position-embeddings\": 2048,\n \"norm\": \"layernorm\",\n \"pos-emb\": \"rotary\",\n \"rotary_pct\": 0.25,\n \"no-weight-tying\": true,\n \"gpt_j_residual\": true,\n \"gpt_j_tied\": true,\n \"output_layer_parallelism\": \"column\",\n \"scaled-upper-triang-masked-softmax-fusion\": true,\n \"bias-gelu-fusion\": true,\n\n # init methods\n \"init_method\": \"small_init\",\n \"output_layer_init_method\": \"wang_init\",\n\n # optimizer settings\n \"optimizer\": {\n \"type\": \"Adam\",\n \"params\": {\n \"lr\": 0.97e-4,\n \"betas\": [0.9, 0.95],\n \"eps\": 1.0e-8,\n }\n },\n\n \"min_lr\": 0.97e-5,\n\n # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training\n \"zero_optimization\": {\n \"stage\": 1,\n \"allgather_partitions\": True,\n \"allgather_bucket_size\": 1260000000,\n \"overlap_comm\": True,\n \"reduce_scatter\": True,\n \"reduce_bucket_size\": 1260000000,\n \"contiguous_gradients\": True,\n },\n\n # batch / data settings (assuming 96 GPUs)\n \"train_micro_batch_size_per_gpu\": 4,\n \"gradient_accumulation_steps\": 32,\n \"data-impl\": \"mmap\",\n \"split\": \"995,4,1\",\n\n # activation checkpointing\n \"checkpoint-activations\": true,\n \"checkpoint-num-layers\": 1,\n \"partition-activations\": false,\n \"synchronize-each-layer\": true,\n\n # regularization\n \"gradient_clipping\": 1.0,\n \"weight-decay\": 0.01,\n \"hidden-dropout\": 0,\n \"attention-dropout\": 0,\n\n # precision settings\n \"fp16\": {\n \"fp16\": true,\n \"enabled\": true,\n \"loss_scale\": 0,\n \"loss_scale_window\": 1000,\n \"initial_scale_power\": 12,\n \"hysteresis\": 2,\n \"min_loss_scale\": 1\n },\n\n # misc. training settings\n \"train-iters\": 150000,\n \"lr-decay-iters\": 150000,\n\n \"distributed-backend\": \"nccl\",\n \"lr-decay-style\": \"cosine\",\n \"warmup\": 0.01,\n \"checkpoint-factor\": 500,\n \"eval-interval\": 1000,\n \"eval-iters\": 10,\n\n # logging\n \"log-interval\": 2,\n \"steps_per_print\": 2,\n \"wall_clock_breakdown\": false,\n\n ### NEW DATA: ####\n \"tokenizer_type\": \"HFTokenizer\",\n \"tensorboard-dir\": \"./tensorboard\",\n \"log-dir\": \"./logs\",\n\n}\n", "text_generation_interactive.yml": "# Parameters used for text generation\n# Make sure load is specified somewhere else\n{\n # Text gen type: input-file, unconditional or interactive\n \"text-gen-type\": \"interactive\",\n\n # Params for all\n \"maximum_tokens\": 500,\n \"temperature\": 0.9,\n \"top_p\": 0,\n \"top_k\": 0,\n \"recompute\": false,\n\n # unconditional: samples\n \"num-samples\": 10,\n\n # input/output file\n \"sample-input-file\": \"prompt.txt\",\n \"sample-output-file\": \"sample_output.txt\",\n}\n"}, "load": "./20B_checkpoints", "checkpoint_factor": 500, "batch_size": 4, "train_iters": 150000, "eval_iters": 10, "split": "995,4,1", "vocab_file": "./20B_checkpoints/20B_tokenizer.json", "attention_dropout": 0, "hidden_dropout": 0, "checkpoint_activations": true, "synchronize_each_layer": true, "gas": 32, "clip_grad": 1.0, "dynamic_loss_scale": true, "pipe_parallel_size": 2, "model_parallel_size": 2, "is_pipe_parallel": true, "wandb_group": "72fp5jTbC3iYzFUHnE9Fh2_35wkyadj", "log_dir": "./logs", "tensorboard_dir": "./tensorboard", "log_interval": 2, "text_gen_type": "interactive", "temperature": 0.9, "maximum_tokens": 500, "sample_input_file": "prompt.txt", "sample_output_file": "sample_output.txt", "num_samples": 10, "user_script": "generate.py", "save_iters": [500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500, 6000, 6500, 7000, 7500, 8000, 8500, 9000, 9500, 10000, 10500, 11000, 11500, 12000, 12500, 13000, 13500, 14000, 14500, 15000, 15500, 16000, 16500, 17000, 17500, 18000, 18500, 19000, 19500, 20000, 20500, 21000, 21500, 22000, 22500, 23000, 23500, 24000, 24500, 25000, 25500, 26000, 26500, 27000, 27500, 28000, 28500, 29000, 29500, 30000, 30500, 31000, 31500, 32000, 32500, 33000, 33500, 34000, 34500, 35000, 35500, 36000, 36500, 37000, 37500, 38000, 38500, 39000, 39500, 40000, 40500, 41000, 41500, 42000, 42500, 43000, 43500, 44000, 44500, 45000, 45500, 46000, 46500, 47000, 47500, 48000, 48500, 49000, 49500, 50000, 50500, 51000, 51500, 52000, 52500, 53000, 53500, 54000, 54500, 55000, 55500, 56000, 56500, 57000, 57500, 58000, 58500, 59000, 59500, 60000, 60500, 61000, 61500, 62000, 62500, 63000, 63500, 64000, 64500, 65000, 65500, 66000, 66500, 67000, 67500, 68000, 68500, 69000, 69500, 70000, 70500, 71000, 71500, 72000, 72500, 73000, 73500, 74000, 74500, 75000, 75500, 76000, 76500, 77000, 77500, 78000, 78500, 79000, 79500, 80000, 80500, 81000, 81500, 82000, 82500, 83000, 83500, 84000, 84500, 85000, 85500, 86000, 86500, 87000, 87500, 88000, 88500, 89000, 89500, 90000, 90500, 91000, 91500, 92000, 92500, 93000, 93500, 94000, 94500, 95000, 95500, 96000, 96500, 97000, 97500, 98000, 98500, 99000, 99500, 100000, 100500, 101000, 101500, 102000, 102500, 103000, 103500, 104000, 104500, 105000, 105500, 106000, 106500, 107000, 107500, 108000, 108500, 109000, 109500, 110000, 110500, 111000, 111500, 112000, 112500, 113000, 113500, 114000, 114500, 115000, 115500, 116000, 116500, 117000, 117500, 118000, 118500, 119000, 119500, 120000, 120500, 121000, 121500, 122000, 122500, 123000, 123500, 124000, 124500, 125000, 125500, 126000, 126500, 127000, 127500, 128000, 128500, 129000, 129500, 130000, 130500, 131000, 131500, 132000, 132500, 133000, 133500, 134000, 134500, 135000, 135500, 136000, 136500, 137000, 137500, 138000, 138500, 139000, 139500, 140000, 140500, 141000, 141500, 142000, 142500, 143000, 143500, 144000, 144500, 145000, 145500, 146000, 146500, 147000, 147500, 148000, 148500, 149000, 149500], "global_num_gpus": 4}']' returned non-zero exit status 1. mchorse@473650d01`

    bug 
    opened by ahmedavid 0
  • Upstream DeepSpeed -> HF checkpoint conversion script update

    Upstream DeepSpeed -> HF checkpoint conversion script update

    This PR fixes #750 . Upstream DeepSpeed saves checkpoints in a new layout which is incompatible with the old conversion script. This makes convert_to_hf.py work with upstream DeepSpeed, and leaves legacy_convert_to_hf.py for conversion from DeeperSpeed.

    Draft for now because I need to test on a real model.

    opened by haileyschoelkopf 0
  • Upstream DeepSpeed breaks HF conversion script

    Upstream DeepSpeed breaks HF conversion script

    The tools/convert_to_hf.py script will need to be updated / a different version may need to be created for checkpoints saved with DeepSpeed. Checkpoints are no longer saved layer-by-layer, it seems, and now all weights are in several mp_rank_{MP_RANK}_model_states.pt files for each Model Parallel partition.

    Upstream DeepSpeed checkpoint:

    drwxr-xr-x 2 hailey eleuther     33280 Dec 18 14:29 configs
    -rw-r--r-- 1 hailey eleuther 810771646 Dec 18 14:29 mp_rank_00_model_states.pt
    -rw-r--r-- 1 hailey eleuther 608006863 Dec 18 14:29 zero_pp_rank_0_mp_rank_00_optim_states.pt
    -rw-r--r-- 1 hailey eleuther 608008079 Dec 18 14:29 zero_pp_rank_1_mp_rank_00_optim_states.pt
    -rw-r--r-- 1 hailey eleuther 608008079 Dec 18 14:29 zero_pp_rank_2_mp_rank_00_optim_states.pt
    -rw-r--r-- 1 hailey eleuther 608008143 Dec 18 14:29 zero_pp_rank_3_mp_rank_00_optim_states.pt
    -rw-r--r-- 1 hailey eleuther 608008143 Dec 18 14:29 zero_pp_rank_4_mp_rank_00_optim_states.pt
    -rw-r--r-- 1 hailey eleuther 608008143 Dec 18 14:29 zero_pp_rank_5_mp_rank_00_optim_states.pt
    -rw-r--r-- 1 hailey eleuther 608008079 Dec 18 14:29 zero_pp_rank_6_mp_rank_00_optim_states.pt
    -rw-r--r-- 1 hailey eleuther 608006863 Dec 18 14:29 zero_pp_rank_7_mp_rank_00_optim_states.pt
    

    DeeperSpeed checkpoint:

    drwxrwxrwx 2 hailey eleuther     33280 Nov 18 04:55 configs
    -rwxrwxrwx 1 hailey eleuther 206045931 Nov 18 04:55 layer_00-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_02-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_03-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_04-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_05-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_06-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_07-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_08-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_09-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_10-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_11-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_12-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_13-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_14-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_15-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_16-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_17-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_18-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_19-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_20-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_21-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_22-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_23-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_24-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_25-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther      9127 Nov 18 04:55 layer_27-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 206045931 Nov 18 04:55 layer_28-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther     16291 Nov 18 04:55 mp_rank_00_model_states.pt
    -rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_0_mp_rank_00_optim_states.pt
    -rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_10_mp_rank_00_optim_states.pt
    -rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_11_mp_rank_00_optim_states.pt
    -rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_12_mp_rank_00_optim_states.pt
    -rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_13_mp_rank_00_optim_states.pt
    -rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_14_mp_rank_00_optim_states.pt
    -rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_15_mp_rank_00_optim_states.pt
    ...
    

    Updating the script shouldn't be too hard at all though.

    bug 
    opened by haileyschoelkopf 0
  • Issue deploying GPT-NeoX-20b on AWS Sagemaker with Jupyter Notebook

    Issue deploying GPT-NeoX-20b on AWS Sagemaker with Jupyter Notebook

    Describe the bug I get the following error upon trying to use predictor.predict(data) for AWS Sagemaker.

    ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
      "code": 400,
      "type": "InternalServerException",
      "message": "Could not load model /.sagemaker/mms/models/EleutherAI__gpt-neox-20b with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForCausalLM\u0027\u003e, \u003cclass \u0027transformers.models.gpt_neox.modeling_gpt_neox.GPTNeoXForCausalLM\u0027\u003e)."
    }
    

    To Reproduce Steps to reproduce the behavior:

    1. Create Dockerfile
    2. Add the following into Dockerfile
    FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04
    RUN pip install --upgrade 'transformers==4.25.1'
    RUN pip install --upgrade 'torch==1.13.0'
    
    1. Build the Dockerfile via command docker build -t gpt-neox . in the directory Dockerfille is in
    2. Create a file named dockerize.sh
    3. Add the following content into the file
    %%sh
    
    # Specify an algorithm name
    algorithm_name=gpt-neox
    
    account=$(aws sts get-caller-identity --query Account --output text)
    
    # Get the region defined in the current configuration (default to us-west-2 if none defined)
    region=$(aws configure get region)
    
    fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"
    
    # If the repository doesn't exist in ECR, create it.
    
    aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1
    if [ $? -ne 0 ]
    then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
    fi
    
    # Log into Docker
    aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}
    
    # Build the docker image locally with the image name and then push it to ECR
    # with the full name.
    
    docker build -t ${algorithm_name} .
    docker tag ${algorithm_name} ${fullname}
    
    docker push ${fullname}
    
    1. Run command docker login (you need docker cli)
    2. Execute the shell script file (you need aws cli)
    3. Open Jupyter Notebook
    4. Add the following to Jupyter Notebook
    %pip install sagemaker
    %pip install boto3
    
    from sagemaker.huggingface import HuggingFaceModel
    import boto3
    
    client=boto3.client('sts')
    account=client.get_caller_identity()['Account']
    
    my_session=boto3.session.Session()
    region=my_session.region_name
    
    algorithm_name="gpt-neox"
    tag="latest"
    ecr_image='{}.dkr.ecr.{}.amazonaws.com/{}:{}'.format(account, region, algorithm_name, tag)
    
    role = 'SageMaker'
    
    hub = {
        'HF_MODEL_ID':'EleutherAI/gpt-neox-20b',
        'HF_TASK':'text-generation'
    }
    
    huggingface_model = HuggingFaceModel(
        image_uri=ecr_image,
        env=hub,
        role=role,
    #     transformers_version="4.17", these are not needed anymore
    #     pytorch_version="1.10",
    #     py_version="py38",
    )
    
    predictor = huggingface_model.deploy(
        initial_instance_count=1,
        instance_type="ml.m5.xlarge"
    )
    
    1. Run this final command in Jupyter Notebook once predictor is done.
    predictor.predict({
        "inputs": "The weather is"
    })
    
    1. See error

    Expected behavior I expect to see a generated query from the input using the 20 billion parameters pretrained EleutherAI model.

    Proposed solution I suspect I could fix this issue if I all together ditched the huggingface Sagemaker library. Also, the model hasn't been updated for the last 8 months, so I'm not sure if it is due to that.

    Environment (please complete the following information):

    • GPUs: none
    • Configs: unsure

    Additional context I have tried the other versions of GPT-Neo's like 125M and 2.7B and those have worked perfectly. The reason that I need to extend the Docker container for AWS is to not get another error which is apparently an issue with the latest version of transformers (4.17?) on the default docker is not up to date enough.

    bug 
    opened by BjornTheProgrammer 1
  • Model ckpts from `DeeperSpeed` cannot be loaded using `deepspeed_main`/upstream DeepSpeed

    Model ckpts from `DeeperSpeed` cannot be loaded using `deepspeed_main`/upstream DeepSpeed

    Describe the bug

    Using DeeperSpeed-trained model checkpoints (git+https://github.com/EleutherAI/DeeperSpeed.git@eb7f5cff36678625d23db8a8fe78b4a93e5d2c75#egg=deepspeed ),

    loading them raises an error when trying to use deepspeed_main with the upstream DeepSpeed.

    To Reproduce Steps to reproduce the behavior:

    Train a model from main branch using DeeperSpeed (or download model ckpt from s-eai-neox/pythia/1.3B/global_step71500)

    Try to load this checkpoint using the deepspeed_main branch and upstream Deepspeed (for either training or evaluation), gives the following error:

    Traceback (most recent call last):
      File "/fsx/hailey/deepspeed-main-neox/gpt-neox/evaluate.py", line 76, in <module>
        main()
      File "/fsx/hailey/deepspeed-main-neox/gpt-neox/evaluate.py", line 35, in main
        model, neox_args = setup_for_inference_or_eval(use_cache=False)
      File "/fsx/hailey/deepspeed-main-neox/gpt-neox/megatron/utils.py", line 440, in setup_for_inference_or_eval
        model, _, _ = setup_model_and_optimizer(
      File "/fsx/hailey/deepspeed-main-neox/gpt-neox/megatron/training.py", line 437, in setup_model_and_optimizer
        neox_args.iteration = load_checkpoint(
      File "/fsx/hailey/deepspeed-main-neox/gpt-neox/megatron/checkpointing.py", line 235, in load_checkpoint
        checkpoint_name, state_dict = model.load_checkpoint(
      File "/fsx/shiv/torchtest/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2647, in load_checkpoint
        load_path, client_states = self._load_checkpoint(load_dir,
      File "/fsx/shiv/torchtest/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2713, in _load_checkpoint
        self.load_module_state_dict(state_dict=checkpoint['module'],
      File "/fsx/shiv/torchtest/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2507, in load_module_state_dict
        self.module.load_state_dict(state_dict, # TODO
      File "/fsx/shiv/torchtest/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1620, in load_state_dict
        raise TypeError("Expected state_dict to be dict-like, got {}.".format(type(state_dict)))
    TypeError: Expected state_dict to be dict-like, got <class 'NoneType'>.
    

    This gives the above traceback and checkpoint loading fails.

    Expected behavior The checkpoints should be loadable by either Deepspeed version, ideally.

    Proposed solution This could be an issue with Deepspeed checkpoint formats changing over the course of 4 versions--not sure yet.

    Additional context Relevant to merging #663 since we have checkpoints we want to use trained in DeeperSpeed.

    cc @Quentin-Anthony @dashstander @StellaAthena

    bug 
    opened by haileyschoelkopf 3
Releases(legacy_gptj_residual.1.0.0)
Owner
EleutherAI
EleutherAI
Train 🤗transformers with DeepSpeed: ZeRO-2, ZeRO-3

Fork from https://github.com/huggingface/transformers/tree/86d5fb0b360e68de46d40265e7c707fe68c8015b/examples/pytorch/language-modeling at 2021.05.17.

Junbum Lee 12 Oct 26, 2022
GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

Nathan Cooper 2.3k Jan 1, 2023
Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

This repository provides a library for efficient training of masked language models (MLM), built with fairseq. We fork fairseq to give researchers mor

Princeton Natural Language Processing 92 Dec 27, 2022
Seonghwan Kim 24 Sep 11, 2022
Simple and efficient RevNet-Library with DeepSpeed support

RevLib Simple and efficient RevNet-Library with DeepSpeed support Features Half the constant memory usage and faster than RevNet libraries Less memory

Lucas Nestler 112 Dec 5, 2022
The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Neural Machine Translation communication system The model is basically direct to convert one source language to another targeted language using encode

Nishant Banjade 7 Sep 22, 2022
Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Boolean Prompting for Neural Text Generators Neural text generators like the GPT models promise a general-purpose means of manipulating texts. These m

Jeffrey M. Binder 20 Jan 9, 2023
Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

SpeechMix Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together. Introduction For the same input: from datas

Eric Lam 31 Nov 7, 2022
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 1.2k Jan 8, 2023
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 903 Feb 17, 2021
This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

GPT-2 Catalan playground and scripts to train a GPT-2 model either from scrath or from another pretrained model.

Laura 1 Jan 28, 2022
Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Token Shift GPT Implementation of Token Shift GPT - An autoregressive model that relies solely on shifting along the sequence dimension and feedforwar

Phil Wang 32 Oct 14, 2022
Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

gpt-2-simple A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifical

Max Woolf 3.1k Jan 7, 2023
Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

gpt-2-simple A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifical

Max Woolf 2.5k Feb 17, 2021
API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend

gpt-j-api ?? An API to interact with the GPT-J language model. You can use and test the model in two different ways: Streamlit web app at http://api.v

Víctor Gallego 276 Dec 31, 2022
Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

AI-BOT Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

Thempra 2 Dec 21, 2022
Honor's thesis project analyzing whether the GPT-2 model can more effectively generate free-verse or structured poetry.

gpt2-poetry The following code is for my senior honor's thesis project, under the guidance of Dr. Keith Holyoak at the University of California, Los A

Ashley Kim 2 Jan 9, 2022
🕹 An esoteric language designed so that the program looks like the transcript of a Pokémon battle

PokéBattle is an esoteric language designed so that the program looks like the transcript of a Pokémon battle. Original inspiration and specification

Eduardo Correia 9 Jan 11, 2022
Ongoing research training transformer language models at scale, including: BERT & GPT-2

What is this fork of Megatron-LM and Megatron-DeepSpeed This is a detached fork of https://github.com/microsoft/Megatron-DeepSpeed, which in itself is

BigScience Workshop 316 Jan 3, 2023