Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

Overview

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 Billion Parameters) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

  • Finetuning large language models like GPT2-xl is often difficult, as these models are too big to fit on a single GPU.
  • This guide explains how to finetune GPT2-xl and GPT-NEO (2.7B Parameters) with just one command of the Huggingface Transformers library on a single GPU.
  • This is made possible by using the DeepSpeed library and gradient checkpointing to lower the required GPU memory usage of the model.
  • I also explain how to set up a server on Google Cloud with a V100 GPU (16GB VRAM), that you can use if you don't have a GPU.

1. (Optional) Setup VM with V100 in Google Compute Engine

Note: The GPT2-xl model does run on any server with a GPU with at least 16 GB VRAM and 60 GB RAM. The GPT-NEO model needs at least 70 GB RAM. If you use your own server and not the setup described here, you will need to install CUDA and Pytorch on it.

Requirements

  1. Install the Google Cloud SDK: Click Here
  2. Register a Google Cloud Account, create a project and set up billing (only once you set up billing, you can use the $300 dollar sign up credit for GPUs).
  3. Request a quota limit increase for "GPU All Regions" to 1. Here is a step by step guide. The UI changed a bit and looks now like this.
  4. Log in and initialize the cloud sdk with gcloud auth login and gcloud init and follow the steps until you are set up.

Create VM

  • Replace YOURPROJECTID in the command below with the project id from your GCE project.
  • You can add the --preemptible flag to the command below, this reduces your cost to about 1/3, but Google is then able to shut down your instance at any point. At the time of writing, this configuration only costs about $1.28 / hour in GCE, when using preemptible.
  • You can change the zone, if there are no ressources available. Here is a list of all zones and whether they have V100 GPUs. Depending on the time of the day you might need to try out a few.
  • We need a GPU server with at least 60 GB RAM, otherwise the run will crash, whenever the script wants to save/pickle a model. This setup below gives us as much RAM as possible with 12 CPU cores in GCE (without paying for extended memory). You also can't use more than 12 CPU cores with a single V100 GPU in GCE.

Run this to create the instance:

gcloud compute instances create gpuserver \
   --project YOURPROJECTID \
   --zone us-west1-b \
   --custom-cpu 12 \
   --custom-memory 78 \
   --maintenance-policy TERMINATE \
   --image-family pytorch-1-7-cu110 \
   --image-project deeplearning-platform-release \
   --boot-disk-size 200GB \
   --metadata "install-nvidia-driver=True" \
   --accelerator="type=nvidia-tesla-v100,count=1" \

After 5 minutes or so (the server needs to install nvidia drivers first), you can connect to your instance with the command below. If you changed the zone, you also will need to change it here.

  • replace YOURSDKACCOUNT with your sdk account name
gcloud compute ssh YOURSDKACCOUNT@gpuserver --zone=us-west1-b

Don't forget to shut down the server once your done, otherwise you will keep getting billed for it. This can be done here.

The next time you can restart the server from the same web ui here.

2. Download script and install libraries

Run this to download the script and to install all libraries:

git clone https://github.com/Xirider/finetune-gpt2xl.git
chmod -R 777 finetune-gpt2xl/
cd finetune-gpt2xl
pip install -r requirements.txt 
  • This installs transformers from source, as the current release doesn't work well with deepspeed.

(Optional) If you want to use Wandb.ai for experiment tracking, you have to login:

wandb login

3. Finetune GPT2-xl (1.5 Billion Parameters)

Then add your training data:

  • replace the example train.txt and validation.txt files in the folder with your own training data with the same names and then run python text2csv.py. This converts your .txt files into one column csv files with a "text" header and puts all the text into a single line. We need to use .csv files instead of .txt files, because Huggingface's dataloader removes line breaks when loading text from a .txt file, which does not happen with the .csv files.
  • If you want to feed the model separate examples instead of one continuous block of text, you need to pack each of your examples into an separate line in the csv train and validation files.
  • Be careful with the encoding of your text. If you don't clean your text files or if just copy text from the web into a text editor, the dataloader from the datasets library might not load them.

Run this:

deepspeed --num_gpus=1 run_clm.py \
--deepspeed ds_config.json \
--model_name_or_path gpt2-xl \
--train_file train.csv \
--validation_file validation.csv \
--do_train \
--do_eval \
--fp16 \
--overwrite_cache \
--evaluation_strategy="steps" \
--output_dir finetuned \
--eval_steps 200 \
--num_train_epochs 1 \
--gradient_accumulation_steps 2 \
--per_device_train_batch_size 8
  • This command runs the the standard run_clm.py file from Huggingface's examples with deepspeed, just with 2 lines added to enable gradient checkpointing to use less memory.
  • Training on the Shakespeare example should take about 17 minutes. With gradient accumulation 2 and batch size 8, one gradient step takes about 9 seconds. This means the model training speed should be almost 2 examples / second. You can go up to batch size of 12 before running out of memory, but that doesn't provide any speedups.
  • Note that the default huggingface optimizer hyperparameters and the hyperparameters given as flag overwrite the hyperparameters in the ds_config.json file. Therefore if you want to adjust learning rates, warmup and more, you need to set these as flags to the training command. For an example you can find further below the training command of GPT-NEO which changes the learning rate.
  • You might want to try different hyperparameters like --learning_rate and --warmup_steps to improve the finetuning.

4. Generate text with your finetuned model

You can test your finetuned GPT2-xl model with this script from Huggingface Transfomers (is included in the folder):

python run_generation.py --model_type=gpt2 --model_name_or_path=finetuned --length 200

Or you can use it now in your own code like this to generate text in batches:

# credit to Niels Rogge - https://github.com/huggingface/transformers/issues/10704

from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

tokenizer = GPT2Tokenizer.from_pretrained('finetuned')
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained('finetuned').to(device)
print("model loaded")

# this is a single input batch with size 3
texts = ["From off a hill whose concave womb", "Another try", "A third test"]

encoding = tokenizer(texts, padding=True, return_tensors='pt').to(device)
with torch.no_grad():
    generated_ids = model.generate(**encoding, max_length=100)
generated_texts = tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True)

print(generated_texts)
  • model inference runs on even small gpus or on cpus without any more additional changes

Finetune GPT-NEO (2.7 Billion Parameters)

This works now. I tested it with a server with one V100 GPU (16 GB VRAM) and 78 GB normal RAM, but it might not actually need that much RAM.

Add your training data like you would for GPT2-xl:

  • replace the example train.txt and validation.txt files in the folder with your own training data with the same names and then run python text2csv.py. This converts your .txt files into one column csv files with a "text" header and puts all the text into a single line. We need to use .csv files instead of .txt files, because Huggingface's dataloader removes line breaks when loading text from a .txt file, which does not happen with the .csv files.

  • If you want to feed the model separate examples instead of one continuous block of text, you need to pack each of your examples into an separate line in the csv train and validation files.

  • Be careful with the encoding of your text. If you don't clean your text files or if just copy text from the web into a text editor, the dataloader from the datasets library might not load them.

  • Be sure to either login into wandb.ai with wandb login or uninstall it completely. Otherwise it might cause a memory error during the run.

Then start the training run this command:

deepspeed --num_gpus=1 run_clm.py \
--deepspeed ds_config_gptneo.json \
--model_name_or_path EleutherAI/gpt-neo-2.7B \
--train_file train.csv \
--validation_file validation.csv \
--do_train \
--do_eval \
--fp16 \
--overwrite_cache \
--evaluation_strategy="steps" \
--output_dir finetuned \
--num_train_epochs 1 \
--eval_steps 15 \
--gradient_accumulation_steps 2 \
--per_device_train_batch_size 4 \
--use_fast_tokenizer False \
--learning_rate 5e-06 \
--warmup_steps 10
  • This uses a smaller "allgather_bucket_size" setting in the ds_config_gptneo.json file and a smaller batch size to further reduce gpu memory.
  • You might want to change and try hyperparameters to be closer to the orignal EleutherAi training config. You can find these here.

Generate text with a GPT-NEO 2.7 Billion Parameters model

I provided a script, that allows you to interactively prompt your GPT-NEO model. If you just want to sample from the pretrained model without finetuning it yourself, replace "finetuned" with "EleutherAI/gpt-neo-2.7B". Start it with this:

python run_generate_neo.py finetuned

Or use this snippet to generate text from your finetuned model within your code:

# credit to Suraj Patil - https://github.com/huggingface/transformers/pull/10848 - modified

from transformers import GPTNeoForCausalLM, AutoTokenizer

model = GPTNeoForCausalLM.from_pretrained("finetuned").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("finetuned")

text = "From off a hill whose concave"
ids = tokenizer(text, return_tensors="pt").input_ids.to("cuda")

max_length = 400 + ids.shape[1] # add the length of the prompt tokens to match with the mesh-tf generation

gen_tokens = model.generate(
  ids,
  do_sample=True,
  min_length=max_length,
  max_length=max_length,
  temperature=0.9,
  use_cache=True
)
gen_text = tokenizer.batch_decode(gen_tokens)[0]
print(gen_text)

(Optional) Configuration

You can change the learning rate, weight decay and warmup by setting them as flags to the training command. Warm up and learning rates in the config are ignored, as the script always uses the Huggingface optimizer/trainer default values. If you want to overwrite them you need to use flags. You can check all the explanations here:

https://huggingface.co/transformers/master/main_classes/trainer.html#deepspeed

The rest of the training arguments can be provided as a flags and are all listed here:

https://huggingface.co/transformers/master/main_classes/trainer.html#trainingarguments

Comments
  • Freezing at

    Freezing at "Using /home/user/.cache/torch_extensions as PyTorch extensions root..."

    After installing the dependencies and running the given commands to fine-tune a model, some GPU VRAM is allocated(looking at nvidia-smi) , but then the program seems to just stop with once "Using /home/user/.cache/torch_extensions as PyTorch extensions root..." prints

    opened by mallorbc 4
  • Errors while trying to train with two GPUs

    Errors while trying to train with two GPUs

    Hi,

    When trying to train on two GPUs, I'm getting this error:

    Traceback (most recent call last): File "run_clm.py", line 478, in main() File "run_clm.py", line 441, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1083, in train deepspeed_engine, optimizer, lr_scheduler = deepspeed_init( File "/root/miniconda3/lib/python3.8/site-packages/transformers/integrations.py", line 520, in deepspeed_init model, optimizer, _, lr_scheduler = deepspeed.initialize( File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/init.py", line 116, in initialize engine = DeepSpeedEngine(args=args, File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 148, in init self._configure_with_arguments(args, mpu) File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 517, in _configure_with_arguments self._config = DeepSpeedConfig(config_file, mpu, param_dict=self.config_params) File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 597, in init self._configure_train_batch_size() File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 732, in _configure_train_batch_size self._set_batch_related_parameters() File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 728, in _set_batch_related_parameters assert False,
    AssertionError: Either train_batch_size or micro_batch_per_gpu needs to be provided

    So if I added the flag --train_batch_size 8 and I got the following error:

    Traceback (most recent call last): File "run_clm.py", line 478, in main() File "run_clm.py", line 192, in main model_args, data_args, training_args = parser.parse_args_into_dataclasses() File "/root/miniconda3/lib/python3.8/site-packages/transformers/hf_argparser.py", line 196, in parse_args_into_dataclasses Traceback (most recent call last): File "run_clm.py", line 478, in raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") ValueError: Some specified arguments are not used by the HfArgumentParser: ['--train_batch_size', '8'] main() File "run_clm.py", line 192, in main model_args, data_args, training_args = parser.parse_args_into_dataclasses() File "/root/miniconda3/lib/python3.8/site-packages/transformers/hf_argparser.py", line 196, in parse_args_into_dataclasses raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") ValueError: Some specified arguments are not used by the HfArgumentParser: ['--train_batch_size', '8']

    Looks to me like a mismatch between deepspeed and transformers, do you have any suggestions on how to solve it?

    This is my ds_report:

    DeepSpeed general environment info: torch install path ............... ['/root/miniconda3/lib/python3.8/site-packages/torch'] torch version .................... 1.7.1 torch cuda version ............... 11.0 nvcc version ..................... 11.0 deepspeed install path ........... ['/root/miniconda3/lib/python3.8/site-packages/deepspeed'] deepspeed info ................... 0.3.15, unknown, unknown deepspeed wheel compiled w. ...... torch 1.7, cuda 11.0

    opened by barakw2021 4
  • Gpt-neo inference with Deepspeed: IndexError: Dimension out of range

    Gpt-neo inference with Deepspeed: IndexError: Dimension out of range

    Thanks for this useful repository. I was able to follow it to train a gtp-neo 2.7B model.

    Inference on the model works well for me, using less than 8GB of Vram, so fits on consumer-level gpus, however, I'm not yet able to get the inference working with Deepspeed.

    To be clear...

    I am using the code from here:

    https://github.com/Xirider/finetune-gpt2xl/blob/main/README.md#generate-text-with-a-gpt-neo-27-billion-parameters-model

    And it works well, if I comment out this line:

    deepspeed.init_inference(model, mp_size=1, dtype=torch.half, replace_method='auto')

    If I retain the line, then the inference fails with this error message:

      File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 374, in forward
        output = DeepSpeedSelfAttentionFunction.apply(
      File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 312, in forward
        output, key_layer, value_layer, context_layer = selfAttention_fp()
      File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 270, in selfAttention_fp
        qkv_out = qkv_func(input,
    IndexError: Dimension out of range (expected to be in range of [-2, 1], but got 2)
    

    I'm actually a bit vague on whether Deepspeed actually should be used with inference for GTP-NEO, as far.

    Huggingface says....

    https://huggingface.co/transformers/main_classes/deepspeed.html

    DeepSpeed ZeRO-2 is primarily used only for training, as its features are of no use to inference.

    But, Microsoft has a guide which shows the usage of Deepspeed for inference with this model...

    https://github.com/microsoft/DeepSpeed/blob/master/docs/_tutorials/inference-tutorial.md#end-to-end-gpt-neo-27b-inference

    opened by kingpalethe 3
  • Can't change BOS token or EOS token for GPT Neo

    Can't change BOS token or EOS token for GPT Neo

    In order to better control the start and stop of generated text, I have added BOS tokens and EOS tokens for GPT2xl. This works well and the generated text stops at an appropriate length and starts how a normal sentence would. However, I want to do this process on GPT Neo, and this does not work. I have discovered that for some reason arguments that normally set BOS and EOS are not working when GPT Neo is ran, even if I change the tokenizer from AutoTokenizer to GPT2Tokenizer. Below is some code that shows what I mean.

        tokenizer = GPT2Tokenizer.from_pretrained(
        model_args.model_name_or_path, bos_token='<|beginingtext|>',eos_token='<|endingtext|>', pad_token='<|pad|>',**tokenizer_kwargs)
        print(tokenizer.eos_token)
        print(tokenizer.bos_token)
        quit()
    

    As I said, when I run this with GPT2xl, the tokens are appropriately changed. When I run this with GPT Neo, both the BOS and EOS tokens are <|endoftext|>

    opened by mallorbc 3
  • Training on a larger dataset fails due to memory issues on faster GPUs

    Training on a larger dataset fails due to memory issues on faster GPUs

    Thanks so much for producing this repo, it's been really helpful in getting up and running on the biggest GPT-Neo model.

    I'm having an issue training gpt-neo_2-7B though - my dataset is just over 200mb, which leads to an out of memory issue on the very last step of loading a model into memory before training.

    [INFO|integrations.py:533] 2021-04-20 12:40:32,650 >> Attempting to resume from paragraphs/checkpoint-600 [2021-04-20 12:40:32,664] [INFO] [engine.py:1445:_load_checkpoint] rank: 0 loading checkpoint: paragraphs/checkpoint-600/global_step600/mp_rank_00_model_states.pt Traceback (most recent call last): File "run_clm.py", line 478, in <module> main() File "run_clm.py", line 441, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) [...] RuntimeError: [enforce fail at CPUAllocator.cpp:65] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 10605230080 bytes. Error code 12 (Cannot allocate memory)

    I've tried a number of GPUs on Google cloud, and I can get it run on the P100 since I can up the RAM to 100GB, but both the V100 and A100s fail (with 78GB and 85GB respectively)

    Unfortunately Google puts a hard limit on RAM for these GPUs, and increasing the number of GPUs also doubles the number of processes run and so the RAM required - so unless I pay for 2 GPUs and let one sit idle I have to train on the much slower P100.

    This is .. ok .. πŸ˜… but I'd love to go faster if I can. So far I've tried:

    • Reducing per_device_train_batch_size to 2
    • Halving the dataset size but neither have made a difference.

    Do you have any other tips on how I might squeeze into the 85GB you get with an A100? It's so tantalizingly close - I wish Google would just let me add more RAM!

    opened by jonnyplatt 3
  • Resume from checkpoint

    Resume from checkpoint

    I have RTX 3090 (24GB) and 64 GB RAM, and 50 GB swap memory, and although training works pretty nicely, unfortunately resuming training from checkpoints results in OOM:

    [2021-05-07 19:18:39,962] [WARNING] [runner.py:122:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
    [2021-05-07 19:18:39,973] [INFO] [runner.py:360:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 run_clm.py --deepspeed ds_config_gptneo_new.json --model_name_or_path /datadrive/model/checkpoint-800/ --train_file merged_train.txt.csv --do_train --fp16 --overwrite_cache --output_dir /datadrive/model --num_train_epochs 1 --gradient_accumulation_steps 2 --per_device_train_batch_size 4 --use_fast_tokenizer False --learning_rate 5e-06 --save_steps 400
    [2021-05-07 19:18:40,526] [INFO] [launch.py:73:main] 0 NCCL_VERSION 2.7.8
    [2021-05-07 19:18:40,526] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0]}
    [2021-05-07 19:18:40,526] [INFO] [launch.py:86:main] nnodes=1, num_local_procs=1, node_rank=0
    [2021-05-07 19:18:40,526] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
    [2021-05-07 19:18:40,526] [INFO] [launch.py:102:main] dist_world_size=1
    [2021-05-07 19:18:40,526] [INFO] [launch.py:104:main] Setting CUDA_VISIBLE_DEVICES=0
    [2021-05-07 19:18:41,601] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
    05/07/2021 19:18:41 - WARNING - __main__ -   Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
    05/07/2021 19:18:41 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=/datadrive/model, overwrite_output_dir=False, do_train=True, do_eval=False, do_predict=False, evaluation_strategy=IntervalStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=4, per_device_eval_batch_size=8, gradient_accumulation_steps=2, eval_accumulation_steps=None, learning_rate=5e-06, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=runs/May07_19-18-41_9c3c6cac903e, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=500, save_strategy=IntervalStrategy.STEPS, save_steps=400, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=/datadrive/model, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=ds_config_gptneo_new.json, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, length_column_name=length, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, _n_gpu=1, mp_parameters=)
    05/07/2021 19:18:42 - WARNING - datasets.builder -   Using custom data configuration default-b5898a6a80220f13
    05/07/2021 19:18:42 - WARNING - datasets.builder -   Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-b5898a6a80220f13/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)
    [INFO|configuration_utils.py:515] 2021-05-07 19:18:42,390 >> loading configuration file /datadrive/model/checkpoint-800/config.json
    [INFO|configuration_utils.py:553] 2021-05-07 19:18:42,390 >> Model config GPTNeoConfig {
      "_name_or_path": "EleutherAI/gpt-neo-2.7B",
      "activation_function": "gelu_new",
      "architectures": [
        "GPTNeoForCausalLM"
      ],
      "attention_dropout": 0,
      "attention_layers": [
        "global",
        "local",
        "global",
        "local",
        "global",
        "local",
        "global",
        "local",
        "global",
        "local",
        "global",
        "local",
        "global",
        "local",
        "global",
        "local",
        "global",
        "local",
        "global",
        "local",
        "global",
        "local",
        "global",
        "local",
        "global",
        "local",
        "global",
        "local",
        "global",
        "local",
        "global",
        "local"
      ],
      "attention_types": [
        [
          [
            "global",
            "local"
          ],
          16
        ]
      ],
      "bos_token_id": 50256,
      "embed_dropout": 0,
      "eos_token_id": 50256,
      "gradient_checkpointing": true,
      "hidden_size": 2560,
      "initializer_range": 0.02,
      "intermediate_size": null,
      "layer_norm_epsilon": 1e-05,
      "max_position_embeddings": 2048,
      "model_type": "gpt_neo",
      "num_heads": 20,
      "num_layers": 32,
      "resid_dropout": 0,
      "summary_activation": null,
      "summary_first_dropout": 0.1,
      "summary_proj_to_labels": true,
      "summary_type": "cls_index",
      "summary_use_proj": true,
      "task_specific_params": {
        "text-generation": {
          "do_sample": true,
          "max_length": 50,
          "temperature": 0.9
        }
      },
      "tokenizer_class": "GPT2Tokenizer",
      "transformers_version": "4.6.0.dev0",
      "use_cache": false,
      "vocab_size": 50257,
      "window_size": 256
    }
    
    [INFO|configuration_utils.py:517] 2021-05-07 19:18:42,765 >> loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /models/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
    [INFO|configuration_utils.py:553] 2021-05-07 19:18:42,765 >> Model config GPT2Config {
      "activation_function": "gelu_new",
      "architectures": [
        "GPT2LMHeadModel"
      ],
      "attn_pdrop": 0.1,
      "bos_token_id": 50256,
      "embd_pdrop": 0.1,
      "eos_token_id": 50256,
      "gradient_checkpointing": false,
      "initializer_range": 0.02,
      "layer_norm_epsilon": 1e-05,
      "model_type": "gpt2",
      "n_ctx": 1024,
      "n_embd": 768,
      "n_head": 12,
      "n_inner": null,
      "n_layer": 12,
      "n_positions": 1024,
      "resid_pdrop": 0.1,
      "scale_attn_weights": true,
      "summary_activation": null,
      "summary_first_dropout": 0.1,
      "summary_proj_to_labels": true,
      "summary_type": "cls_index",
      "summary_use_proj": true,
      "task_specific_params": {
        "text-generation": {
          "do_sample": true,
          "max_length": 50
        }
      },
      "transformers_version": "4.6.0.dev0",
      "use_cache": true,
      "vocab_size": 50257
    }
    
    [INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/vocab.json from cache at /models/transformers/684fe667923972fb57f6b4dcb61a3c92763ad89882f3da5da9866baf14f2d60f.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f
    [INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/merges.txt from cache at /models/transformers/c0c761a63004025aeadd530c4c27b860ec4ecbe8a00531233de21d865a402598.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
    [INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/added_tokens.json from cache at None
    [INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/special_tokens_map.json from cache at None
    [INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/tokenizer_config.json from cache at None
    [INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/tokenizer.json from cache at /models/transformers/16a2f78023c8dc511294f0c97b5e10fde3ef9889ad6d11ffaa2a00714e73926e.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0
    [INFO|modeling_utils.py:1147] 2021-05-07 19:18:44,955 >> loading weights file /datadrive/model/checkpoint-800/pytorch_model.bin
    [INFO|modeling_utils.py:1328] 2021-05-07 19:18:59,255 >> All model checkpoint weights were used when initializing GPTNeoForCausalLM.
    
    [INFO|modeling_utils.py:1336] 2021-05-07 19:18:59,255 >> All the weights of GPTNeoForCausalLM were initialized from the model checkpoint at /datadrive/model/checkpoint-800/.
    If your task is similar to the task the model of the checkpoint was trained on, you can already use GPTNeoForCausalLM for predictions without further training.
      0%|                                                     | 0/1 [00:00<?, ?ba/s][WARNING|tokenization_utils_base.py:3170] 2021-05-07 19:19:40,807 >> Token indices sequence length is longer than the specified maximum sequence length for this model (14397149 > 1024). Running this sequence through the model will result in indexing errors
    100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:42<00:00, 42.00s/ba]
    100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:08<00:00,  8.47s/ba]
    [INFO|trainer.py:414] 2021-05-07 19:19:50,812 >> Using amp fp16 backend
    [INFO|trainer.py:1042] 2021-05-07 19:19:50,865 >> Loading model from /datadrive/model/checkpoint-800/).
    [2021-05-07 19:19:50,867] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.16, git-hash=unknown, git-branch=unknown
    [2021-05-07 19:19:50,867] [WARNING] [config.py:79:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
    [2021-05-07 19:19:54,135] [INFO] [utils.py:11:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
    Using /root/.cache/torch_extensions as PyTorch extensions root...
    Detected CUDA files, patching ldflags
    Emitting ninja build file /root/.cache/torch_extensions/cpu_adam/build.ninja...
    Building extension module cpu_adam...
    Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
    ninja: no work to do.
    Loading extension module cpu_adam...
    Time to load cpu_adam op: 2.1879847049713135 seconds
    Adam Optimizer #0 is created with AVX2 arithmetic capability.
    Config: alpha=0.000005, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
    [2021-05-07 19:19:58,240] [INFO] [engine.py:610:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer
    [2021-05-07 19:19:58,240] [INFO] [engine.py:615:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
    Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
    [2021-05-07 19:19:58,240] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
    [2021-05-07 19:19:58,240] [INFO] [stage2.py:102:__init__] Reduce bucket size 200000000.0
    [2021-05-07 19:19:58,240] [INFO] [stage2.py:103:__init__] Allgather bucket size 200000000.0
    [2021-05-07 19:19:58,240] [INFO] [stage2.py:104:__init__] CPU Offload: True
    Using /root/.cache/torch_extensions as PyTorch extensions root...
    Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
    Building extension module utils...
    Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
    ninja: no work to do.
    Loading extension module utils...
    Time to load utils op: 1.4445114135742188 seconds
    [2021-05-07 19:21:35,500] [INFO] [stage2.py:381:__init__] optimizer state initialized
    [2021-05-07 19:21:35,709] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
    [2021-05-07 19:21:35,760] [INFO] [engine.py:439:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR
    [2021-05-07 19:21:35,761] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7fe9d20fb5b0>
    [2021-05-07 19:21:35,769] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-06], mom=[[0.9, 0.999]]
    [2021-05-07 19:21:35,777] [INFO] [config.py:747:print] DeepSpeedEngine configuration:
    [2021-05-07 19:21:35,925] [INFO] [config.py:751:print]   activation_checkpointing_config  {
        "partition_activations": false, 
        "contiguous_memory_optimization": false, 
        "cpu_checkpointing": false, 
        "number_checkpoints": null, 
        "synchronize_checkpoint_boundary": false, 
        "profile": false
    }
    [2021-05-07 19:21:35,926] [INFO] [config.py:751:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
    [2021-05-07 19:21:35,926] [INFO] [config.py:751:print]   allreduce_always_fp32 ........ False
    [2021-05-07 19:21:35,927] [INFO] [config.py:751:print]   amp_enabled .................. False
    [2021-05-07 19:21:35,927] [INFO] [config.py:751:print]   amp_params ................... False
    [2021-05-07 19:21:35,927] [INFO] [config.py:751:print]   checkpoint_tag_validation_enabled  True
    [2021-05-07 19:21:35,928] [INFO] [config.py:751:print]   checkpoint_tag_validation_fail  False
    [2021-05-07 19:21:35,928] [INFO] [config.py:751:print]   disable_allgather ............ False
    [2021-05-07 19:21:35,928] [INFO] [config.py:751:print]   dump_state ................... False
    [2021-05-07 19:21:35,929] [INFO] [config.py:751:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
    [2021-05-07 19:21:35,929] [INFO] [config.py:751:print]   elasticity_enabled ........... False
    [2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   flops_profiler_config ........ {
        "enabled": false, 
        "profile_step": 1, 
        "module_depth": -1, 
        "top_modules": 3, 
        "detailed": true
    }
    [2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   fp16_enabled ................. True
    [2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   global_rank .................. 0
    [2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   gradient_accumulation_steps .. 2
    [2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   gradient_clipping ............ 1.0
    [2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   gradient_predivide_factor .... 1.0
    [2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   initial_dynamic_scale ........ 65536
    [2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   loss_scale ................... 0
    [2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   memory_breakdown ............. False
    [2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   optimizer_legacy_fusion ...... False
    [2021-05-07 19:21:35,932] [INFO] [config.py:751:print]   optimizer_name ............... adamw
    [2021-05-07 19:21:35,932] [INFO] [config.py:751:print]   optimizer_params ............. {'lr': 5e-06, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.0}
    [2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
    [2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   pld_enabled .................. False
    [2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   pld_params ................... False
    [2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   prescale_gradients ........... False
    [2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   scheduler_name ............... WarmupLR
    [2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 5e-06, 'warmup_num_steps': 0}
    [2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   sparse_attention ............. None
    [2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   sparse_gradients_enabled ..... False
    [2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   steps_per_print .............. 2000
    [2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   tensorboard_enabled .......... False
    [2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   tensorboard_job_name ......... DeepSpeedJobName
    [2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   tensorboard_output_path ...... 
    [2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   train_batch_size ............. 8
    [2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   train_micro_batch_size_per_gpu  4
    [2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   wall_clock_breakdown ......... False
    [2021-05-07 19:21:35,934] [INFO] [config.py:751:print]   world_size ................... 1
    [2021-05-07 19:21:35,934] [INFO] [config.py:751:print]   zero_allow_untested_optimizer  False
    [2021-05-07 19:21:35,938] [INFO] [config.py:751:print]   zero_config .................. {
        "stage": 2, 
        "contiguous_gradients": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 2.000000e+08, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 2.000000e+08, 
        "overlap_comm": true, 
        "load_from_fp32_weights": true, 
        "elastic_checkpoint": true, 
        "offload_param": null, 
        "offload_optimizer": {
            "device": "cpu", 
            "nvme_path": null, 
            "buffer_count": 4, 
            "pin_memory": false, 
            "pipeline_read": false, 
            "pipeline_write": false, 
            "fast_init": false
        }, 
        "sub_group_size": 1.000000e+12, 
        "prefetch_bucket_size": 5.000000e+07, 
        "param_persistence_threshold": 1.000000e+05, 
        "max_live_parameters": 1.000000e+09, 
        "max_reuse_distance": 1.000000e+09, 
        "gather_fp16_weights_on_model_save": false, 
        "find_unused_parameters": false
    }
    [2021-05-07 19:21:35,938] [INFO] [config.py:751:print]   zero_enabled ................. True
    [2021-05-07 19:21:35,938] [INFO] [config.py:751:print]   zero_optimization_stage ...... 2
    [2021-05-07 19:21:35,942] [INFO] [config.py:753:print]   json = {
        "fp16": {
            "enabled": true, 
            "loss_scale": 0, 
            "loss_scale_window": 1000, 
            "initial_scale_power": 16, 
            "hysteresis": 2, 
            "min_loss_scale": 1
        }, 
        "optimizer": {
            "type": "AdamW", 
            "params": {
                "lr": 5e-06, 
                "betas": [0.9, 0.999], 
                "eps": 1e-08, 
                "weight_decay": 0.0
            }
        }, 
        "scheduler": {
            "type": "WarmupLR", 
            "params": {
                "warmup_min_lr": 0, 
                "warmup_max_lr": 5e-06, 
                "warmup_num_steps": 0
            }
        }, 
        "zero_optimization": {
            "stage": 2, 
            "allgather_partitions": true, 
            "allgather_bucket_size": 2.000000e+08, 
            "overlap_comm": true, 
            "reduce_scatter": true, 
            "reduce_bucket_size": 2.000000e+08, 
            "contiguous_gradients": true, 
            "cpu_offload": true
        }, 
        "gradient_accumulation_steps": 2, 
        "gradient_clipping": 1.0, 
        "steps_per_print": 2.000000e+03, 
        "train_batch_size": 8, 
        "train_micro_batch_size_per_gpu": 4, 
        "wall_clock_breakdown": false
    }
    Using /root/.cache/torch_extensions as PyTorch extensions root...
    No modifications detected for re-loaded extension module utils, skipping build step...
    Loading extension module utils...
    Time to load utils op: 0.09232521057128906 seconds
    [INFO|integrations.py:536] 2021-05-07 19:21:36,160 >> Attempting to resume from /datadrive/model/checkpoint-800/
    [2021-05-07 19:21:36,175] [INFO] [engine.py:1480:_load_checkpoint] rank: 0 loading checkpoint: /datadrive/model/checkpoint-800/global_step800/mp_rank_00_model_states.pt
    
    opened by ArturTan 2
  • Exception: Installed CUDA version 11.0 does not match the version torch was compiled with 11.1 [SOLUTION]

    Exception: Installed CUDA version 11.0 does not match the version torch was compiled with 11.1 [SOLUTION]

    Hey first off awesome project Im getting this error when i try to run the deepspeed command. I found my solution if anyone else has this problem

    wget https://developer.download.nvidia.com/compute/cuda/11.1.1/local_installers/cuda_11.1.1_455.32.00_linux.run sudo sh cuda_11.1.1_455.32.00_linux.run

    opened by CupOfGeo 2
  • separate examples for finetuning

    separate examples for finetuning

    #17 Hey I have been using this. I would help update the docs to add the separated_samples_max_length argument and when you should you it. This has been a super helpful repo. If you want me to change anything just let me know.

    opened by CupOfGeo 1
  • Crashes with new Transformers version

    Crashes with new Transformers version

    Here's the error:

    Traceback (most recent call last): File "run_clm.py", line 478, in main() File "run_clm.py", line 422, in main trainer = Trainer( File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 295, in init logging.set_verbosity(log_level) File "/root/miniconda3/lib/python3.8/site-packages/transformers/utils/logging.py", line 161, in set_verbosity _get_library_root_logger().setLevel(verbosity) File "/root/miniconda3/lib/python3.8/logging/init.py", line 1409, in setLevel self.level = _checkLevel(level) File "/root/miniconda3/lib/python3.8/logging/init.py", line 194, in _checkLevel raise ValueError("Unknown level: %r" % level)

    The fix was to install transformers v4.6.0 from pip

    opened by barakw2021 1
  • Ideal number of epochs? Number of examples meaning?

    Ideal number of epochs? Number of examples meaning?

    Is there a recommended number of epochs to use? I was able to successfully train on a custom dataset with near 45k entries for the training set and near 11k in the validation set. In the example only 1 epoch set for the flag. However, I have found that training for 4 epochs leads to a lower loss than 1 epoch, and I imagine continuing to train the model would lead to an even better result. It is difficult to say at what point overfitting may start occurring, as the validation data is only evaluated at the end of the training

    Thus I ask, is there a rough ideal number of epochs for fine-tuning? If there is, I think it would be a good idea to add that to the README(which I can do if needed).

    My second question is related to the Num examples part of training and evaluation. As I said, I have near 45k training texts and near 11k validation texts. However, the Num examples say 1472 and 365 respectfully for training and validation. What does this mean? Is not all the data being used? Why does it not say the much larger numbers of 45k and 11k?

    Thanks for the repo and for your help. This is very cool and relatively easy to work with after one gets experience with DeepSpeed

    opened by mallorbc 1
  • Multiple entries csv

    Multiple entries csv

    Hi i come from upwork, is this what are you looking for, split dataset into (multi row csv)

    
    start_token = "|<start of text>|"
    end_token = "|<end of text>|"
    with open('train.txt', encoding='utf-8') as txtfile:
        all_text = txtfile.read().replace(start_token,"").split(end_token)
        all_text = all_text[0:len(all_text)-1]
    with open('train.csv', mode='w', encoding='utf-8') as csv_file:
        fieldnames = ['text']
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        writer.writeheader()
        for row in all_text:
            writer.writerow({'text': all_text})
    
    
    with open('validation.txt', encoding='utf-8') as txtfile:
        all_text = txtfile.read().replace(start_token,"").split(end_token)
        all_text = all_text[0:len(all_text)-1]
    with open('validation.csv', mode='w', encoding='utf-8') as csv_file:
        fieldnames = ['text']
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        writer.writeheader()
        for row in all_text:
            writer.writerow({'text': row})
    
    print("created train.csv and validation.csv > files")```
    
    opened by kikirizki 1
  • subprocess.CalledProcessError:

    subprocess.CalledProcessError:

    I got the following error: [2022-01-13 14:47:32,154] [INFO] [launch.py:131:sigkill_handler] Killing subprocess 2273 Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/gpt2_lm/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/ubuntu/anaconda3/envs/gpt2_lm/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/ubuntu/anaconda3/envs/gpt2_lm/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 167, in <module> main() File "/home/ubuntu/anaconda3/envs/gpt2_lm/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 156, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/home/ubuntu/anaconda3/envs/gpt2_lm/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 137, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/home/ubuntu/anaconda3/envs/gpt2_lm/bin/python', '-u', 'run_clm.py', '--local_rank=0', '--deepspeed', 'ds_config.json', '--model_name_or_path', 'gpt2-xl', '--train_file', '../../dataset/train.txt', '--validation_file', '../../dataset/test.txt', '--do_train', '--do_eval', '--fp16', '--overwrite_cache', '--evaluation_strategy=steps', '--output_dir', 'finetuned', '--eval_steps', '500', '--num_train_epochs', '1', '--gradient_accumulation_steps', '2', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1']' died with <Signals.SIGKILL: 9>.

    opened by Dhanachandra 1
  • Out of memory with RTX3090

    Out of memory with RTX3090

    Hi, I'm trying to train gpt2xl, but keep getting OOM, even when I set batch size to 1 and gradient_accumulation to 8\16\512, contigous_gradients false and allgather_bucket_size \ reduce_bucket_size 2e2. I can see in nvidia-smi that I'm only reaching half the memory capacity - around 12GB My system is as stated - 3090 with 24GB memory 80 GB RAM 5600x cpu if that matters running WSL2 on windows 10 Thanks.

    opened by PyxAI 4
  • Feeding the model separate examples instead of one continuous block of text

    Feeding the model separate examples instead of one continuous block of text

    Hello I'm interested in adding this feature anding a function in text2csv.py to take a folder of texts and then in run_clm.py pad and truncate them instead of the group_text function.

    opened by CupOfGeo 1
  • AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

    AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

    I try to use your script (gpt2-xl) but I have an error: AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

    pip list Package Version


    certifi 2021.5.30 charset-normalizer 2.0.4 click 8.0.1 configparser 5.0.2 datasets 1.8.0 deepspeed 0.4.0 dill 0.3.4 docker-pycreds 0.4.0 filelock 3.0.12 fsspec 2021.7.0 gitdb 4.0.7 GitPython 3.1.18 huggingface-hub 0.0.8 idna 3.2 importlib-metadata 4.7.0 joblib 1.0.1 multiprocess 0.70.12.2 ninja 1.10.2 numpy 1.21.2 packaging 21.0 pandas 1.3.2 pathtools 0.1.2 Pillow 8.3.1 pip 21.2.4 promise 2.3 protobuf 3.17.3 psutil 5.8.0 pyarrow 3.0.0 pyparsing 2.4.7 python-dateutil 2.8.2 pytz 2021.1 PyYAML 5.4.1 regex 2021.8.21 requests 2.26.0 sacremoses 0.0.45 sentry-sdk 1.3.1 setuptools 57.4.0 shortuuid 1.0.1 six 1.16.0 smmap 4.0.0 subprocess32 3.5.4 tensorboardX 1.8 tokenizers 0.10.3 torch 1.9.0 torchvision 0.10.0 tqdm 4.49.0 transformers 4.7.0 triton 1.0.0 typing-extensions 3.10.0.0 urllib3 1.26.6 wandb 0.12.0 wheel 0.37.0 xxhash 2.0.2 zipp 3.5.0

    opened by remotejob 7
  • New issue with Pandas

    New issue with Pandas

    I got this error:

    Traceback (most recent call last): File "run_clm.py", line 478, in main() File "run_clm.py", line 271, in main datasets = load_dataset( File "/root/miniconda3/lib/python3.8/site-packages/datasets/load.py", line 742, in load_dataset builder_instance.download_and_prepare( File "/root/miniconda3/lib/python3.8/site-packages/datasets/builder.py", line 574, in download_and_prepare self._download_and_prepare( File "/root/miniconda3/lib/python3.8/site-packages/datasets/builder.py", line 652, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "/root/miniconda3/lib/python3.8/site-packages/datasets/builder.py", line 1041, in _prepare_split for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose): File "/root/miniconda3/lib/python3.8/site-packages/tqdm/std.py", line 1133, in iter for obj in iterable: File "/root/miniconda3/lib/python3.8/site-packages/datasets/packaged_modules/csv/csv.py", line 92, in _generate_tables csv_file_reader = pd.read_csv( File "/root/miniconda3/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 571, in read_csv kwds_defaults = _refine_defaults_read( File "/root/miniconda3/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1306, in _refine_defaults_read raise ValueError("Specified named and prefix; you can only specify one.") ValueError: Specified named and prefix; you can only specify one. Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/csv/default-84d6151a5e4565ed/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0... Traceback (most recent call last): File "run_clm.py", line 478, in main() File "run_clm.py", line 271, in main datasets = load_dataset( File "/root/miniconda3/lib/python3.8/site-packages/datasets/load.py", line 742, in load_dataset builder_instance.download_and_prepare( File "/root/miniconda3/lib/python3.8/site-packages/datasets/builder.py", line 574, in download_and_prepare self._download_and_prepare( File "/root/miniconda3/lib/python3.8/site-packages/datasets/builder.py", line 652, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "/root/miniconda3/lib/python3.8/site-packages/datasets/builder.py", line 1041, in _prepare_split for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose): File "/root/miniconda3/lib/python3.8/site-packages/tqdm/std.py", line 1133, in iter for obj in iterable: File "/root/miniconda3/lib/python3.8/site-packages/datasets/packaged_modules/csv/csv.py", line 92, in _generate_tables csv_file_reader = pd.read_csv( File "/root/miniconda3/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 571, in read_csv kwds_defaults = _refine_defaults_read( File "/root/miniconda3/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1306, in _refine_defaults_read raise ValueError("Specified named and prefix; you can only specify one.") ValueError: Specified named and prefix; you can only specify one. Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/csv/default-84d6151a5e4565ed/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0...

    Apparently it's a know error with the latest Pandas: https://github.com/pandas-dev/pandas/issues/42387

    I solved it by downgrading to Pandas 1.2.5

    opened by barakw2021 0
Owner
null
Seonghwan Kim 24 Sep 11, 2022
Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

"# bpe_algorithm_can_finetune_tokenizer" this is an implyment for https://github

张博 1 Feb 2, 2022
Train πŸ€—transformers with DeepSpeed: ZeRO-2, ZeRO-3

Fork from https://github.com/huggingface/transformers/tree/86d5fb0b360e68de46d40265e7c707fe68c8015b/examples/pytorch/language-modeling at 2021.05.17.

Junbum Lee 12 Oct 26, 2022
GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

Nathan Cooper 2.3k Jan 1, 2023
Label data using HuggingFace's transformers and automatically get a prediction service

Label Studio for Hugging Face's Transformers Website β€’ Docs β€’ Twitter β€’ Join Slack Community Transfer learning for NLP models by annotating your textu

Heartex 135 Dec 29, 2022
:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Haystack is an end-to-end framework for Question Answering & Neural search that enables you to ... ... ask questions in natural language and find gran

deepset 6.4k Jan 9, 2023
:mag: End-to-End Framework for building natural language search interfaces to data by utilizing Transformers and the State-of-the-Art of NLP. Supporting DPR, Elasticsearch, HuggingFace’s Modelhub and much more!

Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want

deepset 1.4k Feb 18, 2021
KoBART model on huggingface transformers

KoBART-Transformers SKTμ—μ„œ κ³΅κ°œν•œ KoBARTλ₯Ό νŽΈλ¦¬ν•˜κ²Œ μ‚¬μš©ν•  수 있게 transformers둜 ν¬νŒ…ν•˜μ˜€μŠ΅λ‹ˆλ‹€. Install (Optional) BartModelκ³Ό PreTrainedTokenizerFastλ₯Ό μ΄μš©ν•˜λ©΄ μ„€μΉ˜ν•˜μ‹€ ν•„μš” μ—†μŠ΅λ‹ˆλ‹€. p

Hyunwoong Ko 58 Dec 7, 2022
A deep learning-based translation library built on Huggingface transformers

DL Translate A deep learning-based translation library built on Huggingface transformers and Facebook's mBART-Large ?? GitHub Repository ?? Documentat

Xing Han Lu 244 Dec 30, 2022
Huggingface Transformers + Adapters = ❀️

adapter-transformers A friendly fork of HuggingFace's Transformers, adding Adapters to PyTorch language models adapter-transformers is an extension of

AdapterHub 1.2k Jan 9, 2023
Code for lyric-section-to-comment generation based on huggingface transformers.

CommentGeneration Code for lyric-section-to-comment generation based on huggingface transformers. Migrate Guyu model and code (both 12-layers and 24-l

Yawei Sun 8 Sep 4, 2021
Partially offline multi-language translator built upon Huggingface transformers.

Translate Command-line interface to translation pipelines, powered by Huggingface transformers. This tool can download translation models, and then us

Richard Jarry 8 Oct 25, 2022
Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET

Training COMET using seq2seq setting Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET. The codes are modified from run_summarizati

tqfang 9 Dec 17, 2022
PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

PyTorch Large-Scale Language Model A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset Latest Results 39.98 Perp

Ryan Spring 114 Nov 4, 2022
Simple and efficient RevNet-Library with DeepSpeed support

RevLib Simple and efficient RevNet-Library with DeepSpeed support Features Half the constant memory usage and faster than RevNet libraries Less memory

Lucas Nestler 112 Dec 5, 2022
Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

NLP-Summarizer Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5 This project aimed to provide in

Samuel Sharkey 1 Feb 7, 2022
Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

TextCortex - HemingwAI Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingw

TextCortex AI 27 Nov 28, 2022
Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration πŸšƒ

This repository provides a library for efficient training of masked language models (MLM), built with fairseq. We fork fairseq to give researchers mor

Princeton Natural Language Processing 92 Dec 27, 2022
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

Eliyar Eziz 2.3k Dec 29, 2022