Training COMET using seq2seq setting
Use AutoModelForSeq2SeqLM
in Huggingface Transformers to train COMET. The codes are modified from run_summarization.py in the official example codes for transformers version 4.16.0.dev0.
The ./deepspeed/
folder is copied from https://github.com/huggingface/transformers/tree/master/tests/deepspeed .
The training data of ATOMIC2020 can be downloaded at https://allenai.org/data/atomic-2020. You need to convert the .tsv file to .csv to be compatible with the dataloader in transformers.
Dependencies
python
torch==1.7.1
cudatoolkit=11.0
transformers==4.15.0
deepspeed==0.5.10
others
GCC/G++ 5.2.0 (to complie deepspeed ops)
Usage
1. Normal training without memory optimization:
CUDA_VISIBLE_DEVICES=0 python models/comet_seq2seq.py \
--model_name_or_path t5-small \
--do_train \
--train_file /path/to/train.csv \
--source_prefix "" \
--output_dir data/models/t5-small \
--overwrite_output_dir \
--gradient_accumulation_steps=4 \
--per_device_train_batch_size=8 \
--per_device_eval_batch_size=4 \
--max_source_length 16 \
--max_target_length 18 \
--text_column head_event --summary_column tail_event \
--save_strategy epoch \
--num_train_epochs 3 \
--learning_rate 1e-5
2. Train with gradient_checkpointing=True. Smaller memory usage, meanwhile lower training speed.
CUDA_VISIBLE_DEVICES=0 python models/comet_seq2seq.py \
--model_name_or_path t5-small \
--do_train \
--train_file /path/to/train.csv \
--source_prefix "" \
--output_dir data/models/t5-small \
--overwrite_output_dir \
--gradient_accumulation_steps=4 \
--per_device_train_batch_size=8 \
--per_device_eval_batch_size=4 \
--max_source_length 16 \
--max_target_length 18 \
--text_column head_event --summary_column tail_event \
--save_strategy epoch \
--num_train_epochs 3 \
--learning_rate 1e-5 \
--gradient_checkpointing
3. Train with DeepSpeed (Either zero-stage2 or zero-stage3)
# google/t5-3B training, on 2080Ti (11GB)
deepspeed --include localhost:0,1 --master_port 30000 models/comet_seq2seq.py \
--deepspeed deepspeed/ds_config_zero2.json \
--model_name_or_path google/t5-xl-lm-adapt \
--do_train \
--train_file data/kg/atomic2020_data-feb2021/train.csv \
--source_prefix "" \
--output_dir data/models/comet/t5_xl_s2_bs32_fp16 \
--overwrite_output_dir \
--gradient_accumulation_steps=1 \
--per_device_train_batch_size=16 \
--max_source_length 16 \
--max_target_length 18 \
--text_column head_event --summary_column tail_event \
--save_strategy epoch \
--num_train_epochs 3 \
--learning_rate 1e-5 \
--fp16
4. Comparison of memory usage of different memory optimization methods
Compare the memory usage on NVIDIA RTX A6000 (48685MB memory) and Nvidia GeForce 3090 (24268MB memory).
1. fp16
T5-3B: effects of fp16. A 20% reduce of memory size.
Device | fp16 | Batch Size x Grad-Accum x Num-GPU | Memory Usage | Time to Train a Batch | |
---|---|---|---|---|---|
vanilla | A6000 | False | 8x4x1 | 47.5k M | 1.5s/32ex |
vanilla | A6000 | True | 8x4x1 | 31k M | 1.0s/32ex |
vanilla | 3090 | False | 1x32x1 | |
- |
vanilla | 3090 | True | 1x32x1 | |
- |
2. gradient_checkpointing
T5-3B: Effects of gradient_checkpointing.
Device | fp16 | Batch Size x Grad-Accum x Num-GPU | Memory Usage | Time to Train a Batch | |
---|---|---|---|---|---|
vanilla | A6000 | False | 8x4x1 | 47k M | 1.5s/32ex |
vanilla | A6000 | True | 8x4x1 | 31k M | 1.0s/32ex |
grad-ckpt | A6000 | False | 8x4x1 | 46.4k M | 1.3s/32ex |
grad-ckpt | A6000 | True | 8x4x1 | 23.9k M | 1.1/32ex |
vanilla | 3090 | True | 1x32x1 | |
- |
grad-ckpt | 3090 | True | 1x32x1 | 23.8k M | 15s/32ex |
3. Deepspeed stage 2
T5-3B: Effects of deepspeed.
Device | fp16 | Batch Size x Grad-Accum x Num-GPU | Memory Usage | Time to Train a Batch | |
---|---|---|---|---|---|
vanilla | 3090 | True | 1x32x1 | |
- |
grad-ckpt | 3090 | True | 1x32x1 | 23k M | 13.5s/32ex |
stage2 | 3090 | True | 32x1x1 | 20.3k M | 7.5s/32ex |
stage2 | 3090 | True | 16x1x2 | 20.3k M | 6.36s/32ex |
stage2 | 3090 | True | 32x1x2 | 20.3k M | 3.75s/32ex |
4. Deepspeed stage 3
stage3 will lead to smaller usage of memory but way smaller training speed.
5. Automatic Evaluation Result on ATOMIC2020 data
BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE-L | CIDEr | |
---|---|---|---|---|---|---|---|
T5-3B (no deepspeed), lr1e-5, epoch 3 | 0.346 | 0.184 | 0.12 | 0.084 | 0.19 | 0.422 | 0.646 |
T5-3B (no deepspeed), lr1e-5, epoch 2 | 0.348 | 0.185 | 0.121 | 0.085 | 0.19 | 0.424 | 0.651 |
T5-3B (no deepspeed), lr1e-5, epoch 1 | 0.343 | 0.177 | 0.113 | 0.079 | 0.186 | 0.416 | 0.629 |
T5-3B (ds_stage2, fp16) epoch 3 | 0.340 | 0.182 | 0.118 | 0.083 | 0.189 | 0.418 | 0.637 |
T5-3B (ds_stage2, fp16) epoch 2 | 0.337 | 0.177 | 0.114 | 0.078 | 0.189 | 0.419 | 0.633 |
T5-3B (ds_stage2, fp16) epoch 1 | 0.335 | 0.174 | 0.112 | 0.076 | 0.186 | 0.415 | 0.632 |
Useful discussions regarding environment setups
- Errors building DeepSpeed Ops: https://github.com/microsoft/DeepSpeed/issues/885
TODO
DeepSpeed without Trainer()
: https://huggingface.co/docs/transformers/main_classes/deepspeed#deepspeed-non-trainer-integration