Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

Pytorch Lightning

Last update: Dec 21, 2022

Related tags

Overview

Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

What is Lightning Transfomers • Using Lightning Transformers • Docs • Community • License

Installation

Option 1: from PyPI

pip install lightning-transformers
# instead of: `python train.py ...`, run with:
pl-transformers-train ...

Option 2: from source

git clone https://github.com/PyTorchLightning/lightning-transformers.git
cd lightning-transformers
pip install .
python train.py ...
# the `pl-transformers-train` endpoint is also available!

What is Lightning-Transformers

Lightning Transformers offers a flexible interface for training and fine-tuning SOTA Transformer models using the PyTorch Lightning Trainer.

Train using HuggingFace Transformers models and datasets with Lightning custom Callbacks, Loggers, Accelerators and high performance scaling.
Seamless Memory and Speed Optimizations such as DeepSpeed ZeRO or FairScale Sharded Training with no code changes.
Powerful config composition backed by Hydra - Easily swap out models, optimizers, schedulers and many more configurations without touching the code.
Transformer Task Abstraction for Rapid Research & Experimentation - Built from the ground up to be task agnostic, the library supports creating transformer tasks across all modalities with little friction.

Lightning Transformers tasks allow you to train models using HuggingFace Transformer models and datasets, use Hydra to hotswap models, optimizers or schedulers and leverage all the advances features that Lightning has to offer, including custom Callbacks, Loggers, Accelerators and high performance scaling with minimal changes.

Using Lightning-Transformers

Grid is our platform for training models at scale on the cloud! Sign up here.

Task	Quick Commands	Run
Language Modeling	`python train.py task=nlp/language_modeling dataset=nlp/language_modeling/wikitext trainer.gpus=1 training.batch_size=8`
Multiple Choice	`python train.py task=nlp/multiple_choice dataset=nlp/multiple_choice/race trainer.gpus=1`
Question Answering	`python train.py task=nlp/question_answering dataset=nlp/question_answering/squad trainer.gpus=1`
Summarization	`python train.py task=nlp/summarization dataset=nlp/summarization/xsum trainer.gpus=1`
Text Classification	`python train.py task=nlp/text_classification dataset=nlp/text_classification/emotion trainer.gpus=1`
Token Classification	`python train.py task=nlp/token_classification dataset=nlp/token_classification/conll trainer.gpus=1`
Translation	`python train.py task=nlp/translation dataset=nlp/translation/wmt16 trainer.gpus=1`

Quick recipes

Train bert-base-cased on the CARER emotion dataset using the Text Classification task.

python train.py \
    task=nlp/text_classification \
    dataset=nlp/text_classification/emotion

See the composed Hydra config used under-the-hood

optimizer:
  _target_: torch.optim.AdamW
  lr: ${training.lr}
  weight_decay: 0.001
scheduler:
  _target_: transformers.get_linear_schedule_with_warmup
  num_training_steps: -1
  num_warmup_steps: 0.1
training:
  run_test_after_fit: true
  lr: 5.0e-05
  output_dir: .
  batch_size: 16
  num_workers: 16
trainer:
  _target_: pytorch_lightning.Trainer
  logger: true
  checkpoint_callback: true
  callbacks: null
  default_root_dir: null
  gradient_clip_val: 0.0
  process_position: 0
  num_nodes: 1
  num_processes: 1
  gpus: null
  auto_select_gpus: false
  tpu_cores: null
  log_gpu_memory: null
  progress_bar_refresh_rate: 1
  overfit_batches: 0.0
  track_grad_norm: -1
  check_val_every_n_epoch: 1
  fast_dev_run: false
  accumulate_grad_batches: 1
  max_epochs: 1
  min_epochs: 1
  max_steps: null
  min_steps: null
  limit_train_batches: 1.0
  limit_val_batches: 1.0
  limit_test_batches: 1.0
  val_check_interval: 1.0
  flush_logs_every_n_steps: 100
  log_every_n_steps: 50
  accelerator: null
  sync_batchnorm: false
  precision: 32
  weights_summary: top
  weights_save_path: null
  num_sanity_val_steps: 2
  truncated_bptt_steps: null
  resume_from_checkpoint: null
  profiler: null
  benchmark: false
  deterministic: false
  reload_dataloaders_every_epoch: false
  auto_lr_find: false
  replace_sampler_ddp: true
  terminate_on_nan: false
  auto_scale_batch_size: false
  prepare_data_per_node: true
  plugins: null
  amp_backend: native
  amp_level: O2
  move_metrics_to_cpu: false
task:
  _recursive_: false
  backbone: ${backbone}
  optimizer: ${optimizer}
  scheduler: ${scheduler}
  _target_: lightning_transformers.task.nlp..text_classification.TextClassificationTransformer
  downstream_model_type: transformers.AutoModelForSequenceClassification
dataset:
  cfg:
    batch_size: ${training.batch_size}
    num_workers: ${training.num_workers}
    dataset_name: emotion
    dataset_config_name: null
    train_file: null
    validation_file: null
    test_file: null
    train_val_split: null
    max_samples: null
    cache_dir: null
    padding: max_length
    truncation: only_first
    preprocessing_num_workers: 1
    load_from_cache_file: true
    max_length: 128
    limit_train_samples: null
    limit_val_samples: null
    limit_test_samples: null
  _target_: lightning_transformers.task.nlp.text_classification.TextClassificationDataModule
experiment_name: ${now:%Y-%m-%d}_${now:%H-%M-%S}
log: false
ignore_warnings: true
tokenizer:
  _target_: transformers.AutoTokenizer.from_pretrained
  pretrained_model_name_or_path: ${backbone.pretrained_model_name_or_path}
  use_fast: true
backbone:
  pretrained_model_name_or_path: bert-base-cased

Swap the backbone to RoBERTa and the optimizer to RMSprop:

python train.py \
    task=nlp/text_classification \
    dataset=nlp/text_classification/emotion
    backbone.pretrained_model_name_or_path=roberta-base
    optimizer=rmsprop

See the changed Hydra config under-the-hood

 optimizer:
-  _target_: torch.optim.AdamW
+  _target_: torch.optim.RMSprop
   lr: ${training.lr}
-  weight_decay: 0.001
 scheduler:
   _target_: transformers.get_linear_schedule_with_warmup
   num_training_steps: -1
....
tokenizer:
   pretrained_model_name_or_path: ${backbone.pretrained_model_name_or_path}
   use_fast: true
 backbone:
-  pretrained_model_name_or_path: bert-base-cased
+  pretrained_model_name_or_path: roberta-base

Enable Sharded Training.

python train.py \
    task=nlp/text_classification \
    dataset=nlp/text_classification/emotion \
    trainer=sharded

See the changed Hydra config under-the-hood

Without the need to modify any code, the config updated automatically for sharded training:

optimizer:
   _target_: torch.optim.AdamW
   lr: ${training.lr}
trainer:
   process_position: 0
   num_nodes: 1
   num_processes: 1
-  gpus: null
+  gpus: 1
   auto_select_gpus: false
   tpu_cores: null
   log_gpu_memory: null
   ...
   log_every_n_steps: 50
-  accelerator: null
+  accelerator: ddp
   sync_batchnorm: false
-  precision: 32
+  precision: 16
   weights_summary: top
   ....
   terminate_on_nan: false
   auto_scale_batch_size: false
   prepare_data_per_node: true
-  plugins: null
+  plugins:
+    _target_: pytorch_lightning.plugins.DDPShardedPlugin
   amp_backend: native
   amp_level: O2
   move_metrics_to_cpu: false
tokenizer:
   pretrained_model_name_or_path: ${backbone.pretrained_model_name_or_path}
   use_fast: true
 backbone:
   pretrained_model_name_or_path: bert-base-cased

Enable DeepSpeed ZeRO Training.

python train.py \
    task=nlp/text_classification \
    dataset=nlp/text_classification/emotion \
    trainer=deepspeed

See the changed Hydra config under-the-hood

Without the need to modify any code, the config updated automatically for DeepSpeed:

optimizer:
   _target_: torch.optim.AdamW
   lr: ${training.lr}
trainer:
   process_position: 0
   num_nodes: 1
   num_processes: 1
-  gpus: null
+  gpus: 1
   auto_select_gpus: false
   tpu_cores: null
   log_gpu_memory: null
   ...
   val_check_interval: 1.0
   flush_logs_every_n_steps: 100
   log_every_n_steps: 50
-  accelerator: null
+  accelerator: ddp
   sync_batchnorm: false
-  precision: 32
+  precision: 16
   ...
-  plugins: null
+  plugins:
+    _target_: pytorch_lightning.plugins.DeepSpeedPlugin
+    stage: 2
+    cpu_offload: true
   amp_backend: native
   amp_level: O2
   move_metrics_to_cpu: false
...

Train with a pre-trained t5-base backbone, on the XSUM dataset using the Summarization task.

python train.py \
    task=nlp/summarization \
    dataset=nlp/summarization/xsum \
    backbone.pretrained_model_name_or_path=t5-base

Train with a pre-trained mt5-base backbone, on the WMT16 dataset using the Translation task with 2 GPUs.

python train.py \
    task=nlp/translation \
    dataset=nlp/translation/wmt16 \
    backbone.pretrained_model_name_or_path=google/mt5-base \
    trainer.gpus=2

Custom Files & Datasets

You can train, validate and test Lightning transformers tasks on your own data files, and you can extend datasets for custom processing and your own tasks.

How to train, validate and test on custom files

How to extend datasets

Custom Tasks

Extending the Language Modeling Task

Contribute

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

Community

For help or questions, join our huge community on Slack!

License

Please observe the Apache 2.0 license that is listed in this repository. In addition, the Lightning framework is Patent Pending.

Comments

sparseml integration
Closes #196

Add option to use sparseml. This is still a little rough around the edges, but I welcome the feedback! Implementation on Google Colab.

Still needs:

Implementation of MODELS_PATH and RECIPE_PATH through Hydra instead of environment variable

"test" stage needs to work, only training and evaluation work at the moment

stylistic coding changes

Will continue to add more changes after my reactions engineering exam tomorrow. Feel free to give me pointers or links to tutorials if you see any changes I need to make.
enhancement
opened by mathemusician 12
compact packaging - all inside namespace

🚀 Feature

refactor to make the package more compact, move configs and train inside

Motivation

be able to call it from everywhere as python -m lightning_transformers.train --some asrgs

Pitch

Easier to use from everywhere reduce collision with configs (if any) from other packages

Alternatives

Additional context
enhancement help wanted wontfix

opened by Borda 11

Question answering example throws an exception even if sanity check is skipped

🐛 Bug

Running the squad example python train.py task=nlp/question_answering dataset=nlp/question_answering/squad trainer.gpus=[1] training.batch_size=8 trainer.num_sanity_val_steps=0 throws an exception while finalizing training. This is not a replication of #218

To Reproduce

Steps to reproduce the behavior:

Run python train.py task=nlp/question_answering dataset=nlp/question_answering/squad trainer.gpus=[1] training.batch_size=8 trainer.num_sanity_val_steps=0
See error

Epoch 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 12442/12445 [44:35<00:00,  4.65it/s, loss=0.957
Error executing job with overrides: ['task=nlp/question_answering', 'dataset=nlp/question_answering/squad', 'trainer.gpus=[1]', 'training.batch_size=8', 'trainer.num_sanity_val_steps=0']3 [01:39<00:00, 13.59it/s]
Traceback (most recent call last):
  File "/home/vrt/lightning-transformers/train.py", line 10, in hydra_entry
    main(cfg)
  File "/home/vrt/lightning-transformers/lightning_transformers/cli/train.py", line 69, in main
    run(
  File "/home/vrt/lightning-transformers/lightning_transformers/cli/train.py", line 60, in run
    trainer.fit(model, datamodule=data_module)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
    self.fit_loop.run()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
    self.epoch_loop.run(data_fetcher)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 146, in run
    self.on_advance_end()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 242, in on_advance_end
    self._run_validation()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 337, in _run_validation
    self.val_loop.run()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 151, in run
    output = self.on_run_end()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 134, in on_run_end
    self._on_evaluation_epoch_end()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 241, in _on_evaluation_epoch_end
    self.trainer.call_hook(hook_name)
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1501, in call_hook
    output = model_fx(*args, **kwargs)
  File "/home/vrt/lightning-transformers/lightning_transformers/task/nlp/question_answering/model.py", line 59, in on_validation_epoch_end
    metric_dict = self.metric.compute()
  File "/home/vrt/miniconda3/lib/python3.9/site-packages/torchmetrics/metric.py", line 380, in wrapped_func
    value = compute(*args, **kwargs)
  File "/home/vrt/lightning-transformers/lightning_transformers/task/nlp/question_answering/datasets/squad/metric.py", line 23, in compute
    example_ids = [reverse_lookup[i.item()] for i in self.example_ids]
  File "/home/vrt/lightning-transformers/lightning_transformers/task/nlp/question_answering/datasets/squad/metric.py", line 23, in <listcomp>
    example_ids = [reverse_lookup[i.item()] for i in self.example_ids]
KeyError: 0

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Environment

PyTorch Version: 1.6.0
OS: Ubuntu 18.04.6 LTS
How you installed PyTorch: conda
Python version: 3.9.7
CUDA/cuDNN version: 11.4
GPU models and configuration: 2x NVIDIA GeForce RTX 2080 Ti (First device not used)
Any other relevant information: The same error occurs during sanity check if trainer.num_sanity_val_steps=-1 is used, as in #184

bug / fix help wanted

opened by Pointy-Hat 10

Glue MNLI task fails due to missing 'validation' key in dataset
🐛 Bug

MNLI has two validation and two test sets, called validation_matched, validation_mismatched, test_matched and test_matched. I assume that this was not taken into account in the datamodule.

To Reproduce

Steps to reproduce the behavior:

Run the following command:

python train.py task=nlp/text_classification dataset=nlp/text_classification/glue dataset.cfg.dataset_config_name=mnli

Expected behavior

I would expect the dataloader to handle the special case of MNLI and load validation_matched and test_matched by default. Maybe add an option to additionally test on test_mismatched as well, when desired.

Environment

A standard pip install from source, as of 2021.12.01. Fails with or without GPU.
bug / fix help wanted
opened by mariomeissner 10
Fix protobuf version

See https://github.com/Lightning-AI/lightning-transformers/pull/284#issuecomment-1243050284

This is based upon @rohitgr7 's awesome https://github.com/Lightning-AI/lightning-transformers/pull/284

opened by turian 8
Should we get rid of Hydra?

Motivation

This is motivated by recent development of LightningCLI which IMO is simpler, and better supported primarily because it exists within Lightning.

Over time I've noticed that due to the changes upstream in Lightning, hydra default configs are going out of sync every so often.

I also notice that there are fundamental issues such as this https://github.com/PyTorchLightning/lightning-transformers/issues/149 where the confs exist within the package. We assume that the user will always clone the repo and make their modifications there. I would be curious if this works for users currently.

Pitch

Remove Hydra, treating this library more as a set of components people can import into their own scripts. If we'd like to keep the ability to run training and testing from the library, we should use the LightningCLI instead of Hydra.

Alternatives

Keep as is :)

cc @mathemusician @Borda @carmocca
help wanted User Experience refactor & code health

opened by SeanNaren 8
Use and rename special subset name if provided
This PR addresses issue, closes #213

I modify the load_dataset function in core/nlp/data.py to find and use special subset names if provided, then rename them back to the standard names to avoid failures further along.

Cannot be done in [subset]_dataloader function as initially proposed, because other functions such as _select_samples are called before that (and would not work well).

Two main reasons for patching core/nlp/data/py instead of core/data.py:

load_dataset is not present in the former. Could create a function there and then let the function in the latter call that first...

Never saw this issue happen in speech or vision, it really seems to be an MNLI only thing for now.

It's up to debate if this should really be moved to core/data.py instead.

This PR should likely also include documentation changes, but I didn't check yet how to do that. Pointers appreciated.

I tested the following commands successfully:

# A normal glue task without any subset name issues CUDA_VISIBLE_DEVICES=3 python train.py task=nlp/text_classification dataset=nlp/text_classification/glue dataset.cfg.dataset_config_name=sst2 trainer.val_check_interval=0.01 training.batch_size=128 trainer.gpus=1

# MNLI issue with renamed subset, smaller validation size and more frequent validation to test that subsampling also works fine CUDA_VISIBLE_DEVICES=3 python train.py task=nlp/text_classification dataset=nlp/text_classification/glue dataset.cfg.dataset_config_name=mnli ++dataset.cfg.validation_subset_name=validation_matched trainer.val_check_interval=0.01 training.batch_size=128 trainer.gpus=1 ++dataset.cfg.limit_val_samples=256
enhancement
opened by mariomeissner 7
Add an option to use Huggingface metrics

🚀 Feature

Support Huggingface metrics.

Motivation

Torchmetrics are great, but there are many metrics that are not publicly available. Luckily, Huggingface implemented lots of them. Can you please an easy way to add metrics from Huggingface?

Pitch

Specifying a metric from Huggingface, making sure I give it the correct arguments without needing to implement it on my own.
enhancement help wanted

opened by yuvalkirstain 7

Custom data file for classification seems to be failing

🐛 Bug

When training a classification model on custom data file, the training fails because it expect num_classes

To Reproduce

Use this collab: https://colab.research.google.com/drive/1uamw6SNaOr_4ch24JNxAj2yfgLUKfJqO?usp=sharing Error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/lightning_transformers/cli/train.py", line 84, in hydra_entry
    main(cfg)
  File "/usr/local/lib/python3.7/dist-packages/lightning_transformers/cli/train.py", line 78, in main
    logger=logger,
  File "/usr/local/lib/python3.7/dist-packages/lightning_transformers/cli/train.py", line 61, in run
    trainer.fit(model, datamodule=data_module)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
    self.call_setup_hook(model)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 1066, in call_setup_hook
    model.setup(stage_name)
  File "/usr/local/lib/python3.7/dist-packages/lightning_transformers/core/model.py", line 88, in setup
    self.configure_metrics(stage)
  File "/usr/local/lib/python3.7/dist-packages/lightning_transformers/task/nlp/text_classification/model.py", line 61, in configure_metrics
    self.prec = Precision(num_classes=self.num_classes)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 948, in __getattr__
    type(self).__name__, name))
AttributeError: 'TextClassificationTransformer' object has no attribute 'num_classes'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Expected behavior

It should start training.

Environment

check the notebook

bug / fix help wanted wontfix

opened by zippeurfou 7

Ver0.2.4 compatibility PL v1.8

Suggest to assign to

Ver0.2.4 new NotImplementedError: LightningDataModule.on_load_checkpoint was deprecated in v1.6 and is no longer supported as of v1.8. Use load_state_dict instead.

🐛 Bug

model.fit fails owing to Pytorch-Lightning checks failure

> File ~\miniconda3\envs\UnBias-99-5\lib\site-packages\pytorch_lightning\trainer\configuration_validator.py:61, in verify_loop_configurations(trainer)
>      59 _check_deprecated_logger_methods(trainer)
>      60 # TODO: Delete this check in v2.0
> ---> 61 _check_unsupported_datamodule_hooks(trainer)

To Reproduce

trainer.fit(model,datamodel)

> ---------------------------------------------------------------------------
> NotImplementedError                       Traceback (most recent call last)
> Input In [58], in <cell line: 1>()
> ----> 1 trainer.fit(model,dm)
> 
> File ~\miniconda3\envs\MyEnv\lib\site-packages\pytorch_lightning\trainer\trainer.py:579, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
>     577     raise TypeError(f"`Trainer.fit()` requires a `LightningModule`, got: {model.__class__.__qualname__}")
>     578 self.strategy._lightning_module = model
> --> 579 call._call_and_handle_interrupt(
>     580     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
>     581 )
> 
> File ~\miniconda3\envs\MyEnv\lib\site-packages\pytorch_lightning\trainer\call.py:38, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
>      36         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
>      37     else:
> ---> 38         return trainer_fn(*args, **kwargs)
>      40 except _TunerExitException:
>      41     trainer._call_teardown_hook()
> 
> File ~\miniconda3\envs\MyEnv\lib\site-packages\pytorch_lightning\trainer\trainer.py:621, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
>     614 ckpt_path = ckpt_path or self.resume_from_checkpoint
>     615 self._ckpt_path = self._checkpoint_connector._set_ckpt_path(
>     616     self.state.fn,
>     617     ckpt_path,  # type: ignore[arg-type]
>     618     model_provided=True,
>     619     model_connected=self.lightning_module is not None,
>     620 )
> --> 621 self._run(model, ckpt_path=self.ckpt_path)
>     623 assert self.state.stopped
>     624 self.training = False
> 
> File ~\miniconda3\envs\MyEnv\lib\site-packages\pytorch_lightning\trainer\trainer.py:984, in Trainer._run(self, model, ckpt_path)
>     981 self._callback_connector._attach_model_callbacks()
>     982 self._callback_connector._attach_model_logging_functions()
> --> 984 verify_loop_configurations(self)
>     986 # hook
>     987 log.detail(f"{self.__class__.__name__}: preparing data")
> 
> File ~\miniconda3\envs\MyEnv\lib\site-packages\pytorch_lightning\trainer\configuration_validator.py:61, in verify_loop_configurations(trainer)
>      59 _check_deprecated_logger_methods(trainer)
>      60 # TODO: Delete this check in v2.0
> ---> 61 _check_unsupported_datamodule_hooks(trainer)
> 
> File ~\miniconda3\envs\MyEnv\lib\site-packages\pytorch_lightning\trainer\configuration_validator.py:295, in _check_unsupported_datamodule_hooks(trainer)
>     290     raise NotImplementedError(
>     291         "`LightningDataModule.on_save_checkpoint` was deprecated in v1.6 and is no longer supported as of v1.8."
>     292         " Use `state_dict` instead."
>     293     )
>     294 if is_overridden("on_load_checkpoint", datahook_selector.datamodule):
> --> 295     raise NotImplementedError(
>     296         "`LightningDataModule.on_load_checkpoint` was deprecated in v1.6 and is no longer supported as of v1.8."
>     297         " Use `load_state_dict` instead."
>     298     )
> 
> NotImplementedError: `LightningDataModule.on_load_checkpoint` was deprecated in v1.6 and is no longer supported as of v1.8. Use `load_state_dict` instead.

Code sample

import os
from accelerate import (init_empty_weights)
from transformers import (FlaubertTokenizer, FlaubertWithLMHeadModel, TrainingArguments, DataCollatorForLanguageModeling) 
from datasets import (load_dataset)
import pytorch_lightning as pl
from lightning_transformers.task.nlp.masked_language_modeling import (MaskedLanguageModelingTransformer, MaskedLanguageModelingDataModule)

dataset = load_from_disk(os.path.join(drive_letter,os.path.join(dataset_dir, 'dataset')))
dataset = dataset.remove_columns(["text"])
dataset = dataset.shuffle()
dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask', 'labels'], device=device) 

LM_tokenizer = FlaubertTokenizer.from_pretrained('./tokenizer/FlauBERT_tokenizer', do_lowercase=False)

with init_empty_weights():
  model = MaskedLanguageModelingTransformer(
              pretrained_model=LMhead_model,
              tokenizer=LM_tokenizer,
              load_weights=False,
              low_cpu_mem_usage=True,
              device_map="auto"
              #deepspeed_sharding=True,  # Linux only, defer initialization of the model to shard/load pre-train weights
          )

batch_size=2

datamodel = MaskedLanguageModelingDataModule(
    batch_size=batch_size,
    dataset=dataset,
    tokenizer=LM_tokenizer,
    num_workers=os.cpu.count())

trainer = pl.Trainer(
    accelerator="auto",
    devices="auto",
    #strategy="deepspeed_stage_3", # linux only
    precision=16,
    max_epochs=1,
    #strategy='dp',
    #auto_lr_find=True,
    #detect_anomaly=True
    #val_check_interval=0
    #progress_bar_refresh_rate=50
)

trainer.fit(model,datamodel)

Expected behavior

Trainer fits

Environment

PyTorch Version (e.g., 1.0): 1.2.1
OS (e.g., Linux): Windows 10
How you installed PyTorch (conda, pip, source): conda

py3.9_cuda11.6_cudnn8_0 pytorch

Python version: 3.9
CUDA/cuDNN version: CUDA 11.6, cuDNN 8.0
GPU models and configuration: NVIDIA Quadro RTX 3000
Any other relevant information: none

Additional context

Comes on top of 0.2.3 and despite 0.2.4 release

bug / fix help wanted

opened by BenoitDalFerro 6

ViT Image Classification Support

Starter code

import pytorch_lightning as pl
from transformers import AutoFeatureExtractor

from lightning_transformers.task.vision.image_classification import (
    ImageClassificationDataConfig,
    ImageClassificationDataModule,
    ImageClassificationTransformer,
)

MODEL_NAME = "nateraw/vit-base-beans"
feature_extractor = AutoFeatureExtractor.from_pretrained(pretrained_model_name_or_path=MODEL_NAME)

dm = ImageClassificationDataModule(
    cfg=ImageClassificationDataConfig(
        batch_size=8,
        dataset_name="beans",
        num_workers=8
    ),
    feature_extractor=feature_extractor,
)

model = ImageClassificationTransformer(pretrained_model_name_or_path=MODEL_NAME)
trainer = pl.Trainer(accelerator="gpu", max_epochs=5)
trainer.fit(model, dm)

opened by tanmoyio 6

`TransformerDataModule.setup()` run more than once unnecessarily

🐛 Bug

TransformerDataModule.setup() is run more than once unnecessarily. For example, when running the code included below, it runs setup() when calling dm.num_classes and then when calling trainer.fit(model, dm).

setup() then calls self.load_dataset(), self.split_dataset(dataset) and self.process_data(dataset, stage=stage). Calling self.load_dataset() several times is not a big deal because it will load it from the cache, but the other two methods are expensive and I think it does not make sense to run them again (since they just override whatever self.ds was there before.

To Reproduce

Take the below example from the docs and just check the console output or run it in debug mode with a breakpoint. It can be seen that TransformerDataModule.setup() and the subsequent methods load_dataset(), split_dataset() and are run more than once.

import pytorch_lightning as pl
from transformers import AutoTokenizer

from lightning_transformers.task.nlp.text_classification import (
    TextClassificationDataModule,
    TextClassificationTransformer,
)

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="bert-base-uncased")
dm = TextClassificationDataModule(
    batch_size=1,
    dataset_name="glue",
    dataset_config_name="sst2",
    max_length=512,
    tokenizer=tokenizer,
)
model = TextClassificationTransformer(pretrained_model_name_or_path="bert-base-uncased", num_labels=dm.num_classes)
trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=1)

trainer.fit(model, dm)

Expected behavior

Given that TransformerDataModule.setup() currently does the following:

def setup(self, stage: Optional[str] = None): 
  dataset = self.load_dataset()
  dataset = self.split_dataset(dataset)
  dataset = self.process_data(dataset, stage=stage)
  self.ds = dataset

Perhaps a way to avoid running it again would be creating the class attribute self.setup_stages_run = [] when the class is initialized and then defining the setup method as:

    def setup(self, stage: Optional[str] = None): 
        # Load and split dataset only if setup has not been run before
        if len(self.setup_stages_run) == 0: 
            dataset = self.load_dataset()
            dataset = self.split_dataset(dataset)
        else:
            dataset = self.ds

        # Process dataset only if setup has not been run before for this stage    
        if stage not in self.setup_stages_run:            
            self.ds = self.process_data(dataset, stage=stage)
            self.setup_stages_run.append(stage)

Can create a PR if you think this makes sense. Thanks!

enhancement help wanted

opened by RR-28023 0

Shuffling support

🚀 Feature

Add option for data shuffling in core/data.py Data shuffling is crucial for removing dataset structure bias.

Motivation

I noticed my model was not performing well when I was using a custom dataset with spikes in performance across the epoch. I then realized it was because the class data was in sequence, and there was no shuffling performed by default. I then looked into the code but couldn't find any option to add shuffling: core/data.py I had to then overwrite 3 functions, train_dataloader, val_dataloader, test_dataloader in order to get this functionality.

Pitch

Add a boolean shuffling argument in the constructor that enables this.

Alternatives

Additional context
enhancement help wanted

opened by juliusfrost 0
Jointly train Question Answering and Multiple Choice

🚀 Feature: Multi task NLP model

For such an IELTS exam paper, there are several types of questions such as Question Answering and Multiple Choice. The current implementation of lightning_transformer does well for a single task but I wonder whether a case to jointly train 2 tasks at the same time? Because the context will be shared during two tasks, therefore sharing the encoder will be beneficial.

Alternatives

I found a reference to do this directly on huggingface transformer but dont know how to structure it to adapt with lightning transformers. https://colab.research.google.com/github/zphang/zphang.github.io/blob/master/files/notebooks/Multi_task_Training_with_Transformers_NLP.ipynb#scrollTo=xW8bnTgCsx5c
enhancement help wanted

opened by tmquan 1
Deepspeed sharding and load from checkpoint with custom lightning module - setup() not called during checkpoint loading
❓ Questions and Help

Before asking:

search the issues.

search the docs.

What is your question?

Hi, I'm doing training from scratch using deepspeed, pytorch lightning, and transformers in a multi node setting, and wanted to know how to setup the code to handle loading from a pytorch checkpoint.

Going off of the docs here, I see that the model is intended to be defined in setup(). However, this doesn't work when loading from a state dict since setup is not called. What's the right way to structure the code here? Does enable_transformers_pretrained_deepspeed_sharding need to be called in setup or can it be called in the constructor?

This has been my potential workaround in the constructor, because it does seem to fail on certain ranks

def __init__(self, config): # irrelevant constructor things here try: enable_transformers_pretrained_deepspeed_sharding(self) except AttributeError: pl.utilities.rank_zero.rank_zero_warn( "Transformers sharding initialization not enabled..." ) # needed to load from checkpoint self.model = AutoModelForCausalLM.from_config(self.base_config)

As opposed to:

def setup(self, stage): if not hasattr(self, "model"): try: enable_transformers_pretrained_deepspeed_sharding(self) ### sometimes using ddp for inference so this will fail except AttributeError: pl.utilities.rank_zero.rank_zero_warn( "Transformers sharding initialization not enabled - likely not using DeepSpeed..." ) self.model = AutoModelForCausalLM.from_config(self.base_config)

Code

What have you tried?

What's your environment?

Linux, conda/pip, deepspeed==0.7.3 pytorch-lightning==1.6.5 lighting-transformers==0.2.1

OS: [e.g. iOS, Linux, Win]

Packaging [e.g. pip, conda]

Version [e.g. 0.5.2.1]

Thanks in advance for the help!
question
opened by maxzvyagin 2
Can you demonstrate how to fine-tune a pretrained model on unlabeled data
🚀 Feature

Documentation or example showing how to fine-tune a pretrained model on unlabeled data.

Motivation

It's great to fine-tune your pretrained model on untrained data, so that---if you have precious few labels in the target domain---you still have adapted to that domain using untrained data.

Pitch

We have these super huge foundational models, but for niche domains without larges it's great to fine tune. Examples:

Want to work on a particular style of text.

Want to fine-tune on a spoken language that it was not exposed.

etc.

Alternatives

Hack around, maybe use hugging face. IDK?
enhancement help wanted
opened by turian 6
TextClassificationTransformer should log torchmetrics object instead of computed Tensors
🐛 Bug

In line 58 of the TextClassificationTransformer.common_step() method (https://github.com/Lightning-AI/lightning-transformers/blob/master/lightning_transformers/task/nlp/text_classification/model.py#L58), Logging is called with a dictionary of metric values computed for the current batch. I am new to this package but I believe the logger has to be called with the torchmetrics.Metric subclass for cases where the computed value is different from the average of the per-batch values, such as for class Precision - otherwise the aggregation gives wrong results. Similar code exists in TokenClassificationTransformer and ImageClassificationTransformer as well.

To Reproduce

Using TextClassificationTransformer with Precision and Recall metrics (as configured in the default) will result in inaccurate per-epoch values for validation and testing.

Environment

PyTorch Version: 1.12.1

OS: Linux

How you installed PyTorch (conda, pip, source): pip

Build command you used (if compiling from source):

Python version: 3.7

CUDA/cuDNN version:

help wanted question
opened by stefan-schroedl 1

Releases(0.2.5)

0.2.5(Nov 21, 2022)
[0.2.5] - 2022-11-21

Fixed

Fixed loading HF model (#306)

Fixed passing config name to CNNDailyMailSummarizationDataModule (#310)

Fixed move pipeline to self.device as default (#309)

New Contributors

@lantiga made their first contribution in https://github.com/Lightning-AI/lightning-transformers/pull/305

Full Changelog: https://github.com/Lightning-AI/lightning-transformers/compare/0.2.4...0.2.5
Source code(tar.gz)
Source code(zip)
0.2.4(Nov 4, 2022)
[0.2.4] - 2022-11-04

Changed

Added support for Lightning v1.8.0 (#297)

Contributors

@rohitgr7

Full Changelog: https://github.com/Lightning-AI/lightning-transformers/compare/0.2.3...0.2.4
Source code(tar.gz)
Source code(zip)
0.2.3(Oct 8, 2022)

Use Lightning Utilities (#292) by @Borda

Full Changelog: https://github.com/Lightning-AI/lightning-transformers/compare/0.2.2...0.2.3
Source code(tar.gz)
Source code(zip)
0.2.2(Oct 7, 2022)

Update tests for latest PL release (#284) by @rohitgr7

Full Changelog: https://github.com/Lightning-AI/lightning-transformers/compare/0.2.1...0.2.2
Source code(tar.gz)
Source code(zip)
0.2.1(Jun 28, 2022)
⚡ Lightning Transformers 0.2.1

This is an incremental release with some documentation changes, DeepSpeed training support and a refactor to expose transformer model creation.

What's Changed

Refractor the code for model creation by @espoirMur in https://github.com/Lightning-AI/lightning-transformers/pull/268

Simplify Big Model inference support/Add DeepSpeed Training by @SeanNaren in https://github.com/Lightning-AI/lightning-transformers/pull/269

New Contributors

@espoirMur made their first contribution in https://github.com/Lightning-AI/lightning-transformers/pull/268

Full Changelog: https://github.com/Lightning-AI/lightning-transformers/compare/0.2.0...0.2.1
Source code(tar.gz)
Source code(zip)

0.2.0(Jun 23, 2022)

⚡ Lightning Transformers 0.2.0 Release

Below is a summary of the features/fixes we’ve added since the previous release!

ViT Image Classification 🚀

Thanks to @tanmoyio we now have ViT support within Lightning Transformers!

Here is a simple example showing how you can fine-tune a ViT model using Lightning Transformers:

import pytorch_lightning as pl
from transformers import AutoFeatureExtractor

from lightning_transformers.task.vision.image_classification import (
    ImageClassificationDataModule,
    ImageClassificationTransformer,
)

feature_extractor = AutoFeatureExtractor.from_pretrained(pretrained_model_name_or_path="nateraw/vit-base-beans")
dm = ImageClassificationDataModule(
    batch_size=8, 
    dataset_name="beans", 
    num_workers=8,
    feature_extractor=feature_extractor,
)
model = ImageClassificationTransformer(
    pretrained_model_name_or_path="nateraw/vit-base-beans", num_labels=dm.num_classes
)

trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=5)
trainer.fit(model, dm)

Save HuggingFace Hub Compatible Checkpoints 💾

Many users have requested the ability to save HF Hub-compatible models. Look no further, we offer manual support + saving an additional HF compatible checkpoint automatically during training.

from lightning_transformers.task.nlp.text_classification import TextClassificationTransformer

model = TextClassificationTransformer(pretrained_model_name_or_path="prajjwal1/bert-tiny")

# saves a HF checkpoint to this path.
model.save_hf_checkpoint("checkpoint")

To save via training, just pass the HFSaveCheckpoint plugin within your training code:

import pytorch_lightning as pl
from lightning_transformers.plugins.checkpoint import HFSaveCheckpoint
...

model = TextClassificationTransformer(pretrained_model_name_or_path="prajjwal1/bert-tiny")
trainer = pl.Trainer(plugins=HFSaveCheckpoint(model=model))
trainer.fit(model, dm)

Big Model Inference 🤖

As transformer models get larger, they require more compute to run. In Lightning Transformers, we've utilized HF Accelerate to allow users to run billion parameter model inference.

This will in turn allow people who do not have the GPU memory or compute to run these models, by leveraging CPU memory & compute and disk space!

import torch
from accelerate import init_empty_weights
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.language_modeling import LanguageModelingTransformer

# initializes empty model for us to the load the checkpoint.
with init_empty_weights():
model = LanguageModelingTransformer(
   pretrained_model_name_or_path="EleutherAI/gpt-j-6B",
   tokenizer=AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
)

# automatically selects the best devices (cpu/gpu) to load model layers based on available memory
# even offloads to disk if necessary.
model.load_checkpoint_and_dispatch("sharded-gpt-j-6B", device_map="auto", no_split_module_classes=["GPTJBlock"])

output = model.generate("Hello, my name is", device=torch.device("cuda"))
print(model.tokenizer.decode(output[0].tolist()))

SparseML Support 🔍

We now have native support for SparseML! SparseML provides GPU-class performance on CPUs through sparsification, pruning, and quantization.

To enable SparseML, all you do is pass the callback to the Lightning Trainer with paths to your recipe!

import pytorch_lightning as pl

from lightning_transformers.callbacks import TransformerSparseMLCallback

pl.Trainer(
    callbacks=TransformerSparseMLCallback(
        output_dir="/content/MODELS",
        recipe_path="/content/recipe.yaml"
    )
)

See our medium blog post for more details.

Align with PyTorch Lightning ⚡

Within this release we've simplified the API, removing complicated internal boilerplate and configuration that should exist outside this library. Keeping this library minimal means easier extensibility and easier contributions for everyone 🔥

Thanks to all the contributors that helped out!

Source code(tar.gz)
Source code(zip)

0.2.0rc1(May 31, 2022)

Source code(tar.gz)
Source code(zip)
0.1.0(Apr 21, 2021)
The first release for Lightning Transformers!

Lightning Transformers offers a flexible interface for training and fine-tuning SOTA Transformer models using the PyTorch Lightning Trainer.

Train using HuggingFace Transformers models and datasets with Lightning custom Callbacks, Loggers, Accelerators and high performance scaling.

Seamless Memory and Speed Optimizations such as DeepSpeed ZeRO or FairScale Sharded Training with no code changes.

Powerful config composition backed by Hydra - Easily swap out models, optimizers, schedulers and many more configurations without touching the code.

Transformer Task Abstraction for Rapid Research & Experimentation - Built from the ground up to be task agnostic, the library supports creating transformer tasks across all modalities with little friction.

Lightning Transformers tasks allow you to train models using HuggingFace Transformer models and datasets, use Hydra to hotswap models, optimizers or schedulers and leverage all the advances features that Lightning has to offer, including custom Callbacks, Loggers, Accelerators and high performance scaling with minimal changes.

In this release, we introduce these Transformer Tasks to train and predict:

Casual Language Modeling

Multiple Choice

Question Answering

Summarization

Text Classification

Token Classification

Translation

Each task supports various datasets, see our documentation for more information!
Source code(tar.gz)
Source code(zip)