The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

Pytorch Lightning

Last update: Jan 1, 2023

Related tags

Deep Learning python data-science machine-learning ai deep-learning pytorch artificial-intelligence

Overview

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

Website • Key Features • How To Use • Docs • Examples • Community • Grid AI • Licence

*Codecov is > 90%+ but build delays may show less

NEWS

Dec 2020 - Read about how Facebook uses Lightning to standardize deep learning across research and production teams

PyTorch Lightning is just organized PyTorch

Lightning disentangles PyTorch code to decouple the science from the engineering.

Lightning Philosophy

Lightning is designed with these principles in mind:

Principle 1: Enable maximal flexibility. Principle 2: Abstract away unnecessary boilerplate, but make it accessible when needed. Principle 3: Systems should be self-contained (ie: optimizers, computation code, etc). Principle 4: Deep learning code should be organized into 4 distinct categories.

Research code (the LightningModule).
Engineering code (you delete, and is handled by the Trainer).
Non-essential research code (logging, etc... this goes in Callbacks).
Data (use PyTorch Dataloaders or organize them into a LightningDataModule).

Once you do this, you can train on multiple-GPUs, TPUs, CPUs and even in 16-bit precision without changing your code!

Get started with our 2 step guide

Inference

Lightning is also designed for the fast inference AI researchers and production teams need to scale up things like BERT and self-supervised learning. Lightning can automatically export to ONNX or TorchScript for those cases.

Continuous Integration

System / PyTorch ver.	1.3 (min. req.)*	1.4	1.5	1.6	1.7 (latest)	1.8 (nightly)
Conda py3.7 [linux]
Linux py3.7 [GPUs**]	-	-	-		-	-
Linux py3.{6,7} [TPUs***]	-	-	-			-
Linux py3.{6,7}		-	-	-		-
OSX py3.{6,7,8}	-		-	-		-
Windows py3.{6,7,8}		-	-	-		-

* torch>=1.4 is the minimal pytorch version for Python 3.8
** tests run on two NVIDIA K80
*** tests run on Google GKE TPUv2/3
TPU w/ py3.6/py3.7 means we support Colab and Kaggle env.

How To Use

Step 0: Install

Simple installation from PyPI

pip install pytorch-lightning

To get full package experience you can install also all optional dependencies with pytorch-lightning['extra'] or for CPU users with pytorch-lightning['cpu-extra'].

From Conda

conda install pytorch-lightning -c conda-forge

Install bleeding-edge - future 1.2

the actual status of 1.2 [nightly] is following:

Install future release from the source (no guarantees)

pip install git+https://github.com/PytorchLightning/pytorch-lightning.git@release/1.2-dev --upgrade

or nightly from testing PyPI

pip install -iU https://test.pypi.org/simple/ pytorch-lightning

Step 1: Add these imports

import os
import torch
from torch import nn
import torch.nn.functional as F
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader, random_split
from torchvision import transforms
import pytorch_lightning as pl

Step 2: Define a LightningModule (nn.Module subclass)

A LightningModule defines a full system (ie: a GAN, autoencoder, BERT or a simple Image Classifier).

class LitAutoEncoder(pl.LightningModule):

    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(nn.Linear(28 * 28, 128), nn.ReLU(), nn.Linear(128, 3))
        self.decoder = nn.Sequential(nn.Linear(3, 128), nn.ReLU(), nn.Linear(128, 28 * 28))

    def forward(self, x):
        # in lightning, forward defines the prediction/inference actions
        embedding = self.encoder(x)
        return embedding

    def training_step(self, batch, batch_idx):
        # training_step defined the train loop. It is independent of forward
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x)
        self.log('train_loss', loss)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

Note: Training_step defines the training loop. Forward defines how the LightningModule behaves during inference/prediction.

Step 3: Train!

dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor())
train, val = random_split(dataset, [55000, 5000])

autoencoder = LitAutoEncoder()
trainer = pl.Trainer()
trainer.fit(autoencoder, DataLoader(train), DataLoader(val))

And without changing a single line of code, you could run on GPUs/TPUs

# 8 GPUs
trainer = Trainer(max_epochs=1, gpus=8)

# 256 GPUs
trainer = Trainer(max_epochs=1, gpus=8, num_nodes=32)

# TPUs
trainer = Trainer(tpu_cores=8)

And even export for production via onnx or torchscript

# torchscript
autoencoder = LitAutoEncoder()
torch.jit.save(autoencoder.to_torchscript(), "model.pt")

# onnx
with tempfile.NamedTemporaryFile(suffix='.onnx', delete=False) as tmpfile:
    autoencoder = LitAutoEncoder()
    input_sample = torch.randn((1, 64))
    autoencoder.to_onnx(tmpfile.name, input_sample, export_params=True)
    os.path.isfile(tmpfile.name)

For advanced users, you can still own complex training loops

class LitAutoEncoder(pl.LightningModule):
    def training_step(self, batch, batch_idx, optimizer_idx):
        # access your optimizers with use_pl_optimizer=False. Default is True
        (opt_a, opt_b) = self.optimizers(use_pl_optimizer=True)

        loss_a = ...
        self.manual_backward(loss_a, opt_a)
        opt_a.step()
        opt_a.zero_grad()

        loss_b = ...
        self.manual_backward(loss_b, opt_b, retain_graph=True)
        self.manual_backward(loss_b, opt_b)
        opt_b.step()
        opt_b.zero_grad()

Key Features

Scale your models to run on any hardware (CPU, GPUs, TPUs) without changing your model
Making code more readable by decoupling the research code from the engineering
Easier to reproduce
Less error prone by automating most of the training loop and tricky engineering
Keeps all the flexibility (LightningModules are still PyTorch modules), but removes a ton of boilerplate
Lightning has out-of-the-box integration with the popular logging/visualizing frameworks (Tensorboard, MLFlow, Neptune.ai, Comet.ml, Wandb).
Tested rigorously with every new PR. We test every combination of PyTorch and Python supported versions, every OS, multi GPUs and even TPUs.
Minimal running speed overhead (about 300 ms per epoch compared with pure PyTorch).

Lightning automates 40+ parts of DL/ML research

GPU training
Distributed GPU (cluster) training
TPU training
EarlyStopping
Logging/Visualizing
Checkpointing
Experiment management
Full list here

Examples

Hello world

Contrastive Learning

NLP

BERT
GPT-2

Reinforcement Learning

Vision

Classic ML

Community

The lightning community is maintained by

16 core contributors who are all a mix of professional engineers, Research Scientists, Ph.D. students from top AI labs.
280+ community contributors.

Lightning is also part of the PyTorch ecosystem which requires projects to have solid testing, documentation and support.

Asking for help

If you have any questions please:

Funding

Building open-source software with only a few part-time people is hard!

We're venture funded and backed by some of the top VC funds in the world, Index Ventures, Bain Capital Ventures, First Minute Capital.

Their funding ensures we can continue to build awesome tooling like Grid, give you around the clock support, hire a full-time staff, attend conferences, and move faster through implementing features you request.

To supercharge your research and production work, visit our Grid.ai platform

Grid AI

Grid AI is our native platform for training models at scale on the cloud!

Sign up for early access here

To use grid, take your regular command:

    python my_model.py --learning_rate 1e-6 --layers 2 --gpus 4

And change it to use the grid train command:

    grid train --grid_gpus 4 my_model.py --learning_rate 'uniform(1e-6, 1e-1, 20)' --layers '[2, 4, 8, 16]'

The above command will launch (20 * 4) experiments each running on 4 GPUs (320 GPUs!) - by making ZERO changes to your code.

Licence

Please observe the Apache 2.0 license that is listed in this repository. In addition the Lightning framework is Patent Pending.

BibTeX

If you want to cite the framework feel free to use this (but only if you loved it 😊 ):

@article{falcon2019pytorch,
  title={PyTorch Lightning},
  author={Falcon, WA},
  journal={GitHub. Note: https://github.com/PyTorchLightning/pytorch-lightning},
  volume={3},
  year={2019}
}

Comments

Cross validation feature

🚀 Feature

Cross-Validation is a crucial model validation techniques for assessing how the model generalizes on new data.

Motivation

Research papers usually require cross-validation. From my point of view, this kind of feature would simplify the work of researches.

Pitch

I want to pass a parameter to the Trainer object to specify that I want to train the model on K-folds.

In the case that nobody wants to make a PR, I can start working on that.
feature help wanted good first issue discussion

opened by BraveDistribution 106
Improve typing coverage (4/n)
🚀 Typing coverage

Let's improve typing coverage of PyTorch Lightning together!

I'm creating a new issue in order to increase visibility. There are three older issues (#7037, #5023, #4698) which became stale over time.

Plan

Currently, there are 55 files which are excluded from mypy checks so that our CI does not fail. These files vastly differ in difficulty in order to make the typing complete. For this reason, we are introducing difficulty estimate for each file so that community members can choose to work on the files appropriate to their skill level.

Please, comment on this issue in order to reserve a particular file to work on. Once you do so, I will edit this top comment to avoid collisions. Once you think your work is finished, please open a PR referencing this issue which:

removes the corresponding line from pyproject.toml

and passes mypy checks with the corresponding line removed. You can test it locally by running mypy from root directory

If you are struggling with pushing it over the finish line, open the PR anyway and someone from our team will help you to get it there. 🚀

Please note, that it can happen that you may need to edit more than just one file. This is fine, but please keep in mind, that the goal of your PR will be to make the check passing for the chosen file. Also, please note that the difficulty is just an educated guess.

For those of you who are not familiar with the process of contributing a PR, we have prepared a simple guide that will walk you through the necessary steps. You can do it! :rocket: :muscle:

List of files and guesstimated difficulty

Completed

Difficulty 1 of 3

[x] pytorch_lightning/core/decorators.py #14044

[x] pytorch_lightning/profilers/advanced.py @nninept #13792 ~- [ ] pytorch_lightning/profilers/base.py @LeeChanHyuk #13879~

[x] pytorch_lightning/loggers/base.py @JustinGoheen #13494

[x] pytorch_lightning/__setup__.py @CyprienRicque #13472 ~- [ ] pytorch_lightning/distributed/dist.py @puhuk #13492~

[x] pytorch_lightning/strategies/single_device.py @CyprienRicque #13532

[x] pytorch_lightning/trainer/optimizers.py @gautierdag #13470

[x] pytorch_lightning/utilities/distributed.py @krishnakalyan3 #13678

[x] pytorch_lightning/callbacks/finetuning.py @ar90n #13516

[x] pytorch_lightning/loggers/mlflow.py @JustinGoheen ~~#13690~~ #13691

[x] pytorch_lightning/tuner/tuning.py @donlapark ~~#13616~~ #13631

[x] pytorch_lightning/strategies/single_tpu.py @CyprienRicque #13534

[x] pytorch_lightning/strategies/ddp2.py @CyprienRicque #13535

[x] pytorch_lightning/strategies/parallel.py @CyprienRicque #13556

[x] pytorch_lightning/loggers/csv_logs.py @JustinGoheen #13538

[x] pytorch_lightning/tuner/lr_finder.py @donlapark #13513 #13652

[x] pytorch_lightning/strategies/dp.py @CyprienRicque #13564

[x] pytorch_lightning/profilers/simple.py @krishnakalyan3 #14103

[x] pytorch_lightning/strategies/sharded_spawn.py @krishnakalyan3 #14102

[x] pytorch_lightning/demos/mnist_datamodule.py @alro923 #13929

[x] pytorch_lightning/demos/boring_classes.py @krishnakalyan3 #14201

[x] pytorch_lightning/tuner/batch_size_scaling.py @ar90n #13518

Difficulty 2 of 3

[x] pytorch_lightning/loops/epoch/training_epoch_loop.py @himkt #13555

[x] pytorch_lightning/core/mixins/device_dtype_mixin.py @krishnakalyan3 #13704

[x] pytorch_lightning/loggers/comet.py @JustinGoheen #13689

[x] pytorch_lightning/loggers/tensorboard.py @JustinGoheen #13688

[x] pytorch_lightning/strategies/horovod.py @CyprienRicque #13570

[x] pytorch_lightning/callbacks/model_checkpoint.py @BongYang #13617

[x] pytorch_lightning/strategies/fully_sharded.py @BongYang #13941

[x] pytorch_lightning/loggers/neptune.py @JustinGoheen #13692

[x] pytorch_lightning/utilities/meta.py @nninept #13763 #13868

[x] pytorch_lightning/strategies/tpu_spawn.py @BongYang #13813

[x] pytorch_lightning/loggers/logger.py @JustinGoheen #13541

[x] pytorch_lightning/loggers/wandb.py @gautierdag #13483

[x] pytorch_lightning/callbacks/stochastic_weight_avg.py @donlapark #13685 #13860

[x] pytorch_lightning/strategies/strategy.py @CyprienRicque #13519

[x] pytorch_lightning/strategies/deepspeed.py @donlapark #13832

[x] pytorch_lightning/strategies/ddp_spawn.py @donlapark #13865

[x] pytorch_lightning/strategies/ipu.py @HalestormAI #13786

[x] pytorch_lightning/trainer/connectors/callback_connector.py @krishnakalyan3 #13750

[x] pytorch_lightning/strategies/ddp.py @lijm1358 #13885

[x] pytorch_lightning/core/saving.py @JustinGoheen #13932

[x] pytorch_lightning/callbacks/quantization.py @krishnakalyan3 #13782

[x] pytorch_lightning/strategies/sharded.py @lijm1358 #14184

[x] pytorch_lightning/core/datamodule.py @JustinGoheen #13693

Difficulty 3 of 3

~- [ ] pytorch_lightning/trainer/callback_hook.py @JustinGoheen #13807 ~

[x] pytorch_lightning/core/module.py @JustinGoheen #13603

[x] pytorch_lightning/trainer/connectors/data_connector.py @JustinGoheen #13806

[x] pytorch_lightning/utilities/auto_restart.py @donlapark #13904

[x] pytorch_lightning/trainer/supporters.py @donlapark #14633

[x] pytorch_lightning/profilers/pytorch.py @krishnakalyan3 #14405

[x] pytorch_lightning/utilities/data.py @nandwalritik #13901

[x] pytorch_lightning/trainer/trainer.py ~@JustinGoheen #13810~ @BongYang #14204

[x] pytorch_lightning/callbacks/progress/rich_progress.py @donlapark #14963

cc @borda @justusschock @awaelchli @rohitgr7 @Borda @tchaton @aniketmaurya @kingjuno @alat-rights @carmocca @akihironitta @stancld as you were all involved in previous issues
help wanted good first issue let's do it! code quality
opened by otaj 105
Code stuck on "initalizing ddp" when using more than one gpu
🐛 Bug

I am trying to run a pytorch lightning model on a 4-GPU node. In my trainer, if I specify

pl.Trainer(gpus=[0])

It runs fine. However, once I add another GPU

pl.Trainer(gpus=[0,1,2,3])

I get this output:

GPU available: True, used: True TPU available: False, using: 0 TPU cores LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3] initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4 initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4 initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4 initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4

And the model just hangs there forever. I have tried this with only 2 GPUs and get the same behavior.

Any idea why this may happen? I have tried with both ddp and ddp_spawn.

PyTorch Version-- tried both 1.4 and 1.7

OS-- Linux

Installed with pip

Python version: 3.8.5

CUDA/cuDNN version: 10.1

GPU models and configuration: NVIDIA K80s

bug help wanted distributed priority: 1
opened by JosephGatto 78
Implementing mAP
What does this PR do?

Implements mAP, as mentioned in #2552. I'm creating a draft pull request, as opposed to a regular pull request, to receive some feedback as well as guidance on some implementation details.

Before submitting

[x] Was this discussed/approved via a Github issue? (no need for typos and docs improvements)

[x] Did you read the contributor guideline, Pull Request section?

[x] Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.

[x] Did you make sure to update the documentation with your changes?

[x] Did you write any new necessary tests?

[x] Did you verify new and existing tests pass locally with your changes?

[x] If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed. Before you start reviewing make sure you have read Review guidelines. In in short, see following bullet-list:

[x] Is this pull request ready for review? (if not, please submit in draft mode)

[x] Check that all items from Before submitting are resolved

[x] Make sure the title is self explanatory and the description concisely explains the PR

[x] Add labels and milestones (and optionally projects) to the PR so it can be classified; Bugfixes should be including in bug-fix release milestones (m.f.X) and features should be included in (m.X.b) releases.

Did you have fun?

Make sure you had fun coding 🙃
feature has conflicts
opened by briankosw 68
Add Support for multiple train loaders
Before submitting

[ ] Was this discussed/approved via a Github issue? (no need for typos and docs improvements)

[x] Did you read the contributor guideline, Pull Request section?

[ ] Did you make sure to update the docs?

[x] Did you write any new necessary tests?

[ ] If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

When this is finished it adds support for drawing batches from multiple train loaders at once. If the loaders are specified as a Mapping (dict), the resulting batch will consist of one batch per loader under the same keys as the loaders like this:

loaders = {"x": loader_x, "y": loader_y, "z": loader_z}

will result in a batch like this:

{"x": batch_from_loader_x, "y": batch_from_loader_y, "z": batch_from_loader_z}

and loaders in a sequence will return in a sequence-batch built of the separate batches in the correct order:

loaders = [loader_0, loader_1, loader_2]

will result in a batch like this:

[batch_from_loader_0, batch_from_loader_1, batch_from_loader_2]

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃
feature help wanted ready priority: 0 design
opened by justusschock 67
Remove deprecated code after the 1.6 release
Proposed refactor

Remove deprecated code after the 1.6 release.

NOTE: Please pick up a single item from the list (by commenting here in the issue) - and if there are no conflicts - we will happily assign you and put your name in front of the item in the list.

Please note that unless mentioned, the classes are importable from pytorch_lightning, example: from pytorch_lightning import Trainer.

[x] LightningModule.summarize -> #12559

[x] pytorch_lightning.core.memory.LayerSummary -> #12593

[x] pytorch_lightning.core.memory.ModelSummary -> #12593

[x] pytorch_lightning.core.memory.get_gpu_memory_map -> #12644

[x] pytorch_lightning.core.memory.get_memory_profile -> #12659

[x] LightningModule.model_size -> #12641

[x] LightningDataModule.train_transforms -> #12662

[x] LightningDataModule.val_transforms -> #12763

[x] LightningDataModule.test_transforms -> #12773

[x] LightningDataModule.size -> #12780

[x] LightningDataModule.dims and LightningDataModule(dims=...) -> #12780

[x] LightningModule.get_progress_bar_dict -> #12839

[x] Trainer.progress_bar_dict -> #12839

[x] Trainer(prepare_data_per_node=...) -> #12536

[x] Trainer(stochastic_weight_avg=...) -> #12535

[x] Trainer(terminate_on_nan=...) and Trainer.terminate_on_nan -> #12553

[x] LightningModule.on_{train,val,test,predict}_dataloader -> #13033

[x] pytorch_lightning.loggers.TestTubeLogger -> #12859

[x] pytorch_lightning.Callback.on_keyboard_interrupt -> #13438

[x] Trainer(process_position=...) -> #13071

[x] Trainer(flush_logs_every_n_steps=...) -> #13074

[x] LightningModule.add_to_queue -> @shenoynikhil

[x] LightningModule.get_from_queue -> @shenoynikhil

[x] Trainer(progress_bar_refresh_rate=...) -> #12514

[x] LightningLoggerBase.close and pytorch_lightning.loggers.LoggerCollection.close -> #13149

[x] pytorch_lightning.distributed.dist.LightningDistributed #13549

[x] Trainer(checkpoint_callback=...) -> #13027

[x] Passing dataloader_idx to on_train_batch_start of pytorch_lightning.Callback and LightningModule -> #12769

[x] LightningModule.on_post_move_to_device #13548

[x] pytorch_lightning.core.decorators.parameter_validation #13514

[x] Trainer(accelerator="ddp_spawn") #12696

[x] Trainer(plugins="ddp_spawn") #12700

[x] Trainer(weights_summary="full"), Trainer(weights_summary=None), Trainer.weights_summary -> #13070

[x] Trainer(log_gpu_memory=...) -> #12657

[x] Trainer.slurm_job_id #13459

[x] pytorch_lightning.callbacks.gpu_stats.GPUStatsMonitor -> #12554

[x] pytorch_lightning.callbacks.gpu_stats.XLAStatsMonitor -> #12688

[x] pytorch_lightning.callbacks.progress.ProgressBar -> #12658

[x] Trainer(max_steps=None) and Trainer.fit_loop.max_steps = None #13591

[x] pytorch_lightning.callbacks.lr_monitor.LearningRateMonitor.lr_sch_names -> #13353

[x] KubeflowEnvironment.is_using_kubeflow, LSFEnvironment.is_using_lsf, TorchElasticEnvironment.is_using_torchelastic #13458

[x] pytorch_lightning.overrides.distributed.IndexBatchSamplerWrapper.batch_indices #13565

[x] pytorch_lightning.strategies.SingleDeviceStrategy.post_dispatch #13461

[x] pytorch_lightning.trainer.connectors.logger_connector.logger_connector.LoggerConnector.gpu_metrics

Feel free to cross-check from the test file to ensure that the relevant test fails now (since it's no more deprecated and instead removed).

Pitch

All the deprecated features we have are tested here:

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/deprecated_api/test_remove_1-7.py

If you are interested in taking care of one item, post a comment here asking to take it. This avoids multiple people working on the same thing.

Additional context

See pull requests linked in #10312 for examples on how to contribute :) Or a recent pull request #12514.

If you enjoy Lightning, check out our other projects! ⚡

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @borda @justusschock @awaelchli @rohitgr7 @krshrimali
good first issue refactor
opened by akihironitta 65

replace Hparams by init args

Problem

hparams was a temporary fix for not auto storing args by users. It’s something everyone hacks around, is not intuitive and makes the pl module somehow less like at pt module.

end of hparams!

This PR

This PR removes that and instead:

Stores all the args passed in init automatically so checkpoints can have this information.
doesn’t store things like losses, etc... only primitives, lists, dicts, tuples and namespace
auto saves this info into checkpoints
it DOES NOT assign properties automatically

Backward compatibility

this PR is still backward compatible for people who want to continue using hparams directly.

Summary

Before:

hparams = dict or Namespace

class LitModel(pl.LightningModule):
    def __init__(self, hparams, my_pretrained_nn_module):
        super().__init__()
        self.hparams = hparams
        self.l1 = nn.Linear(hparams.in_dim, hparams.out_dim)
        self.feature_extractor = my_pretrained_nn_module()

# old way had a ton of problems with this
model = LitModel.load_from_checkpoint(PATH)

New:

class LitModel(pl.LightningModule):
    def __init__(self, in_dim, out_dim, my_pretrained_nn_module):
        super().__init__()
        self.in_dim = in_dim
        self.out_dim = out_dim
        
        # self.in_dim, etc were auto registered to the module
        self.l1 = nn.Linear(in_dim, out_dim)
        self.feature_extractor = my_pretrained_nn_module()

# load from checkpoint still works as normal, but objects and such need to be specified
model = LitModel.load_from_checkpoint(PATH, my_pretrained_nn_module=MyModule)

# or can overwrite the old settings as well
model = LitModel.load_from_checkpoint(PATH, in_dim=some_new_dim, my_pretrained_nn_module=MyModule)

feature help wanted

opened by williamFalcon 63

Unify usage of multiple callbacks

🚀 Feature

Simplified API, with callbacks... as e.g. Keras did, pass just list of callbacks to be executed and Trainer will call then when needed instead of having them specified https://github.com/PyTorchLightning/pytorch-lightning/blob/b1040523b2180300574d961444b00abfa3c84195/pytorch_lightning/trainer/trainer.py#L65-L66

mentioned also in https://github.com/PyTorchLightning/pytorch-lightning/issues/825#issuecomment-588226411
feature help wanted discussion

opened by Borda 60
Lose performance between 0.6.0 and 0.7.1
🐛 Bug

When I train exactly the same model with pl 0.7.1, I get worse performance compared to pl0.6.0. I did a fresh install or Asteroid with both versions and ran exactly the same script on the same hardware. I get significantly worse performance with pl0.7.1. Are there some known issues I should be aware of? In the mean time, I'll have to downgrade to 0.6.0

Environment

PL 0.6.0

Collecting environment information... [8/105] PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Debian GNU/Linux 10 (buster)
GCC version: (Debian 8.3.0-6) 8.3.0
CMake version: version 3.14.0

Python version: 3.6 Is CUDA available: No CUDA runtime version: No CUDA GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA

Versions of relevant libraries: [pip3] numpy==1.18.1 [pip3] pytorch-lightning==0.6.0 [pip3] torch==1.4.0 [pip3] torchvision==0.4.2 [conda] blas 1.0 mkl [conda] mkl 2019.4 243 [conda] mkl-include 2020.0 166 [conda] mkl-service 2.3.0 py36he904b0f_0 [conda] mkl_fft 1.0.14 py36ha843d7b_0 [conda] mkl_random 1.1.0 py36hd6b4f25_0 [conda] torch 1.3.1 pypi_0 pypi [conda] torchvision 0.4.2 pypi_0 pypi

Diff between 0.6.0 and 0.7.1 envs

diff env_0.7 env_0.6

19c19 < [pip3] pytorch-lightning==0.7.1 --- > [pip3] pytorch-lightning==0.6.0
help wanted
opened by mpariente 53

CUDA OOM when initializing DDP

🐛 Bug

Hey everyone,

I am trying to train a model on the GPU workstation of our lab (that has 10 GPUs, of which 1 only is usually in use) using Lightning ad DDP. I have tried with several models (including the BoringModel) without success. In particular, I get a CUDA OOM error when DDP initializes. I tried BoringModel with the following Trainer configuration:

trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        max_epochs=1,
        weights_summary=None,
        gpus=2,
        accelerator="ddp",
        auto_select_gpus=True
)

And the output I get is the following:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):
  File "boring_model.py", line 138, in <module>
    run_test()
  File "boring_model.py", line 133, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
    self.init_ddp_connection(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
    torch_distrib.init_process_group(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: CUDA error: out of memory
Traceback (most recent call last):
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 138, in <module>
    run_test()
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 133, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
    self.init_ddp_connection(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
    torch_distrib.init_process_group(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: Broken pipe

The script with the BoringModel I run on our workstation is in this gist.

However, this doesn't happen on Colab using your BoringModel notebook (my version can be found here).

I also tried to run locally the same notebook as Colab, and the result at the first attempt is the following:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-11-1f9f6fbe4f6c> in <module>
----> 1 test_x(tmpdir)

<ipython-input-10-d400f0366266> in test_x(tmpdir)
     16 
     17     # Train the model ⚡
---> 18     trainer.fit(model, train, val)
     19 
     20     trainer.test(test_dataloaders=test)

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
    442         self.call_hook('on_fit_start')
    443 
--> 444         results = self.accelerator_backend.train()
    445         self.accelerator_backend.teardown()
    446 

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py in train(self)
    146         model = self.trainer.model
    147 
--> 148         results = self.ddp_train(process_idx=self.task_idx, model=model)
    149         if 'WORLD_SIZE' in os.environ:
    150             del os.environ['WORLD_SIZE']

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py in ddp_train(self, process_idx, model)
    236         # where to store ip_table
    237         model.trainer = self.trainer
--> 238         self.init_ddp_connection(
    239             self.trainer.global_rank,
    240             self.trainer.world_size,

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py in init_ddp_connection(self, global_rank, world_size, is_slurm_managing_tasks)
    213                 f"initializing ddp: GLOBAL_RANK: {global_rank}, MEMBER: {global_rank + 1}/{world_size}"
    214             )
--> 215             torch_distrib.init_process_group(
    216                 torch_backend, rank=global_rank, world_size=world_size
    217             )

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py in init_process_group(backend, init_method, timeout, world_size, rank, store, group_name)
    440     # process groups including global variables are updated correctly on all
    441     # ranks.
--> 442     barrier()
    443 
    444 def _new_process_group_helper(world_size,

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py in barrier(group, async_op)
   1945     if group == GroupMember.WORLD:
   1946         _check_default_pg()
-> 1947         work = _default_pg.barrier()
   1948     else:
   1949         work = group.barrier()

RuntimeError: CUDA error: out of memory

At the second attempt, though, it works, as expected (i.e. the model trains with no errors, even with multiple GPUs)! So in the script, I tried to do the following to attempt the fit twice as in the notebook:

try:
	trainer.fit(model, train_data, val_data)
except:
	trainer.fit(model, train_data, val_data)

As a result, I get this stack trace:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):
  File "boring_model.py", line 135, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
    self.init_ddp_connection(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
    torch_distrib.init_process_group(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: CUDA error: out of memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "boring_model.py", line 143, in <module>
    run_test()
  File "boring_model.py", line 137, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 275, in ddp_train
    model = self.configure_ddp(model, device_ids)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 292, in configure_ddp
    model = self.ddp_plugin.configure_ddp(model, device_ids)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/plugins/ddp_plugin.py", line 59, in configure_ddp
    model = LightningDistributedDataParallel(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 410, in __init__
    self._sync_params_and_buffers(authoritative_rank=0)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 417, in _sync_params_and_buffers
    self._distributed_broadcast_coalesced(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 978, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729009598/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
Traceback (most recent call last):
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 135, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
    self.init_ddp_connection(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
    torch_distrib.init_process_group(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729009598/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 143, in <module>
    run_test()
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 137, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 275, in ddp_train
    model = self.configure_ddp(model, device_ids)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 292, in configure_ddp
    model = self.ddp_plugin.configure_ddp(model, device_ids)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/plugins/ddp_plugin.py", line 59, in configure_ddp
    model = LightningDistributedDataParallel(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 410, in __init__
    self._sync_params_and_buffers(authoritative_rank=0)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 417, in _sync_params_and_buffers
    self._distributed_broadcast_coalesced(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 978, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: Broken pipe

Expected behavior

The models should train without issues.

Environment

CUDA:
- GPU:
  - TITAN V
  - TITAN V
  - TITAN V
  - TITAN V
  - TITAN V
  - TITAN V
  - TITAN V
  - TITAN V
  - TITAN V
  - TITAN V
- available: True
- version: 10.1
Packages:
- numpy: 1.19.2
- pyTorch_debug: True
- pyTorch_version: 1.7.0
- pytorch-lightning: 1.0.6
- tqdm: 4.52.0
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.8.5
- version: #1 SMP Fri Oct 18 17:15:30 UTC 2019

Additional context

I tried installing torch, torchvision and pl with both Conda and PIP with fresh environments, and still no solution to this problem.

This happens also if I select (free) GPUs manually by specifying them in the gpus flag as a List[int]. Also interestingly, if I run this tutorial notebook by PyTorch that uses vanilla PyTorch DDP, I have no issues whatsoever. Final interesting fact, setting accelerator="dp"I have no issues.

Thanks in advance!

bug help wanted distributed

opened by dedeswim 51

NCCL error using DDP and PyTorch 1.7

🐛 Bug

Getting this error when attempting to use ddp with the "getting started" autoencoder example:

Stack Trace:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):
  File "01_getting_started_autoencoder.py", line 66, in <module>
    modle, trainer = cli_main()
  File "01_getting_started_autoencoder.py", line 60, in cli_main
    trainer.fit(model, train_dl)
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
Traceback (most recent call last):
  File "/home/user/development/_training/ml/pl-playground/01_getting_started_autoencoder.py", line 66, in <module>
    results = self.accelerator_backend.train()
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 138, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 231, in ddp_train
    self.trainer.is_slurm_managing_tasks
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 213, in init_ddp_connection
    torch_backend, rank=global_rank, world_size=world_size
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    modle, trainer = cli_main()
  File "/home/user/development/_training/ml/pl-playground/01_getting_started_autoencoder.py", line 60, in cli_main
    trainer.fit(model, train_dl)
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
    results = self.accelerator_backend.train()
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 138, in train
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 231, in ddp_train
    self.trainer.is_slurm_managing_tasks
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 213, in init_ddp_connection
    torch_backend, rank=global_rank, world_size=world_size
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

To Reproduce

Follow the code in the getting started question with these parameters to Trainer:

model = LitAutoEncoder()
trainer = pl.Trainer(gpus='1,2', distributed_backend='ddp')
trainer.fit(model, train_dl)

Expected behavior

For it to train on multiple GPUs :)

Environment

PyTorch Version 1.7:
OS (e.g., Linux): Ubuntu 18.04
How you installed PyTorch (conda, pip, source): pip
Build command you used (if compiling from source): n/a
Python version: 3.7
CUDA/cuDNN version: 10.2/7.6.5
GPU models and configuration: 2 1080Tis
Any other relevant information: n/a

bug help wanted priority: 0 distributed 3rd party

opened by ohmeow 51

Use SIGTERM instead of SIGKILL on DDP Errors
Description & Motivation

In DDP training, when a process has an exception then a SIGKILL is issued to all other processes to tear them down: https://github.com/Lightning-AI/lightning/blob/4c3ce605ad814155e309b6bb0737db6b428a2a0c/src/pytorch_lightning/strategies/ddp.py#L458

Because it's a SIGKILL, everything is immediately torn down and no finalizers an run. This is a problem for us, because we have code setup on the main process to report errors (e.g. log the stack trace to our job tracker, upload any dumped files / logs to S3 for permanent storage + post-hoc debugging) which will not run.

Pitch

The simplest alternative would be to issue a SIGTERM instead of SIGKILL -- this would let us install a signal handler on the main process to do all of the "finalizers" for us before exiting.

Alternatives

There are other things we could do, of course:

We could always do something like send SIGTERM, sleep for some time, check that the process is killed, if process is not killed within timeout send SIGKILL. This would guarantee that processes would always be killed in case of a bad SIGTERM handler.

We could use a SIGINT (which raises a KeyboardInterrupt) -- that way, all of the try/excepts and with statements will naturally run cleanup without having to write a special handler.

We could make a Callback that tries to do things in the on_exception handler. This should be mostly possible, but would require more work on our side because right now only the main process establishes connections with our services for error handling + logging

Additional context

No response

cc @borda @justusschock @awaelchli
feature strategy: ddp
opened by alanhdu 1

save_on_train_epoch_end=True same behavior as save_on_train_epoch_end=False

Bug description

I believe setting save_on_train_epoch_end=True still runs checkpointing after the validation loop.

I have put a breakpoint in the module function on_save_checkpoint() and it is called only after validation even with save_on_train_epoch_end=True. Also I have written my own class to debug by looking into on_train_epoch_end function:

class MyCKPT(ModelCheckpoint):
    def on_train_epoch_end(self, trainer, pl_module) -> None:
        """Save a checkpoint at the end of the training epoch."""
        if not self._should_skip_saving_checkpoint(trainer) and self._save_on_train_epoch_end:
            monitor_candidates = self._monitor_candidates(trainer)
            if self._every_n_epochs >= 1 and (trainer.current_epoch + 1) % self._every_n_epochs == 0:
                self._save_topk_checkpoint(trainer, monitor_candidates)
            self._save_last_checkpoint(trainer, monitor_candidates)

Similarly trying both save_on_train_epoch_end=True and save_on_train_epoch_end=False, the above function is called once validation is ran.

I use version '1.6.0'

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment

#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 1.10):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

needs triage

opened by DianeBouchacourt 0

fix legacy creation
What does this PR do?

fixing Traner arguments for creating legacy checkpoints see master failures: https://github.com/Lightning-AI/lightning/actions/runs/3856241754/jobs/6572256322

Traceback (most recent call last): File "/home/runner/work/lightning/lightning/tests/legacy/simple_classif_training.py", line 52, in <module> main_train(path_dir) File "/home/runner/work/lightning/lightning/tests/legacy/simple_classif_training.py", line 30, in main_train trainer = pl.Trainer( File "/opt/hostedtoolcache/Python/3.8.15/x64/lib/python3.8/site-packages/pytorch_lightning/utilities/argparse.py", line 340, in insert_env_defaults return fn(self, **kwargs) File "/opt/hostedtoolcache/Python/3.8.15/x64/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 414, in __init__ self._accelerator_connector = AcceleratorConnector( File "/opt/hostedtoolcache/Python/3.8.15/x64/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 191, in __init__ self._check_device_config_and_set_final_flags( File "/opt/hostedtoolcache/Python/3.8.15/x64/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 419, in _check_device_config_and_set_final_flags raise MisconfigurationException( lightning_fabric.utilities.exceptions.MisconfigurationException: `Trainer(devices=0)` value is not a valid input using None accelerator.

Before submitting

[ ] Was this discussed/approved via a GitHub issue? (not for typos and docs)

[x] Did you read the contributor guideline, Pull Request section?

[x] Did you make sure your PR does only one thing, instead of bundling different changes together?

[ ] Did you make sure to update the documentation with your changes? (if necessary)

[ ] Did you write any new necessary tests? (not for typos and docs)

[ ] Did you verify new and existing tests pass locally with your changes?

[ ] Did you list all the breaking changes introduced by this pull request?

[ ] Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR. Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

[x] Is this pull request ready for review? (if not, please submit in draft mode)

[x] Check that all items from Before submitting are resolved

[x] Make sure the title is self-explanatory and the description concisely explains the PR

[x] Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃
bug pl
opened by Borda 1
Lightning package includes additional dir
Bug description

the package inludes additonal folders except lightning so the tree is

src |-lightning |-lightning.egg-info |-lightning_app <- WRONG |-lightning_fabric <- WRONG |-pytorch_lightning <- WRONG '-version.info

on top of it, this shall be caught by our CI

How to reproduce the bug

see the latest 1.9.0rc0

More info

No response

cc @carmocca @borda
priority: 0 release
opened by Borda 0
Deepspeed checkpoint cannot be loaded for my model
Bug description

I am training a whisper model using the lightning trainer. If I add the option strategy="deepspeed_stage_3_offload" to pytorch_lightning.Trainer(), then trainer.fit() goes well, but the checkpoint is saved in a way that I can't load.

Using the file zero_to_fp32.py in the saved checkpoint directory to produce a model results in an un-loadable file.

Using this script loading the produced model gives this error:

RuntimeError: Error(s) in loading state_dict for Whisper: Missing key(s) in state_dict: "model.proj_out.weight".

See also the discussion in this closed issue.

How to reproduce the bug

from transformers import WhisperForConditionalGeneration import pytorch_lightning as pl class Whisper(pl.LightningModule): def __init__(self, model) super().__init__() self.model = model ... model_card = "openai/whisper-base" model = WhisperForConditionalGeneration.from_pretrained(model_card) trainer = pl.Trainer(accelerator='gpu', precision=16, strategy="deepspeed_stage_3_offload") trainer.fit(model=whisper, train_dataloaders=...) With a proper data loader, train_step() etc.

Error messages and logs

RuntimeError: Error(s) in loading state_dict for Whisper: Missing key(s) in state_dict: "model.proj_out.weight".

Environment

CUDA:

GPU:

NVIDIA GeForce RTX 3090

available: True

version: 11.7

Lightning:

lightning-utilities: 0.5.0

pytorch-lightning: 1.8.6

torch: 1.13.0

torchaudio: 0.13.1

torchmetrics: 0.11.0

Packages:

aiohttp: 3.8.3

aiosignal: 1.2.0

antlr4-python3-runtime: 4.8

anyio: 3.6.2

appdirs: 1.4.4

argon2-cffi: 21.3.0

argon2-cffi-bindings: 21.2.0

asttokens: 2.0.8

async-timeout: 4.0.2

attrs: 22.1.0

audioread: 3.0.0

babel: 2.10.3

backcall: 0.2.0

beautifulsoup4: 4.11.1

bitarray: 2.6.1

bleach: 5.0.1

certifi: 2022.9.24

cffi: 1.15.1

charset-normalizer: 2.1.1

click: 8.1.3

colorama: 0.4.6

comm: 0.1.2

cython: 0.29.32

datasets: 2.6.1

debugpy: 1.6.4

decorator: 5.1.1

deepspeed: 0.7.7

defusedxml: 0.7.1

dill: 0.3.5.1

distlib: 0.3.6

docopt: 0.6.2

entrypoints: 0.4

evaluate: 0.4.0

executing: 1.1.1

fairseq: 1.0.0a0+d6855ba

fastjsonschema: 2.16.2

ffmpeg-python: 0.2.0

filelock: 3.8.0

flashlight: 1.0.0

frozenlist: 1.3.1

fsspec: 2022.10.0

future: 0.18.2

gitdb: 4.0.9

gitpython: 3.1.29

hjson: 3.1.0

huggingface-hub: 0.10.1

hydra-core: 1.0.7

idna: 3.4

importlib-metadata: 5.2.0

importlib-resources: 5.10.0

ipykernel: 6.19.4

ipython: 8.5.0

ipython-genutils: 0.2.0

ipywidgets: 8.0.3

jedi: 0.18.1

jinja2: 3.1.2

jiwer: 2.5.1

joblib: 1.2.0

jsonschema: 4.17.1

julia: 0.5.7

jupyter: 1.0.0

jupyter-client: 7.4.8

jupyter-console: 6.4.4

jupyter-core: 5.1.0

jupyter-events: 0.5.0

jupyter-server: 2.0.4

jupyter-server-terminals: 0.4.3

jupyterlab-pygments: 0.2.2

jupyterlab-widgets: 3.0.4

lark: 0.11.3

levenshtein: 0.20.2

librosa: 0.9.2

lightning-utilities: 0.5.0

llvmlite: 0.39.1

lxml: 4.9.1

markupsafe: 2.1.1

matplotlib-inline: 0.1.6

mistune: 2.0.4

more-itertools: 9.0.0

multidict: 6.0.2

multiprocess: 0.70.13

multiprocessing-logging: 0.3.3

nbclassic: 0.4.8

nbclient: 0.7.2

nbconvert: 7.2.7

nbformat: 5.7.1

nest-asyncio: 1.5.6

ninja: 1.11.1

notebook: 6.5.2

notebook-shim: 0.2.2

num2words: 0.5.12

numba: 0.56.4

numpy: 1.23.4

nvidia-cublas-cu11: 11.10.3.66

nvidia-cuda-nvrtc-cu11: 11.7.99

nvidia-cuda-runtime-cu11: 11.7.99

nvidia-cudnn-cu11: 8.5.0.96

omegaconf: 2.0.6

packaging: 21.3

pandas: 1.5.1

pandocfilters: 1.5.0

parso: 0.8.3

pexpect: 4.8.0

pickleshare: 0.7.5

pillow: 9.2.0

pip: 20.0.2

pkg-resources: 0.0.0

pkgutil-resolve-name: 1.3.10

platformdirs: 2.5.4

pooch: 1.6.0

portalocker: 2.6.0

prometheus-client: 0.15.0

prompt-toolkit: 3.0.31

protobuf: 3.20.1

psutil: 5.9.3

ptyprocess: 0.7.0

pure-eval: 0.2.2

py-cpuinfo: 9.0.0

pyarmor: 7.7.0

pyarrow: 10.0.0

pycparser: 2.21

pydantic: 1.10.4

pygments: 2.13.0

pykaldi: 0.2.2

pyparsing: 3.0.9

pyrsistent: 0.19.2

python-dateutil: 2.8.2

python-json-logger: 2.0.4

pytorch-lightning: 1.8.6

pytz: 2022.5

pyyaml: 6.0

pyzmq: 24.0.1

qtconsole: 5.4.0

qtpy: 2.3.0

rapidfuzz: 2.13.7

regex: 2022.9.13

requests: 2.28.1

resampy: 0.4.2

responses: 0.18.0

sacrebleu: 2.3.1

sacremoses: 0.0.53

scikit-learn: 1.2.0

scipy: 1.9.3

send2trash: 1.8.0

setproctitle: 1.3.2

setuptools: 44.0.0

six: 1.16.0

smmap: 5.0.0

sniffio: 1.3.0

soundfile: 0.11.0

soupsieve: 2.3.2.post1

spraaklab-text: 0.1.0

stack-data: 0.5.1

tabulate: 0.9.0

tensorboardx: 2.5.1

terminado: 0.17.1

text-unidecode: 1.3

threadpoolctl: 3.1.0

tinycss2: 1.2.1

tokenizers: 0.13.1

torch: 1.13.0

torchaudio: 0.13.1

torchmetrics: 0.11.0

tornado: 6.2

tqdm: 4.64.1

traitlets: 5.8.0

transformers: 4.25.1

transitions: 0.9.0

typing-extensions: 4.4.0

urllib3: 1.26.12

uvloop: 0.17.0

virtualenv: 20.17.0

wcwidth: 0.2.5

webencodings: 0.5.1

websocket-client: 1.4.2

websockets: 10.4

wheel: 0.38.4

whisper: 1.0

widgetsnbextension: 4.0.4

worderrorrate: 0.1.1

xxhash: 3.1.0

yarl: 1.8.1

zipp: 3.10.0

System:

OS: Linux

architecture:

64bit

ELF

processor: x86_64

python: 3.8.10

version: #127-Ubuntu SMP Wed May 18 14:30:56 UTC 2022

More info

No response
needs triage
opened by davidavdav 1
Handle `set_to_none` when using DeepSpeed optimizer in Lite
What does this PR do?

The deepspeed optimizer does not accept the argument optimizer.zero_grad(set_to_none=...). In their optimizer, the name is optimizer.zero_grad(set_grads_to_None=...) 🤣 We are translating this inconsistency for the user.

Perhaps DeepSpeed should change their name to be consistent with PyTorch.

cc @borda @carmocca @justusschock @awaelchli

Before submitting

[x] Was this discussed/approved via a GitHub issue? (not for typos and docs)

[x] Did you read the contributor guideline, Pull Request section?

[x] Did you make sure your PR does only one thing, instead of bundling different changes together?

[x] Did you make sure to update the documentation with your changes? (if necessary)

[x] Did you write any new necessary tests? (not for typos and docs)

[x] Did you verify new and existing tests pass locally with your changes?

[x] Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed. Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

[x] Is this pull request ready for review? (if not, please submit in draft mode)

[x] Check that all items from Before submitting are resolved

[x] Make sure the title is self-explanatory and the description concisely explains the PR

[x] Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

I made sure I had fun coding 🙃
feature lightninglite fun
opened by awaelchli 1

Releases(1.9.0rc0)

1.9.0rc0(Jan 6, 2023)

The release candidate is a preview of new features and improvements included in the upcoming stable version 1.9.0 🎉

Changelog

Coming soon.
Source code(tar.gz)
Source code(zip)
lightning-1.9.0rc0-py3-none-any.whl(1.96 MB)
lightning-1.9.0rc0.tar.gz(1.98 MB)
lightning-app-1.9.0rc0.tar.gz(1.09 MB)
lightning-fabric-1.9.0rc0.tar.gz(105.41 KB)
lightning_app-1.9.0rc0-py3-none-any.whl(1.16 MB)
lightning_fabric-1.9.0rc0-py3-none-any.whl(146.68 KB)
pytorch-lightning-1.9.0rc0.tar.gz(580.97 KB)
pytorch_lightning-1.9.0rc0-py3-none-any.whl(798.50 KB)
1.8.6(Dec 21, 2022)
App

Added

Added partial support for fastapi Request annotation in configure_api handlers (#16047)

Added a nicer UI with URL and examples for the autoscaler component (#16063)

Enabled users to have more control over scaling out/in intervals (#16093)

Added more datatypes to the serving component (#16018)

Added work.delete method to delete the work (#16103)

Added display_name property to LightningWork for the cloud (#16095)

Added ColdStartProxy to the AutoScaler (#16094)

Added status endpoint, enable ready (#16075)

Implemented ready for components (#16129)

Changed

The default start_method for creating Work processes locally on macOS is now 'spawn' (previously 'fork') (#16089)

The utility lightning.app.utilities.cloud.is_running_in_cloud now returns True during the loading of the app locally when running with --cloud (#16045)

Updated Multinode Warning (#16091)

Updated app testing (#16000)

Changed overwrite to True (#16009)

Simplified messaging in cloud dispatch (#16160)

Added annotations endpoint (#16159)

Fixed

Fixed PythonServer messaging "Your app has started" (#15989)

Fixed auto-batching to enable batching for requests coming even after the batch interval but is in the queue (#16110)

Fixed a bug where AutoScaler would fail with min_replica=0 (#16092

Fixed a non-thread safe deepcopy in the scheduler (#16114)

Fixed HTTP Queue sleeping for 1 sec by default if no delta was found (#16114)

Fixed the endpoint info tab not showing up in the AutoScaler UI (#16128)

Fixed an issue where an exception would be raised in the logs when using a recent version of streamlit (#16139)

Fixed e2e tests (#16146)

Full Changelog: https://github.com/Lightning-AI/lightning/compare/1.8.5.post0...1.8.6
Source code(tar.gz)
Source code(zip)
lightning-1.8.6-py3-none-any.whl(1.70 MB)
lightning-1.8.6.tar.gz(1.42 MB)
lightning-app-1.8.6.tar.gz(1.11 MB)
lightning-lite-1.8.6.tar.gz(93.07 KB)
lightning_app-1.8.6-py3-none-any.whl(1.18 MB)
lightning_lite-1.8.6-py3-none-any.whl(133.47 KB)
pytorch-lightning-1.8.6.tar.gz(562.70 KB)
pytorch_lightning-1.8.6-py3-none-any.whl(781.50 KB)
1.8.5.post0(Dec 16, 2022)
App

Fixed install/upgrade - removing single quote (#16079)

Fixed bug where components that are re-instantiated several times failed to initialize if they were modifying self.lightningignore (#16080)

Fixed a bug where apps that had previously been deleted could not be run again from the CLI (#16082)

Pytorch

Add function to remove checkpoint to allow override for extended classes (#16067)

Full Changelog: https://github.com/Lightning-AI/lightning/compare/1.8.5...1.8.5.post0
Source code(tar.gz)
Source code(zip)
lightning-1.8.5.post0-py3-none-any.whl(1.69 MB)
lightning-1.8.5.post0.tar.gz(1.41 MB)
lightning-app-1.8.5.post0.tar.gz(1.11 MB)
lightning-lite-1.8.5.post0.tar.gz(93.19 KB)
lightning_app-1.8.5.post0-py3-none-any.whl(1.18 MB)
lightning_lite-1.8.5.post0-py3-none-any.whl(133.54 KB)
pytorch-lightning-1.8.5.post0.tar.gz(563.05 KB)
pytorch_lightning-1.8.5.post0-py3-none-any.whl(781.74 KB)
1.8.5(Dec 15, 2022)
App

Added

Added Lightning{Flow,Work}.lightningignores attributes to programmatically ignore files before uploading to the cloud (#15818)

Added a progress bar while connecting to an app through the CLI (#16035)

Support running on multiple clusters (#16016)

Added guards to cluster deletion from cli (#16053)

Added creation of the default .lightningignore that ignores venv (#16056)

Changed

Cleanup cluster waiting (#16054)

Fixed

Fixed DDPStrategy import in app framework (#16029)

Fixed AutoScaler raising an exception when non-default cloud compute is specified (#15991)

Fixed and improvements of login flow (#16052)

Fixed the debugger detection mechanism for the lightning App in VSCode (#16068)

Pytorch

some minor cleaning

Full Changelog: https://github.com/Lightning-AI/lightning/compare/1.8.4.post0...1.8.5
Source code(tar.gz)
Source code(zip)
lightning-1.8.5-py3-none-any.whl(1.69 MB)
lightning-1.8.5.tar.gz(1.41 MB)
lightning-app-1.8.5.tar.gz(1.11 MB)
lightning-lite-1.8.5.tar.gz(93.15 KB)
lightning_app-1.8.5-py3-none-any.whl(1.18 MB)
lightning_lite-1.8.5-py3-none-any.whl(133.46 KB)
pytorch-lightning-1.8.5.tar.gz(562.80 KB)
pytorch_lightning-1.8.5-py3-none-any.whl(781.60 KB)
1.8.4.post0(Dec 9, 2022)
App

Fixed MultiNode Component to use separate cloud computes (#15965)

Fixed Registration for CloudComputes of Works in L.app.structures (#15964)

Fixed a bug where auto-upgrading to the latest lightning via the CLI could get stuck in a loop (#15984)

Pytorch

Fixed the XLAProfiler not recording anything due to mismatching of action names (#15885)

Full Changelog: https://github.com/Lightning-AI/lightning/compare/1.8.4...1.8.4.post0
Source code(tar.gz)
Source code(zip)
lightning-1.8.4.post0-py3-none-any.whl(1.69 MB)
lightning-1.8.4.post0.tar.gz(1.41 MB)
lightning-app-1.8.4.post0.tar.gz(1.10 MB)
lightning-lite-1.8.4.post0.tar.gz(93.12 KB)
lightning_app-1.8.4.post0-py3-none-any.whl(1.17 MB)
lightning_lite-1.8.4.post0-py3-none-any.whl(133.50 KB)
pytorch-lightning-1.8.4.post0.tar.gz(562.79 KB)
pytorch_lightning-1.8.4.post0-py3-none-any.whl(781.56 KB)
1.8.3.post2(Dec 9, 2022)

:robot:
Source code(tar.gz)
Source code(zip)
lightning-1.8.3.post2-py3-none-any.whl(1.67 MB)
lightning-1.8.3.post2.tar.gz(1.39 MB)
lightning-app-1.8.3.post2.tar.gz(1.09 MB)
lightning-lite-1.8.3.post2.tar.gz(92.94 KB)
lightning_app-1.8.3.post2-py3-none-any.whl(1.16 MB)
lightning_lite-1.8.3.post2-py3-none-any.whl(133.40 KB)
pytorch-lightning-1.8.3.post2.tar.gz(561.32 KB)
pytorch_lightning-1.8.3.post2-py3-none-any.whl(780.22 KB)
1.8.4(Dec 8, 2022)
App

Added

Add code_dir argument to tracer run (#15771)

Added the CLI command lightning run model to launch a LightningLite accelerated script (#15506)

Added the CLI command lightning delete app to delete a lightning app on the cloud (#15783)

Added a CloudMultiProcessBackend which enables running a child App from within the Flow in the cloud (#15800)

Utility for pickling work object safely even from a child process (#15836)

Added AutoScaler component (#15769)

Added the property ready of the LightningFlow to inform when the Open App should be visible (#15921)

Added private work attributed _start_method to customize how to start the works (#15923)

Added a configure_layout method to the LightningWork which can be used to control how the work is handled in the layout of a parent flow (#15926)

Added the ability to run a Lightning App or Component directly from the Gallery using lightning run app organization/name (#15941)

Added automatic conversion of list and dict of works and flows to structures (#15961)

Changed

The MultiNode components now warn the user when running with num_nodes > 1 locally (#15806)

Cluster creation and deletion now waits by default [#15458

Running an app without a UI locally no longer opens the browser (#15875)

Show a message when BuildConfig(requirements=[...]) is passed but a requirements.txt file is already present in the Work (#15799)

Show a message when BuildConfig(dockerfile="...") is passed but a Dockerfile file is already present in the Work (#15799)

Dropped name column from cluster list (#15721)

Apps without UIs no longer activate the "Open App" button when running in the cloud (#15875)

Wait for full file to be transferred in Path / Payload (#15934)

Removed

Removed the SingleProcessRuntime (#15933)

Fixed

Fixed SSH CLI command listing stopped components (#15810)

Fixed bug when launching apps on multiple clusters (#15484)

Fixed Sigterm Handler causing thread lock which caused KeyboardInterrupt to hang (#15881)

Fixed MPS error for multinode component (defaults to cpu on mps devices now as distributed operations are not supported by pytorch on mps) (#15748)

Fixed the work not stopped when successful when passed directly to the LightningApp (#15801)

Fixed the PyTorch Inference locally on GPU (#15813)

Fixed the enable_spawn method of the WorkRunExecutor (#15812)

Fixed require/import decorator (#15849)

Fixed a bug where using L.app.structures would cause multiple apps to be opened and fail with an error in the cloud (#15911)

Fixed PythonServer generating noise on M1 (#15949)

Fixed multiprocessing breakpoint (#15950)

Fixed detection of a Lightning App running in debug mode (#15951)

Fixed ImportError on Multinode if package not present (#15963)

Lite

Fixed shuffle=False having no effect when using DDP/DistributedSampler (#15931)

Pytorch

Changed

Direct support for compiled models (#15922)

Fixed

Fixed issue with unsupported torch.inference_mode() on hpu backends (#15918)

Fixed LRScheduler import for PyTorch 2.0 (#15940)

Fixed fit_loop.restarting to be False for lr finder (#15620)

Fixed torch.jit.script-ing a LightningModule causing an unintended error message about deprecated use_amp property (#15947)

Full Changelog: https://github.com/Lightning-AI/lightning/compare/1.8.3...1.8.4
Source code(tar.gz)
Source code(zip)
lightning-1.8.4-py3-none-any.whl(1.69 MB)
lightning-1.8.4.tar.gz(1.41 MB)
lightning-app-1.8.4.tar.gz(1.10 MB)
lightning-lite-1.8.4.tar.gz(93.09 KB)
lightning_app-1.8.4-py3-none-any.whl(1.17 MB)
lightning_lite-1.8.4-py3-none-any.whl(133.42 KB)
pytorch-lightning-1.8.4.tar.gz(562.35 KB)
pytorch_lightning-1.8.4-py3-none-any.whl(781.21 KB)
1.8.3.post1(Nov 25, 2022)
App

Changed

Fixed the PyTorch Inference locally on GPU (#15813)

Full Changelog: https://github.com/Lightning-AI/lightning/compare/1.8.3...1.8.3
Source code(tar.gz)
Source code(zip)
lightning-1.8.3.post1-py3-none-any.whl(1.60 MB)
lightning-1.8.3.post1.tar.gz(1.32 MB)
lightning-app-1.8.3.post1.tar.gz(1.02 MB)
lightning-lite-1.8.3.post1.tar.gz(92.99 KB)
lightning_app-1.8.3.post1-py3-none-any.whl(1.09 MB)
lightning_lite-1.8.3.post1-py3-none-any.whl(133.40 KB)
pytorch-lightning-1.8.3.post1.tar.gz(561.41 KB)
pytorch_lightning-1.8.3.post1-py3-none-any.whl(780.22 KB)
1.8.3.post0(Nov 23, 2022)

Source code(tar.gz)
Source code(zip)
lightning-1.8.3.post0-py3-none-any.whl(1.61 MB)
lightning-1.8.3.post0.tar.gz(1.33 MB)
lightning-app-1.8.3.post0.tar.gz(1.03 MB)
lightning-lite-1.8.3.post0.tar.gz(92.94 KB)
lightning_app-1.8.3.post0-py3-none-any.whl(1.10 MB)
lightning_lite-1.8.3.post0-py3-none-any.whl(133.39 KB)
pytorch-lightning-1.8.3.post0.tar.gz(561.37 KB)
pytorch_lightning-1.8.3.post0-py3-none-any.whl(780.22 KB)
1.8.3(Nov 23, 2022)
App

Changed

Deduplicate top-level lighting CLI command groups (#15761)

lightning add ssh-key CLI command has been transitioned to lightning create ssh-key

lightning remove ssh-key CLI command has been transitioned to lightning delete ssh-key

Set Torch inference mode for prediction (#15719)

Improved LightningTrainerScript start-up time (#15751)

Disable XSRF protection in StreamlitFrontend to support upload in localhost (#15684)

Fixed

Fixed debugging with VSCode IDE (#15747)

Fixed setting property to the LightningFlow (#15750)

Lite

Changed

Temporarily removed support for Hydra multi-run (#15737)

Pytorch

Changed

Temporarily removed support for Hydra multi-run (#15737)

Switch from tensorboard to tensorboardx in TensorBoardLogger (#15728)

Full Changelog: https://github.com/Lightning-AI/lightning/compare/1.8.2...1.8.3
Source code(tar.gz)
Source code(zip)
lightning-1.8.3-py3-none-any.whl(1.61 MB)
lightning-1.8.3.tar.gz(1.33 MB)
lightning-app-1.8.3.tar.gz(1.03 MB)
lightning-lite-1.8.3.tar.gz(92.98 KB)
lightning_app-1.8.3-py3-none-any.whl(1.10 MB)
lightning_lite-1.8.3-py3-none-any.whl(133.32 KB)
pytorch-lightning-1.8.3.tar.gz(561.37 KB)
pytorch_lightning-1.8.3-py3-none-any.whl(780.14 KB)
1.8.2(Nov 18, 2022)
App

Added

Added title and description to ServeGradio (#15639)

Added a friendly error message when attempting to run the default cloud compute with a custom base image configured (#14929)

Changed

Improved support for running apps when dependencies aren't installed (#15711)

Changed the root directory of the app (which gets uploaded) to be the folder containing the app file, rather than any parent folder containing a .lightning file (#15654)

Enabled MultiNode Components to support state broadcasting (#15607)

Prevent artefactual "running from outside your current environment" error (#15647)

Rename failed -> error in tables (#15608)

Fixed

Fixed race condition to over-write the frontend with app infos (#15398)

Fixed bi-directional queues sending delta with Drive Component name changes (#15642)

Fixed CloudRuntime works collection with structures and accelerated multi node startup time (#15650)

Fixed catimage import (#15712)

Parse all lines in app file looking for shebangs to run commands (#15714)

Lite

Fixed

Fixed the automatic fallback from LightningLite(strategy="ddp_spawn", ...) to LightningLite(strategy="ddp", ...) when on an LSF cluster (#15103)

Pytorch

Fixed

Make sure save_dir can be empty str (#15638](https://github.com/PyTorchLightning/pytorch-lightning/issues/15638))

Fixed the automatic fallback from Trainer(strategy="ddp_spawn", ...) to Trainer(strategy="ddp", ...) when on an LSF cluster (#15103](https://github.com/PyTorchLightning/pytorch-lightning/issues/15103))

Full Changelog: https://github.com/Lightning-AI/lightning/compare/1.8.1...1.8.2
Source code(tar.gz)
Source code(zip)
lightning-1.8.2-py3-none-any.whl(1.61 MB)
lightning-1.8.2.tar.gz(1.33 MB)
lightning-app-1.8.2.tar.gz(1.03 MB)
lightning-lite-1.8.2.tar.gz(93.06 KB)
lightning_app-1.8.2-py3-none-any.whl(1.10 MB)
lightning_lite-1.8.2-py3-none-any.whl(133.40 KB)
pytorch-lightning-1.8.2.tar.gz(561.06 KB)
pytorch_lightning-1.8.2-py3-none-any.whl(779.93 KB)
1.8.1(Nov 10, 2022)
App

Added

Added the start method to the work (#15523)

Added a MultiNode Component to run with distributed computation with any frameworks (#15524)

Expose RunWorkExecutor to the work and provides default ones for the MultiNode Component (#15561)

Added a start_with_flow flag to the LightningWork which can be disabled to prevent the work from starting at the same time as the flow (#15591)

Added support for running Lightning App with VSCode IDE debugger (#15590)

Added bi-directional delta updates between the flow and the works (#15582)

Added --setup flag to lightning run app CLI command allowing for dependency installation via app comments (#15577)

Auto-upgrade / detect environment mis-match from the CLI (#15434)

Added Serve component (#15609)

Changed

Changed the flow.flows to be recursive wont to align the behavior with the flow.works (#15466)

The params argument in TracerPythonScript.run no longer prepends -- automatically to parameters (#15518)

Only check versions / env when not in the cloud (#15504)

Periodically sync database to the drive (#15441)

Slightly safer multi node (#15538)

Reuse existing commands when running connect more than once (#15471)

Fixed

Fixed writing app name and id in connect.txt file for the command CLI (#15443)

Fixed missing root flow among the flows of the app (#15531)

Fixed bug with Multi Node Component and add some examples (#15557)

Fixed a bug where payload would take a very long time locally (#15557)

Fixed an issue with the lightning CLI taking a long time to error out when the cloud is not reachable (#15412)

Lite

Fixed

Fix an issue with the SLURM srun detection causing permission errors (#15485)

Fixed the import of lightning_lite causing a warning 'Redirects are currently not supported in Windows or MacOs' (#15610)

PyTorch

Fixed

Fixed TensorBoardLogger not validating the input array type when logging the model graph (#15323)

Fixed an attribute error in ColossalAIStrategy at import time when torch.distributed is not available (#15535)

Fixed an issue when calling fs.listdir with file URI instead of path in CheckpointConnector (#15413)

Fixed an issue with the BaseFinetuning callback not setting the track_running_stats attribute for batch normaliztion layers (#15063)

Fixed an issue with WandbLogger(log_model=True|'all) raising an error and not being able to serialize tensors in the metadata (#15544)

Fixed the gradient unscaling logic when using Trainer(precision=16) and fused optimizers such as Adam(..., fused=True) (#15544)

Fixed model state transfer in multiprocessing launcher when running multi-node (#15567)

Fixed manual optimization raising AttributeError with Bagua Strategy (#12534)

Fixed the import of pytorch_lightning causing a warning 'Redirects are currently not supported in Windows or MacOs' (#15610)

Full Changelog: https://github.com/Lightning-AI/lightning/compare/1.8.0...1.8.1
Source code(tar.gz)
Source code(zip)
lightning-1.8.1-py3-none-any.whl(1.59 MB)
lightning-1.8.1.tar.gz(1.31 MB)
lightning-app-1.8.1.tar.gz(1.03 MB)
lightning-lite-1.8.1.tar.gz(92.91 KB)
lightning_app-1.8.1-py3-none-any.whl(1.09 MB)
lightning_lite-1.8.1-py3-none-any.whl(133.31 KB)
pytorch-lightning-1.8.1.tar.gz(560.75 KB)
pytorch_lightning-1.8.1-py3-none-any.whl(779.69 KB)
1.8.0.post1(Nov 2, 2022)
What's Changed

Implement freeze batchnorm with freezing track running stats by @PososikTeam in https://github.com/Lightning-AI/lightning/pull/15063

Pkg: fix parsing versions by @Borda in https://github.com/Lightning-AI/lightning/pull/15401

Remove pytest as a requirement to run app by @manskx in https://github.com/Lightning-AI/lightning/pull/15449

New Contributors

@PososikTeam made their first contribution in https://github.com/Lightning-AI/lightning/pull/15063

Full Changelog: https://github.com/Lightning-AI/lightning/compare/1.8.0...1.8.0.post1
Source code(tar.gz)
Source code(zip)
lightning-1.8.0.post1-py3-none-any.whl(1.57 MB)
lightning-1.8.0.post1.tar.gz(1.29 MB)
lightning-app-1.8.0.post1.tar.gz(1022.57 KB)
lightning-lite-1.8.0.post1.tar.gz(92.34 KB)
lightning_app-1.8.0.post1-py3-none-any.whl(1.06 MB)
lightning_lite-1.8.0.post1-py3-none-any.whl(133.24 KB)
pytorch-lightning-1.8.0.post1.tar.gz(558.13 KB)
pytorch_lightning-1.8.0.post1-py3-none-any.whl(777.42 KB)
1.8.0(Nov 1, 2022)
The core team is excited to announce the release of Lightning 1.8 :zap:

Highlights

Backward Incompatible Changes

Deprecations

Full Changelog

Contributors

Lightning v1.8 is the culmination of work from 52 contributors who have worked on features, bug-fixes, and documentation for a total of over 550+ commits since v1.7.

Highlights

Colossal-AI

Colossal-AI focuses on improving efficiency when training large-scale AI models with billions of parameters. With the new Colossal-AI strategy in Lightning 1.8, you can train existing models like GPT-3 with up to half as many GPUs as usually needed. You can also train models up to twice as big with the same number of GPUs, saving you significant cost. Here is how you use it:

# Select the strategy with good defaults trainer = Trainer(strategy="colossalai") # or tune parameters to your liking from lightning.pytorch.strategies import ColossalAIStrategy trainer = Trainer(strategy=ColossalAIStrategy(placement_policy="cpu", ...))

You can find Colossal-AI's benchmarks with Lightning on GPT-2 here.

Under the hood, Colossal-AI implements different parallelism algorithms that are especially interesting for the development of SOTA transformer models:

Data Parallelism

Pipeline Parallelism

1D, 2D, 2.5D, 3D Tensor Parallelism

Sequence Parallelism

Zero Redundancy Optimization

Learn how to install and use Colossal-AI effectively with Lightning here.

NOTE: This strategy is marked as experimental. Stay tuned for more updates in the future.

Secrets for Lightning Apps

Introducing encrypted secrets (#14612), a feature requested by Lightning App users :tada:!

Encrypted secrets allow you to securely pass private data to your apps, like API keys, access tokens, database passwords, or other credentials, without exposing them in your code.

Add a secret to your Lightning account in lightning.ai (read more here)

Add an environment variable to your app to read the secret:

# somewhere in your Flow or Work: GitHubComponent(api_token=os.environ["API_TOKEN"])

Pass the secret to your app run with the following command:

lightning run app app.py --cloud --secret API_TOKEN=github_api_token

These secrets are encrypted and stored in the Lightning database. Nothing except your app can access the value.

NOTE: This is an experimental feature.

CLI Commands for Lightning Apps

Introducing CLI commands for apps (#13602)! As a Lightning App builder, if you want to easily create a CLI interface for users to interract with your app, then this is for you.

Here is an example where users can dynamically create notebooks from the CLI. All you need to do is implement the configure_commands hook on the LightningFlow:

import lightning as L from commands.notebook.run import RunNotebook class Flow(L.LightningFlow): ... def configure_commands(self): # Return a list of dictionaries with commands: return [{"run notebook": RunNotebook(method=self.run_notebook)}] app = L.LightningApp(Flow())

Once the app is running with lightning run app app.py, you can connect to the app with the following command:

lightning connect {app name} -y

and run the command that was configured:

lightning run notebook --name=my_notebook_name

NOTE: This is an experimental feature.

Auto-wrapping for FSDP Strategy

In Lightning v1.7, we introduced an integration for PyTorch FSDP in the form of our FSDP strategy, which allows you to train huge models with billions of parameters sharded across hundreds of GPUs and machines.

# Native FSDP implementation trainer = Trainer(strategy="fsdp_native")

We are continuing to improve the support for this feature by adding automatic wrapping of layers for use cases where the model fits into CPU memory, but not into GPU memory (#14383).

Here are some examples:

Case 1: Model is so large that it does not fit into CPU memory. Construct your layers in the configure_sharded_model hook and wrap the large ones you want to shard across GPUs:

class MassiveModel(LightningModule): ... # Create model here and wrap the large layers for sharding def configure_sharded_model(self): for i, layer in enumerate(self.block): self.block[i] = wrap(layer) ...

Case 2: Model fits into CPU memory, but not into GPU memory. In Lightning v1.8, you no longer need to do anything special here, as we can automatically wrap the layers for you using FSDP's policy:

model = MassiveModel() trainer = Trainer( accelerator="gpu", devices=8, strategy="fsdp_native", # or strategy="fsdp" for fairscale precision=16 ) # Automatically wraps the layers here: trainer.fit(model)

Case 3: Model fits into GPU memory. No action required, use any strategy you want.

Note: if you want to manually wrap layers for more control, you can still do that!

Read more about FSDP and how layer wrapping works in our docs.

New Tuner Callbacks

In this release, we focused on Tuner improvements and introduced two new callbacks that can help you customize the batch size finder and learning rate finder as per your use case.

Batch Size Finder (#11089)

You can customize the BatchSizeFinder callback to run at different epochs. This feature is useful while fine-tuning models since you can't always use the same batch size after unfreezing the backbone.

from lightning.pytorch.callbacks import BatchSizeFinder class FineTuneBatchSizeFinder(BatchSizeFinder): def __init__(self, milestones, *args, **kwargs): super().__init__(*args, **kwargs) self.milestones = milestones def on_fit_start(self, *args, **kwargs): return def on_train_epoch_start(self, trainer, pl_module): if trainer.current_epoch in self.milestones or trainer.current_epoch == 0: self.scale_batch_size(trainer, pl_module) trainer = Trainer(callbacks=[FineTuneBatchSizeFinder(milestones=(5, 10))]) trainer.fit(...)

Run batch size finder for validate/test/predict.

from lightning.pytorch.callbacks import BatchSizeFinder class EvalBatchSizeFinder(BatchSizeFinder): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) def on_fit_start(self, *args, **kwargs): return def on_test_start(self, trainer, pl_module): self.scale_batch_size(trainer, pl_module) trainer = Trainer(callbacks=[EvalBatchSizeFinder()]) trainer.test(...)

Learning Rate Finder (#13802)

You can now use the LearningRateFinder callback to run at different intervals. This feature is useful when fine-tuning models, for example.

from lightning.pytorch.callbacks import LearningRateFinder class FineTuneLearningRateFinder(LearningRateFinder): def __init__(self, milestones, *args, **kwargs): super().__init__(*args, **kwargs) self.milestones = milestones def on_fit_start(self, *args, **kwargs): return def on_train_epoch_start(self, trainer, pl_module): if trainer.current_epoch in self.milestones or trainer.current_epoch == 0: self.lr_find(trainer, pl_module) trainer = Trainer(callbacks=[FineTuneLearningRateFinder(milestones=(5, 10))]) trainer.fit(...)

LightningCLI Improvements

Even though the LightningCLI class is designed to help in the implementation of command line tools, there are instances when it might be more desirable to run directly from Python. In Lightning 1.8, you can now do this (#14596):

from lightning.pytorch.cli import LightningCLI def cli_main(args): cli = LightningCLI(MyModel, ..., args=args) ...

Anywhere in your program, you can now call the CLI directly:

cli_main(["--trainer.max_epochs=100", "--model.encoder_layers=24"])

Learn about all features of the LightningCLI!

Improvements to the SLURM Support

Multi-node training on a SLURM cluster has been supported since the inception of Lightning Trainer, and has seen several improvements over time thanks to many community contributions. And we just keep going! In this release, we've added two quality of life improvements:

The preemption/termination signal is now configurable (#14626):

# the default signal is SIGUSR1 trainer = Trainer(plugins=[SLURMEnvironment(requeue_signal=signal.SIGUSR1)]) # customize it for your cluster trainer = Trainer(plugins=[SLURMEnvironment(requeue_signal=signal.SIGHUP)])

Automatic requeuing of jobs now also works for array jobs (#15040)! Array jobs are a convenient way to group/launch several scripts at once. When the SLURM scheduler interrupts your jobs, Lightning will save a checkpoint, resubmit a new job, and, once the scheduler allocates resources, the Trainer will resume from where it left off.

Read more about our SLURM integration here.

Backward Incompatible Changes

This section outlines notable changes that are not backward compatible with previous versions. The full list of changes and removals can be found in the CHANGELOG below.

Callback hooks for loading and saving checkpoints

The signature and behavior of the on_load_checkpoint and on_save_checkpoint callback hooks have changed (#14835):

Before:

def on_save_checkpoint(self, trainer, pl_module, checkpoint): ... # previously, we were able to return state here return state def on_load_checkpoint(self, trainer, pl_module, callback_state): # previously, only the state for this callback was passed in as argument ...

Now:

def on_save_checkpoint(self, trainer, pl_module, checkpoint): ... # returning a value here is no longer supported # you can modify the checkpoint dict directly return None def state_dict(self): ... # Now, return state from this new method return state def on_load_checkpoint(self, trainer, pl_module, checkpoint): # previously, only the state for this callback was passed in as argument ... def load_state_dict(self, state): # Now, the state for this callback gets passed to this new method ...

DataModule hooks for loading and saving checkpoints

The on_save_checkpoint and on_load_checkpoint hooks on the LightningDataModule have been removed in favor of the state_dict and load_state_dict methods:

-def on_save_checkpoint(self, checkpoint): - checkpoint["banana"] = self.banana +def state_dict(self): + return dict(banana=self.banana) -def on_load_checkpoint(self, checkpoint): - self.banana = checkpoint["banana"] +def load_state_dict(self, state): + self.banana = state["banana"]

Callback hooks

We removed some Callback hooks that were ambiguous to use Removed deprecated callback hooks (#14834):

| Old name | New name | |------------------------------|--------------------------------| | on_batch_start | on_train_batch_start | | on_batch_end | on_train_batch_end | | on_epoch_start | on_train_epoch_start | | on_epoch_start | on_validation_epoch_start | | on_epoch_start | on_test_epoch_start | | on_pretrain_routine_start | on_fit_start |

Trainer Device Attributes

We cleaned up the properties related to device indices (#14829).

The attributes Trainer.{devices,gpus,num_gpus,ipus,tpu_cores,num_processes,root_gpu,data_parallel_device_ids} have been removed in favor of accelerator-agnostic attributes:

trainer = Trainer(...) # access the number of devices the trainer uses on this machine ... print(trainer.num_devices) # ... or the device IDs print(trainer.device_ids)

Setting the torch-distributed backend

In previous versions of Lightning, switching between the "gloo" and "nccl" backends for multi-GPU, multi-node training was possible through setting an environment variable like so:

PL_TORCH_DISTRIBUTED_BACKEND="gloo" python train.py

But not all strategies support changing the backend in this way. From now on, the backend has to be set in the code (#14693):

trainer = Trainer(strategy=DDPStrategy(process_group_backend="gloo"))

The default remains "nccl", and you should choose "gloo" only for debugging purposes.

Logging with multiple loggers

Logging with multiple loggers can be super useful (and super easy with Lightning). For example, you could be using one logger to record sensitive image logs to a hosted MLFlow server within your organization, and at the same time log loss curves online to WandB.

trainer = Trainer( loggers=[WandbLogger(...), MLFlowLogger(...)] )

Here are two major changes that apply when using multiple loggers in 1.8:

Checkpoints and profiler reports no longer go to a strange folder with a long, hard to remember name (#14325). From now on, these arifacts will land in the version folder of the first logger in the list.

The loggers used to be wrapped by a LoggerCollection object, so that when you accessed trainer.logger you could log to all of them simultaneously. However, this "magic" caused confusion and errors among users and we decided to simplify this (#14283):

# now returns the first logger in the list print(trainer.logger) # access all loggers in a list with plural loggers = trainer.loggers for logger in loggers: logger.do_something()

Deprecations

Why is Lightning deprecating APIs in every release?

Many users have this question, and it is a fair one! Deprecations are a normal part of API evolution in all software. We continually improve Lightning, which means we make APIs like class names, methods, hooks and arguments clear, easy to remember, and general enough to adopt more functionality in the future. Sometimes we have to let old things go to build new and better products.

Learn more about our deprecation window here.

So far, we have followed the pattern of removing deprecated functionality and APIs after two minor versions of deprecation. From Lightning 1.8 onward, we will additionaly convert warnings to error messages after the deprecation phase ends. This way, we can greatly improve the upgrade experience with helpful messages for users who skip more than two minor Lightning versions. The exception to this rule are experimental features, which are marked as such in our documentation.

Here is a summary of major deprecations introduced in 1.8:

| API | Removal version | Alternative | |--------------------------------------------------------------------------------------------------------------------------|-----------------|-------------------------------------------------| | Argument Trainer(amp_level=...) | 1.10 | Trainer(plugins=[ApexMixedPrecisionPlugin(amp_level=...)]) | | Function unwrap_lightning_module | 1.10 | Strategy.lightning_module | | Function unwrap_lightning_module_sharded | 1.10 | Strategy.lightning_module | | Import pl.core.mixins.DeviceDtypeModuleMixin | 1.10 | No longer supported | | Argument LightningCLI(save_config_filename=...) | 1.10 | LightningCLI(save_config_kwargs=dict(config_filename=...)) | | Argument LightningCLI(save_config_overwrite=...) | 1.10 | LightningCLI(save_config_kwargs=dict(overwrite=...)) | | Argument LightningCLI(save_config_multifile=...) | 1.10 | LightningCLI(save_config_kwargs=dict(multifile=...)) | | Enum TrainerFn.TUNING | 1.10 | No longer supported | | Enum RunningStage.TUNING | 1.10 | No longer supported | | Attribute Trainer.tuning | 1.10 | No longer supported |

CHANGELOG

Lightning App

Added

Added load_state_dict and state_dict hooks for LightningFlow components (#14100)

Added a --secret option to CLI to allow binding secrets to app environment variables when running in the cloud (#14612)

Added support for running the works without cloud compute in the default container (#14819)

Added an HTTPQueue as an optional replacement for the default redis queue (#14978

Added support for configuring flow cloud compute (#14831)

Added support for adding descriptions to commands either through a docstring or the DESCRIPTION attribute (#15193

Added a try / catch mechanism around request processing to avoid killing the flow (#15187

Added an Database Component (#14995

Added authentication to HTTP queue (#15202)

Added support to pass a LightningWork to the LightningApp (#15215

Added support getting CLI help for connected apps even if the app isn't running (#15196

Added support for adding requirements to commands and installing them when missing when running an app command (#15198

Added Lightning CLI Connection to be terminal session instead of global (#15241

Added support for managing SSH-keys via CLI (#15291)

Add a JustPyFrontend to ease UI creation with https://github.com/justpy-org/justpy (#15002)

Added a layout endpoint to the Rest API and enable to disable pulling or pushing to the state (#15367

Added support for functions for configure_api and configure_commands to be executed in the Rest API process (#15098

Added support to start lightning app on cloud without needing to install dependencies locally (#15019

Changed

Improved the show logs command to be standalone and re-usable (#15343

Removed the --instance-types option when creating clusters (#15314)

Fixed

Fixed an issue when using the CLI without arguments (#14877)

Fixed a bug where the upload files endpoint would raise an error when running locally (#14924)

Fixed BYOC cluster region selector -> hiding it from help since only us-east-1 has been tested and is recommended ([#15277]https://github.com/Lightning-AI/lightning/pull/15277)

Fixed a bug when launching an app on multiple clusters (#15226)

Fixed a bug with a default CloudCompute for Lightning flows (#15371)

Lightning Trainer

Added

Added support for requeueing slurm array jobs (#15040)

Added native AMP support for ddp_fork (and associated alias strategies) with CUDA GPUs (#14983)

Added BatchSizeFinder callback (#11089)

Added LearningRateFinder callback (#13802)

Tuner now supports a new method argument which will determine when to run the BatchSizeFinder: one of fit, validate, test or predict (#11089)

Added prefix to log message in seed_everything with rank info (#14031)

Added support for auto wrapping for DDPFullyShardedNativeStrategy (#14252)

Added support for passing extra init-parameters to the LightningDataModule.from_datasets (#14185)

Added support for saving sharded optimizer state dict outside of DDPShardedStrategy (#14208)

Added support for auto wrapping for DDPFullyShardedStrategy (#14383)

Integrate the lightning_utilities package ( #14475, #14537, #14556, #14558, #14575, #14620)

Added args parameter to LightningCLI to ease running from within Python (#14596)

Added WandbLogger.download_artifact and WandbLogger.use_artifact for managing artifacts with Weights and Biases (#14551)

Added an option to configure the signal SLURM sends when a job is preempted or requeued (#14626)

Added a warning when the model passed to LightningLite.setup() does not have all parameters on the same device (#14822)

The CometLogger now flags the Comet Experiments as being created from Lightning for analytics purposes (#14906)

Introduce ckpt_path="hpc" keyword for checkpoint loading (#14911)

Added a more descriptive error message when attempting to fork processes with pre-initialized CUDA context (#14709)

Added support for custom parameters in subclasses of SaveConfigCallback (#14998)

Added inference_mode flag to Trainer to let users enable/disable inference mode during evaluation (#15034)

Added LightningLite.no_backward_sync for control over efficient gradient accumulation with distributed strategies (#14966)

Added a sanity check that scripts are executed with the srun command in SLURM and that environment variables are not conflicting (#15011)

Added an error message when attempting to launch processes with python -i and an interactive-incompatible strategy (#15293)

Changed

The Trainer.{fit,validate,test,predict,tune} methods now raise a useful error message if the input is not a LightningModule (#13892)

Raised a MisconfigurationException if batch transfer hooks are overriden with IPUAccelerator (#13961)

Replaced the unwrapping logic in strategies with direct access to unwrapped LightningModule (#13738)

Enabled on_before_batch_transfer for DPStrategy and IPUAccelerator (#14023)

When resuming training with Apex enabled, the Trainer will now raise an error (#14341)

Included torch.cuda rng state to the aggregate _collect_rng_states() and _set_rng_states() (#14384)

Changed trainer.should_stop to not stop in between an epoch and run until min_steps/min_epochs only (#13890)

The pyDeprecate dependency is no longer installed (#14472)

When using multiple loggers, by default checkpoints and profiler output now get saved to the log dir of the first logger in the list (#14325)

In Lightning Lite, state-dict access to the module wrapper now gets passed through to the original module reference (#14629)

Removed fall-back to LightningEnvironment when number of SLURM tasks does not correspond to number of processes in Trainer (#14300)

Aligned DDP and DDPSpawn strategies in setting up the environment (#11073)

Integrated the Lite Precision plugins into the PL Precision plugins - the base class in PL now extends the lightning_lite.precision.Precision base class (#14798)

The PrecisionPlugin.backward signature changed: The closure_loss argument was renamed to tensor

The PrecisionPlugin.{pre_,post_}backward signature changed: The closure_loss argument was renamed to tensor and moved as the first argument

The PrecisionPlugin.optimizer_step signature changed: The model, optimizer_idx and closure arguments need to be passed as keyword arguments now

Trainer queries the CUDA devices through NVML if available to avoid initializing CUDA before forking, which eliminates the need for the PL_DISABLE_FORK environment variable introduced in v1.7.4 (#14631)

The MLFlowLogger.finalize() now sets the status to FAILED when an exception occurred in Trainer, and sets the status to FINISHED on successful completion (#12292)

It is no longer needed to call model.double() when using precision=64 in Lightning Lite (#14827)

HPC checkpoints are now loaded automatically only in slurm environment when no specific value for ckpt_path has been set (#14911)

The Callback.on_load_checkpoint now gets the full checkpoint dictionary and the callback_state argument was renamed checkpoint (#14835)

Moved the warning about saving nn.Module in save_hyperparameters() to before the deepcopy (#15132)

To avoid issues with forking processes, from PyTorch 1.13 and higher, Lightning will directly use the PyTorch NVML-based check for torch.cuda.device_count and from PyTorch 1.14 and higher, Lightning will configure PyTorch to use a NVML-based check for torch.cuda.is_available. (#15110, #15133)

The NeptuneLogger now uses neptune.init_run instead of the deprecated neptune.init to initialize a run (#15393)

Deprecated

Deprecated LightningDeepSpeedModule (#14000)

Deprecated amp_level from Trainer in favour of passing it explictly via precision plugin (#13898)

Deprecated the calls to pytorch_lightning.utiltiies.meta functions in favor of built-in https://github.com/pytorch/torchdistx support (#13868)

Deprecated the unwrap_lightning_module and unwrap_lightning_module_sharded utility functions in favor of accessing the unwrapped LightningModule on the strategy directly (#13738)

Deprecated the pl_module argument in LightningParallelModule, LightningDistributedModule, LightningShardedDataParallel, LightningBaguaModule and LightningDeepSpeedModule wrapper classes (#13738)

Deprecated the on_colab_kaggle function (#14247)

Deprecated the internal pl.core.mixins.DeviceDtypeModuleMixin class (#14511, #14548)

Deprecated all functions in pytorch_lightning.utilities.xla_device (#14514, #14550)

Deprecated the internal inner_f function

Deprecated the internal pl_multi_process function

Deprecated the internal XLADeviceUtils.xla_available staticmethod

Deprecated the XLADeviceUtils.tpu_device_exists staticmethod in favor of pytorch_lightning.accelerators.TPUAccelerator.is_available()

Deprecated pytorch_lightning.utilities.distributed.tpu_distributed in favor of lightning_lite.accelerators.tpu.tpu_distributed (#14550)

Deprecated all functions in pytorch_lightning.utilities.cloud_io in favor of lightning_lite.utilities.cloud_io (#14515)

Deprecated the functions in pytorch_lightning.utilities.apply_func in favor of lightning_utilities.core.apply_func (#14516, #14537)

Deprecated all functions in pytorch_lightning.utilities.device_parser (#14492, #14753)

Deprecated the pytorch_lightning.utilities.device_parser.determine_root_gpu_device in favor of lightning_lite.utilities.device_parser.determine_root_gpu_device

Deprecated the pytorch_lightning.utilities.device_parser.parse_gpu_ids in favor of lightning_lite.utilities.device_parser.parse_gpu_ids

Deprecated the pytorch_lightning.utilities.device_parser.is_cuda_available in favor of lightning_lite.accelerators.cuda.is_cuda_available

Deprecated the pytorch_lightning.utilities.device_parser.num_cuda_devices in favor of lightning_lite.accelerators.cuda.num_cuda_devices

Deprecated the pytorch_lightning.utilities.device_parser.parse_cpu_cores in favor of lightning_lite.accelerators.cpu.parse_cpu_cores

Deprecated the pytorch_lightning.utilities.device_parser.parse_tpu_cores in favor of lightning_lite.accelerators.tpu.parse_tpu_cores

Deprecated the pytorch_lightning.utilities.device_parser.parse_hpus in favor of pytorch_lightning.accelerators.hpu.parse_hpus

Deprecated duplicate SaveConfigCallback parameters in LightningCLI.__init__: save_config_kwargs, save_config_overwrite and save_config_multifile. New save_config_kwargs parameter should be used instead (#14998)

Deprecated TrainerFn.TUNING, RunningStage.TUNING and trainer.tuning property (#15100)

Deprecated custom pl.utilities.distributed.AllGatherGrad implementation in favor of PyTorch's (#15364)

Removed

Removed the deprecated Trainer.training_type_plugin property in favor of Trainer.strategy (#14011)

Removed all deprecated training type plugins (#14011)

Removed the deprecated DDP2Strategy (#14026)

Removed the deprecated DistributedType and DeviceType enum classes (#14045)

Removed deprecated support for passing the rank_zero_warn warning category positionally (#14470)

Removed the legacy and unused Trainer.get_deprecated_arg_names() (#14415)

Removed the deprecated on_train_batch_end(outputs) format when multiple optimizers are used and TBPTT is enabled (#14373)

Removed the deprecated training_epoch_end(outputs) format when multiple optimizers are used and TBPTT is enabled (#14373)

Removed the experimental pytorch_lightning.utiltiies.meta functions in favor of built-in https://github.com/pytorch/torchdistx support (#13868)

Removed the deprecated LoggerCollection; Trainer.logger and LightningModule.logger now returns the first logger when more than one gets passed to the Trainer (#14283)

Removed the deprecated the trainer.lr_schedulers (#14408)

Removed the deprecated LightningModule.{on_hpc_load,on_hpc_save} hooks in favor of the general purpose hooks LightningModule.{on_load_checkpoint,on_save_checkpoint} (#14315)

Removed deprecated support for old torchtext versions (#14375)

Removed deprecated support for the old neptune-client API in the NeptuneLogger (#14727)

Removed the deprecated weights_save_path Trainer argumnent and Trainer.weights_save_path property (#14424)

Removed the deprecated (#14471)

pytorch_lightning.utilities.distributed.rank_zero_only in favor of pytorch_lightning.utilities.rank_zero.rank_zero_only

pytorch_lightning.utilities.distributed.rank_zero_debug in favor of pytorch_lightning.utilities.rank_zero.rank_zero_debug

pytorch_lightning.utilities.distributed.rank_zero_info in favor of pytorch_lightning.utilities.rank_zero.rank_zero_info

pytorch_lightning.utilities.warnings.rank_zero_warn in favor of pytorch_lightning.utilities.rank_zero.rank_zero_warn

pytorch_lightning.utilities.warnings.rank_zero_deprecation in favor of pytorch_lightning.utilities.rank_zero.rank_zero_deprecation

pytorch_lightning.utilities.warnings.LightningDeprecationWarning in favor of pytorch_lightning.utilities.rank_zero.LightningDeprecationWarning

Removed deprecated Trainer.num_processes attribute in favour of Trainer.num_devices (#14423)

Removed the deprecated Trainer.data_parallel_device_ids hook in favour of Trainer.device_ids (#14422)

Removed the deprecated class TrainerCallbackHookMixin (#14401)

Removed the deprecated BaseProfiler and AbstractProfiler classes (#14404)

Removed the deprecated way to set the distributed backend via the environment variable PL_TORCH_DISTRIBUTED_BACKEND, in favor of setting the process_group_backend in the strategy constructor (#14693)

Removed deprecated callback hooks (#14834)

Callback.on_configure_sharded_model in favor of Callback.setup

Callback.on_before_accelerator_backend_setup in favor of Callback.setup

Callback.on_batch_start in favor of Callback.on_train_batch_start

Callback.on_batch_end in favor of Callback.on_train_batch_end

Callback.on_epoch_start in favor of Callback.on_{train,validation,test}_epoch_start

Callback.on_epoch_end in favor of Callback.on_{train,validation,test}_epoch_end

Callback.on_pretrain_routine_{start,end} in favor of Callback.on_fit_start

Removed the deprecated device attributes Trainer.{devices,gpus,num_gpus,ipus,tpu_cores} in favor of the accelerator-agnostic Trainer.num_devices (#14829)

Removed the deprecated LightningIPUModule (#14830)

Removed the deprecated Logger.agg_and_log_metrics hook in favour of Logger.log_metrics and the agg_key_funcs and agg_default_func arguments. (#14840)

Removed the deprecated precision plugin checkpoint hooks PrecisionPlugin.on_load_checkpoint and PrecisionPlugin.on_save_checkpoint (#14833)

Removed the deprecated Trainer.root_gpu attribute in favor of Trainer.strategy.root_device (#14829)

Removed the deprecated Trainer.use_amp and LightningModule.use_amp attributes (#14832)

Removed the deprecated callback hooks Callback.on_init_start and Callback.on_init_end (#14867)

Removed the deprecated Trainer.run_stage in favor of Trainer.{fit,validate,test,predict} (#14870)

Removed the deprecated SimpleProfiler.profile_iterable and AdvancedProfiler.profile_iterable attributes (#14864)

Removed the deprecated Trainer.verbose_evaluate (#14884)

Removed the deprecated Trainer.should_rank_save_checkpoint (#14885)

Removed the deprecated TrainerOptimizersMixin (#14887)

Removed the deprecated Trainer.lightning_optimizers (#14889)

Removed the deprecated TrainerDataLoadingMixin (#14888)

Removed the deprecated Trainer.call_hook in favor of Trainer._call_callback_hooks, Trainer._call_lightning_module_hook, Trainer._call_ttp_hook, and Trainer._call_accelerator_hook (#14869)

Removed the deprecated Trainer.{validated,tested,predicted}_ckpt_path (#14897)

Removed the deprecated device_stats_monitor_prefix_metric_keys (#14890)

Removed the deprecated LightningDataModule.on_save/load_checkpoint hooks (#14909)

Removed support for returning a value in Callback.on_save_checkpoint in favor of implementing Callback.state_dict (#14835)

Fixed

Fixed an issue with LightningLite.setup() not setting the .device attribute correctly on the returned wrapper (#14822)

Fixed an attribute error when running the tuner together with the StochasticWeightAveraging callback (#14836)

Fixed MissingFieldException in offline mode for the NeptuneLogger() (#14919)

Fixed wandb save_dir is overridden by None dir when using CLI (#14878)

Fixed a missing call to LightningDataModule.load_state_dict hook while restoring checkpoint using LightningDataModule.load_from_checkpoint (#14883)

Fixed torchscript error with containers of LightningModules (#14904)

Fixed reloading of the last checkpoint on run restart (#14907)

SaveConfigCallback instances should only save the config once to allow having the overwrite=False safeguard when using LightningCLI(..., run=False) (#14927)

Fixed an issue with terminating the trainer profiler when a StopIteration exception is raised while using an IterableDataset (#14940)

Do not update on-plateau schedulers when reloading from an end-of-epoch checkpoint (#14702)

Fixed Trainer support for PyTorch built without distributed support (#14971)

Fixed batch normalization statistics calculation in StochasticWeightAveraging callback (#14866)

Avoided initializing optimizers during deepspeed inference (#14944)

Fixed LightningCLI parse_env and description in subcommands (#15138)

Fixed an exception that would occur when creating a multiprocessing.Pool after importing Lightning (#15292)

Fixed a pickling error when using RichProgressBar together with checkpointing (#15319)

Fixed the RichProgressBar crashing when used with distributed strategies (#15376)

Fixed an issue with RichProgressBar not resetting the internal state for the sanity check progress (#15377)

Fixed an issue with DataLoader re-instantiation when the attribute is an array and the default value of the corresponding argument changed (#15409)

Full commit list: https://github.com/PyTorchLightning/pytorch-lightning/compare/1.7.0...1.8.0

Contributors

Veteran

@akihironitta @ananthsub @AndresAlgaba @ar90n @Atharva-Phatak @awaelchli @BongYang @Borda @carmocca @dependabot @donlapark @ethanwharris @Felonious-Spellfire @hhsecond @jerome-habana @JustinGoheen @justusschock @kaushikb11 @krishnakalyan3 @krshrimali @luca-medeiros @manangoel99 @manskx @mauvilsa @MrShevan @nicolai86 @nmiculinic @otaj @Queuecumber @rlizzo @rohitgr7 @rschireman @SeanNaren @speediedan @tchaton @tshu-w

New

@Birch-san @clementpoiret @HalestormAI @thongonary @alecmerdler @adam-lightning @yurijmikhalevich @lijm1358 @robert-s-lee @panos-is @kacperlukawski @alro923 @dmitsf @Anner-deJong @cschell @nishantb06 @Callidior @j0rd1smit @MarcSkovMadsen @KralaBenjamin @robertomest @daniel347x @pierocor @datumbox @nohalon @pritamsoni-hsr @nandwalritik @gilfree @ritsuki1227 @christopher-nguyen-re @JulesGM @jgbos @dconathan @jsr-p @NeoKish @Blaizzy @suyash-811 @alexkuzmik @ziyadsheeba @geoffrey-g-delhomme @amrutha1098 @AlessioQuercia @ver217 @Helias @zxvix @1SAA @fabiofumarola @luca3rd @kimpty @PaulLerner @rbracco @wouterzwerink

If we forgot somebody or you have a suggestion, find support here :zap:

Did you know?

Chuck Norris can write functions of infinite recursion ... and have them return.
Source code(tar.gz)
Source code(zip)
lightning-1.8.0-py3-none-any.whl(1.57 MB)
lightning-1.8.0.tar.gz(1.29 MB)
lightning-app-1.8.0.tar.gz(1022.44 KB)
lightning-lite-1.8.0.tar.gz(92.29 KB)
lightning_app-1.8.0-py3-none-any.whl(1.06 MB)
lightning_lite-1.8.0-py3-none-any.whl(133.15 KB)
pytorch-lightning-1.8.0.tar.gz(557.88 KB)
pytorch_lightning-1.8.0-py3-none-any.whl(777.19 KB)
App/0.7.0(Oct 20, 2022)
[0.7.0] - 2022-10-20

Added

Add --secret option to CLI to allow binding Secrets to app environment variables when running in the cloud (#14612)

Added support for adding descriptions to commands either through a docstring or the DESCRIPTION attribute (#15193

Added option to add custom meta tags to the UI container (#14915)

Added support to pass a LightningWork to the LightningApp (#15215

Changed

Allowed root path to run the app on /path (#14972)

Source code(tar.gz)
Source code(zip)
lightning-2022.10.20-py3-none-any.whl(186.84 KB)
lightning-2022.10.20.tar.gz(65.21 KB)
lightning-app-0.7.0.tar.gz(1006.30 KB)
lightning_app-0.7.0-py3-none-any.whl(1.03 MB)
app/0.6.3(Oct 7, 2022)
[0.6.3] - 2022-10-07

Added

Added option to add custom meta tags to the UI container (#14915)

Changed

Allowed root path to run the app on /path (#14972)

Contributors

@pritamsoni-hsr

If we forgot someone due to not matching commit email with GitHub account, let us know :]
Source code(tar.gz)
Source code(zip)
lightning-2022.10.7-py3-none-any.whl(186.46 KB)
lightning-2022.10.7.tar.gz(65.19 KB)
lightning-app-0.6.3.tar.gz(1005.19 KB)
lightning_app-0.6.3-py3-none-any.whl(1.03 MB)
1.7.7(Sep 22, 2022)
[1.7.7] - 2022-09-22

Fixed

Fixed the availability check for the neptune-client package (#14714)

Break HPU Graphs into two parts (forward + backward as one and optimizer as another) for better performance (#14656)

Fixed torchscript error with ensembles of LightningModules (#14657, #14724)

Fixed an issue with TensorBoardLogger.finalize creating a new experiment when none was created during the Trainer's execution (#14762)

Fixed TypeError on import when torch.distributed is not available (#14809)

Contributors

@awaelchli @Borda @carmocca @dependabot @otaj @raoakarsha

If we forgot someone due to not matching commit email with GitHub account, let us know :)
Source code(tar.gz)
Source code(zip)
lightning-2022.9.22-py3-none-any.whl(176.53 KB)
lightning-2022.9.22.tar.gz(61.61 KB)
pytorch-lightning-1.7.7.tar.gz(511.63 KB)
pytorch_lightning-1.7.7-py3-none-any.whl(691.51 KB)
app/0.6.2(Sep 22, 2022)
[0.6.2] - 2022-09-22

Changed

Improved Lightning App connect logic by disconnecting automatically (#14532)

Improved the error message when the LightningWork is missing the run method (#14759)

Improved the error message when the root LightningFlow passed to LightningApp is missing the run method (#14760)

Fixed

Fixed a bug where the uploaded command file wasn't properly parsed (#14532)

Fixed an issue where custom property setters were not being used LightningWork class (#14259)

Fixed an issue where some terminals would display broken icons in the PL app CLI (#14226)

Contributors

@awaelchli, @borda, @pranjaldatta, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]
Source code(tar.gz)
Source code(zip)
lightning-2022.9.22-py3-none-any.whl(186.08 KB)
lightning-2022.9.22.tar.gz(65.12 KB)
lightning-app-0.6.2.tar.gz(994.45 KB)
lightning_app-0.6.2-py3-none-any.whl(1.02 MB)
app/0.6.1(Sep 19, 2022)
[0.6.1] - 2022-09-19

Added

Add support to upload files to the Drive through an asynchronous upload_file endpoint (#14703)

Changed

Application storage prefix moved from app_id to project_id/app_id (#14583)

LightningCloud client calls to use keyword arguments instead of positional arguments (#14685)

Fixed

Making threadpool non-default from LightningCloud client (#14757)

Resolved a bug where the state change detection using DeepDiff won't work with Path, Drive objects (#14465)

Resolved a bug where the wrong client was passed to collect cloud logs (#14684)

Resolved the memory leak issue with the Lightning Cloud package and bumped the requirements to use the latest version (#14697)

Fixing 5000 log line limitation for Lightning AI BYOC cluster logs (#14458)

Fixed a bug where the uploaded command file wasn't properly parsed (#14532)

Resolved LightningApp(..., debug=True) (#14464)

Contributors

@dmitsf @hhsecond @tchaton @nohalon @krshrimali @pritamsoni-hsr @nmiculinic @ethanwharris @yurijmikhalevich @Felonious-Spellfire @otaj @Borda

If we forgot someone due to not matching commit email with GitHub account, let us know :)
Source code(tar.gz)
Source code(zip)
lightning-2022.9.19-py3-none-any.whl(186.02 KB)
lightning-2022.9.19.tar.gz(64.99 KB)
lightning-app-0.6.1.tar.gz(992.79 KB)
lightning_app-0.6.1-py3-none-any.whl(1.02 MB)
1.7.6(Sep 13, 2022)
[1.7.6] - 2022-09-13

Changed

Improved the error messaging when passing Trainer.method(model, x_dataloader=None) with no module-method implementations available (#14614)

Fixed

Reset the dataloaders on OOM failure in batch size finder to use the last successful batch size (#14372)

Fixed an issue to keep downscaling the batch size in case there hasn't been even a single successful optimal batch size with mode="power" (#14372)

Fixed an issue where self.log-ing a tensor would create a user warning from PyTorch about cloning tensors (#14599)

Fixed compatibility when torch.distributed is not available (#14454)

Contributors

@akihironitta @awaelchli @Borda @carmocca @dependabot @krshrimali @mauvilsa @pierocor @rohitgr7 @wangraying

If we forgot someone due to not matching commit email with GitHub account, let us know :)
Source code(tar.gz)
Source code(zip)
lightning-2022.9.13-py3-none-any.whl(176.05 KB)
lightning-2022.9.13.tar.gz(61.55 KB)
pytorch-lightning-1.7.6.tar.gz(511.20 KB)
pytorch_lightning-1.7.6-py3-none-any.whl(690.96 KB)
app/0.6.0(Sep 8, 2022)
[0.6.0] - 2022-09-08

Added

Introduce lightning connect (#14452)

Adds PanelFrontend to easily create complex UI in Python (#13531)

Add support for Lightning App Commands through the configure_commands hook on LightningFlow and ClientCommand (#13602)

Add support for Lightning AI BYOC cluster management (#13835)

Add support to see Lightning AI BYOC cluster logs (#14334)

Add support to run Lightning apps on Lightning AI BYOC clusters (#13894)

Add support for listing Lightning AI apps (#13987)

Adds LightningTrainingComponent. LightningTrainingComponent orchestrates multi-node training in the cloud (#13830)

Add support for printing application logs using CLI lightning show logs <app_name> [components] (#13634)

Add support for Lightning API through the configure_api hook on the LightningFlow and the Post, Get, Delete, Put with HttpMethods (#13945)

Added a warning when configure_layout returns URLs configured with HTTP instead of HTTPS (#14233)

Add --app_args support from the CLI (#13625)

Changed

Default values and parameter names for Lightning AI BYOC cluster management (#14132)

Run the flow only if the state has changed from the previous execution (#14076)

Increased DeepDiff's verbose level to properly handle dict changes (#13960)

Setup: added requirement freeze for the next major version (#14480)

Fixed

Unification of app template: moved app.py to root dir for lightning init app <app_name> template (#13853)

Fixed an issue with lightning --version command (#14433)

Fixed imports of collections.abc for py3.10 (#14345)

Contributors

@adam-lightning, @awaelchli, @Borda, @dmitsf, @manskx, @MarcSkovMadsen, @nicolai86, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]
Source code(tar.gz)
Source code(zip)
lightning-2022.9.8-py3-none-any.whl(185.47 KB)
lightning-2022.9.8.tar.gz(64.78 KB)
lightning-app-0.6.0.tar.gz(987.74 KB)
lightning_app-0.6.0-py3-none-any.whl(1.01 MB)
1.7.5(Sep 7, 2022)
[1.7.5] - 2022-09-06

Fixed

Squeezed tensor values when logging with LightningModule.log (#14489)

Fixed WandbLogger save_dir is not set after creation (#14326)

Fixed Trainer.estimated_stepping_batches when maximum number of epochs is not set (#14317)

Contributors

@carmocca @dependabot @robertomest @rohitgr7 @tshu-w

If we forgot someone due to not matching commit email with GitHub account, let us know :)
Source code(tar.gz)
Source code(zip)
lightning-2022.9.7-py3-none-any.whl(173.51 KB)
lightning-2022.9.7.tar.gz(60.70 KB)
pytorch-lightning-1.7.5.tar.gz(509.56 KB)
pytorch_lightning-1.7.5-py3-none-any.whl(690.05 KB)
1.7.4(Aug 31, 2022)
[1.7.4] - 2022-08-31

Added

Added an environment variable PL_DISABLE_FORK that can be used to disable all forking in the Trainer (#14319)

Fixed

Fixed LightningDataModule hparams parsing (#12806)

Reset epoch progress with batch size scaler (#13846)

Fixed restoring the trainer after using lr_find() so that the correct LR schedule is used for the actual training (#14113)

Fixed incorrect values after transferring data to an MPS device (#14368)

Contributors

@rohitgr7 @tanmoyio @justusschock @cschell @carmocca @Callidior @awaelchli @j0rd1smit @dependabot @Borda @otaj
Source code(tar.gz)
Source code(zip)
lightning-2022.8.31-py3-none-any.whl(173.49 KB)
lightning-2022.8.31.tar.gz(60.68 KB)
pytorch-lightning-1.7.4.tar.gz(509.50 KB)
pytorch_lightning-1.7.4-py3-none-any.whl(689.97 KB)
1.7.3(Aug 25, 2022)
[1.7.3] - 2022-08-25

Fixed

Fixed an assertion error when using a ReduceOnPlateau scheduler with the Horovod strategy (#14215)

Fixed an AttributeError when accessing LightningModule.logger and the Trainer has multiple loggers (#14234)

Fixed wrong num padding for RichProgressBar (#14296)

Added back support for logging in the configure_gradient_clipping hook after unintended removal in v1.7.2 (#14298)

Fixed an issue to avoid the impact of sanity check on reload_dataloaders_every_n_epochs for validation (#13964)

Contributors

@awaelchli @Borda @carmocca @dependabot @kaushikb11 @otaj @rohitgr7
Source code(tar.gz)
Source code(zip)
lightning-2022.8.25-py3-none-any.whl(173.49 KB)
lightning-2022.8.25.tar.gz(60.68 KB)
pytorch-lightning-1.7.3.tar.gz(508.76 KB)
pytorch_lightning-1.7.3-py3-none-any.whl(689.26 KB)
app/0.5.7(Aug 22, 2022)
[0.5.7] - 2022-08-22

Changed

Release LAI docs as stable (#14250)

Compatibility for Python 3.10

Fixed

Pinning starsessions to 1.x (#14333)

Parsed local package versions (#13933)

Contributors

@borda, @hhsecond, @manskx

If we forgot someone due to not matching commit email with GitHub account, let us know :]
Source code(tar.gz)
Source code(zip)
lightning-2022.8.22-py3-none-any.whl(166.27 KB)
lightning-2022.8.22.tar.gz(59.05 KB)
lightning-app-0.5.7.tar.gz(964.39 KB)
lightning_app-0.5.7-py3-none-any.whl(1012.86 KB)
app/0.5.6(Aug 18, 2022)
[0.5.6] - 2022-08-16

Fixed

Resolved a bug where the install command was not installing the latest version of an app/component by default (#14181)

Contributors

@manskx

If we forgot someone due to not matching commit email with GitHub account, let us know :]
Source code(tar.gz)
Source code(zip)
lightning-2022.8.18-py3-none-any.whl(166.40 KB)
lightning-2022.8.18.tar.gz(58.62 KB)
lightning-app-0.5.6.tar.gz(1.03 MB)
lightning_app-0.5.6-py3-none-any.whl(1.08 MB)
1.7.2(Aug 17, 2022)
[1.7.2] - 2022-08-17

Added

Added FullyShardedNativeNativeMixedPrecisionPlugin to handle precision for DDPFullyShardedNativeStrategy (#14092)

Added profiling to these hooks: on_before_batch_transfer, transfer_batch_to_device, on_after_batch_transfer, configure_gradient_clipping, clip_gradients (#14069)

Changed

Updated compatibility for LightningLite to run with the latest DeepSpeed 0.7.0 (13967)

Raised a MisconfigurationException if batch transfer hooks are overriden with IPUAccelerator (13961)

The default project name in WandbLogger is now "lightning_logs" (#14145)

The WandbLogger.name property no longer returns the name of the experiment, and instead returns the project's name (#14145)

Fixed

Fixed a bug that caused spurious AttributeError when multiple DataLoader classes are imported (#14117)

Fixed epoch-end logging results not being reset after the end of the epoch (#14061)

Fixed saving hyperparameters in a composition where the parent class is not a LightningModule or LightningDataModule (#14151)

Fixed epoch-end logging results not being reset after the end of the epoch (#14061)

Fixed the device placement when LightningModule.cuda() gets called without specifying a device index and the current cuda device was not 0 (#14128)

Avoided false positive warning about using sync_dist when using torchmetrics (#14143)

Avoid metadata.entry_points deprecation warning on Python 3.10 (#14052)

Avoid raising the sampler warning if num_replicas=1 (#14097)

Fixed resuming from a checkpoint when using Stochastic Weight Averaging (SWA) (#9938)

Avoided requiring the FairScale package to use precision with the fsdp native strategy (#14092)

Fixed an issue in which the default name for a run in WandbLogger would be set to the project name instead of a randomly generated string (#14145)

Fixed not preserving set attributes on DataLoader and BatchSampler when instantiated inside *_dataloader hooks (#14212)

Contributors

@adamreeve @akihironitta @awaelchli @Borda @carmocca @dependabot @otaj @rohitgr7
Source code(tar.gz)
Source code(zip)
lightning-2022.8.17-py3-none-any.whl(165.45 KB)
lightning-2022.8.17.tar.gz(58.97 KB)
pytorch-lightning-1.7.2.tar.gz(508.56 KB)
pytorch_lightning-1.7.2-py3-none-any.whl(689.02 KB)
1.7.1(Aug 9, 2022)
[1.7.1] - 2022-08-09

Fixed

Casted only floating point tensors to fp16 with IPUs (#13983)

Casted tensors to fp16 before moving them to device with DeepSpeedStrategy (#14000)

Fixed the NeptuneLogger dependency being unrecognized (#13988)

Fixed an issue where users would be warned about unset max_epochs even when fast_dev_run was set (#13262)

Fixed MPS device being unrecognized (#13992)

Fixed incorrect precision="mixed" being used with DeepSpeedStrategy and IPUStrategy (#14041)

Fixed dtype inference during gradient norm computation (#14051)

Fixed a bug that caused ddp_find_unused_parameters to be set False, whereas the intended default is True (#14095)

Contributors

@adamjstewart @akihironitta @awaelchli @Birch-san @carmocca @clementpoiret @dependabot @rohitgr7
Source code(tar.gz)
Source code(zip)
lightning-2022.8.9-py3-none-any.whl(164.74 KB)
lightning-2022.8.9.tar.gz(58.75 KB)
pytorch-lightning-1.7.1.tar.gz(505.91 KB)
pytorch_lightning-1.7.1-py3-none-any.whl(685.08 KB)
app/0.5.5(Aug 9, 2022)
[0.5.5] - 2022-08-9

Deprecated

Deprecate sheety API (#14004)

Fixed

Resolved a bug where the work statuses will grow quickly and be duplicated (#13970)

Resolved a bug about a race condition when sending the work state through the caller_queue (#14074)

Fixed Start Lightning App on Cloud if Repo Begins With Name "Lightning" (#14025)

Contributors

@manskx, @rlizzo, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]
Source code(tar.gz)
Source code(zip)
lightning-2022.8.9-py3-none-any.whl(165.38 KB)
lightning-2022.8.9.tar.gz(58.34 KB)
lightning-app-0.5.5.tar.gz(1.03 MB)
lightning_app-0.5.5-py3-none-any.whl(1.07 MB)
1.7.0(Aug 2, 2022)
The core team is excited to announce the release of PyTorch Lightning 1.7 :zap:

Highlights

Backward Incompatible Changes

Deprecations

Full Changelog

Contributors

PyTorch Lightning 1.7 is the culmination of work from 106 contributors who have worked on features, bug-fixes, and documentation for a total of over 492 commits since 1.6.0.

Highlights

Apple Silicon Support

For those using PyTorch 1.12 on M1 or M2 Apple machines, we have created the MPSAccelerator. MPSAccelerator enables accelerated GPU training on Apple’s Metal Performance Shaders (MPS) as a backend process.

NOTE

Support for this accelerator is currently marked as experimental in PyTorch. Because many operators are still missing, you may run into a few rough edges.

# Selects the accelerator trainer = pl.Trainer(accelerator="mps") # Equivalent to from pytorch_lightning.accelerators import MPSAccelerator trainer = pl.Trainer(accelerator=MPSAccelerator()) # Defaults to "mps" when run on M1 or M2 Apple machines # to avoid code changes when switching computers trainer = pl.Trainer(accelerator="gpu")

Native Fully Sharded Data Parallel Strategy

PyTorch 1.12 also added native support for Fully Sharded Data Parallel (FSDP). Previously, PyTorch Lightning enabled this by using the fairscale project. You can now choose between both options.

NOTE

Support for this strategy is marked as beta in PyTorch.

# Native PyTorch implementation trainer = pl.Trainer(strategy="fsdp_native") # Equivalent to from pytorch_lightning.strategies import DDPFullyShardedNativeStrategy trainer = pl.Trainer(strategy=DDPFullyShardedNativeStrategy()) # For reference, FairScale's implementation can be used with trainer = pl.Trainer(strategy="fsdp")

A Collaborative Training strategy using Hivemind

Collaborative Training solves the need for top-tier multi-GPU servers by allowing you to train across unreliable machines such as local ones or even preemptible cloud compute across the Internet.

Under the hood, we use Hivemind. This provides de-centralized training across the Internet.

from pytorch_lightning.strategies import HivemindStrategy trainer = pl.Trainer( strategy=HivemindStrategy(target_batch_size=8192), accelerator="gpu", devices=1 )

For more information, check out the docs.

Distributed support in Jupyter Notebooks

So far, the only multi-GPU strategy supported in Jupyter notebooks (including Grid.ai, Google Colab, and Kaggle, for example) has been the Data-Parallel (DP) strategy (strategy="dp"). DP, however, has several limitations that often obstruct users' workflows. It can be slow, it's incompatible with TorchMetrics, it doesn't persist state changes on replicas, and it's difficult to use with non-primitive input- and output structures.

In this release, we've added support for Distributed Data Parallel in Jupyter notebooks using the fork mechanism to address these shortcomings. This is only available for MacOS and Linux (sorry Windows!).

NOTE

This feature is experimental.

This is how you use multi-device in notebooks now:

# Train on 2 GPUs in a Jupyter notebook trainer = pl.Trainer(accelerator="gpu", devices=2) # Can be set explicitly trainer = pl.Trainer(accelerator="gpu", devices=2, strategy="ddp_notebook") # Can also be used in non-interactive environments trainer = pl.Trainer(accelerator="gpu", devices=2, strategy="ddp_fork")

By default, the Trainer detects the interactive environment and selects the right strategy for you. Learn more in the full documentation.

Versioning of "last" checkpoints

If a run is configured to save to the same directory as a previous run and ModelCheckpoint(save_last=True) is enabled, the "last" checkpoint is now versioned with a simple -v1 suffix to avoid overwriting the existing "last" checkpoint. This mimics the behaviour for checkpoints that monitor a metric.

Automatically reload the "last" checkpoint

In certain scenarios, like when running in a cloud spot instance with fault-tolerant training enabled, it is useful to load the latest available checkpoint. It is now possible to pass the string ckpt_path="last" in order to load the latest available checkpoint from the set of existing checkpoints.

trainer = Trainer(...) trainer.fit(..., ckpt_path="last")

Validation every N batches across epochs

In some cases, for example iteration based training, it is useful to run validation after every N number of training batches without being limited by the epoch boundary. Now, you can enable validation based on total training batches.

trainer = Trainer(..., val_check_interval=N, check_val_every_n_epoch=None) trainer.fit(...)

For example, given 5 epochs of 10 batches, setting N=25 would run validation in the 3rd and 5th epoch.

CPU stats monitoring

PyTorch Lightning provides the DeviceStatsMonitor callback to monitor the stats of the hardware currently used. However, users often also want to monitor the stats of other hardware. In this release, we have added an option to additionally monitor CPU stats:

from pytorch_lightning.callbacks import DeviceStatsMonitor # Log both CPU stats and GPU stats trainer = pl.Trainer(callbacks=DeviceStatsMonitor(cpu_stats=True), accelerator="gpu") # Log just the GPU stats trainer = pl.Trainer(callbacks=DeviceStatsMonitor(cpu_stats=False), accelerator="gpu") # Equivalent to `DeviceStatsMonitor()` trainer = pl.Trainer(callbacks=DeviceStatsMonitor(cpu_stats=True), accelerator="cpu")

The CPU stats are gathered using the psutil package.

Automatic distributed samplers

It is now possible to use custom samplers in a distributed environment without the need to set replace_ddp_sampler=False and wrap your sampler manually with the DistributedSampler.

Inference mode support

PyTorch 1.9 introduced torch.inference_mode, which is a faster alternative for torch.no_grad. Lightning will now use inference_mode wherever possible during evaluation.

Support for warn-level determinism

In Pytorch 1.11, operations that do not have a deterministic implementation can be set to throw a warning instead of an error when ran in deterministic mode. This is now supported by our Trainer:

trainer = pl.Trainer(deterministic="warn")

LightningCLI improvements

After the latest updates to jsonargparse, the library supporting the LightningCLI, there's now complete support for shorthand notation. This includes automatic support for shorthand notation to all arguments, not just the ones that are part of the registries, plus support inside configuration files.

+ # pytorch_lightning==1.7.0 trainer: callbacks: - - class_path: pytorch_lightning.callbacks.EarlyStopping + - class_path: EarlyStopping init_args: monitor: "loss"

A header with the version that generated the config is now included.

All subclasses for a given base class can be specified by name, so there's no need to explicitly register them. The only requirement is that the module where the subclass is defined is imported prior to parsing.

from pytorch_lightning.cli import LightningCLI import my_code.models import my_code.optimizers cli = LightningCLI() # Now use any of the classes: # python trainer.py fit --model=Model1 --optimizer=CustomOptimizer

The new version renders the registries and the auto_registry flag, introduced in 1.6.0, unnecessary, so we have deprecated them.

Support was also added for list appending; for example, to add a callback to an existing list that might be already configured:

$ python trainer.py fit \ - --trainer.callbacks=EarlyStopping \ + --trainer.callbacks+=EarlyStopping \ --trainer.callbacks.patience=5 \ - --trainer.callbacks=LearningRateMonitor \ + --trainer.callbacks+=LearningRateMonitor \ --trainer.callbacks.logging_interval=epoch

Callback registration through entry points

Entry Points are an advanced feature in Python's setuptools that allow packages to expose metadata to other packages. In Lightning, we allow an arbitrary package to include callbacks that the Lightning Trainer can automatically use when installed, without you having to manually add them to the Trainer. This is useful in production environments where it is common to provide specialized monitoring and logging callbacks globally for every application.

A setup.py file for a callbacks plugin package could look something like this:

from setuptools import setup setup( name="my-package", version="0.0.1", entry_points={ # Lightning will look for this key here in the environment: "pytorch_lightning.callbacks_factory": [ "monitor_callbacks=factories:my_custom_callbacks_factory" ] }, )

Read more about callback entry points in our docs.

Rank-zero only EarlyStopping messages

Our EarlyStopping callback implementation, by default, logs the stopping messages on every rank when it's run in a distributed environment. This was done in case the monitored values were not synchronized. However, some users found this verbose. To avoid this, you can now set a flag:

from pytorch_lightning.callbacks import EarlyStopping trainer = pl.Trainer(callbacks=EarlyStopping(..., log_rank_zero_only=True))

A base Checkpoint class for extra customization

If you want to customize ModelCheckpoint callback, without all the extra functionality this class provides, this release provides an empty class Checkpoint for easier inheritance. In all internal code, the check is made against the Checkpoint class in order to ensure everything works properly for custom classes.

Validation now runs in overfitting mode

Setting overfit_batches=N, now enables validation and runs N number of validation batches during trainer.fit.

# Uses 1% of each train & val set trainer = Trainer(overfit_batches=0.01) # Uses 10 batches for each train & val set trainer = Trainer(overfit_batches=10)

Device Stats Monitoring support for HPUs

DeviceStatsMonitor callback can now be used to automatically monitor and log device stats during the training stage with Habana devices.

from pytorch_lightning import Trainer from pytorch_lightning.callbacks import DeviceStatsMonitor device_stats = DeviceStatsMonitor() trainer = Trainer(accelerator="hpu", callbacks=[device_stats])

New Hooks

LightningDataModule.load_from_checkpoint

Now, hyper-parameters from LightningDataModule save to checkpoints and reload when training is resumed. And just like you use LightningModule.load_from_checkpoint to load a model using a checkpoint filepath, you can now load LightningDataModule using the same hook.

# Lad weights without mapping ... datamodule = MyLightningDataModule.load_from_checkpoint('path/to/checkpoint.ckpt') # Or load weights and hyperparameters from separate files. datamodule = MyLightningDataModule.load_from_checkpoint( 'path/to/checkpoint.ckpt', hparams_file='/path/to/hparams_file.yaml' ) # Override some of the params with new values datamodule = MyLightningDataModule.load_from_checkpoint( 'path/to/checkpoint.ckpt', batch_size=32, num_workers=10, )

Experimental Features

ServableModule and its Servable Module Validator Callback

When serving models in production, it generally is a good pratice to ensure that the model can be served and optimzed before starting training to avoid wasting money.

To do so, you can import a ServableModule (an nn.Module) and add it as an extra base class to your base model as follows:

from pytorch_lightning import LightningModule from pytorch_lightning.serve import ServableModule class ProductionReadyModel(LightningModule, ServableModule): ...

To make your model servable, you would need to implement three hooks:

configure_payload: Describe the format of the payload (data sent to the server).

configure_serialization: Describe the functions used to convert the payload to tensors (de-serialization) and tensors to payload (serialization)

serve_step: The method used to transform the input tensors to a dictionary of prediction tensors.

from pytorch_lightning.serve import ServableModule, ServableModuleValidator class ProductionReadyModel(LitModule, ServableModule): def configure_payload(self): # 1: Access the train dataloader and load a single sample. image, _ = self.trainer.train_dataloader.loaders.dataset[0] # 2: Convert the image into a PIL Image to bytes and encode it with base64 pil_image = T.ToPILImage()(image) buffered = BytesIO() pil_image.save(buffered, format="JPEG") img_str = base64.b64encode(buffered.getvalue()).decode("UTF-8") payload = {"body": {"x": img_str}} return payload def configure_serialization(self): deserializers = {"x": Image(224, 224).deserialize} serializers = {"output": Top1().serialize} return deserializers, serializers def serve_step(self, x: torch.Tensor) -> Dict[str, torch.Tensor]: return {"output": self.model(x)}

Finally, add the ServableModuleValidator callback to the Trainer to validate the model is servable on_train_start. This uses a FastAPI server.

pl_module = ProductionReadyModel() trainer = Trainer(..., callbacks=[ServableModuleValidator()]) trainer.fit(pl_module)

Have a look at the full example here.

Asynchronous Checkpointing

You can now save checkpoints asynchronously using the AsyncCheckpointIO plugin without blocking your training process. To enable this, you can pass a AsyncCheckpointIO plugin to the Trainer.

from pytorch_lightning.plugins.io import AsyncCheckpointIO trainer = Trainer(plugins=[AsyncCheckpointIO()])

Have a look at the full example here.

Backward Incompatible Changes

This section outlines notable changes that are not backward compatible with previous versions. The full list of changes and removals can be found in the CHANGELOG below.

Removed support for the DDP2 strategy

The DDP2 strategy, previously known as the DDP2 plugin, has been part of Lightning since its inception. Due to both the technical challenges in maintaining the plugin after PyTorch's removal of the multi-device support in DistributedDataParallel, as well as a general lack of interest, we have decided to retire the strategy entirely.

Do not force metric synchronization on epoch end

In previous versions, metrics logged inside epoch-end hooks were forcefully synced. This makes the sync_dist flag irrelevant and causes communication overhead that might be undesired. In this release, we've removed this behaviour and instead warn the user that synchronization might be desired.

Deprecations

| API | Removal version | Alternative | |--------------------------------------------------------------------------------------------------------------|-----------------|-------------------------------------------------| | Import pytorch_lightning.loggers.base.LightningLoggerBase | 1.9 | pytorch_lightning.loggers.logger.Logger | | Import pytorch_lightning.callbacks.base.Callback | 1.9 | pytorch_lightning.callbacks.callback.Callback | | Import pytorch_lightning.core.lightning.LightningModule | 1.9 | pytorch_lightning.core.module.LightningModule | | Import pytorch_lightning.loops.base.Loop | 1.9 | pytorch_lightning.loops.loop.Loop | | Import pytorch_lightning.profiler | 1.9 | pytorch_lightning.profilers | | Arguments Trainer(num_processes=..., gpus=..., tpu_cores=..., ipus=...) | 2.0 | Trainer(accelerator=..., devices=...) | | Argument LightningCLI(seed_everything_default=None) | 1.9 | LightningCLI(seed_everything_default=False) | | Method Trainer.reset_train_val_dataloaders() | 1.9 | Trainer.reset_{train,val}_dataloader | | Import pytorch_lightning.utilities.cli module | 1.9 | pytorch_lightning.cli | | Objects pytorch_lightning.utilities.cli.{OPTIMIZER,LR_SCHEDULER,MODEL,DATAMODULE,CALLBACK,LOGGER}_REGISTRY | 1.9 | Not necessary anymore | | Argument LightningCLI(auto_registry=...) | 1.9 | Not necessary anymore | | Argument Trainer(strategy="ddp2") and class pytorch_lightning.strategies.DDP2Strategy | 1.8 | No longer supported |

CHANGELOG

Added

Added ServableModule and its associated callback called ServableModuleValidator to ensure the model can served (#13614)

Converted validation loop config warnings to PossibleUserWarning (#13377)

Added a flag named log_rank_zero_only to EarlyStopping to disable logging to non-zero rank processes (#13233)

Added support for reloading the last checkpoint saved by passing ckpt_path="last" (#12816)

Added LightningDataModule.load_from_checkpoint to support loading datamodules directly from checkpoint (#12550)

Added a friendly error message when attempting to call Trainer.save_checkpoint() without a model attached (#12772)

Added a friendly error message when attempting to use DeepSpeedStrategy on unsupported accelerators (#12699)

Enabled torch.inference_mode for evaluation and prediction (#12715)

Added support for setting val_check_interval to a value higher than the amount of training batches when check_val_every_n_epoch=None (#11993)

Include the pytorch_lightning version as a header in the CLI config files (#12532)

Added support for Callback registration through entry points (#12739)

Added support for Trainer(deterministic="warn") to warn instead of fail when a non-deterministic operation is encountered (#12588)

Added profiling to the loops' dataloader __next__ calls (#12124)

Hivemind Strategy

Added CollaborativeStrategy (#12842)

Renamed CollaborativeStrategy to HivemindStrategy (#13388)

Removed unnecessary endpoint logic, renamed collaborative to hivemind (#13392)

Include a version suffix for new "last" checkpoints of later runs in the same directory (#12902)

Show a better error message when a Metric that does not return a Tensor is logged (#13164)

Added missing predict_dataset argument in LightningDataModule.from_datasets to create predict dataloaders (#12942)

Added class name prefix to metrics logged by DeviceStatsMonitor (#12228)

Automatically wrap custom samplers under a distributed environment by using DistributedSamplerWrapper (#12959)

Added profiling of LightningDataModule hooks (#12971)

Added Native FSDP Strategy (#12447)

Added breaking of lazy graph across training, validation, test and predict steps when training with habana accelerators to ensure better performance (#12938)

Added Checkpoint class to inherit from (#13024)

Added CPU metric tracking to DeviceStatsMonitor (#11795)

Added teardown() method to Accelerator (#11935)

Added support for using custom Trainers that don't include callbacks using the CLI (#13138)

Added a timeout argument to DDPStrategy and DDPSpawnStrategy. (#13244, #13383)

Added XLAEnvironment cluster environment plugin (#11330)

Added logging messages to notify when FitLoop stopping conditions are met (#9749)

Added support for calling unknown methods with DummyLogger (#13224

Added support for recursively setting the Trainer reference for ensembles of LightningModules (#13638

Added Apple Silicon Support via MPSAccelerator (#13123)

Added support for DDP Fork (#13405)

Added support for async checkpointing (#13658)

Added support for HPU Device stats monitor (#13819)

Changed

accelerator="gpu" now automatically selects an available GPU backend (CUDA and MPS currently) (#13642)

Enable validation during overfitting (#12527)

Added dataclass support to extract_batch_size (#12573)

Changed checkpoints save path in the case of one logger and user-provided weights_save_path from weights_save_path/name/version/checkpoints to weights_save_path/checkpoints (#12372)

Changed checkpoints save path in the case of multiple loggers and user-provided weights_save_path from weights_save_path/name1_name2/version1_version2/checkpoints to weights_save_path/checkpoints (#12372)

Marked swa_lrs argument in StochasticWeightAveraging callback as required (#12556)

LightningCLI's shorthand notation changed to use jsonargparse native feature (#12614)

LightningCLI changed to use jsonargparse native support for list append (#13129)

Changed seed_everything_default argument in the LightningCLI to type Union[bool, int]. If set to True a seed is automatically generated for the parser argument --seed_everything. (#12822, #13110)

Make positional arguments required for classes passed into the add_argparse_args function. (#12504)

Raise an error if there are insufficient training batches when using a float value of limit_train_batches (#12885)

DataLoader instantiated inside a *_dataloader hook will not set the passed arguments as attributes anymore (#12981)

When a multi-element tensor is logged, an error is now raised instead of silently taking the mean of all elements (#13164)

The WandbLogger will now use the run name in the logs folder if it is provided, and otherwise the project name (#12604)

Enabled using any Sampler in distributed environment in Lite (#13646)

Raised a warning instead of forcing sync_dist=True on epoch end (13364)

Updated val_check_interval(int) to consider total train batches processed instead of _batches_that_stepped for validation check during training (#12832

Updated Habana Accelerator's auto_device_count, is_available & get_device_name methods based on the latest torch habana package (#13423)

Disallowed using BatchSampler when running on multiple IPUs (#13854)

Deprecated

Deprecated pytorch_lightning.accelerators.gpu.GPUAccelerator in favor of pytorch_lightning.accelerators.cuda.CUDAAccelerator (#13636)

Deprecated pytorch_lightning.loggers.base.LightningLoggerBase in favor of pytorch_lightning.loggers.logger.Logger, and deprecated pytorch_lightning.loggers.base in favor of pytorch_lightning.loggers.logger (#120148)

Deprecated pytorch_lightning.callbacks.base.Callback in favor of pytorch_lightning.callbacks.callback.Callback (#13031)

Deprecated num_processes, gpus, tpu_cores, and ipus from the Trainer constructor in favor of using the accelerator and devices arguments (#11040)

Deprecated setting LightningCLI(seed_everything_default=None) in favor of False (#12804).

Deprecated pytorch_lightning.core.lightning.LightningModule in favor of pytorch_lightning.core.module.LightningModule (#12740)

Deprecated pytorch_lightning.loops.base.Loop in favor of pytorch_lightning.loops.loop.Loop (#13043)

Deprecated Trainer.reset_train_val_dataloaders() in favor of Trainer.reset_{train,val}_dataloader (#12184)

Deprecated LightningCLI's registries in favor of importing the respective package (#13221)

Deprecated public utilities in pytorch_lightning.utilities.cli.LightningCLI in favor of equivalent copies in pytorch_lightning.cli.LightningCLI (#13767)

Deprecated pytorch_lightning.profiler in favor of pytorch_lightning.profilers (#12308)

Removed

Removed deprecated IndexBatchSamplerWrapper.batch_indices (#13565)

Removed the deprecated LightningModule.add_to_queue and LightningModule.get_from_queue method (#13600)

Removed deprecated pytorch_lightning.core.decorators.parameter_validation from decorators (#13514)

Removed the deprecated Logger.close method (#13149)

Removed the deprecated weights_summary argument from the Trainer constructor (#13070)

Removed the deprecated flush_logs_every_n_steps argument from the Trainer constructor (#13074)

Removed the deprecated process_position argument from the Trainer constructor (13071)

Removed the deprecated checkpoint_callback argument from the Trainer constructor (#13027)

Removed the deprecated on_{train,val,test,predict}_dataloader hooks from the LightningModule and LightningDataModule (#13033)

Removed the deprecated TestTubeLogger (#12859)

Removed the deprecated pytorch_lightning.core.memory.LayerSummary and pytorch_lightning.core.memory.ModelSummary (#12593)

Removed the deprecated summarize method from the LightningModule (#12559)

Removed the deprecated model_size property from the LightningModule class (#12641)

Removed the deprecated stochastic_weight_avg argument from the Trainer constructor (#12535)

Removed the deprecated progress_bar_refresh_rate argument from the Trainer constructor (#12514)

Removed the deprecated prepare_data_per_node argument from the Trainer constructor (#12536)

Removed the deprecated pytorch_lightning.core.memory.{get_gpu_memory_map,get_memory_profile} (#12659)

Removed the deprecated terminate_on_nan argument from the Trainer constructor (#12553)

Removed the deprecated XLAStatsMonitor callback (#12688)

Remove deprecated pytorch_lightning.callbacks.progress.progress (#12658)

Removed the deprecated dim and size arguments from the LightningDataModule constructor(#12780)

Removed the deprecated train_transforms argument from the LightningDataModule constructor(#12662)

Removed the deprecated log_gpu_memory argument from the Trainer constructor (#12657)

Removed the deprecated automatic logging of GPU stats by the logger connector (#12657)

Removed deprecated GPUStatsMonitor callback (#12554)

Removed support for passing strategy names or strategy instances to the accelerator Trainer argument (#12696)

Removed support for passing strategy names or strategy instances to the plugins Trainer argument (#12700)

Removed the deprecated val_transforms argument from the LightningDataModule constructor (#12763)

Removed the deprecated test_transforms argument from the LightningDataModule constructor (#12773)

Removed deprecated Trainer(max_steps=None) (#13591)

Removed deprecated dataloader_idx argument from on_train_batch_start/end hooks Callback and LightningModule (#12769, #12977)

Removed deprecated get_progress_bar_dict property from LightningModule (#12839)

Removed sanity check for multi-optimizer support with habana backends (#13217)

Removed the need to explicitly load habana module (#13338)

Removed the deprecated Strategy.post_dispatch() hook (#13461)

Removed deprecated pytorch_lightning.callbacks.lr_monitor.LearningRateMonitor.lr_sch_names (#13353)

Removed deprecated Trainer.slurm_job_id in favor of SLURMEnvironment.job_id (#13459)

Removed support for the DDP2Strategy (#12705)

Removed deprecated LightningDistributed (#13549)

Removed deprecated ClusterEnvironment properties master_address and master_port in favor of main_address and main_port (#13458)

Removed deprecated ClusterEnvironment methods KubeflowEnvironment.is_using_kubelfow(), LSFEnvironment.is_using_lsf() and TorchElasticEnvironment.is_using_torchelastic() in favor of the detect() method (#13458)

Removed deprecated Callback.on_keyboard_interrupt (#13438)

Removed deprecated LightningModule.on_post_move_to_device (#13548)

Removed TPUSpawnStrategy.{tpu_local_core_rank,tpu_global_core_rank} attributes in favor of TPUSpawnStrategy.{local_rank,global_rank} (#11163)

Removed SingleTPUStrategy.{tpu_local_core_rank,tpu_global_core_rank} attributes in favor of SingleTPUStrategy.{local_rank,global_rank}(#11163)

Fixed

Improved support for custom DataLoaders when instantiated in *_dataloader hook (#12981)

Allowed custom BatchSamplers when instantiated in *_dataloader hook #13640)

Fixed an issue with unsupported torch.inference_mode() on hpu backends by making it use no_grad (#13014)

The model wrapper returned by LightningLite.setup() now properly supports pass-through when looking up attributes (#12597)

Fixed issue where the CLI fails with certain torch objects (#13153)

Fixed LightningCLI signature parameter resolving for some lightning classes (#13283)

Fixed Model Summary when using DeepSpeed Stage 3 (#13427)

Fixed pytorch_lightning.utilities.distributed.gather_all_tensors to handle tensors of different dimensions (#12630)

Fixed the input validation for the accelerator Trainer argument when passed as a string (#13417)

Fixed Trainer.predict(return_predictions=False) to track prediction's batch_indices (#13629)

Fixed and issue that prevented setting a custom CheckpointIO plugin with strategies (#13785)

Fixed main progress bar counter when val_check_interval=int and check_val_every_n_epoch=None (#12832

Improved support for custom ReduceLROnPlateau scheduler if reduce_on_plateau is set by the user in scheduler config (#13838)

Used global_step while restoring logging step for old checkpoints (#13645)

When training with precision=16 on IPU, the cast has been moved off the IPU onto the host, making the copies from host to IPU cheaper (#13880)

Fixed error handling in learning rate finder when not enough data points are available to give a good suggestion (#13845)

Fixed an issue that caused the learning rate finder to set the model's learning rate to None when no suggestion was possible (#13845)

Fixed an issue causing deterministic algorighms and other globals to get reset in spawned processes (#13921)

Fixed default amp_level for DeepSpeedPrecisionPlugin to O2 (#13897)

Fixed Python 3.10 compatibility for truncated back-propagation through time (TBPTT) (#13973)

Fixed TQDMProgressBar reset and update to show correct time estimation (2/2) (#13962)

Full commit list: https://github.com/PyTorchLightning/pytorch-lightning/compare/1.6.0...1.7.0

Contributors

Veteran

@akashkw @akihironitta @aniketmaurya @awaelchli @Benjamin-Etheredge @Borda @carmocca @catalys1 @daniellepintz @edenlightning @edward-io @EricWiener @fschlatt @ftorres16 @jerome-habana @justusschock @karthikrangasai @kaushikb11 @krishnakalyan3 @krshrimali @mauvilsa @nikvaessen @otaj @pre-commit-ci @puhuk @raoakarsha @rasbt @rohitgr7 @SeanNaren @s-rog @talregev @tchaton @tshu-w @twsl @weiji14 @williamFalcon @WrRan

New

@alvitawa @aminst @ankitaS11 @ar90n @Atharva-Phatak @bibhabasumohapatra @BongYang @code-review-doctor @CompRhys @Cyprien-Ricque @dependabot @digital-idiot @DN6 @donlapark @ekagra-ranjan @ethanfurman @gautierdag @georgestein @HallerPatrick @HenryLau0220 @hhsecond @himkt @HMellor @igorgad @inwaves @ishtos @JeroenDelcour @JiahaoYao @jiny419 @jinyoung-lim @JustinGoheen @jxmorris12 @Keiku @kingjuno @lsy643 @luca-medeiros @lukasugar @maciek-pioro @mads-oestergaard @manskx @martinosorb @MohammedAlkhrashi @MrShevan @myxik @naisofly @NathanielDamours @nayoungjun @niberger @nitinramvelraj @nninept @pbsds @Pragyanstha @PrajwalBorkar @Prometheos2 @rampartrange @rhjohnstone @rschireman @samz5320 @Schinkikami @semaphore-egg @shantam-8 @shenoynikhil @sisilmehta2000 @s-kumano @stanbiryukov @talregev @tanmoyio @tkonopka @vumichien @wangherr @yhl48 @YongWookHa

If we forgot somebody or you have a suggestion, find support here :zap:

Did you know?

Chuck Norris can unit-test entire applications with a single assert.
Source code(tar.gz)
Source code(zip)
lightning-2022.8.2-py3-none-any.whl(164.77 KB)
lightning-2022.8.2.tar.gz(58.69 KB)
pytorch-lightning-1.7.0.tar.gz(505.14 KB)
pytorch_lightning-1.7.0-py3-none-any.whl(684.43 KB)

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

Related tags

Overview

*Codecov is > 90%+ but build delays may show less

NEWS

PyTorch Lightning is just organized PyTorch

Lightning Philosophy

Inference

Continuous Integration

How To Use

Step 0: Install

Install bleeding-edge - future 1.2

Step 1: Add these imports

Step 2: Define a LightningModule (nn.Module subclass)

Step 3: Train!

And without changing a single line of code, you could run on GPUs/TPUs

And even export for production via onnx or torchscript

For advanced users, you can still own complex training loops

Key Features

Lightning automates 40+ parts of DL/ML research

Examples

Hello world

Contrastive Learning

NLP

Reinforcement Learning

Vision

Classic ML

Community

Asking for help

Funding

Grid AI

Licence

BibTeX

Comments

🚀 Feature

Motivation

Pitch

🚀 Typing coverage

Plan

List of files and guesstimated difficulty

Completed

Difficulty 1 of 3

Difficulty 2 of 3

Difficulty 3 of 3

🐛 Bug

What does this PR do?

Before submitting

PR review

Did you have fun?

Before submitting

What does this PR do?

PR review

Did you have fun?

Proposed refactor

Pitch

Additional context

If you enjoy Lightning, check out our other projects! ⚡

Problem

This PR

Backward compatibility

Summary

🚀 Feature

🐛 Bug

Environment

PL 0.6.0

Diff between 0.6.0 and 0.7.1 envs

🐛 Bug

Expected behavior

Environment

Additional context

🐛 Bug

To Reproduce

Expected behavior

Environment

Description & Motivation

Pitch

Alternatives

Additional context

Bug description

How to reproduce the bug