PyTorch extensions for high performance and large scale training.

Facebook Research

Last update: Dec 28, 2022

Related tags

Machine Learning fairscale

Overview

Description

FairScale is a PyTorch extension library for high performance and large scale training on one or multiple machines/nodes. This library extends basic PyTorch capabilities while adding new experimental ones.

FairScale supports:

Parallelism:
- Pipeline parallelism (fairscale.nn.pipe)
- Asynchronous Pipeline parallelism (fairscale.nn.async_pipe)
- Model Parallelism (fairscale.nn.model_parallel.layers)
- experimental AmpNet (fairscale.experimental.nn.ampnet_pipe)
Sharded training:
- Optimizer state sharding (fairscale.optim.OSS)
- Sharded Data Parallel (SDP) (fairscale.nn.ShardedDataParallel)
- Fully Sharded Data Parallel (FSDP) (fairscale.nn.FullyShardedDataParallel) (PyTorch >= 1.6)
Optimization at scale:
- AdaScale SGD (fairscale.optim.AdaScale)
GPU memory optimization:
- Activation checkpointing wrapper (fairscale.nn.misc.checkpoint_wrapper)
GPU speed optimization:
- Sharded grad scaler - automatic mixed precision (fairscale.optim.grad_scaler)

Requirements

PyTorch >= 1.5.1

Installation

Normal installation:

pip install fairscale

Development mode:

cd fairscale
pip install -r requirements.txt
pip install -e .

If either of the above fails, add --no-build-isolation to the pip install command (this could be a problem with recent versions of pip).

Getting Started

The full documentation (https://fairscale.readthedocs.io/) contains instructions for getting started and extending fairscale.

Examples

Pipe

Run a 4-layer model on 2 GPUs. The first two layers run on cuda:0 and the next two layers run on cuda:1.

import torch

import fairscale

model = torch.nn.Sequential(a, b, c, d)
model = fairscale.nn.Pipe(model, balance=[2, 2], devices=[0, 1], chunks=8)

Optimizer state sharding (ZeRO)

See a more complete example here, but a minimal example could look like the following :

import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from fairscale.optim.oss import OSS
from fairscale.nn.data_parallel import ShardedDataParallel as ShardedDDP

def train(
    rank: int,
    world_size: int,
    epochs: int):

    # DDP init example
    dist.init_process_group(backend='nccl', init_method="tcp://localhost:29501", rank=rank, world_size=world_size)

    # Problem statement
    model = myAwesomeModel().to(rank)
    dataloader = mySuperFastDataloader()
    loss_fn = myVeryRelevantLoss()
    base_optimizer = torch.optim.SGD # pick any pytorch compliant optimizer here
    base_optimizer_arguments = {} # pass any optimizer specific arguments here, or directly below when instantiating OSS

    # Wrap the optimizer in its state sharding brethren
    optimizer = OSS(params=model.parameters(), optim=base_optimizer, **base_optimizer_arguments)

    # Wrap the model into ShardedDDP, which will reduce gradients to the proper ranks
    model = ShardedDDP(model, optimizer)

    # Any relevant training loop, nothing specific to OSS. For example:
    model.train()
    for e in range(epochs):
        for batch in dataloader:
            # Train
            model.zero_grad()
            outputs = model(batch["inputs"])
            loss = loss_fn(outputs, batch["label"])
            loss.backward()
            optimizer.step()

    dist.destroy_process_group()

if __name__ == "__main__":
    # Supposing that WORLD_SIZE and EPOCHS are somehow defined somewhere
    mp.spawn(
        train,
        args=(
            WORLD_SIZE,
            EPOCHS,
        ),
        nprocs=WORLD_SIZE,
        join=True,
    )

AdaScale SGD

AdaScale can be used to wrap a SGD optimizer and to be used in DDP (Distributed Data Parallel) training or non-DDP with gradient accumulation. The benefit is to re-use the same LR schedule from a baseline batch size when effective batch size is bigger.

Note that AdaScale does not help increase per-GPU batch size.

from torch.optim import SGD
from torch.optim.lr_scheduler import LambdaLR  # or your scheduler
from fairscale.optim import AdaScale

...
optim = AdaScale(SGD(model.parameters(), lr=0.1))
scheduler = LambdaLR(optim, ...)
...
# Note: the train loop should be with DDP or with gradient accumulation.
last_epoch = 0
step = 0
done = False
while not done:
    for sample in dataset:
        ...
        step += optim.gain()
        optim.step()
        epoch = step // len(dataset)
        if last_epoch != epoch:
            scheduler.step()
            last_epoch = epoch
        if epoch > max_epoch:
            done = True

Primary goal is to allow scaling to bigger batch sizes without losing model accuracy. (However, training time might be longer comparing to without AdaScale.)

At a high level, we want ML researchers to:

go parallel more easily (i.e. no need to find new learning rate schedules)
not worrying about losing accuracy
potentially higher GPU efficiency (fewer steps, less networking overhead, etc.)

Testing

We use circleci to test on PyTorch versions 1.6.0, 1.7.1, and 1.8.0. Please create an issue if you are having trouble with installation.

Contributors

See the CONTRIBUTING file for how to help out.

License

fairscale is licensed under the BSD-3-Clause License.

References

Here is a list of all authors on relevant research papers this work is based on:

torchgpipe: Chiheon Kim, Heungsub Lee, Myungryong Jeong, Woonhyuk Baek, Boogeon Yoon, Ildoo Kim, Sungbin Lim, Sungwoong Kim. [Paper] [Code]
ZeRO: Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. [Paper] [Code]
Megatron-LM: Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro. [Paper][Code]
AdaScale SGD: Tyler B. Johnson, Pulkit Agrawal, Haijie Gu, Carlos Guestrin. [Paper]
GShard: Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen [Paper]
AMPNet:Alexander L. Gaunt, Matthew A. Johnson, Maik Riechert, Daniel Tarlow, Ryota Tomioka, Dimitrios Vytiniotis, Sam Webster [Paper]

Comments

[ShardedDDP] Handle transition to eval + parameter change

🐛 Bug

Reported by @SeanNaren, after a successful training the switch to eval() is not properly taken into account, and the grads are marked as "waiting to be reduced" while they should not (we're in eval..)

opened by blefaudeux 26
FSDP: issues with inferencing

❓ Questions and Help

I am trying to integrate FSDP into my code. I have questions related to optimizer sharding. Does FSDP automatically shards the gradient, optimizer, and parameter, or do I need to call OSS to shard optimizer? Also can anyone suggest right way to do mixed precision with autocast and FSDP. Also does validation/test happens on GPU rank 0 or on all the nodes?
FSDP

opened by HITESHLPATEL 24
[feat] Add context manager to FSDP for easier child module wrapping
What does this PR do?

As discussed in https://github.com/PyTorchLightning/pytorch-lightning/pull/6152#issuecomment-785950642 this adds a context manager that assists in making child modules with similar defaults.

from fairscale.nn.misc import enable_wrap, wrap with enable_wrap(**handleful_of_important_params): layer_1 = wrap(torch.nn.Linear(5, 5)) layer_2 = wrap(torch.nn.Linear(5, 5), flatten_parameters=True) # Override parameters if you'd like ... # without the context manager, creates Linear layer layer_1 = wrap(torch.nn.Linear(5, 5))

If not within the FSDP context, this would be a no-op. This makes it easier to annotate layers without having to copy any changes in parameters.

Before submitting

[x] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)

[x] Did you read the contributor guideline?

[ ] Did you make sure to update the docs?

[x] Did you write any new necessary tests?

PR review

Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃
CLA Signed
opened by SeanNaren 24
[Fix][FSDP] Don't remove post backward hooks for multiple backward fix

fixes #918

I am quite confident, that we dont need to remove backward hooks even after finalizing. They will be automatically removed if the leaf variables go out of context and cuda autograd graph cleans up.

Mots tests were succeeding, apart from one related to cpu offload locally, will debug that
CLA Signed

opened by ngoyal2707 22
[feat][OSS] elastic and pytorch compatible checkpoints
Before submitting

[ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)

[x] Did you read the contributor guideline?

[ ] Did you make sure to update the docs?

[x] Did you write any new necessary tests?

What does this PR do?

Fixes #164, and make the saved state pytorch-compliant (no extra keyword). The number of ranks can change before and after the checkpoints, it will automatically adapt by repartitioning at load. Adding a new unit test which checks reproducibility (cc @joshim5)

PR review

Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged. @joshim5 @stas00 @SeanNaren this breaks compatibility with old checkpoints, is that a big issue ? I could add some scaffolding to move old checkpoints to the new form.

@mannatsingh I know you mentioned that a long time ago, finally there. Not sure how you would rate the complexity of doing this (masking the sharding when rebuilding a pytorch-compatible state dict), but it's now out of the box with this PR

Did you have fun?

Make sure you had fun coding 🙃
CLA Signed
opened by blefaudeux 22
[RFC] FSDP arguments refactoring
cc @myleott @sshleifer @QuentinDuval @prigoyal @anj-s @blefaudeux @msbaines @tmarkstrum @SeanNaren

Motivations

Currently FSDP has a long list of arguments. They deal with flattening, sharding, mixed precision, cpu offloading. The long list is harder for users to get a quick glance on what's supported combinations and what's not. It is harder for developers to reasonable about interactions and unit testing all combinations.

Proposal

I propose we separate the params into different groups, as the following:

parameter_handling: enum of "flatten", "original" sharding: enum of "none, full, shard_after_backward" mixed_precision: enum of "full, amp_w32_b32_g32, amp_w16_b16_g16, ..." cpu_offloading: enum of "none, grad, param, grad_and_param"

The combination space above is still very big but at least the related arguments are grouped and it is easier to see what's supported and what's not.

Implementation

We can first put the new arguments in place and deprecate the older ones. Some of the options may raise NotImplemented error and we will add implementation support gradually.

I am learning from Benjamin where we can use an enum class that's also a string, like:

class Foo(str, Enum): V1 = "val1"

This way, we can have both yaml config and enum based argument passing.

Any comments or suggestions are most welcome from the cc'ed folks and the community. Please add people if I missed anyone.
opened by min-xu-ai 21

ShardedDataParallel doesn't work with multiple nodes

🐛 Bug

ShardedDataParallel successfully works with 8gpus x 1nodes, while got the following error with 8gpus x 2 nodes

AssertionError: A bucket failed to be sent, probably unused parameters.Either remove the unused parameter or de-activate ShardedDDP buckets -set reduce_buffer_size to 0-

However, obviously there are not unused parameters. Note that torch.nn.DistributedDataParallel can work with same environment.

To Reproduce

import os
import sys 

import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from fairscale.optim.oss import OSS 
from fairscale.nn.data_parallel import ShardedDataParallel as ShardedDDP


def train(rank, local_rank, world_size, init_method):
    print("DDP init", world_size, rank, local_rank, init_method, file=sys.stderr)
    dist.init_process_group(backend='nccl', init_method=init_method, rank=rank, world_size=world_size)
    torch.cuda.set_device(local_rank)

    model = torch.nn.Linear(3, 3).cuda()
    base_optimizer = torch.optim.Adam
    base_optimizer_arguments = {}

    optimizer = OSS(params=model.parameters(), optim=base_optimizer, **base_optimizer_arguments)
    model = ShardedDDP(model, optimizer)

    print("Train", file=sys.stderr)
    model.train()
    model.zero_grad()
    outputs = model(torch.randn(2,3).cuda())
    loss = outputs.sum()
    loss.backward()
    optimizer.step()
    print("finish", file=sys.stderr)

I used shared file system initialization for init_method.

init_method="file:..."

Environment

Note that I tested tcp and infiniband connection both.

PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: version 2.8.12.2

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB
GPU 4: Tesla V100-SXM2-32GB
GPU 5: Tesla V100-SXM2-32GB
GPU 6: Tesla V100-SXM2-32GB
GPU 7: Tesla V100-SXM2-32GB

Nvidia driver version: 450.51.06
cuDNN version: Probably one of the following:
/usr/lib64/libcudnn.so.8.0.4
/usr/lib64/libcudnn_adv_infer.so.8.0.4
/usr/lib64/libcudnn_adv_train.so.8.0.4
/usr/lib64/libcudnn_cnn_infer.so.8.0.4
/usr/lib64/libcudnn_cnn_train.so.8.0.4
/usr/lib64/libcudnn_ops_infer.so.8.0.4
/usr/lib64/libcudnn_ops_train.so.8.0.4
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] pytorch-ranger==0.1.1
[pip3] pytorch-wpe==0.0.0
[pip3] torch==1.7.1
[pip3] torch-complex==0.2.0
[pip3] torch-optimizer==0.0.1a17
[pip3] torchaudio==0.7.2
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.0.221             h6bb024c_0  
[conda] mkl                       2020.2                      256  
[conda] mkl-service               2.3.0            py38he904b0f_0  
[conda] mkl_fft                   1.2.0            py38h23d657b_0  
[conda] mkl_random                1.1.1            py38h0573a6f_0  
[conda] numpy                     1.19.2           py38h54aff64_0  
[conda] numpy-base                1.19.2           py38hfa32c7d_0  
[conda] pytorch                   1.7.1           py3.8_cuda11.0.221_cudnn8.0.5_0    pytorch
[conda] pytorch-ranger            0.1.1                    pypi_0    pypi
[conda] pytorch-wpe               0.0.0                    pypi_0    pypi
[conda] torch-complex             0.2.0                    pypi_0    pypi
[conda] torch-optimizer           0.0.1a17                 pypi_0    pypi
[conda] torchaudio                0.7.2                    pypi_0    pypi

bug

opened by kamo-naoyuki 19

[feat] ShardedDataParallel with autoreduce
Before submitting

[x] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)

[x] Did you read the contributor guideline?

[x] Did you make sure to update the docs?

[x] Did you write any new necessary tests?

What does this PR do?

Stopgap solution before Pytorch's DDP become flexible enough to accommodate different reduction patterns out of the box, another DDP dedicated to the sharded optimizer which automatically reduces gradients to the appropriate ranks and releases the grad buffers

Key features that this PR brings or maintains in a different form:

[x] automatic gradient reduction to the appropriate ranks

[x] overlap the gradient reduction and the backward pass

[x] keep the tunable bucketing of small gradients

[x] keep the reduce calls asynchronous (non-blocking)

cc @mrshenli

PR review

Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃
CLA Signed
opened by blefaudeux 19
[feat] Gossip/SlowMo
Before submitting

[ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)

[ ] Did you read the contributor guideline?

[ ] Did you make sure to update the docs?

[ ] Did you write any new necessary tests?

What does this PR do?

Disclaimer: I (@lefaudeux) am no the author, Vinayak (@vtantia) is. Just testing the CI and putting up a draft PR

TODOs:

[x] Write documentation

[x] Fix the licensing

[x] Make sure that the unit tests run with the global pytest runner

[x] Factorize the unit tests a little, cleanup/autogenerate

[ ] Add tutorial

PR review

Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃
CLA Signed
opened by blefaudeux 18

Fail on install - CUDA 11.1.0

🐛 Bug

Hi, pip install on below environment throws an error. I'm happy to provide more info if it would be useful. Thanks!

fatal error: multi_tensor_apply.cuh: No such file or directory

Environment (NVIDIA-Python Docker: 20.10)

Python: 3.6 PyTorch: 1.7.0 CUDA: 11.1.0 cuDNN: 8.0.4

Error

Error:
b'  ERROR: Command errored out with exit status 1:
   command: /usr/bin/python3 /usr/local/lib/python3.6/dist-packages/pip/_vendor/pep517/_in_process.py build_wheel /tmp/tmpl6xin7ge
       cwd: /tmp/pip-install-0_xiegmi/fairscale
  Complete output (158 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.6
  creating build/lib.linux-x86_64-3.6/fairscale
  copying fairscale/__init__.py -> build/lib.linux-x86_64-3.6/fairscale
  creating build/lib.linux-x86_64-3.6/fairscale/optim
  copying fairscale/optim/grad_scaler.py -> build/lib.linux-x86_64-3.6/fairscale/optim
  copying fairscale/optim/oss.py -> build/lib.linux-x86_64-3.6/fairscale/optim
  copying fairscale/optim/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/optim
  copying fairscale/optim/utils.py -> build/lib.linux-x86_64-3.6/fairscale/optim
  copying fairscale/optim/adam.py -> build/lib.linux-x86_64-3.6/fairscale/optim
  creating build/lib.linux-x86_64-3.6/fairscale/nn
  copying fairscale/nn/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/nn
  creating build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/dependency.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/pipe.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/stream.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/batchnorm.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/pipeline.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/checkpoint.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/copy.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/phony.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/worker.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  copying fairscale/nn/pipe/microbatch.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
  creating build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
  copying fairscale/nn/model_parallel/random.py -> build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
  copying fairscale/nn/model_parallel/initialize.py -> build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
  copying fairscale/nn/model_parallel/layers.py -> build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
  copying fairscale/nn/model_parallel/cross_entropy.py -> build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
  copying fairscale/nn/model_parallel/mappings.py -> build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
  copying fairscale/nn/model_parallel/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
  copying fairscale/nn/model_parallel/utils.py -> build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
  creating build/lib.linux-x86_64-3.6/fairscale/nn/data_parallel
  copying fairscale/nn/data_parallel/sharded_ddp.py -> build/lib.linux-x86_64-3.6/fairscale/nn/data_parallel
  copying fairscale/nn/data_parallel/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/nn/data_parallel
  creating build/lib.linux-x86_64-3.6/fairscale/nn/moe
  copying fairscale/nn/moe/top2gate.py -> build/lib.linux-x86_64-3.6/fairscale/nn/moe
  copying fairscale/nn/moe/moelayer.py -> build/lib.linux-x86_64-3.6/fairscale/nn/moe
  copying fairscale/nn/moe/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/nn/moe
  creating build/lib.linux-x86_64-3.6/fairscale/nn/pipe/balance
  copying fairscale/nn/pipe/balance/profile.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/balance
  copying fairscale/nn/pipe/balance/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/balance
  copying fairscale/nn/pipe/balance/blockpartition.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/balance
  creating build/lib.linux-x86_64-3.6/fairscale/nn/pipe/skip
  copying fairscale/nn/pipe/skip/skippable.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/skip
  copying fairscale/nn/pipe/skip/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/skip
  copying fairscale/nn/pipe/skip/tracker.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/skip
  copying fairscale/nn/pipe/skip/layout.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/skip
  copying fairscale/nn/pipe/skip/portal.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/skip
  copying fairscale/nn/pipe/skip/namespace.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/skip
  running egg_info
  writing fairscale.egg-info/PKG-INFO
  writing dependency_links to fairscale.egg-info/dependency_links.txt
  writing requirements to fairscale.egg-info/requires.txt
  writing top-level names to fairscale.egg-info/top_level.txt
  reading manifest file \'fairscale.egg-info/SOURCES.txt\'
  reading manifest template \'MANIFEST.in\'
  writing manifest file \'fairscale.egg-info/SOURCES.txt\'
  creating build/lib.linux-x86_64-3.6/fairscale/clib
  creating build/lib.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda
  copying fairscale/clib/fused_adam_cuda/fused_adam_cuda.cpp -> build/lib.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda
  copying fairscale/clib/fused_adam_cuda/fused_adam_cuda_kernel.cu -> build/lib.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda
  running build_ext
  building \'fairscale.fused_adam_cuda\' extension
  creating /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6
  creating /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale
  creating /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale/clib
  creating /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda
  Emitting ninja build file /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  [1/2] /usr/local/cuda/bin/nvcc -I/usr/local/lib/python3.6/dist-packages/torch/include -I/usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.6/dist-packages/torch/include/TH -I/usr/local/lib/python3.6/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.6m -c -c /tmp/pip-install-0_xiegmi/fairscale/fairscale/clib/fused_adam_cuda/fused_adam_cuda_kernel.cu -o /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda/fused_adam_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options \'\'"\'"\'-fPIC\'"\'"\'\' -O3 --use_fast_math -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=fused_adam_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=sm_86 -std=c++14
  FAILED: /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda/fused_adam_cuda_kernel.o
  /usr/local/cuda/bin/nvcc -I/usr/local/lib/python3.6/dist-packages/torch/include -I/usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.6/dist-packages/torch/include/TH -I/usr/local/lib/python3.6/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.6m -c -c /tmp/pip-install-0_xiegmi/fairscale/fairscale/clib/fused_adam_cuda/fused_adam_cuda_kernel.cu -o /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda/fused_adam_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options \'\'"\'"\'-fPIC\'"\'"\'\' -O3 --use_fast_math -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=fused_adam_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=sm_86 -std=c++14
  /tmp/pip-install-0_xiegmi/fairscale/fairscale/clib/fused_adam_cuda/fused_adam_cuda_kernel.cu:12:10: fatal error: multi_tensor_apply.cuh: No such file or directory
   #include "multi_tensor_apply.cuh"
            ^~~~~~~~~~~~~~~~~~~~~~~~
  compilation terminated.
  [2/2] c++ -MMD -MF /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda/fused_adam_cuda.o.d -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/lib/python3.6/dist-packages/torch/include -I/usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.6/dist-packages/torch/include/TH -I/usr/local/lib/python3.6/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.6m -c -c /tmp/pip-install-0_xiegmi/fairscale/fairscale/clib/fused_adam_cuda/fused_adam_cuda.cpp -o /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda/fused_adam_cuda.o -O3 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=fused_adam_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
  In file included from /usr/local/lib/python3.6/dist-packages/torch/include/ATen/Parallel.h:149:0,
                   from /usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include/torch/utils.h:3,
                   from /usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include/torch/nn/cloneable.h:5,
                   from /usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include/torch/nn.h:3,
                   from /usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include/torch/all.h:12,
                   from /usr/local/lib/python3.6/dist-packages/torch/include/torch/extension.h:4,
                   from /tmp/pip-install-0_xiegmi/fairscale/fairscale/clib/fused_adam_cuda/fused_adam_cuda.cpp:1:
  /usr/local/lib/python3.6/dist-packages/torch/include/ATen/ParallelOpenMP.h:84:0: warning: ignoring #pragma omp parallel [-Wunknown-pragmas]
   #pragma omp parallel for if ((end - begin) >= grain_size)
  
  ninja: build stopped: subcommand failed.
  Traceback (most recent call last):
    File "/usr/local/lib/python3.6/dist-packages/torch/utils/cpp_extension.py", line 1522, in _run_ninja_build
      env=env)
    File "/usr/lib/python3.6/subprocess.py", line 438, in run
      output=stdout, stderr=stderr)
  subprocess.CalledProcessError: Command \'[\'ninja\', \'-v\']\' returned non-zero exit status 1.

bug

opened by johncookds 17

Add FullyShardedDataParallel (FSDP)
Co-authored-by: @min-xu-ai and @sshleifer

Overview

Recent work by Microsoft and Google has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new FullyShardedDataParallel (FSDP) wrapper, which is a drop-in replacement for PyTorch's DistributedDataParallel (DDP) wrapper.

Compared to PyTorch DDP:

FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs

FSDP with reshard_after_forward=False has the same communication cost as PyTorch DDP and is similar to ZeRO-2

FSDP with reshard_after_forward=True increases total communication by 50% and is similar to ZeRO-3:

all-gather parameters at start of forward pass and start of backward pass

reduce-scatter grads at end of backward pass

in practice, FSDP is faster than PyTorch DDP because the optimizer step is sharded, and the extra communication can be overlapped with the forward pass

FSDP enables training 13B parameter models on 8 GPUs and 175B parameter models on 128 GPUs. When using the cpu_offload=True option, it's possible to train 1T parameter models on 256 GPUs.

General usage notes

for best memory efficiency wrap each layer in your network with FSDP and set reshard_after_forward=True

for best training speed set reshard_after_forward=False (wrapping each layer is not required, but will improve speed further)

if you're using torch.cuda.amp.autocast for mixed precision, that's fully compatible with the FSDP wrapper, just set mixed_precision=True

if combining with activation checkpointing, prefer FSDP(checkpoint_wrapper(module)) over checkpoint_wrapper(FSDP(module)). The latter will result in more communication and will be slower.

this is full compatible with pointwise Optimizers, e.g., Adam, AdamW, Adadelta, Adamax, SGD, etc.. However, the sharding will result in slightly different results when using non-pointwise Optimizers, e.g., Adagrad, Adafactor, LAMB, etc.

How it works

In standard distributed data parallel (DDP) training every worker processes a separate batch and the gradients are summed across workers using an all-reduce operation. While DDP has become very popular, it wastes GPU memory because the model weights and optimizer states are replicated across all DDP workers.

The key insight to unlock full parameter sharding is that we can decompose the all-reduce operation in DDP into separate all-gather and reduce-scatter operations:

Then, we can rearrange the reduce-scatter + all-gather so that each DDP worker only needs to store a single shard of parameters and optimizer state. The figure below illustrates standard DDP training (left) and fully sharded training (right):

To maximize memory efficiency we can discard the full weights after each layer's forward pass, saving memory for subsequent layers. This can be implemented by applying the FSDP wrapper to every layer in your network (with reshard_after_forward=True). In pseudo-code:

FSDP forward pass: for layer_i in layers: all-gather full weights for layer_i forward pass for layer_i discard full weights for layer_i FSDP backward pass: for layer_i in layers: all-gather full weights for layer_i backward pass for layer_i discard full weights for layer_i reduce-scatter gradients for layer_i
CLA Signed
opened by myleott 16
FSDP cannot consolidate optimizer state dict with flatten params is False

I'm now training a large model with 2.5B parameters with AdamW optimizer. Due to the known issue about FSDP and activation checkpointing, I'm using FSDP with flatten params = False. When saving the training checkpoint, the model has state_dict() and local_state_dict() two methods which distinguish saving full or sharded model states. Is it possible to save all full (not sharded ) optimizer states in a single file as well?

I saw the gather_full_optim_state_dict method but there is an assertion that requires flatten_parameters=True

opened by ShenglongZ 3

clip_grad_norm_ from fairscale downcasts to bf16 before all reduce

Copied from: https://github.com/fairinternal/xlformers/issues/117

Shouldn't we remove the .to(dtype=parameters[0].dtype) from this line? https://github.com/facebookresearch/fairscale/blob/ee647b976cf4c8fdd37bc9ae3fd6331d225ba2a0/fairscale/internal/params.py#L75 It seems weird (and it results in inaccuracies) to convert partial gradient norms to fp16/bf16 before summing them.

Context:

We use: https://github.com/facebookresearch/fairscale/blob/ee647b976cf4c8fdd37bc9ae3fd6331d225ba2a0/fairscale/nn/data_parallel/fully_sharded_data_parallel.py#L621

which calculates grad norms via: https://github.com/facebookresearch/fairscale/blob/ee647b976cf4c8fdd37bc9ae3fd6331d225ba2a0/fairscale/internal/params.py#L59

which downcasts to param dtype via: https://github.com/facebookresearch/fairscale/blob/ee647b976cf4c8fdd37bc9ae3fd6331d225ba2a0/fairscale/internal/params.py#L75

before the allreduce: https://github.com/facebookresearch/fairscale/blob/ee647b976cf4c8fdd37bc9ae3fd6331d225ba2a0/fairscale/nn/data_parallel/fully_sharded_data_parallel.py#L672

Spotted from looking at how unusually even grad norms look at each training step:

"g_norm": 5.6875
"g_norm": 11.1875
"g_norm": 23.0
"g_norm": 45.25
"g_norm": 89.5
"g_norm": 176.0
"g_norm": 360.0
"g_norm": 704.0
"g_norm": 720.0
"g_norm": 724.0
"g_norm": 728.0
"g_norm": 716.0
"g_norm": 724.0
"g_norm": 728.0
"g_norm": 752.0
"g_norm": 736.0
"g_norm": 728.0
"g_norm": 728.0
"g_norm": 736.0
"g_norm": 728.0
"g_norm": 728.0
"g_norm": 724.0
"g_norm": 724.0
"g_norm": 724.0
"g_norm": 732.0
"g_norm": 764.0
"g_norm": 720.0
"g_norm": 728.0
"g_norm": 728.0
"g_norm": 740.0
"g_norm": 732.0
"g_norm": 736.0
"g_norm": 704.0
"g_norm": 700.0
"g_norm": 728.0
"g_norm": 740.0
"g_norm": 724.0
"g_norm": 752.0
"g_norm": 712.0
"g_norm": 716.0
"g_norm": 724.0
"g_norm": 744.0
"g_norm": 728.0
"g_norm": 736.0
"g_norm": 720.0
"g_norm": 716.0
"g_norm": 724.0
"g_norm": 716.0
"g_norm": 720.0
"g_norm": 712.0
"g_norm": 744.0
"g_norm": 724.0
"g_norm": 708.0
"g_norm": 708.0
"g_norm": 716.0
"g_norm": 704.0
"g_norm": 712.0
"g_norm": 724.0
"g_norm": 708.0
"g_norm": 708.0
"g_norm": 728.0
"g_norm": 720.0
"g_norm": 724.0
"g_norm": 716.0
"g_norm": 712.0
"g_norm": 704.0
"g_norm": 700.0
"g_norm": 688.0
"g_norm": 692.0
"g_norm": 696.0
"g_norm": 732.0
"g_norm": 620.0
"g_norm": 1168.0
"g_norm": 1152.0
"g_norm": 1144.0
"g_norm": 1112.0
"g_norm": 1128.0
"g_norm": 1136.0
"g_norm": 1128.0
"g_norm": 1128.0
"g_norm": 1104.0
"g_norm": 1112.0
"g_norm": 1088.0
"g_norm": 1112.0
"g_norm": 1112.0
"g_norm": 1120.0
"g_norm": 1112.0
"g_norm": 1064.0
"g_norm": 1040.0
"g_norm": 1024.0
"g_norm": 1056.0
"g_norm": 1032.0
"g_norm": 1032.0
"g_norm": 1024.0
"g_norm": 1048.0
"g_norm": 1016.0
"g_norm": 1040.0
"g_norm": 1016.0
"g_norm": 936.0
"g_norm": 828.0
"g_norm": 764.0
"g_norm": 732.0
"g_norm": 692.0
"g_norm": 676.0
"g_norm": 1376.0
"g_norm": 1360.0
"g_norm": 1328.0
"g_norm": 1360.0
"g_norm": 1360.0
"g_norm": 1312.0
"g_norm": 1328.0
"g_norm": 1264.0
"g_norm": 1304.0
"g_norm": 1280.0
"g_norm": 1296.0
"g_norm": 1224.0
"g_norm": 1256.0
"g_norm": 1264.0
"g_norm": 1224.0
"g_norm": 1152.0
"g_norm": 1160.0
"g_norm": 1184.0
"g_norm": 1184.0
"g_norm": 1144.0
"g_norm": 1128.0
"g_norm": 1112.0
"g_norm": 1080.0
"g_norm": 1072.0
"g_norm": 1048.0
"g_norm": 1040.0
"g_norm": 1040.0
"g_norm": 1072.0
"g_norm": 1032.0
"g_norm": 1024.0
"g_norm": 996.0
"g_norm": 976.0
"g_norm": 988.0
"g_norm": 976.0
"g_norm": 956.0
"g_norm": 988.0
"g_norm": 944.0
"g_norm": 924.0
"g_norm": 924.0
"g_norm": 904.0
"g_norm": 1840.0
"g_norm": 1872.0
"g_norm": 1816.0
"g_norm": 1760.0
"g_norm": 1752.0
"g_norm": 1808.0

opened by glample 3

Can't load optimizer state due to `state_steps`
Hi, I recently upgraded to PyTorch 1.12 and have had issues with loading a saved optimizer state using FSDP here and the issue seems something that is addressed in comments here - https://github.com/facebookresearch/fairscale/blob/4975b05e89aaa29923b72c23b7b0f45118e4252f/fairscale/nn/data_parallel/fully_sharded_data_parallel.py#L2436

From what I understand, Adam's step state changed into a singleton tensor and when I call gather_full_optim_state_dict() this step is converted to an int.

Sample saving dict code:

model = FSDP(model, ...) # call on all ranks optim_state = model.gather_full_optim_state_dict(optimizer) if rank == 0: # save only on rank 0 checkpoint = { 'optimizer': optim_state, ... } torch.save(checkpoint)

Now when I load this optim state dict back - I do the following:

model = FSDP(model, ...) torch.distributed.barrier() # on all ranks checkpoint = torch.load(snapshot_name) curr_opt_state_dict = checkpoint["optimizer"] optim_shard_dict = model.get_shard_from_optim_state_dict(curr_opt_state_dict) optimizer.load_state_dict(optim_shard_dict)

This always fails the assertion in the Adam code - https://github.com/pytorch/pytorch/blob/master/torch/optim/adamw.py#L204 because I imagine the step was converted to an int within FSDP and Adam expects it to be a singleton tensor.

My question is am I saving the state dict correctly? Do I need to call optimizer.state_dict() on top of model.gather_full_optim_state_dict()?

A workaround I'm using to get things to bypass the assertion is to convert the ints back to singleton tensors in the adamw function however that does not seem safe. Any thoughts?

Apologies if my understanding is incorrect, I followed some of the discussion here - https://github.com/facebookresearch/fairscale/issues/776 for the state_dict saving logic.
opened by rowhanm 10
[FSDP] fix for high GPU reserved memory (v2)

v2 for https://github.com/facebookresearch/fairscale/pull/972

The general idea is to try to make a guess on the FSDP module execution order and store that in two lists (one for the forward and another for the backward pass) during the first pass. Then, as opposed to letting the CPU run free and scheduling all GPU operations ahead of time (while reserving GPU memory for each), we only schedule the all-gather for the next module and wait until computations for the current module are finished before continuing. By scheduling the all-gather for the next module, we attempt to keep a good parallelism between data transfer and compute streams.

Given the lack of a global/static execution graph in Pytorch, my understanding on the best we can do here is a heuristic based on local information. Pytorch Distributed is also facing a similar issue and, given there is no perfect solution, arguments for/against each approach largely depend on the models you pick to measure success.

Advantages of current approach: (1) Based on several comments in the original PR, it showed to significantly help different large scale runs that are memory bound. I'm attaching some profiles for a ~370M param transformer showing the new behaviour. As long as we get the execution order right and there is not a lot of variance in terms of how long each module takes to run, the parallelism across compute and data transfer is maintained. (2) We only wait until computations are finished for the modules where reshard_after_forward = True, with the assumption that if this is not set then memory is not a limitation and we should let CPU continue. This should help preventing side-effects on models that do not care about memory.

Disadvantages: (1) We may not always schedule the correct all-gather, which can cause run delays. One example is with activation checkpointing, where the execution needs to go further back in the model and execute a forward to recompute activations. The first module in this forward is not all-gathered ahead of time. Another example is if your execution order changes across different passes. So it is theoretically possible for this PR to cause performance degradations in some scenarios. (2) We may under-schedule all-gathers, which can also cause performance degradation. This could happen on models where there is a big variation on execution time across different modules, and we end up having to wait for a long all-gather to finish after running a module that had very short computation time.

One alternative heuristic proposed by Min was to use the amount of available memory to make a call on whether to wait for the current module to finish or continue execution. However, I found it hard to find a justifiable memory threshold (either relative or absolute) that would work well across a large variety of cases, especially with a lack of a comprehensive benchmark to experiment with. Given we've already seen examples of the approach in this PR working well in practice, it seems safer to just go with this route instead and reevaluate if we find evidence of the contrary.

Original behavior: computation stream is always active, however CPU schedules everything at once.

New behavior: computation stream still mostly active, with CPU scheduling one module at a time.
CLA Signed

opened by ruanslv 7
Running stats with gradient checkpointing
According to patch_batchnorm source code if layer collecting running stats (e.g. BatchNorm) is checkpointed it will accumulate statistics only when grad is enabled (on backward pass). This induces inconsistency:

torch.manual_seed(1337) seq = nn.Sequential(nn.Conv2d(4, 4, 3), nn.BatchNorm2d(4)) torch.manual_seed(1337) seq_checkpointed = checkpoint_wrapper(nn.Sequential(nn.Conv2d(4, 4, 3), nn.BatchNorm2d(4))) inp = torch.randn(2, 4, 16, 16) seq(inp) seq_checkpointed(inp) seq[1].running_mean == seq_checkpointed[1].running_mean tensor([False, False, False, False])

I think this behaviour should be modified to accumulate statistics at 1-st forward pass or at least mentioned in docs
opened by vovaf709 8

Releases(v0.4.13)

v0.4.13(Dec 11, 2022)

Source code(tar.gz)
Source code(zip)
v0.4.12(Oct 5, 2022)

Source code(tar.gz)
Source code(zip)
v0.4.11(Sep 30, 2022)

Source code(tar.gz)
Source code(zip)
v0.4.10(Sep 23, 2022)

Source code(tar.gz)
Source code(zip)
v0.4.9(Sep 7, 2022)

Source code(tar.gz)
Source code(zip)
v0.4.8(Jul 26, 2022)

Source code(tar.gz)
Source code(zip)
v0.4.7(Jul 26, 2022)

Source code(tar.gz)
Source code(zip)
v0.4.6(Mar 9, 2022)

Source code(tar.gz)
Source code(zip)
v0.4.5(Jan 14, 2022)

Source code(tar.gz)
Source code(zip)
v0.4.4(Dec 21, 2021)

Source code(tar.gz)
Source code(zip)
v0.4.3(Nov 18, 2021)
What's Changed

[docs][fix] Update example to use offload_model by @anj-s in https://github.com/facebookresearch/fairscale/pull/806

Switch default branch from master to main by @tmarkstrum in https://github.com/facebookresearch/fairscale/pull/807

[FairScale] Remove refs to "cpu_offload" in code comments by @rohan-varma in https://github.com/facebookresearch/fairscale/pull/814

[chore] Remove deprecated THCudaCheck by @anj-s in https://github.com/facebookresearch/fairscale/pull/818

[feat] layer memory tracking by @QuentinDuval in https://github.com/facebookresearch/fairscale/pull/808

[chore] Add log for the new experimental memory tracker feature. by @anj-s in https://github.com/facebookresearch/fairscale/pull/819

[chore] Update the PyTorch version that we run CPU tests with by @anj-s in https://github.com/facebookresearch/fairscale/pull/809

[chore] Update the PyTorch version that we run benchmarks with. by @anj-s in https://github.com/facebookresearch/fairscale/pull/823

Extend auto shard capabilities to work around torch.fx edge cases. by @EugenHotaj in https://github.com/facebookresearch/fairscale/pull/817

[fix] Update golden data for account for the speed regression by @anj-s in https://github.com/facebookresearch/fairscale/pull/825

[chore] Fix main breakage temporarily by relaxing constraints by @anj-s in https://github.com/facebookresearch/fairscale/pull/828

Use correct node names for param counting in auto_shard. by @EugenHotaj in https://github.com/facebookresearch/fairscale/pull/830

[chore] Update requirements file to reflect latest config by @anj-s in https://github.com/facebookresearch/fairscale/pull/832

[fix]: Fixes an issue with pre_backward hook registering by @min-xu-ai in https://github.com/facebookresearch/fairscale/pull/833

[feature] Skip creating the CPU grad tensor when training by @anj-s in https://github.com/facebookresearch/fairscale/pull/821

[test] improve a test's coverage by @min-xu-ai in https://github.com/facebookresearch/fairscale/pull/798

[fix] Decouple move_params_to_cpu from the mixed_precision. by @anj-s in https://github.com/facebookresearch/fairscale/pull/822

[fix] fix test on main by @min-xu-ai in https://github.com/facebookresearch/fairscale/pull/835

[feature] Add the low level SSD APIs by @anj-s in https://github.com/facebookresearch/fairscale/pull/829

[feat] [FSDP]: add experimental support to shared weights by @min-xu-ai in https://github.com/facebookresearch/fairscale/pull/836

update nightly torch and test the flaky test by @min-xu-ai in https://github.com/facebookresearch/fairscale/pull/837

[chore] Fix broken main due to updated github URL requirements by @anj-s in https://github.com/facebookresearch/fairscale/pull/838

[chore] Update Sphinx version in docs requirements file by @vtantia in https://github.com/facebookresearch/fairscale/pull/841

[feat] experimental MEVO layer by @min-xu-ai in https://github.com/facebookresearch/fairscale/pull/840

[feat] Gossip/SlowMo by @blefaudeux in https://github.com/facebookresearch/fairscale/pull/378

[feature]Add support for SSD offload with FSDP for eval workloads by @anj-s in https://github.com/facebookresearch/fairscale/pull/839

[chore] 0.4.2 release by @anupambhatnagar in https://github.com/facebookresearch/fairscale/pull/846

CI config changes by @anupambhatnagar in https://github.com/facebookresearch/fairscale/pull/847

Setup pre-commit github action and apply pre-commit to all files by @anupambhatnagar in https://github.com/facebookresearch/fairscale/pull/849

Allow sharded grad scaler to cpu offload with FSDP by @anupambhatnagar in https://github.com/facebookresearch/fairscale/pull/831

Update changelog, removed meta.yml and requirements cleanup by @anupambhatnagar in https://github.com/facebookresearch/fairscale/pull/853

[feature] Add a OffloadConfig object to specify offloading params to disk. by @anj-s in https://github.com/facebookresearch/fairscale/pull/855

[POC] Testing Manual dispatch by @anupambhatnagar in https://github.com/facebookresearch/fairscale/pull/859

[fix] [MEVO]: make mevo work with eval and optim_state checkpointing by @min-xu-ai in https://github.com/facebookresearch/fairscale/pull/851

[chore] 0.4.3 release by @min-xu-ai in https://github.com/facebookresearch/fairscale/pull/860

New Contributors

@rohan-varma made their first contribution in https://github.com/facebookresearch/fairscale/pull/814

@EugenHotaj made their first contribution in https://github.com/facebookresearch/fairscale/pull/817

@vtantia made their first contribution in https://github.com/facebookresearch/fairscale/pull/841

Full Changelog: https://github.com/facebookresearch/fairscale/compare/v0.4.1...v0.4.3
Source code(tar.gz)
Source code(zip)
v0.4.2(Nov 8, 2021)

Source code(tar.gz)
Source code(zip)
v0.4.1(Sep 20, 2021)

Released version 0.4.1 for FairScale.
Source code(tar.gz)
Source code(zip)
v0.4.0(Aug 12, 2021)

Source code(tar.gz)
Source code(zip)
v0.3.9(Aug 12, 2021)

Source code(tar.gz)
Source code(zip)
v0.3.8(Jul 12, 2021)

Source code(tar.gz)
Source code(zip)
v0.3.7(May 18, 2021)

Source code(tar.gz)
Source code(zip)
v0.3.6(May 18, 2021)

Source code(tar.gz)
Source code(zip)
v0.3.5(May 18, 2021)

Source code(tar.gz)
Source code(zip)
v0.3.4(Apr 13, 2021)
[0.3.4] - 2021-04-13

Added

FSDP: Add no broadcast optim state option (#560)

Fixed

ShardedDDP: Properly handle .eval() mode (#587)

ShardedDDP: Handle model being moved back to CPU prior to state consolidation (#573)

FSDP: much faster state consolidation (#595)

FSDP: Add gradient pre-divide to prevent overflow with large world sizes (#565)

Offload: (experimental) Fix activation offloading to CPU (#588

Source code(tar.gz)
Source code(zip)
v0.3.3(Apr 2, 2021)

Source code(tar.gz)
Source code(zip)
v0.3.2(Apr 2, 2021)

Source code(tar.gz)
Source code(zip)
v0.3.1(Apr 2, 2021)

Source code(tar.gz)
Source code(zip)
v0.3.0(Feb 23, 2021)
[0.3.0] - 2021-02-22

Added

FullyShardedDataParallel (FSDP) (#413)

ShardedDDP fp16 grad reduction option (#402)

Expose experimental algorithms within the pip package (#410)

Fixed

Catch corner case when the model is too small with respect to the world size, and shards are empty (#406)

Memory leak in checkpoint_wrapper (#412)

Source code(tar.gz)
Source code(zip)
v0.1.7(Feb 19, 2021)
Fixed

ShardedDDP and OSS handle model trainability changes during training (#369)

ShardedDDP state dict load/save bug (#386)

ShardedDDP handle train/eval modes (#393)

AdaScale handling custom scaling factors (#401)

Added

ShardedDDP manual reduce option for checkpointing (#389)

Source code(tar.gz)
Source code(zip)
v0.1.6(Feb 11, 2021)
Added

Checkpointing model wrapper (#376)

Faster OSS, flatbuffers (#371)

Small speedup in OSS clipgradnorm (#363)

Fixed

Bug in ShardedDDP with 0.1.5 depending the init (KeyError / OSS)

Much refactoring in Pipe (#357, #358, #360, #362, #370, #373)

Better pip integration / resident pytorch (#375)

Source code(tar.gz)
Source code(zip)
v0.1.5(Feb 3, 2021)
Added

Pytorch compatibility for OSS checkpoints (#310)

Elastic checkpoints for OSS, world size can vary in between save and loads (#310)

Tensor views for OSS bucketing, reduced CPU use (#300)

Bucket calls in ShardedDDP, for faster inter node communications (#327)

FlattenParamWrapper, which flattens module parameters into a single tensor seamlessly (#317)

AMPnet experimental support (#304)

Fixed

ShardedDDP properly handles device changes via .to() (#353)

Add a new interface for AdaScale, AdaScaleWrapper, which makes it compatible with OSS (#347)

Source code(tar.gz)
Source code(zip)
v0.1.4(Jan 7, 2021)
Fixed

Missing cu files in the pip package

Source code(tar.gz)
Source code(zip)
v0.1.3(Jan 5, 2021)

Same as 0.1.2, but with the correct numbering in the source code (see init.py)
Source code(tar.gz)
Source code(zip)
v0.1.2(Jan 4, 2021)
Added

AdaScale: Added gradient accumulation feature (#202)

AdaScale: Added support of torch.lr_scheduler (#229)

Fixed

AdaScale: smoothing factor value fixed when using gradient accumulation (#235)

Pipe: documentation on balancing functions (#243)

ShardedDDP: handle typical NLP models

ShardedDDP: better partitioning when finetuning

Source code(tar.gz)
Source code(zip)