PyTorch extensions for high performance and large scale training.

Overview

FairScale Logo

PyPI Documentation Status CircleCI PyPI - License PRs Welcome

Description

FairScale is a PyTorch extension library for high performance and large scale training on one or multiple machines/nodes. This library extends basic PyTorch capabilities while adding new experimental ones.

FairScale supports:

  • Parallelism:
    • Pipeline parallelism (fairscale.nn.pipe)
    • Asynchronous Pipeline parallelism (fairscale.nn.async_pipe)
    • Model Parallelism (fairscale.nn.model_parallel.layers)
    • experimental AmpNet (fairscale.experimental.nn.ampnet_pipe)
  • Sharded training:
    • Optimizer state sharding (fairscale.optim.OSS)
    • Sharded Data Parallel (SDP) (fairscale.nn.ShardedDataParallel)
    • Fully Sharded Data Parallel (FSDP) (fairscale.nn.FullyShardedDataParallel) (PyTorch >= 1.6)
  • Optimization at scale:
    • AdaScale SGD (fairscale.optim.AdaScale)
  • GPU memory optimization:
    • Activation checkpointing wrapper (fairscale.nn.misc.checkpoint_wrapper)
  • GPU speed optimization:
    • Sharded grad scaler - automatic mixed precision (fairscale.optim.grad_scaler)

Requirements

  • PyTorch >= 1.5.1

Installation

Normal installation:

pip install fairscale

Development mode:

cd fairscale
pip install -r requirements.txt
pip install -e .

If either of the above fails, add --no-build-isolation to the pip install command (this could be a problem with recent versions of pip).

Getting Started

The full documentation (https://fairscale.readthedocs.io/) contains instructions for getting started and extending fairscale.

Examples

Pipe

Run a 4-layer model on 2 GPUs. The first two layers run on cuda:0 and the next two layers run on cuda:1.

import torch

import fairscale

model = torch.nn.Sequential(a, b, c, d)
model = fairscale.nn.Pipe(model, balance=[2, 2], devices=[0, 1], chunks=8)

Optimizer state sharding (ZeRO)

See a more complete example here, but a minimal example could look like the following :

import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from fairscale.optim.oss import OSS
from fairscale.nn.data_parallel import ShardedDataParallel as ShardedDDP

def train(
    rank: int,
    world_size: int,
    epochs: int):

    # DDP init example
    dist.init_process_group(backend='nccl', init_method="tcp://localhost:29501", rank=rank, world_size=world_size)

    # Problem statement
    model = myAwesomeModel().to(rank)
    dataloader = mySuperFastDataloader()
    loss_fn = myVeryRelevantLoss()
    base_optimizer = torch.optim.SGD # pick any pytorch compliant optimizer here
    base_optimizer_arguments = {} # pass any optimizer specific arguments here, or directly below when instantiating OSS

    # Wrap the optimizer in its state sharding brethren
    optimizer = OSS(params=model.parameters(), optim=base_optimizer, **base_optimizer_arguments)

    # Wrap the model into ShardedDDP, which will reduce gradients to the proper ranks
    model = ShardedDDP(model, optimizer)

    # Any relevant training loop, nothing specific to OSS. For example:
    model.train()
    for e in range(epochs):
        for batch in dataloader:
            # Train
            model.zero_grad()
            outputs = model(batch["inputs"])
            loss = loss_fn(outputs, batch["label"])
            loss.backward()
            optimizer.step()

    dist.destroy_process_group()

if __name__ == "__main__":
    # Supposing that WORLD_SIZE and EPOCHS are somehow defined somewhere
    mp.spawn(
        train,
        args=(
            WORLD_SIZE,
            EPOCHS,
        ),
        nprocs=WORLD_SIZE,
        join=True,
    )

AdaScale SGD

AdaScale can be used to wrap a SGD optimizer and to be used in DDP (Distributed Data Parallel) training or non-DDP with gradient accumulation. The benefit is to re-use the same LR schedule from a baseline batch size when effective batch size is bigger.

Note that AdaScale does not help increase per-GPU batch size.

from torch.optim import SGD
from torch.optim.lr_scheduler import LambdaLR  # or your scheduler
from fairscale.optim import AdaScale

...
optim = AdaScale(SGD(model.parameters(), lr=0.1))
scheduler = LambdaLR(optim, ...)
...
# Note: the train loop should be with DDP or with gradient accumulation.
last_epoch = 0
step = 0
done = False
while not done:
    for sample in dataset:
        ...
        step += optim.gain()
        optim.step()
        epoch = step // len(dataset)
        if last_epoch != epoch:
            scheduler.step()
            last_epoch = epoch
        if epoch > max_epoch:
            done = True

Primary goal is to allow scaling to bigger batch sizes without losing model accuracy. (However, training time might be longer comparing to without AdaScale.)

At a high level, we want ML researchers to:

  • go parallel more easily (i.e. no need to find new learning rate schedules)
  • not worrying about losing accuracy
  • potentially higher GPU efficiency (fewer steps, less networking overhead, etc.)

Testing

We use circleci to test on PyTorch versions 1.6.0, 1.7.1, and 1.8.0. Please create an issue if you are having trouble with installation.

Contributors

See the CONTRIBUTING file for how to help out.

License

fairscale is licensed under the BSD-3-Clause License.

fairscale.nn.pipe is forked from torchgpipe, Copyright 2019, Kakao Brain, licensed under Apache License.

fairscale.nn.model_parallel is forked from Megatron-LM, Copyright 2020, NVIDIA CORPORATION, licensed under Apache License.

fairscale.optim.adascale is forked from AdaptDL, Copyright 2020, Petuum, Inc., licensed under Apache License.

fairscale.nn.misc.flatten_params_wrapper is forked from PyTorch-Reparam-Module, Copyright 2018, Tongzhou Wang, licensed under MIT License.

References

Here is a list of all authors on relevant research papers this work is based on:

  • torchgpipe: Chiheon Kim, Heungsub Lee, Myungryong Jeong, Woonhyuk Baek, Boogeon Yoon, Ildoo Kim, Sungbin Lim, Sungwoong Kim. [Paper] [Code]
  • ZeRO: Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. [Paper] [Code]
  • Megatron-LM: Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro. [Paper][Code]
  • AdaScale SGD: Tyler B. Johnson, Pulkit Agrawal, Haijie Gu, Carlos Guestrin. [Paper]
  • GShard: Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen [Paper]
  • AMPNet:Alexander L. Gaunt, Matthew A. Johnson, Maik Riechert, Daniel Tarlow, Ryota Tomioka, Dimitrios Vytiniotis, Sam Webster [Paper]
Comments
  • [ShardedDDP] Handle transition to eval + parameter change

    [ShardedDDP] Handle transition to eval + parameter change

    πŸ› Bug

    Reported by @SeanNaren, after a successful training the switch to eval() is not properly taken into account, and the grads are marked as "waiting to be reduced" while they should not (we're in eval..)

    opened by blefaudeux 26
  • FSDP: issues with inferencing

    FSDP: issues with inferencing

    ❓ Questions and Help

    I am trying to integrate FSDP into my code. I have questions related to optimizer sharding. Does FSDP automatically shards the gradient, optimizer, and parameter, or do I need to call OSS to shard optimizer? Also can anyone suggest right way to do mixed precision with autocast and FSDP. Also does validation/test happens on GPU rank 0 or on all the nodes?

    FSDP 
    opened by HITESHLPATEL 24
  • [feat] Add context manager to FSDP for easier child module wrapping

    [feat] Add context manager to FSDP for easier child module wrapping

    What does this PR do?

    As discussed in https://github.com/PyTorchLightning/pytorch-lightning/pull/6152#issuecomment-785950642 this adds a context manager that assists in making child modules with similar defaults.

    from fairscale.nn.misc import enable_wrap, wrap
    
    with enable_wrap(**handleful_of_important_params):
        layer_1 = wrap(torch.nn.Linear(5, 5))
        layer_2 = wrap(torch.nn.Linear(5, 5), flatten_parameters=True) # Override parameters if you'd like
    
    ...
    
    # without the context manager, creates Linear layer
    layer_1 = wrap(torch.nn.Linear(5, 5))
    

    If not within the FSDP context, this would be a no-op. This makes it easier to annotate layers without having to copy any changes in parameters.

    Before submitting

    • [x] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
    • [x] Did you read the contributor guideline?
    • [ ] Did you make sure to update the docs?
    • [x] Did you write any new necessary tests?

    PR review

    Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

    Did you have fun?

    Make sure you had fun coding πŸ™ƒ

    CLA Signed 
    opened by SeanNaren 24
  • [Fix][FSDP] Don't remove post backward hooks for multiple backward fix

    [Fix][FSDP] Don't remove post backward hooks for multiple backward fix

    fixes #918

    I am quite confident, that we dont need to remove backward hooks even after finalizing. They will be automatically removed if the leaf variables go out of context and cuda autograd graph cleans up.

    Mots tests were succeeding, apart from one related to cpu offload locally, will debug that

    CLA Signed 
    opened by ngoyal2707 22
  • [feat][OSS] elastic and pytorch compatible checkpoints

    [feat][OSS] elastic and pytorch compatible checkpoints

    Before submitting

    • [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
    • [x] Did you read the contributor guideline?
    • [ ] Did you make sure to update the docs?
    • [x] Did you write any new necessary tests?

    What does this PR do?

    Fixes #164, and make the saved state pytorch-compliant (no extra keyword). The number of ranks can change before and after the checkpoints, it will automatically adapt by repartitioning at load. Adding a new unit test which checks reproducibility (cc @joshim5)

    PR review

    Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged. @joshim5 @stas00 @SeanNaren this breaks compatibility with old checkpoints, is that a big issue ? I could add some scaffolding to move old checkpoints to the new form.

    @mannatsingh I know you mentioned that a long time ago, finally there. Not sure how you would rate the complexity of doing this (masking the sharding when rebuilding a pytorch-compatible state dict), but it's now out of the box with this PR

    Did you have fun?

    Make sure you had fun coding πŸ™ƒ

    CLA Signed 
    opened by blefaudeux 22
  • [RFC] FSDP arguments refactoring

    [RFC] FSDP arguments refactoring

    cc @myleott @sshleifer @QuentinDuval @prigoyal @anj-s @blefaudeux @msbaines @tmarkstrum @SeanNaren

    Motivations

    Currently FSDP has a long list of arguments. They deal with flattening, sharding, mixed precision, cpu offloading. The long list is harder for users to get a quick glance on what's supported combinations and what's not. It is harder for developers to reasonable about interactions and unit testing all combinations.

    Proposal

    I propose we separate the params into different groups, as the following:

    parameter_handling: enum of "flatten", "original"
    sharding: enum of "none, full, shard_after_backward"
    mixed_precision: enum of "full, amp_w32_b32_g32, amp_w16_b16_g16, ..."
    cpu_offloading: enum of "none, grad, param, grad_and_param"
    

    The combination space above is still very big but at least the related arguments are grouped and it is easier to see what's supported and what's not.

    Implementation

    We can first put the new arguments in place and deprecate the older ones. Some of the options may raise NotImplemented error and we will add implementation support gradually.

    I am learning from Benjamin where we can use an enum class that's also a string, like:

    class Foo(str, Enum):
       V1 = "val1"
    

    This way, we can have both yaml config and enum based argument passing.

    Any comments or suggestions are most welcome from the cc'ed folks and the community. Please add people if I missed anyone.

    opened by min-xu-ai 21
  • ShardedDataParallel  doesn't work with multiple nodes

    ShardedDataParallel doesn't work with multiple nodes

    πŸ› Bug

    ShardedDataParallel successfully works with 8gpus x 1nodes, while got the following error with 8gpus x 2 nodes

    AssertionError: A bucket failed to be sent, probably unused parameters.Either remove the unused parameter or de-activate ShardedDDP buckets -set reduce_buffer_size to 0-
    

    However, obviously there are not unused parameters. Note that torch.nn.DistributedDataParallel can work with same environment.

    To Reproduce

    import os
    import sys 
    
    import torch
    import torch.distributed as dist
    import torch.multiprocessing as mp
    from fairscale.optim.oss import OSS 
    from fairscale.nn.data_parallel import ShardedDataParallel as ShardedDDP
    
    
    def train(rank, local_rank, world_size, init_method):
        print("DDP init", world_size, rank, local_rank, init_method, file=sys.stderr)
        dist.init_process_group(backend='nccl', init_method=init_method, rank=rank, world_size=world_size)
        torch.cuda.set_device(local_rank)
    
        model = torch.nn.Linear(3, 3).cuda()
        base_optimizer = torch.optim.Adam
        base_optimizer_arguments = {}
    
        optimizer = OSS(params=model.parameters(), optim=base_optimizer, **base_optimizer_arguments)
        model = ShardedDDP(model, optimizer)
    
        print("Train", file=sys.stderr)
        model.train()
        model.zero_grad()
        outputs = model(torch.randn(2,3).cuda())
        loss = outputs.sum()
        loss.backward()
        optimizer.step()
        print("finish", file=sys.stderr)
    

    I used shared file system initialization for init_method.

    init_method="file:..."
    

    Environment

    Note that I tested tcp and infiniband connection both.

    PyTorch version: 1.7.1
    Is debug build: False
    CUDA used to build PyTorch: 11.0
    ROCM used to build PyTorch: N/A
    
    OS: CentOS Linux 7 (Core) (x86_64)
    GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
    Clang version: Could not collect
    CMake version: version 2.8.12.2
    
    Python version: 3.8 (64-bit runtime)
    Is CUDA available: True
    CUDA runtime version: Could not collect
    GPU models and configuration: 
    GPU 0: Tesla V100-SXM2-32GB
    GPU 1: Tesla V100-SXM2-32GB
    GPU 2: Tesla V100-SXM2-32GB
    GPU 3: Tesla V100-SXM2-32GB
    GPU 4: Tesla V100-SXM2-32GB
    GPU 5: Tesla V100-SXM2-32GB
    GPU 6: Tesla V100-SXM2-32GB
    GPU 7: Tesla V100-SXM2-32GB
    
    Nvidia driver version: 450.51.06
    cuDNN version: Probably one of the following:
    /usr/lib64/libcudnn.so.8.0.4
    /usr/lib64/libcudnn_adv_infer.so.8.0.4
    /usr/lib64/libcudnn_adv_train.so.8.0.4
    /usr/lib64/libcudnn_cnn_infer.so.8.0.4
    /usr/lib64/libcudnn_cnn_train.so.8.0.4
    /usr/lib64/libcudnn_ops_infer.so.8.0.4
    /usr/lib64/libcudnn_ops_train.so.8.0.4
    HIP runtime version: N/A
    MIOpen runtime version: N/A
    
    Versions of relevant libraries:
    [pip3] numpy==1.19.2
    [pip3] pytorch-ranger==0.1.1
    [pip3] pytorch-wpe==0.0.0
    [pip3] torch==1.7.1
    [pip3] torch-complex==0.2.0
    [pip3] torch-optimizer==0.0.1a17
    [pip3] torchaudio==0.7.2
    [conda] blas                      1.0                         mkl  
    [conda] cudatoolkit               11.0.221             h6bb024c_0  
    [conda] mkl                       2020.2                      256  
    [conda] mkl-service               2.3.0            py38he904b0f_0  
    [conda] mkl_fft                   1.2.0            py38h23d657b_0  
    [conda] mkl_random                1.1.1            py38h0573a6f_0  
    [conda] numpy                     1.19.2           py38h54aff64_0  
    [conda] numpy-base                1.19.2           py38hfa32c7d_0  
    [conda] pytorch                   1.7.1           py3.8_cuda11.0.221_cudnn8.0.5_0    pytorch
    [conda] pytorch-ranger            0.1.1                    pypi_0    pypi
    [conda] pytorch-wpe               0.0.0                    pypi_0    pypi
    [conda] torch-complex             0.2.0                    pypi_0    pypi
    [conda] torch-optimizer           0.0.1a17                 pypi_0    pypi
    [conda] torchaudio                0.7.2                    pypi_0    pypi
    
    bug 
    opened by kamo-naoyuki 19
  • [feat] ShardedDataParallel with autoreduce

    [feat] ShardedDataParallel with autoreduce

    Before submitting

    • [x] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
    • [x] Did you read the contributor guideline?
    • [x] Did you make sure to update the docs?
    • [x] Did you write any new necessary tests?

    What does this PR do?

    Stopgap solution before Pytorch's DDP become flexible enough to accommodate different reduction patterns out of the box, another DDP dedicated to the sharded optimizer which automatically reduces gradients to the appropriate ranks and releases the grad buffers

    Key features that this PR brings or maintains in a different form:

    • [x] automatic gradient reduction to the appropriate ranks
    • [x] overlap the gradient reduction and the backward pass
    • [x] keep the tunable bucketing of small gradients
    • [x] keep the reduce calls asynchronous (non-blocking)

    cc @mrshenli

    PR review

    Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

    Did you have fun?

    Make sure you had fun coding πŸ™ƒ

    CLA Signed 
    opened by blefaudeux 19
  • [feat] Gossip/SlowMo

    [feat] Gossip/SlowMo

    Before submitting

    • [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
    • [ ] Did you read the contributor guideline?
    • [ ] Did you make sure to update the docs?
    • [ ] Did you write any new necessary tests?

    What does this PR do?

    Disclaimer: I (@lefaudeux) am no the author, Vinayak (@vtantia) is. Just testing the CI and putting up a draft PR

    TODOs:

    • [x] Write documentation
    • [x] Fix the licensing
    • [x] Make sure that the unit tests run with the global pytest runner
    • [x] Factorize the unit tests a little, cleanup/autogenerate
    • [ ] Add tutorial

    PR review

    Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

    Did you have fun?

    Make sure you had fun coding πŸ™ƒ

    CLA Signed 
    opened by blefaudeux 18
  • Fail on install - CUDA 11.1.0

    Fail on install - CUDA 11.1.0

    πŸ› Bug

    Hi, pip install on below environment throws an error. I'm happy to provide more info if it would be useful. Thanks!

    fatal error: multi_tensor_apply.cuh: No such file or directory

    Environment (NVIDIA-Python Docker: 20.10)

    Python: 3.6 PyTorch: 1.7.0 CUDA: 11.1.0 cuDNN: 8.0.4

    Error

    Error:
    b'  ERROR: Command errored out with exit status 1:
       command: /usr/bin/python3 /usr/local/lib/python3.6/dist-packages/pip/_vendor/pep517/_in_process.py build_wheel /tmp/tmpl6xin7ge
           cwd: /tmp/pip-install-0_xiegmi/fairscale
      Complete output (158 lines):
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-3.6
      creating build/lib.linux-x86_64-3.6/fairscale
      copying fairscale/__init__.py -> build/lib.linux-x86_64-3.6/fairscale
      creating build/lib.linux-x86_64-3.6/fairscale/optim
      copying fairscale/optim/grad_scaler.py -> build/lib.linux-x86_64-3.6/fairscale/optim
      copying fairscale/optim/oss.py -> build/lib.linux-x86_64-3.6/fairscale/optim
      copying fairscale/optim/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/optim
      copying fairscale/optim/utils.py -> build/lib.linux-x86_64-3.6/fairscale/optim
      copying fairscale/optim/adam.py -> build/lib.linux-x86_64-3.6/fairscale/optim
      creating build/lib.linux-x86_64-3.6/fairscale/nn
      copying fairscale/nn/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/nn
      creating build/lib.linux-x86_64-3.6/fairscale/nn/pipe
      copying fairscale/nn/pipe/dependency.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
      copying fairscale/nn/pipe/pipe.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
      copying fairscale/nn/pipe/stream.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
      copying fairscale/nn/pipe/batchnorm.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
      copying fairscale/nn/pipe/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
      copying fairscale/nn/pipe/pipeline.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
      copying fairscale/nn/pipe/checkpoint.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
      copying fairscale/nn/pipe/copy.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
      copying fairscale/nn/pipe/phony.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
      copying fairscale/nn/pipe/worker.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
      copying fairscale/nn/pipe/microbatch.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe
      creating build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
      copying fairscale/nn/model_parallel/random.py -> build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
      copying fairscale/nn/model_parallel/initialize.py -> build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
      copying fairscale/nn/model_parallel/layers.py -> build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
      copying fairscale/nn/model_parallel/cross_entropy.py -> build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
      copying fairscale/nn/model_parallel/mappings.py -> build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
      copying fairscale/nn/model_parallel/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
      copying fairscale/nn/model_parallel/utils.py -> build/lib.linux-x86_64-3.6/fairscale/nn/model_parallel
      creating build/lib.linux-x86_64-3.6/fairscale/nn/data_parallel
      copying fairscale/nn/data_parallel/sharded_ddp.py -> build/lib.linux-x86_64-3.6/fairscale/nn/data_parallel
      copying fairscale/nn/data_parallel/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/nn/data_parallel
      creating build/lib.linux-x86_64-3.6/fairscale/nn/moe
      copying fairscale/nn/moe/top2gate.py -> build/lib.linux-x86_64-3.6/fairscale/nn/moe
      copying fairscale/nn/moe/moelayer.py -> build/lib.linux-x86_64-3.6/fairscale/nn/moe
      copying fairscale/nn/moe/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/nn/moe
      creating build/lib.linux-x86_64-3.6/fairscale/nn/pipe/balance
      copying fairscale/nn/pipe/balance/profile.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/balance
      copying fairscale/nn/pipe/balance/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/balance
      copying fairscale/nn/pipe/balance/blockpartition.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/balance
      creating build/lib.linux-x86_64-3.6/fairscale/nn/pipe/skip
      copying fairscale/nn/pipe/skip/skippable.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/skip
      copying fairscale/nn/pipe/skip/__init__.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/skip
      copying fairscale/nn/pipe/skip/tracker.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/skip
      copying fairscale/nn/pipe/skip/layout.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/skip
      copying fairscale/nn/pipe/skip/portal.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/skip
      copying fairscale/nn/pipe/skip/namespace.py -> build/lib.linux-x86_64-3.6/fairscale/nn/pipe/skip
      running egg_info
      writing fairscale.egg-info/PKG-INFO
      writing dependency_links to fairscale.egg-info/dependency_links.txt
      writing requirements to fairscale.egg-info/requires.txt
      writing top-level names to fairscale.egg-info/top_level.txt
      reading manifest file \'fairscale.egg-info/SOURCES.txt\'
      reading manifest template \'MANIFEST.in\'
      writing manifest file \'fairscale.egg-info/SOURCES.txt\'
      creating build/lib.linux-x86_64-3.6/fairscale/clib
      creating build/lib.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda
      copying fairscale/clib/fused_adam_cuda/fused_adam_cuda.cpp -> build/lib.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda
      copying fairscale/clib/fused_adam_cuda/fused_adam_cuda_kernel.cu -> build/lib.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda
      running build_ext
      building \'fairscale.fused_adam_cuda\' extension
      creating /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6
      creating /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale
      creating /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale/clib
      creating /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda
      Emitting ninja build file /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/build.ninja...
      Compiling objects...
      Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
      [1/2] /usr/local/cuda/bin/nvcc -I/usr/local/lib/python3.6/dist-packages/torch/include -I/usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.6/dist-packages/torch/include/TH -I/usr/local/lib/python3.6/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.6m -c -c /tmp/pip-install-0_xiegmi/fairscale/fairscale/clib/fused_adam_cuda/fused_adam_cuda_kernel.cu -o /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda/fused_adam_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options \'\'"\'"\'-fPIC\'"\'"\'\' -O3 --use_fast_math -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=fused_adam_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=sm_86 -std=c++14
      FAILED: /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda/fused_adam_cuda_kernel.o
      /usr/local/cuda/bin/nvcc -I/usr/local/lib/python3.6/dist-packages/torch/include -I/usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.6/dist-packages/torch/include/TH -I/usr/local/lib/python3.6/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.6m -c -c /tmp/pip-install-0_xiegmi/fairscale/fairscale/clib/fused_adam_cuda/fused_adam_cuda_kernel.cu -o /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda/fused_adam_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options \'\'"\'"\'-fPIC\'"\'"\'\' -O3 --use_fast_math -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=fused_adam_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=sm_86 -std=c++14
      /tmp/pip-install-0_xiegmi/fairscale/fairscale/clib/fused_adam_cuda/fused_adam_cuda_kernel.cu:12:10: fatal error: multi_tensor_apply.cuh: No such file or directory
       #include "multi_tensor_apply.cuh"
                ^~~~~~~~~~~~~~~~~~~~~~~~
      compilation terminated.
      [2/2] c++ -MMD -MF /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda/fused_adam_cuda.o.d -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/lib/python3.6/dist-packages/torch/include -I/usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.6/dist-packages/torch/include/TH -I/usr/local/lib/python3.6/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.6m -c -c /tmp/pip-install-0_xiegmi/fairscale/fairscale/clib/fused_adam_cuda/fused_adam_cuda.cpp -o /tmp/pip-install-0_xiegmi/fairscale/build/temp.linux-x86_64-3.6/fairscale/clib/fused_adam_cuda/fused_adam_cuda.o -O3 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=fused_adam_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
      In file included from /usr/local/lib/python3.6/dist-packages/torch/include/ATen/Parallel.h:149:0,
                       from /usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include/torch/utils.h:3,
                       from /usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include/torch/nn/cloneable.h:5,
                       from /usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include/torch/nn.h:3,
                       from /usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include/torch/all.h:12,
                       from /usr/local/lib/python3.6/dist-packages/torch/include/torch/extension.h:4,
                       from /tmp/pip-install-0_xiegmi/fairscale/fairscale/clib/fused_adam_cuda/fused_adam_cuda.cpp:1:
      /usr/local/lib/python3.6/dist-packages/torch/include/ATen/ParallelOpenMP.h:84:0: warning: ignoring #pragma omp parallel [-Wunknown-pragmas]
       #pragma omp parallel for if ((end - begin) >= grain_size)
      
      ninja: build stopped: subcommand failed.
      Traceback (most recent call last):
        File "/usr/local/lib/python3.6/dist-packages/torch/utils/cpp_extension.py", line 1522, in _run_ninja_build
          env=env)
        File "/usr/lib/python3.6/subprocess.py", line 438, in run
          output=stdout, stderr=stderr)
      subprocess.CalledProcessError: Command \'[\'ninja\', \'-v\']\' returned non-zero exit status 1.
    
    bug 
    opened by johncookds 17
  • Add FullyShardedDataParallel (FSDP)

    Add FullyShardedDataParallel (FSDP)

    Co-authored-by: @min-xu-ai and @sshleifer

    Overview

    Recent work by Microsoft and Google has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new FullyShardedDataParallel (FSDP) wrapper, which is a drop-in replacement for PyTorch's DistributedDataParallel (DDP) wrapper.

    Compared to PyTorch DDP:

    • FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs
    • FSDP with reshard_after_forward=False has the same communication cost as PyTorch DDP and is similar to ZeRO-2
    • FSDP with reshard_after_forward=True increases total communication by 50% and is similar to ZeRO-3:
      • all-gather parameters at start of forward pass and start of backward pass
      • reduce-scatter grads at end of backward pass
    • in practice, FSDP is faster than PyTorch DDP because the optimizer step is sharded, and the extra communication can be overlapped with the forward pass
    • FSDP enables training 13B parameter models on 8 GPUs and 175B parameter models on 128 GPUs. When using the cpu_offload=True option, it's possible to train 1T parameter models on 256 GPUs.

    General usage notes

    • for best memory efficiency wrap each layer in your network with FSDP and set reshard_after_forward=True
    • for best training speed set reshard_after_forward=False (wrapping each layer is not required, but will improve speed further)
    • if you're using torch.cuda.amp.autocast for mixed precision, that's fully compatible with the FSDP wrapper, just set mixed_precision=True
    • if combining with activation checkpointing, prefer FSDP(checkpoint_wrapper(module)) over checkpoint_wrapper(FSDP(module)). The latter will result in more communication and will be slower.
    • this is full compatible with pointwise Optimizers, e.g., Adam, AdamW, Adadelta, Adamax, SGD, etc.. However, the sharding will result in slightly different results when using non-pointwise Optimizers, e.g., Adagrad, Adafactor, LAMB, etc.

    How it works

    In standard distributed data parallel (DDP) training every worker processes a separate batch and the gradients are summed across workers using an all-reduce operation. While DDP has become very popular, it wastes GPU memory because the model weights and optimizer states are replicated across all DDP workers.

    The key insight to unlock full parameter sharding is that we can decompose the all-reduce operation in DDP into separate all-gather and reduce-scatter operations:

    Screen Shot 2021-01-12 at 12 35 19 PM

    Then, we can rearrange the reduce-scatter + all-gather so that each DDP worker only needs to store a single shard of parameters and optimizer state. The figure below illustrates standard DDP training (left) and fully sharded training (right):

    Screen Shot 2021-02-24 at 4 39 55 PM

    To maximize memory efficiency we can discard the full weights after each layer's forward pass, saving memory for subsequent layers. This can be implemented by applying the FSDP wrapper to every layer in your network (with reshard_after_forward=True). In pseudo-code:

    FSDP forward pass:
        for layer_i in layers:
            all-gather full weights for layer_i
            forward pass for layer_i
            discard full weights for layer_i
    FSDP backward pass:
        for layer_i in layers:
            all-gather full weights for layer_i
            backward pass for layer_i
            discard full weights for layer_i
            reduce-scatter gradients for layer_i
    
    CLA Signed 
    opened by myleott 16
  • FSDP cannot consolidate optimizer state dict with flatten params is False

    FSDP cannot consolidate optimizer state dict with flatten params is False

    I'm now training a large model with 2.5B parameters with AdamW optimizer. Due to the known issue about FSDP and activation checkpointing, I'm using FSDP with flatten params = False. When saving the training checkpoint, the model has state_dict() and local_state_dict() two methods which distinguish saving full or sharded model states. Is it possible to save all full (not sharded ) optimizer states in a single file as well?

    I saw the gather_full_optim_state_dict method but there is an assertion that requires flatten_parameters=True image

    opened by ShenglongZ 3
  • clip_grad_norm_ from fairscale downcasts to bf16 before all reduce

    clip_grad_norm_ from fairscale downcasts to bf16 before all reduce

    Copied from: https://github.com/fairinternal/xlformers/issues/117

    Shouldn't we remove the .to(dtype=parameters[0].dtype) from this line? https://github.com/facebookresearch/fairscale/blob/ee647b976cf4c8fdd37bc9ae3fd6331d225ba2a0/fairscale/internal/params.py#L75 It seems weird (and it results in inaccuracies) to convert partial gradient norms to fp16/bf16 before summing them.

    Context:

    We use: https://github.com/facebookresearch/fairscale/blob/ee647b976cf4c8fdd37bc9ae3fd6331d225ba2a0/fairscale/nn/data_parallel/fully_sharded_data_parallel.py#L621

    which calculates grad norms via: https://github.com/facebookresearch/fairscale/blob/ee647b976cf4c8fdd37bc9ae3fd6331d225ba2a0/fairscale/internal/params.py#L59

    which downcasts to param dtype via: https://github.com/facebookresearch/fairscale/blob/ee647b976cf4c8fdd37bc9ae3fd6331d225ba2a0/fairscale/internal/params.py#L75

    before the allreduce: https://github.com/facebookresearch/fairscale/blob/ee647b976cf4c8fdd37bc9ae3fd6331d225ba2a0/fairscale/nn/data_parallel/fully_sharded_data_parallel.py#L672

    Spotted from looking at how unusually even grad norms look at each training step:

    "g_norm": 5.6875
    "g_norm": 11.1875
    "g_norm": 23.0
    "g_norm": 45.25
    "g_norm": 89.5
    "g_norm": 176.0
    "g_norm": 360.0
    "g_norm": 704.0
    "g_norm": 720.0
    "g_norm": 724.0
    "g_norm": 728.0
    "g_norm": 716.0
    "g_norm": 724.0
    "g_norm": 728.0
    "g_norm": 752.0
    "g_norm": 736.0
    "g_norm": 728.0
    "g_norm": 728.0
    "g_norm": 736.0
    "g_norm": 728.0
    "g_norm": 728.0
    "g_norm": 724.0
    "g_norm": 724.0
    "g_norm": 724.0
    "g_norm": 732.0
    "g_norm": 764.0
    "g_norm": 720.0
    "g_norm": 728.0
    "g_norm": 728.0
    "g_norm": 740.0
    "g_norm": 732.0
    "g_norm": 736.0
    "g_norm": 704.0
    "g_norm": 700.0
    "g_norm": 728.0
    "g_norm": 740.0
    "g_norm": 724.0
    "g_norm": 752.0
    "g_norm": 712.0
    "g_norm": 716.0
    "g_norm": 724.0
    "g_norm": 744.0
    "g_norm": 728.0
    "g_norm": 736.0
    "g_norm": 720.0
    "g_norm": 716.0
    "g_norm": 724.0
    "g_norm": 716.0
    "g_norm": 720.0
    "g_norm": 712.0
    "g_norm": 744.0
    "g_norm": 724.0
    "g_norm": 708.0
    "g_norm": 708.0
    "g_norm": 716.0
    "g_norm": 704.0
    "g_norm": 712.0
    "g_norm": 724.0
    "g_norm": 708.0
    "g_norm": 708.0
    "g_norm": 728.0
    "g_norm": 720.0
    "g_norm": 724.0
    "g_norm": 716.0
    "g_norm": 712.0
    "g_norm": 704.0
    "g_norm": 700.0
    "g_norm": 688.0
    "g_norm": 692.0
    "g_norm": 696.0
    "g_norm": 732.0
    "g_norm": 620.0
    "g_norm": 1168.0
    "g_norm": 1152.0
    "g_norm": 1144.0
    "g_norm": 1112.0
    "g_norm": 1128.0
    "g_norm": 1136.0
    "g_norm": 1128.0
    "g_norm": 1128.0
    "g_norm": 1104.0
    "g_norm": 1112.0
    "g_norm": 1088.0
    "g_norm": 1112.0
    "g_norm": 1112.0
    "g_norm": 1120.0
    "g_norm": 1112.0
    "g_norm": 1064.0
    "g_norm": 1040.0
    "g_norm": 1024.0
    "g_norm": 1056.0
    "g_norm": 1032.0
    "g_norm": 1032.0
    "g_norm": 1024.0
    "g_norm": 1048.0
    "g_norm": 1016.0
    "g_norm": 1040.0
    "g_norm": 1016.0
    "g_norm": 936.0
    "g_norm": 828.0
    "g_norm": 764.0
    "g_norm": 732.0
    "g_norm": 692.0
    "g_norm": 676.0
    "g_norm": 1376.0
    "g_norm": 1360.0
    "g_norm": 1328.0
    "g_norm": 1360.0
    "g_norm": 1360.0
    "g_norm": 1312.0
    "g_norm": 1328.0
    "g_norm": 1264.0
    "g_norm": 1304.0
    "g_norm": 1280.0
    "g_norm": 1296.0
    "g_norm": 1224.0
    "g_norm": 1256.0
    "g_norm": 1264.0
    "g_norm": 1224.0
    "g_norm": 1152.0
    "g_norm": 1160.0
    "g_norm": 1184.0
    "g_norm": 1184.0
    "g_norm": 1144.0
    "g_norm": 1128.0
    "g_norm": 1112.0
    "g_norm": 1080.0
    "g_norm": 1072.0
    "g_norm": 1048.0
    "g_norm": 1040.0
    "g_norm": 1040.0
    "g_norm": 1072.0
    "g_norm": 1032.0
    "g_norm": 1024.0
    "g_norm": 996.0
    "g_norm": 976.0
    "g_norm": 988.0
    "g_norm": 976.0
    "g_norm": 956.0
    "g_norm": 988.0
    "g_norm": 944.0
    "g_norm": 924.0
    "g_norm": 924.0
    "g_norm": 904.0
    "g_norm": 1840.0
    "g_norm": 1872.0
    "g_norm": 1816.0
    "g_norm": 1760.0
    "g_norm": 1752.0
    "g_norm": 1808.0
    
    opened by glample 3
  • Can't load optimizer state due to `state_steps`

    Can't load optimizer state due to `state_steps`

    Hi, I recently upgraded to PyTorch 1.12 and have had issues with loading a saved optimizer state using FSDP here and the issue seems something that is addressed in comments here - https://github.com/facebookresearch/fairscale/blob/4975b05e89aaa29923b72c23b7b0f45118e4252f/fairscale/nn/data_parallel/fully_sharded_data_parallel.py#L2436

    From what I understand, Adam's step state changed into a singleton tensor and when I call gather_full_optim_state_dict() this step is converted to an int.

    Sample saving dict code:

    model = FSDP(model, ...)
    # call on all ranks
    optim_state = model.gather_full_optim_state_dict(optimizer)
    if rank == 0:
        # save only on rank 0
        checkpoint = {
            'optimizer': optim_state,
            ...
        }
        torch.save(checkpoint)
    

    Now when I load this optim state dict back - I do the following:

    model = FSDP(model, ...)
    torch.distributed.barrier()
    # on all ranks
    checkpoint = torch.load(snapshot_name)
    curr_opt_state_dict = checkpoint["optimizer"]
    optim_shard_dict = model.get_shard_from_optim_state_dict(curr_opt_state_dict)
    optimizer.load_state_dict(optim_shard_dict)
    

    This always fails the assertion in the Adam code - https://github.com/pytorch/pytorch/blob/master/torch/optim/adamw.py#L204 because I imagine the step was converted to an int within FSDP and Adam expects it to be a singleton tensor.

    My question is am I saving the state dict correctly? Do I need to call optimizer.state_dict() on top of model.gather_full_optim_state_dict()?

    A workaround I'm using to get things to bypass the assertion is to convert the ints back to singleton tensors in the adamw function however that does not seem safe. Any thoughts?

    Apologies if my understanding is incorrect, I followed some of the discussion here - https://github.com/facebookresearch/fairscale/issues/776 for the state_dict saving logic.

    opened by rowhanm 10
  • [FSDP] fix for high GPU reserved memory (v2)

    [FSDP] fix for high GPU reserved memory (v2)

    v2 for https://github.com/facebookresearch/fairscale/pull/972

    The general idea is to try to make a guess on the FSDP module execution order and store that in two lists (one for the forward and another for the backward pass) during the first pass. Then, as opposed to letting the CPU run free and scheduling all GPU operations ahead of time (while reserving GPU memory for each), we only schedule the all-gather for the next module and wait until computations for the current module are finished before continuing. By scheduling the all-gather for the next module, we attempt to keep a good parallelism between data transfer and compute streams.

    Given the lack of a global/static execution graph in Pytorch, my understanding on the best we can do here is a heuristic based on local information. Pytorch Distributed is also facing a similar issue and, given there is no perfect solution, arguments for/against each approach largely depend on the models you pick to measure success.

    Advantages of current approach: (1) Based on several comments in the original PR, it showed to significantly help different large scale runs that are memory bound. I'm attaching some profiles for a ~370M param transformer showing the new behaviour. As long as we get the execution order right and there is not a lot of variance in terms of how long each module takes to run, the parallelism across compute and data transfer is maintained. (2) We only wait until computations are finished for the modules where reshard_after_forward = True, with the assumption that if this is not set then memory is not a limitation and we should let CPU continue. This should help preventing side-effects on models that do not care about memory.

    Disadvantages: (1) We may not always schedule the correct all-gather, which can cause run delays. One example is with activation checkpointing, where the execution needs to go further back in the model and execute a forward to recompute activations. The first module in this forward is not all-gathered ahead of time. Another example is if your execution order changes across different passes. So it is theoretically possible for this PR to cause performance degradations in some scenarios. (2) We may under-schedule all-gathers, which can also cause performance degradation. This could happen on models where there is a big variation on execution time across different modules, and we end up having to wait for a long all-gather to finish after running a module that had very short computation time.

    One alternative heuristic proposed by Min was to use the amount of available memory to make a call on whether to wait for the current module to finish or continue execution. However, I found it hard to find a justifiable memory threshold (either relative or absolute) that would work well across a large variety of cases, especially with a lack of a comprehensive benchmark to experiment with. Given we've already seen examples of the approach in this PR working well in practice, it seems safer to just go with this route instead and reevaluate if we find evidence of the contrary.

    Original behavior: computation stream is always active, however CPU schedules everything at once. Screenshot 2022-08-03 at 18 38 43

    New behavior: computation stream still mostly active, with CPU scheduling one module at a time. Screenshot 2022-08-03 at 18 38 16

    CLA Signed 
    opened by ruanslv 7
  • Running stats with gradient checkpointing

    Running stats with gradient checkpointing

    According to patch_batchnorm source code if layer collecting running stats (e.g. BatchNorm) is checkpointed it will accumulate statistics only when grad is enabled (on backward pass). This induces inconsistency:

    torch.manual_seed(1337)
    seq = nn.Sequential(nn.Conv2d(4, 4, 3), nn.BatchNorm2d(4))
    torch.manual_seed(1337)
    seq_checkpointed = checkpoint_wrapper(nn.Sequential(nn.Conv2d(4, 4, 3), nn.BatchNorm2d(4)))
    
    inp = torch.randn(2, 4, 16, 16)
    
    seq(inp)
    seq_checkpointed(inp)
    
    seq[1].running_mean == seq_checkpointed[1].running_mean
    tensor([False, False, False, False])
    

    I think this behaviour should be modified to accumulate statistics at 1-st forward pass or at least mentioned in docs

    opened by vovaf709 8
Releases(v0.4.13)
  • v0.4.13(Dec 11, 2022)

  • v0.4.11(Sep 30, 2022)

  • v0.4.10(Sep 23, 2022)

  • v0.4.3(Nov 18, 2021)

    What's Changed

    • [docs][fix] Update example to use offload_model by @anj-s in https://github.com/facebookresearch/fairscale/pull/806
    • Switch default branch from master to main by @tmarkstrum in https://github.com/facebookresearch/fairscale/pull/807
    • [FairScale] Remove refs to "cpu_offload" in code comments by @rohan-varma in https://github.com/facebookresearch/fairscale/pull/814
    • [chore] Remove deprecated THCudaCheck by @anj-s in https://github.com/facebookresearch/fairscale/pull/818
    • [feat] layer memory tracking by @QuentinDuval in https://github.com/facebookresearch/fairscale/pull/808
    • [chore] Add log for the new experimental memory tracker feature. by @anj-s in https://github.com/facebookresearch/fairscale/pull/819
    • [chore] Update the PyTorch version that we run CPU tests with by @anj-s in https://github.com/facebookresearch/fairscale/pull/809
    • [chore] Update the PyTorch version that we run benchmarks with. by @anj-s in https://github.com/facebookresearch/fairscale/pull/823
    • Extend auto shard capabilities to work around torch.fx edge cases. by @EugenHotaj in https://github.com/facebookresearch/fairscale/pull/817
    • [fix] Update golden data for account for the speed regression by @anj-s in https://github.com/facebookresearch/fairscale/pull/825
    • [chore] Fix main breakage temporarily by relaxing constraints by @anj-s in https://github.com/facebookresearch/fairscale/pull/828
    • Use correct node names for param counting in auto_shard. by @EugenHotaj in https://github.com/facebookresearch/fairscale/pull/830
    • [chore] Update requirements file to reflect latest config by @anj-s in https://github.com/facebookresearch/fairscale/pull/832
    • [fix]: Fixes an issue with pre_backward hook registering by @min-xu-ai in https://github.com/facebookresearch/fairscale/pull/833
    • [feature] Skip creating the CPU grad tensor when training by @anj-s in https://github.com/facebookresearch/fairscale/pull/821
    • [test] improve a test's coverage by @min-xu-ai in https://github.com/facebookresearch/fairscale/pull/798
    • [fix] Decouple move_params_to_cpu from the mixed_precision. by @anj-s in https://github.com/facebookresearch/fairscale/pull/822
    • [fix] fix test on main by @min-xu-ai in https://github.com/facebookresearch/fairscale/pull/835
    • [feature] Add the low level SSD APIs by @anj-s in https://github.com/facebookresearch/fairscale/pull/829
    • [feat] [FSDP]: add experimental support to shared weights by @min-xu-ai in https://github.com/facebookresearch/fairscale/pull/836
    • update nightly torch and test the flaky test by @min-xu-ai in https://github.com/facebookresearch/fairscale/pull/837
    • [chore] Fix broken main due to updated github URL requirements by @anj-s in https://github.com/facebookresearch/fairscale/pull/838
    • [chore] Update Sphinx version in docs requirements file by @vtantia in https://github.com/facebookresearch/fairscale/pull/841
    • [feat] experimental MEVO layer by @min-xu-ai in https://github.com/facebookresearch/fairscale/pull/840
    • [feat] Gossip/SlowMo by @blefaudeux in https://github.com/facebookresearch/fairscale/pull/378
    • [feature]Add support for SSD offload with FSDP for eval workloads by @anj-s in https://github.com/facebookresearch/fairscale/pull/839
    • [chore] 0.4.2 release by @anupambhatnagar in https://github.com/facebookresearch/fairscale/pull/846
    • CI config changes by @anupambhatnagar in https://github.com/facebookresearch/fairscale/pull/847
    • Setup pre-commit github action and apply pre-commit to all files by @anupambhatnagar in https://github.com/facebookresearch/fairscale/pull/849
    • Allow sharded grad scaler to cpu offload with FSDP by @anupambhatnagar in https://github.com/facebookresearch/fairscale/pull/831
    • Update changelog, removed meta.yml and requirements cleanup by @anupambhatnagar in https://github.com/facebookresearch/fairscale/pull/853
    • [feature] Add a OffloadConfig object to specify offloading params to disk. by @anj-s in https://github.com/facebookresearch/fairscale/pull/855
    • [POC] Testing Manual dispatch by @anupambhatnagar in https://github.com/facebookresearch/fairscale/pull/859
    • [fix] [MEVO]: make mevo work with eval and optim_state checkpointing by @min-xu-ai in https://github.com/facebookresearch/fairscale/pull/851
    • [chore] 0.4.3 release by @min-xu-ai in https://github.com/facebookresearch/fairscale/pull/860

    New Contributors

    • @rohan-varma made their first contribution in https://github.com/facebookresearch/fairscale/pull/814
    • @EugenHotaj made their first contribution in https://github.com/facebookresearch/fairscale/pull/817
    • @vtantia made their first contribution in https://github.com/facebookresearch/fairscale/pull/841

    Full Changelog: https://github.com/facebookresearch/fairscale/compare/v0.4.1...v0.4.3

    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Sep 20, 2021)

  • v0.3.4(Apr 13, 2021)

    [0.3.4] - 2021-04-13

    Added

    • FSDP: Add no broadcast optim state option (#560)

    Fixed

    • ShardedDDP: Properly handle .eval() mode (#587)
    • ShardedDDP: Handle model being moved back to CPU prior to state consolidation (#573)
    • FSDP: much faster state consolidation (#595)
    • FSDP: Add gradient pre-divide to prevent overflow with large world sizes (#565)
    • Offload: (experimental) Fix activation offloading to CPU (#588
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Feb 23, 2021)

    [0.3.0] - 2021-02-22

    Added

    • FullyShardedDataParallel (FSDP) (#413)
    • ShardedDDP fp16 grad reduction option (#402)
    • Expose experimental algorithms within the pip package (#410)

    Fixed

    • Catch corner case when the model is too small with respect to the world size, and shards are empty (#406)
    • Memory leak in checkpoint_wrapper (#412)
    Source code(tar.gz)
    Source code(zip)
  • v0.1.7(Feb 19, 2021)

    Fixed

    • ShardedDDP and OSS handle model trainability changes during training (#369)
    • ShardedDDP state dict load/save bug (#386)
    • ShardedDDP handle train/eval modes (#393)
    • AdaScale handling custom scaling factors (#401)

    Added

    • ShardedDDP manual reduce option for checkpointing (#389)
    Source code(tar.gz)
    Source code(zip)
  • v0.1.6(Feb 11, 2021)

    Added

    • Checkpointing model wrapper (#376)
    • Faster OSS, flatbuffers (#371)
    • Small speedup in OSS clipgradnorm (#363)

    Fixed

    • Bug in ShardedDDP with 0.1.5 depending the init (KeyError / OSS)
    • Much refactoring in Pipe (#357, #358, #360, #362, #370, #373)
    • Better pip integration / resident pytorch (#375)
    Source code(tar.gz)
    Source code(zip)
  • v0.1.5(Feb 3, 2021)

    Added

    • Pytorch compatibility for OSS checkpoints (#310)
    • Elastic checkpoints for OSS, world size can vary in between save and loads (#310)
    • Tensor views for OSS bucketing, reduced CPU use (#300)
    • Bucket calls in ShardedDDP, for faster inter node communications (#327)
    • FlattenParamWrapper, which flattens module parameters into a single tensor seamlessly (#317)
    • AMPnet experimental support (#304)

    Fixed

    • ShardedDDP properly handles device changes via .to() (#353)
    • Add a new interface for AdaScale, AdaScaleWrapper, which makes it compatible with OSS (#347)
    Source code(tar.gz)
    Source code(zip)
  • v0.1.4(Jan 7, 2021)

  • v0.1.3(Jan 5, 2021)

  • v0.1.2(Jan 4, 2021)

    Added

    • AdaScale: Added gradient accumulation feature (#202)
    • AdaScale: Added support of torch.lr_scheduler (#229)

    Fixed

    • AdaScale: smoothing factor value fixed when using gradient accumulation (#235)
    • Pipe: documentation on balancing functions (#243)
    • ShardedDDP: handle typical NLP models
    • ShardedDDP: better partitioning when finetuning
    Source code(tar.gz)
    Source code(zip)
Owner
Facebook Research
Facebook Research
A high performance and generic framework for distributed DNN training

BytePS BytePS is a high performance and general distributed training framework. It supports TensorFlow, Keras, PyTorch, and MXNet, and can run on eith

Bytedance Inc. 3.3k Dec 28, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 8, 2023
Create large-scale ML-driven multiscale simulation ensembles to study the interactions

MuMMI RAS v0.1 Released: Nov 16, 2021 MuMMI RAS is the application component of the MuMMI framework developed to create large-scale ML-driven multisca

null 4 Feb 16, 2022
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.9k Jan 5, 2023
Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API.

null 164 Jan 4, 2023
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 7, 2023
2D fluid simulation implementation of Jos Stam paper on real-time fuild dynamics, including some suggested extensions.

Fluid Simulation Usage Download this repo and store it in your computer. Open a terminal and go to the root directory of this folder. Make sure you ha

Mariana Ávalos Arce 5 Dec 2, 2022
High performance implementation of Extreme Learning Machines (fast randomized neural networks).

High Performance toolbox for Extreme Learning Machines. Extreme learning machines (ELM) are a particular kind of Artificial Neural Networks, which sol

Anton Akusok 174 Dec 7, 2022
High performance Python GLMs with all the features!

High performance Python GLMs with all the features!

QuantCo 200 Dec 14, 2022
Model factory is a ML training platform to help engineers to build ML models at scale

Model Factory Machine learning today is powering many businesses today, e.g., search engine, e-commerce, news or feed recommendation. Training high qu

null 16 Sep 23, 2022
Uber Open Source 1.6k Dec 31, 2022
AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications.

AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just a few lines of code, you can train and deploy high-accuracy machine learning and deep learning models tabular data.

Robin 55 Dec 27, 2022
AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications.

AutoTabular AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just

wenqi 2 Jun 26, 2022
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Horovod Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make dis

Horovod 12.9k Jan 7, 2023
Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale.

Model Search Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale. It aims to help researchers sp

AriesTriputranto 1 Dec 13, 2021
A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

Stox A Module to predict the "close price" for the next day and give "technical analysis". It uses a Neural Network and the LSTM algorithm to predict

Stox 31 Dec 16, 2022
Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

Prophet: Automatic Forecasting Procedure Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends ar

Facebook 15.4k Jan 7, 2023
DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective. 10x Larger Models 10x Faster Trainin

Microsoft 8.4k Dec 30, 2022
Massively parallel self-organizing maps: accelerate training on multicore CPUs, GPUs, and clusters

Somoclu Somoclu is a massively parallel implementation of self-organizing maps. It exploits multicore CPUs, it is able to rely on MPI for distributing

Peter Wittek 239 Nov 10, 2022