DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

Microsoft

Last update: Dec 30, 2022

Related tags

Machine Learning machine-learning deep-learning gpu pytorch data-parallelism model-parallelism billion-parameters

Overview

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

10x Larger Models

10x Faster Training

Minimal Code Change

DeepSpeed delivers extreme-scale model training for everyone, from data scientists training on massive supercomputers to those training on low-end clusters or even on a single GPU:

Extreme scale: Using current generation of GPU clusters with hundreds of devices, 3D parallelism of DeepSpeed can efficiently train deep learning models with trillions of parameters.
Extremely memory efficient: With just a single GPU, ZeRO-Offload of DeepSpeed can train models with over 10B parameters, 10x bigger than the state of arts, democratizing multi-billion-parameter model training such that many deep learning scientists can explore bigger and better models.
Extremely long sequence length: Sparse attention of DeepSpeed powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution comparing with dense transformers.
Extremely communication efficient: 3D parallelism improves communication efficiency allows users to train multi-billion-parameter models 2–7x faster on clusters with limited network bandwidth. 1-bit Adam reduces communication volume by up to 5x while achieving similar convergence efficiency to Adam, allowing for scaling to different types of GPU clusters and networks.

Early adopters of DeepSpeed have already produced a language model (LM) with over 17B parameters called Turing-NLG, establishing a new SOTA in the LM category.

DeepSpeed is an important part of Microsoft’s new AI at Scale initiative to enable next-generation AI capabilities at scale, where you can find more information here.

For further documentation, tutorials, and technical deep-dives please see deepspeed.ai!

News

[2021/03/16] 1-bit Adam v2: NCCL-based implementation and more
[2021/03/08] ZeRO-3 Offload: Scale your models to trillion parameters without code changes while leveraging both CPUs & GPUs
[2020/11/12] Simplified install, JIT compiled ops, PyPI releases, and reduced dependencies
[2020/11/10] Efficient and robust compressed training through progressive layer dropping
[2020/09/10] DeepSpeed v0.3: Extreme-scale model training for everyone
[2020/08/07] DeepSpeed Microsoft Research Webinar is now available on-demand

Section	Description
Why DeepSpeed?	DeepSpeed overview
Install	Installation details
Features	Feature list and overview
Further Reading	Documentation, tutorials, etc.
Contributing	Instructions for contributing
Publications	Publications related to DeepSpeed
Videos	Videos related to DeepSpeed

Why DeepSpeed?

Training advanced deep learning models is challenging. Beyond model design, model scientists also need to set up the state-of-the-art training techniques such as distributed training, mixed precision, gradient accumulation, and checkpointing. Yet still, scientists may not achieve the desired system performance and convergence rate. Large model sizes are even more challenging: a large model easily runs out of memory with pure data parallelism and it is difficult to use model parallelism. DeepSpeed addresses these challenges to accelerate model development and training.

Installation

The quickest way to get started with DeepSpeed is via pip, this will install the latest release of DeepSpeed which is not tied to specific PyTorch or CUDA versions. DeepSpeed includes several C++/CUDA extensions that we commonly refer to as our 'ops'. By default, all of these extensions/ops will be built just-in-time (JIT) using torch's JIT C++ extension loader that relies on ninja to build and dynamically link them at runtime.

Note: PyTorch must be installed before installing DeepSpeed.

pip install deepspeed

After installation, you can validate your install and see which extensions/ops your machine is compatible with via the DeepSpeed environment report.

ds_report

If you would like to pre-install any of the DeepSpeed extensions/ops (instead of JIT compiling) or install pre-compiled ops via PyPI please see our advanced installation instructions.

Features

Below we provide a brief feature list, see our detailed feature overview for descriptions and usage.

Distributed Training with Mixed Precision
- 16-bit mixed precision
- Single-GPU/Multi-GPU/Multi-Node
Model Parallelism
- Support for Custom Model Parallelism
- Integration with Megatron-LM
Pipeline Parallelism
- 3D Parallelism
The Zero Redundancy Optimizer (ZeRO)
- Optimizer State and Gradient Partitioning
- Activation Partitioning
- Constant Buffer Optimization
- Contiguous Memory Optimization
ZeRO-Offload
- Leverage both CPU/GPU memory for model training
- Support 10B model training on a single GPU
Ultra-fast dense transformer kernels
Sparse attention
- Memory- and compute-efficient sparse kernels
- Support 10x longer sequences than dense
- Flexible support to different sparse structures
1-bit Adam
- Custom communication collective
- Up to 5x communication volume saving
Additional Memory and Bandwidth Optimizations
- Smart Gradient Accumulation
- Communication/Computation Overlap
Training Features
- Simplified training API
- Gradient Clipping
- Automatic loss scaling with mixed precision
Training Optimizers
- Fused Adam optimizer and arbitrary torch.optim.Optimizer
- Memory bandwidth optimized FP16 Optimizer
- Large Batch Training with LAMB Optimizer
- Memory efficient Training with ZeRO Optimizer
- CPU-Adam
Training Agnostic Checkpointing
Advanced Parameter Search
- Learning Rate Range Test
- 1Cycle Learning Rate Schedule
Simplified Data Loader
Performance Analysis and Debugging

Article	Description
DeepSpeed Features	DeepSpeed features
Getting Started	First steps with DeepSpeed
DeepSpeed JSON Configuration	Configuring DeepSpeed
API Documentation	Generated DeepSpeed API documentation
CIFAR-10 Tutorial	Getting started with CIFAR-10 and DeepSpeed
Megatron-LM Tutorial	Train GPT2 with DeepSpeed and Megatron-LM
BERT Pre-training Tutorial	Pre-train BERT with DeepSpeed
Learning Rate Range Test Tutorial	Faster training with large learning rates
1Cycle Tutorial	SOTA learning schedule in DeepSpeed

Contributing

DeepSpeed welcomes your contributions! Please see our contributing guide for more details on formatting, testing, etc.

Contributor License Agreement

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Publications

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. (2019) ZeRO: memory optimizations toward training trillion parameter models. arXiv:1910.02054 and In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20).
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. (2020) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '20, Tutorial).
Minjia Zhang, Yuxiong He. (2020) Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. arXiv:2010.13369 and NeurIPS 2020.
Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. (2021) ZeRO-Offload: Democratizing Billion-Scale Model Training. arXiv:2101.06840.
Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He. (2021) 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. arXiv:2102.02888.

Videos

DeepSpeed KDD 2020 Tutorial
1. Overview
2. ZeRO + large model training
3. 17B T-NLG demo
4. Fastest BERT training + RScan tuning
5. DeepSpeed hands on deep dive: part 1, part 2, part 3
6. FAQ
Microsoft Research Webinar
- Registration is free and all videos are available on-demand.
- ZeRO & Fastest BERT: Increasing the scale and speed of deep learning training in DeepSpeed.
DeepSpeed on AzureML

Comments

[BUG] some docs have broken formatting

Describe the bug

The API arguments docs aren't formatted, e.g. these ones (but there are probably more of those):

https://deepspeed.readthedocs.io/en/latest/training.html#model-saving https://deepspeed.readthedocs.io/en/latest/training.html#gradient-accumulation https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html (most of the page)

e.g. have a look at: https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html#loading-training-checkpoints

here all the args are piled into one para, instead of being itemized and nothing is formatted:

Load training checkpoint :param load_dir: Required. Directory to load the checkpoint from :param tag: Checkpoint tag used as a unique identifier for checkpoint, if not provided will attempt to load tag in ‘latest’ file :param load_module_strict: Optional. Boolean to strictly enforce that the keys in state_dict of module and checkpoint match. :param load_optimizer_states: Optional. Boolean to load the training optimizer states from Checkpoint. Ex. ADAM’s momentum and variance :param load_lr_scheduler_states: Optional. Boolean to add the learning rate scheduler states from Checkpoint. :param load_module_only: Optional. Boolean to load only the model weights from the checkpoint. Ex. warmstarting. :param custom_load_fn: Optional. Custom model load function.

bug training

opened by stas00 0
[GatheredParameters] fix memory leak

Currently on exit from GatheredParameters with modified_rank=None memory is leaked as the gathered param remains gathered (the leak remains until the param is gathered again, most likely the first forward).

This PR fixes this problem by re-partitioning the param on exit from the GatheredParameters context.

A new test is supplied that reproduces this scenario and which fails prior to this PR.

@tjruwase

opened by stas00 0
[GatheredParameters] add support for any iterable
This PR extends GatheredParameters to support any iterable of parameters.

Currently there is an issue if someone does:

with deepspeed.zero.GatheredParameters(model.parameters(), ...):

it gets silently skipped and no gathering happens.

I raised this issue here https://github.com/microsoft/DeepSpeed/issues/2658 as it can be a huge problem if this happens during model weights init which 99% of the time will silently do nothing on 0-length vectors and the user isn't the wiser that their training is going to break because of that. This is a very important issue. Please kindly give it extra attention. I run into it myself and had users report the same issue.

So this PR at least makes the most obvious mistake no longer a mistake as intuitively model.parameters() should just work and not require the user to remember to do list(model.parameters()) as there is no assert if it's not done in this way.

I modified one of the tests to ensure this case is tested, the list one is obvious sub-case of the generator. but I can fork the test and do each explicitly if you prefer that.

I changed the API doc to match the new reality. The tutorials/docs don't seem to be discussing GatheredParameters's args so nothing to change there,

@tjruwase
opened by stas00 0
[fp16] lower `initial_scale_power` to `16`

I'm proposing to change the default initial_scale_power to 16 from the current 32. Here is why:

From wikipedia

The minimum strictly positive (subnormal) value is 2−24 ≈ 5.96 × 10−8. The minimum positive normal value is 2−14 ≈ 6.10 × 10−5. The maximum representable value is (2−2**−10) × 215 = 65504.

So I guess if the loss were to be 2**−24 then the maximum possible loss scale could be 2**48 (24+16) before it overflows, so 2**32 mathematically passes as a legit loss scale, except it's fantastically improbable.

But practically have you ever seen loss<1 ? And loss>1 then takes us to initial_scale_power=16 as the practical starting point, which is likely to lead to just a few skipped optim states.

(while this might sounds as a pointless change - who cares about a few skipped steps which are likely to be totally insignificant when training for thousands of steps, these things affect situations like writing tests, or debugging a failing from the beginning training, etc.)

I hope this is not a backward compatibility breaking change, as someone not specifying initial_scale_power explicitly and relying on the default 32 will now have a slightly different outcome as they will start training sooner (less skipping). If it is, then we should leave the default 32 but change the doc to use 16 and add a note to why.

So please kindly discuss among your colleagues whether the proposed change is a good idea as is, or whether it'd be safer to not change the default, but the docs only. Thank you.

@tjruwase

opened by stas00 0

Fix INT8-quantization for BLOOM, OPT, and Neo-X

This PR addresses https://github.com/microsoft/DeepSpeed/issues/2616 and https://github.com/microsoft/DeepSpeed/issues/2379

Also, this adds the support for INT8 inference of the different model architectures quantizing form the HF checkpoint directly. Here is an example using the DeepSpeedExamples inference test-suite running facebook/opt-30b using only one 32GB NVIDIA V100 card:

deepspeed --num_nodes 1 --num_gpus 1 inference-test.py --ds_inference --use_kernel --name facebook/opt-30b --use_meta_tensor --checkpoint_path ~/.cache/huggingface/hub/models--facebook--opt-30b/snapshots/463007d7da4e87fe962909a027811a8c0b32ede8/ --dtype int8

producing the following text:

------------------------------------------------------
Free memory : 0.238525 (GigaBytes)  
Total memory: 31.748535 (GigaBytes)  
Requested memory: 0.140137 (GigaBytes) 
Setting maximum total tokens (input + output) to 82 
------------------------------------------------------
generation time is 10.450812101364136 sec

in=DeepSpeed is a machine learning framework
out=DeepSpeed is a machine learning framework for large-scale, complex data

DeepSpeed is a machine learning framework specifically designed to solve some of the most complex and large-scale problems. The goal of DeepSpeed is to provide a rich infrastructure on top of which researchers can build highly
------------------------------------------------------------
[2023-01-04 11:23:05,806] [INFO] [launch.py:350:main] Process 33466 exits successfully.

Note that the memory is too tight here, however, we can still generate 50 tokens using the input text!

opened by RezaYazdaniAminabadi 0

[Bug Fixed] torch.cuda.is_available -> torch.cuda.is_available()
Hi there, torch.cuda.is_available is a function, rather than a bool value. I fixed the bug.

>>> torch.cuda.is_available <function is_available at 0x7f779bdac3a0>
opened by wkcn 0

Releases(v0.7.7)

v0.7.7(Dec 12, 2022)
What's Changed

Update the locator for Megatron-LM by @rapsealk in https://github.com/microsoft/DeepSpeed/pull/2564

use get_global_rank if available by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2567

Add Determined to open-source DL frameworks by @sirredbeard in https://github.com/microsoft/DeepSpeed/pull/2573

Support fp32 gradaccum for bf16 model by @delock in https://github.com/microsoft/DeepSpeed/pull/2566

Drop Maxwell Support by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2574

Fix quantized-inference & Add generic support of checkpoint loading by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2547

Fix MegatronLayerPolicy to have megatron_v2=True by @lekurile in https://github.com/microsoft/DeepSpeed/pull/2579

Update barrier and reduce_scatter_base to conform to PyTorch signatures by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2570

Support N-dimension input in quantization kernel by @lokoppakmsft in https://github.com/microsoft/DeepSpeed/pull/2575

Add checkpoint sharding unit tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2561

Updating docs README by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/2587

Updating API docs by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/2586

Fix issues w. python 3.6 + add py-version checks to CI by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2589

[benchmarks] get mask token from tokenizer by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2592

New Contributors

@rapsealk made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2564

@sirredbeard made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2573

Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.7.6...v0.7.7
Source code(tar.gz)
Source code(zip)
v0.7.6(Dec 1, 2022)
What's Changed

DeepSpeed inference config. (#2459) by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/2472

Update docs to autogenerate pydantic config model docs by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2509

Add max_tokens alias to max_out_tokens arg to maintain backwards compatibility by @lekurile in https://github.com/microsoft/DeepSpeed/pull/2508

Deepspeed quantization library v0.1 by @lokoppakmsft in https://github.com/microsoft/DeepSpeed/pull/2450

Fix backward compatibility for InferenceConfig by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2516

Add missing Inference sub-configs by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2518

Add note about nvcc/hipcc requirement by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2519

Update codeowners by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2525

Dequantization Utils Library by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2521

Fixes for torch 1.14 due to new torch.numel return type by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2522

Ensure MOE is initialized for SD by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2534

Make DS-Inference config readable from JSON by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2537

Add MII tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2533

Remove mutable default parameter in init_inference() by @aphedges in https://github.com/microsoft/DeepSpeed/pull/2540

Change Where DS/Triton is Used in Stable Diffusion by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2536

Pass down the new DS inference config to replace_transformer_layer. by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/2539

Adding Gradient Accumulation Data Type Config by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/2512

Report progress at gradient accumulation boundary by @ShijieZZZZ in https://github.com/microsoft/DeepSpeed/pull/2553

encoded ds config into command line argument when launching child processes in autotuning by @cli99 in https://github.com/microsoft/DeepSpeed/pull/2524

Add missing MoE fields to inference config for backward compatibility by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2556

Abstract accelerator (step 1) by @delock in https://github.com/microsoft/DeepSpeed/pull/2504

Fix invalid check of recorded parameter orders in zero stage3. by @inkcherry in https://github.com/microsoft/DeepSpeed/pull/2550

New Contributors

@ShijieZZZZ made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2553

@delock made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2504

@inkcherry made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2550

Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.7.5...v0.7.6
Source code(tar.gz)
Source code(zip)
v0.7.5(Nov 14, 2022)
What's Changed

Fix Bug #2319 by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/2438

update pytorch pool operator function signiture by @cli99 in https://github.com/microsoft/DeepSpeed/pull/2443

Fix build issues on Windows by @eltonzheng in https://github.com/microsoft/DeepSpeed/pull/2428

rollback ds config changes by @cli99 in https://github.com/microsoft/DeepSpeed/pull/2395

Use CUDA events for inference model profiling by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2371

Fixing a config mismatch in unit test. by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/2447

Reduction Kernel Utility by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2436

deepspeed/launcher/launch.py: add option enable_each_rank_log by @guoyejun in https://github.com/microsoft/DeepSpeed/pull/2409

Fixes for various CI problems by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2457

Cache Allocation and Softmax Fixes by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2433

Fix checkpoint loading at inference-engine by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2429

Create a new folder structure to isolate model-specific code in DS by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/2464

don't gather partitioned activations for mp size 1 by @guoyejun in https://github.com/microsoft/DeepSpeed/pull/2454

Updating autotune json default in docs. by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/2476

Added MLFLOW environment variables for logging metrics within trainig… by @savitamittal1 in https://github.com/microsoft/DeepSpeed/pull/2477

fix accelerate link in README by @kyoto7250 in https://github.com/microsoft/DeepSpeed/pull/2481

Fix Stable-Diffusion: Add correct memory-allocation at DeepSpeed-Attention by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2474

Fix CI issues related to cupy install by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2483

Add scale_attn_by_inverse_layer_idx feature by @hyunwoongko in https://github.com/microsoft/DeepSpeed/pull/2486

Stable Diffusion Enhancements by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2491

stage_1_and_2.py: no allreduce needed when mp size is 1 by @guoyejun in https://github.com/microsoft/DeepSpeed/pull/2494

Make bf16_optimizer work for non pipeline parallelism by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2470

Fix nightly CI tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2493

Make data contiguous before the inplace reshape-copy_ function. by @lokoppakmsft in https://github.com/microsoft/DeepSpeed/pull/2489

Fix typos: deepseed -> deepspeed by @jinyouzhi in https://github.com/microsoft/DeepSpeed/pull/2499

New Contributors

@guoyejun made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2409

@savitamittal1 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2477

@kyoto7250 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2481

@lokoppakmsft made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2489

@jinyouzhi made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2499

Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.7.4...v0.7.5
Source code(tar.gz)
Source code(zip)
v0.7.4(Oct 21, 2022)
What's Changed

MOE residual matmult unit test by @samadejacobs in https://github.com/microsoft/DeepSpeed/pull/2323

MOE matmult with memaccess by @samadejacobs in https://github.com/microsoft/DeepSpeed/pull/2336

Refactor residual add kernels by @arashb in https://github.com/microsoft/DeepSpeed/pull/2333

mem access for quantize kernel by @GuanhuaWang in https://github.com/microsoft/DeepSpeed/pull/2331

increase min pre-commit versions by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2346

Extend scratch buffer for long prompts by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2212

[docs] fix zero docs by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2350

Staging profile inference v1 (#2348) by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/2349

Kernel Data Conversion Utility by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2327

Add Onebit Optimizers in init by @l4d2boomer in https://github.com/microsoft/DeepSpeed/pull/2340

docs(mixture-of-experts-inference): fix typo in tuto by @jqueguiner in https://github.com/microsoft/DeepSpeed/pull/2345

Use blob storage for datasets in unit tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2342

Refactor gptj_residual_add kernels for better readability by @arashb in https://github.com/microsoft/DeepSpeed/pull/2358

Updated issue templates by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2363

fix cuda invalid config error in dequant kernel by @GuanhuaWang in https://github.com/microsoft/DeepSpeed/pull/2362

Add missing pytest fixture scope by @arashb in https://github.com/microsoft/DeepSpeed/pull/2353

Extend residual_add kernel tests to cover pre_attn_norm by @arashb in https://github.com/microsoft/DeepSpeed/pull/2354

Refactor fused_bias_residual kernels for better readability by @arashb in https://github.com/microsoft/DeepSpeed/pull/2356

Capture error message during sweep tests by @molly-smith in https://github.com/microsoft/DeepSpeed/pull/2351

Fix an exception when auto-casting dicts to fp16 by @mjksmith in https://github.com/microsoft/DeepSpeed/pull/2370

Refactor remaining distributed tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2216

Fix the MLP output tensor's shape by @arashb in https://github.com/microsoft/DeepSpeed/pull/2380

add 11.8 to cuda_minor_mismatch_ok to allow building with current CUDA by @Thomas-MMJ in https://github.com/microsoft/DeepSpeed/pull/2390

Pin Transformers test version by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2402

Change type to tuple in replace_wo_policy isinstance check by @lekurile in https://github.com/microsoft/DeepSpeed/pull/2387

Checkpoint backwards-compatbility workaround by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2384

Add Predicated Global Load to Memory Access Utils by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2373

MII blog post by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2418

Fix figure reference by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/2419

Add SLURM Multinode Runner by @dashstander in https://github.com/microsoft/DeepSpeed/pull/2404

Fix issue with corrupted output on long generation for GPT by @andrewchernyh in https://github.com/microsoft/DeepSpeed/pull/2359

Fix GPT Neo-X multi-gpu inference by @andrewchernyh in https://github.com/microsoft/DeepSpeed/pull/2401

CI fixes related to triton by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2422

[docs] update mii blog title by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2423

add SD injection policy by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2381

Fix checkpoint loading when it is a dictionary by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2425

Make error regex more generic in collect_results.py by @molly-smith in https://github.com/microsoft/DeepSpeed/pull/2415

fixes #2389 by @clumsy in https://github.com/microsoft/DeepSpeed/pull/2411

Fix for inference gpt-j test by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2430

Fixing bug 2361 by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/2410

Universal checkpoint for zero stage 1 by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2284

only add deps if extra is explicitly called by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2432

Add TestInjectionPolicy inference unittest class for testing custom injection policies by @lekurile in https://github.com/microsoft/DeepSpeed/pull/2426

[memory estimators] new config args sync by @stas00 in https://github.com/microsoft/DeepSpeed/pull/2431

parallelize writing of layer checkpoint files across data parallel instances by @adammoody in https://github.com/microsoft/DeepSpeed/pull/1419

Fix broken link to DeepSpeed Megatron fork by @lekurile in https://github.com/microsoft/DeepSpeed/pull/2440

New Contributors

@l4d2boomer made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2340

@jqueguiner made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2345

@mjksmith made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2370

@Thomas-MMJ made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2390

@lekurile made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2387

@dashstander made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2404

@andrewchernyh made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2359

@clumsy made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2411

@jomayeri made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2410

Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.7.3...v0.7.4
Source code(tar.gz)
Source code(zip)
v0.7.3(Sep 19, 2022)
What's Changed

Add blob storage to CI runners by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2260

Update replace_module.py, test-gptj.py related fix by @molly-smith in https://github.com/microsoft/DeepSpeed/pull/2269

Fix OrderedDict import for python3.6 by @Dipet in https://github.com/microsoft/DeepSpeed/pull/2267

Ds inference/fix mp2 by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2270

Trajepl: nebula load fix by @trajepl in https://github.com/microsoft/DeepSpeed/pull/2182

Prevent torch ext folder mkdir at tmp by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2274

Ds-inference Int8 support through ZeroQuant technology by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2217

add a new unit test for cuda ops by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/2278

Addition to code owners file by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2279

Memory Access Utility by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2276

Fp32 accuracy bug fix by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2285

Refactor universal checkpointing and tensor fragments by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2253

[ds-inference] fix progress bar by @stas00 in https://github.com/microsoft/DeepSpeed/pull/2286

Offload all gradients to nvme by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2282

fused bias relu unittest by @molly-smith in https://github.com/microsoft/DeepSpeed/pull/2297

Fix for pytest picking up wrong deepspeed by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2299

Fix for Zero3 when MP>1 by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2289

Unit test for bias add kernel by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2298

Update relu.cu with mem_access_utils by @molly-smith in https://github.com/microsoft/DeepSpeed/pull/2306

Add tensor parallel inference unit tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2232

Fix the residual add mp scaling for GPTNeoX by @arashb in https://github.com/microsoft/DeepSpeed/pull/2310

Add unit tests for residual_add kernel by @arashb in https://github.com/microsoft/DeepSpeed/pull/2307

add inference eval scripts by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2303

Upgrade P40 tests to torch 1.8 by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2316

ZeRO-Inference blog by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2271

ZeRO-Inference blog - wrap up by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2321

ZeRO-Inference blog - Update README by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2322

Refactor relu bias add with mem_access utils by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2317

add quant unit test by @GuanhuaWang in https://github.com/microsoft/DeepSpeed/pull/2315

only override forward if using cuda-graph by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2291

Add more options to inference benchmark by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2325

New Contributors

@molly-smith made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2269

Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.7.2...v0.7.3
Source code(tar.gz)
Source code(zip)
v0.7.2(Aug 25, 2022)
What's Changed

Enable contiguous gradients with Z1+MoE by @siddharth9820 in https://github.com/microsoft/DeepSpeed/pull/2250

Correctly detect CPU optimizer usage by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2257

Update Half Precision Kernel Compatibility by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2261

fix #2240: wrong time unit in flops_profiler by @yzs981130 in https://github.com/microsoft/DeepSpeed/pull/2241

New Contributors

@cmikeh2 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2261

@yzs981130 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2241

Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.7.1...v0.7.2
Source code(tar.gz)
Source code(zip)
v0.7.1(Aug 23, 2022)
What's Changed

Fix for distributed tests on pytorch>=1.12 by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2141

delay torch import for inference compatability check by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2167

Fix wrong unit of latency in flops-profiler (#2090) by @zionwu in https://github.com/microsoft/DeepSpeed/pull/2095

[docs] adoption updates by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2173

Update for AMD CI workflow by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2172

[docs] update offload docs to include stage 1 by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2178

Fixing model partitioning without injection by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2179

Match compute and reduce dtype by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2145

Enable fused_lamb_cuda_kernel on ROCm by @rraminen in https://github.com/microsoft/DeepSpeed/pull/2148

Update README to latest Composer version by @hanlint in https://github.com/microsoft/DeepSpeed/pull/2177

[deepspeed/autotuner] Missing hjson import by @rahilbathwal5 in https://github.com/microsoft/DeepSpeed/pull/2175

[docs] add more models to adoption by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2189

[CI] fix lightning tests by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2190

Fix typos on README.md by @gasparitiago in https://github.com/microsoft/DeepSpeed/pull/2192

Fix the layer-past for GPT based models by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2196

Add gradient_average flag support for sparse grads by @Dipet in https://github.com/microsoft/DeepSpeed/pull/2188

Adding the compression tutorial on GPT distillation and quantization by @minjiaz in https://github.com/microsoft/DeepSpeed/pull/2197

Log user config exactly by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2201

Fix the tensor-slicing copy for qkv parameters by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2198

Refactor Distributed Tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2180

fix table syntax by @kamalkraj in https://github.com/microsoft/DeepSpeed/pull/2204

Correctly detect offload configuration by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2208

add cuda 11.7 by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2211

use torch 1.9 in accelerate tests by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2215

[zero-3] print warning once and support torch parameter by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/2127

Add support of OPT models by @arashb in https://github.com/microsoft/DeepSpeed/pull/2205

fix typos in readme. by @zhjohnchan in https://github.com/microsoft/DeepSpeed/pull/2218

Fix regression w. dist_init_required by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2225

add doc for new bert example by @conglongli in https://github.com/microsoft/DeepSpeed/pull/2224

Remove the random-generator from context during inference by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2228

allow saving ckpt w/o ckpt json + bloom copy fix by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2237

Correctly detect zero_offload by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2213

[docs] update community videos by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2249

Refactor dist tests: Checkpointing by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2202

Make OPT policy backward compatible with pre-OPT transformers versions by @arashb in https://github.com/microsoft/DeepSpeed/pull/2254

fix ds-inference without policy by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2247

New Contributors

@zionwu made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2095

@hanlint made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2177

@rahilbathwal5 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2175

@gasparitiago made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2192

@arashb made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2205

@zhjohnchan made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2218

Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.7.0...v0.7.1
Source code(tar.gz)
Source code(zip)
v0.7.0(Aug 1, 2022)
New features

DeepSpeed Compression: https://www.microsoft.com/en-us/research/blog/deepspeed-compression-a-composable-library-for-extreme-compression-and-zero-cost-quantization/

What's Changed

Adding DeepSpeed Compression Composer by @yaozhewei in https://github.com/microsoft/DeepSpeed/pull/2105

Remove hardcoded ROCm install path by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2093

Fix softmax dim of Residual MoE implementation in moe/layer.py by @hero007feng in https://github.com/microsoft/DeepSpeed/pull/2110

reduce ds-inference log verbosity by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2111

DeepSpeed Compression announcement by @conglongli in https://github.com/microsoft/DeepSpeed/pull/2114

Checkpoint reshaping by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1953

Fix init_process_group by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2121

DS Benchmarks QoL Improvements by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2120

[ROCm] Wrong command broke ROCm build. by @jpvillam-amd in https://github.com/microsoft/DeepSpeed/pull/2118

DeepSpeed Communication Profiling and Logging by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2012

Add flake8 to pre-commit checks by @aphedges in https://github.com/microsoft/DeepSpeed/pull/2051

Fix conflict between Tutel and top-2 gate in MoE layer by @yetiansh in https://github.com/microsoft/DeepSpeed/pull/2053

adding HF Accelerate+DS tests workflow by @pacman100 in https://github.com/microsoft/DeepSpeed/pull/2134

[inference tests] turn off time check for now by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2142

Allow turning off loss scaling wrt GAS + update tput calculator by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2140

Refactor ZeRO configs to use Pydantic by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2004

Add purely-local sliding window sparse attention config by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/1962

Trajepl/nebula ckpt engine by @trajepl in https://github.com/microsoft/DeepSpeed/pull/2085

Graceful exit on failures for multi-node runs by @jerrymannil in https://github.com/microsoft/DeepSpeed/pull/2008

fix: fix BF16_Optimizer compatibility issue by @shjwudp in https://github.com/microsoft/DeepSpeed/pull/2152

Fix random token-generation issue + MP-checkpoint loading/saving by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2132

Added retain_graph as a kwarg to the main engine backward function by @ncilfone in https://github.com/microsoft/DeepSpeed/pull/1149

Elastic Training support in DeepSpeed by @aj-prime in https://github.com/microsoft/DeepSpeed/pull/2156

prevent cuda 10 builds of inference kernels on ampere by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2157

[zero-3] shutdown zero.Init from within ds.init by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2150

enable fp16 input autocasting by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2158

Release swap buffers for persisted params by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2089

Tensor parallelism for Mixture of Experts by @siddharth9820 in https://github.com/microsoft/DeepSpeed/pull/2074

New Contributors

@hero007feng made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2110

@jpvillam-amd made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2118

@yetiansh made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2053

@pacman100 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2134

@jimwu6 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2144

@trajepl made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2085

@ncilfone made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1149

Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.6.7...v0.7.0
Source code(tar.gz)
Source code(zip)
v0.6.7(Jul 19, 2022)
What's Changed

Add Inference support for running the BigScience-BLOOM Architecture by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2083

[ds-inference] checkpoint loading => tqdm by @stas00 in https://github.com/microsoft/DeepSpeed/pull/2107

Dont overwrite hook handles in flop profiler by @Sanger2000 in https://github.com/microsoft/DeepSpeed/pull/2106

Support HuggingFace NeoX injection policy by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2087

Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.6.6...v0.6.7
Source code(tar.gz)
Source code(zip)
v0.6.6(Jul 18, 2022)
What's Changed

[docs] add 530b paper by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1979

small fix for the HF Bert models by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/1984

Add unit test for various model families and inference tasks by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/1981

Fix for lightning tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/1988

fix typo when getting kernel dim in conv calculation by @cli99 in https://github.com/microsoft/DeepSpeed/pull/1989

Add torch-latest and torch-nightly CI workflows by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/1990

[bug] Add user-defined launcher args for MPI launcher by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/1933

Propagate max errorcode to deepspeed when using PDSH launcher by @jerrymannil in https://github.com/microsoft/DeepSpeed/pull/1994

[docs] add new build badges to landing page by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1998

DeepSpeed Comm. Backend v1 by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/1985

Relax DeepSpeed MoE ZeRO-1 Assertion by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2007

update CODEOWNERS by @conglongli in https://github.com/microsoft/DeepSpeed/pull/2017

[CI] force upgrade HF dependencies & output py env by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2015

[inference] test suite for ds-kernels (bert, roberta, gpt2, gpt-neo, gpt-j) by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1992

DeepSpeed examples refresh by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2021

Fix transformer API for training-evaluation pipeline by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2018

DataLoader Length Fix by @Sanger2000 in https://github.com/microsoft/DeepSpeed/pull/1718

DeepSpeed Monitor Module (Master) by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2013

Use partition numel by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2011

fix import errors by @KMFODA in https://github.com/microsoft/DeepSpeed/pull/2026

Fix inference unit test import error catching by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2024

Retain available params until last use by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2016

Split parameter offload from z3 by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2009

Fix flops profiler print statements by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2038

Add compression papers by @conglongli in https://github.com/microsoft/DeepSpeed/pull/2042

Fix the half-precision version of CPU-Adam by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2032

Fix for AMD unit tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2047

Wrong partition_id while copying fp32_params -> fp16 params in Z2 for MoE by @siddharth9820 in https://github.com/microsoft/DeepSpeed/pull/2058

Fix missing import in replace_module.py by @aphedges in https://github.com/microsoft/DeepSpeed/pull/2050

Comms Benchmarks by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2040

add ds inference paper by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2072

Comments for better understanding of zero stage1_2 by @kisseternity in https://github.com/microsoft/DeepSpeed/pull/2027

[docs] fix broken read-the-docs build by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2075

Fix building package without a GPU by @aphedges in https://github.com/microsoft/DeepSpeed/pull/2049

Fix partition id in the fp32->fp16 param copying step for z2+cpu-offload by @siddharth9820 in https://github.com/microsoft/DeepSpeed/pull/2059

Codeowner addendum and fix to small model debugging script by @samadejacobs in https://github.com/microsoft/DeepSpeed/pull/2076

remove require grad in params count by @cli99 in https://github.com/microsoft/DeepSpeed/pull/2065

Add missing newline for ZeroOneAdam parameter table by @manuelciosici in https://github.com/microsoft/DeepSpeed/pull/2088

fixed "None type has no len()" by @xiazeyu in https://github.com/microsoft/DeepSpeed/pull/2091

Improving memory utilization of Z2+MoE by @siddharth9820 in https://github.com/microsoft/DeepSpeed/pull/2079

New Contributors

@jerrymannil made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1994

@Sanger2000 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1718

@KMFODA made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2026

@siddharth9820 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2058

@samadejacobs made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2076

@xiazeyu made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2091

Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.6.5...v0.6.6
Source code(tar.gz)
Source code(zip)
v0.6.5(May 25, 2022)
What's Changed

GatheredParameters - accept a tuple of params by @stas00 in https://github.com/microsoft/DeepSpeed/pull/1941

Update partition_parameters.py by @manuelciosici in https://github.com/microsoft/DeepSpeed/pull/1943

fix step in adam by @szhengac in https://github.com/microsoft/DeepSpeed/pull/1823

[pipe] prevent deadlock with multiple evals sequence by @stas00 in https://github.com/microsoft/DeepSpeed/pull/1944

Fairseq support by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1915

DeepSpeed needs to start cleaning up by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1947

trivial fix by @kisseternity in https://github.com/microsoft/DeepSpeed/pull/1954

Enabling CUDA-graph for the bert-type models by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/1952

Add loss scale guard to avoid inf loop by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/1958

[launcher] add option to bypass ssh check by @liamcli in https://github.com/microsoft/DeepSpeed/pull/1957

Bump nokogiri from 1.13.4 to 1.13.6 in /docs by @dependabot in https://github.com/microsoft/DeepSpeed/pull/1965

Fix typo in timer.py by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/1964

[docs] fix dependabot version issue by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1966

Don't add curand on rocm by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1968

Add Unidirectional Sparse Attention Type to BigBird and BSLongformer by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/1959

Fix: Sparse tensors not updating by @Dipet in https://github.com/microsoft/DeepSpeed/pull/1914

Fixing several bugs in the inference-api and the kernels by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/1951

New Contributors

@Quentin-Anthony made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1958

Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.6.4...v0.6.5
Source code(tar.gz)
Source code(zip)
v0.6.4(May 6, 2022)
What's Changed

[fix] Windows installs cannot import fcntl by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/1921

[build] explicitly add op_builder to manifest by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1920

Enable DeepSpeed inference on ROCm by @rraminen in https://github.com/microsoft/DeepSpeed/pull/1922

bf16 inference by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1917

spell err by @kisseternity in https://github.com/microsoft/DeepSpeed/pull/1929

[ZeRO-3] Rename confusing log message by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1932

[bug] Fix time log error in PipelineEngine by @Codle in https://github.com/microsoft/DeepSpeed/pull/1934

Improve z3 trace management by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1916

New Contributors

@kisseternity made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1929

@Codle made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1934

Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.6.3...v0.6.4
Source code(tar.gz)
Source code(zip)
v0.6.3(Apr 27, 2022)
What's Changed

Fix setup.py crash when torch is not installed. by @PaperclipBadger in https://github.com/microsoft/DeepSpeed/pull/1866

Add support for AWS SageMaker. by @matherit in https://github.com/microsoft/DeepSpeed/pull/1868

Fix broken links by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1873

[docs] add amd blog to website by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1874

[docs] add moe paper by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1875

Supporting multiple modules injection with a single policy when they … by @samyam in https://github.com/microsoft/DeepSpeed/pull/1869

[docs] fix dead links by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1877

add now required -lcurand to solve undefined symbol: curandCreateGenerator by @stas00 in https://github.com/microsoft/DeepSpeed/pull/1879

Bug fix for flops profilers output by @VisionTheta in https://github.com/microsoft/DeepSpeed/pull/1885

Bump nokogiri from 1.13.3 to 1.13.4 in /docs by @dependabot in https://github.com/microsoft/DeepSpeed/pull/1889

[docs] fix commonmarker security issue by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1892

bf16+pipeline parallelism by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1801

fix file ordering by @szhengac in https://github.com/microsoft/DeepSpeed/pull/1822

Use f-strings where possible by @manuelciosici in https://github.com/microsoft/DeepSpeed/pull/1900

[partition_parameters.py] better diagnostics by @stas00 in https://github.com/microsoft/DeepSpeed/pull/1887

comm backend: cast bool when not supported by torch2cupy by @conglongli in https://github.com/microsoft/DeepSpeed/pull/1894

Use cuda events to improve timing for multi-stream execution by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1881

Fix multiple zero 3 tracing errors by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1901

Improve ds_report output for HIP/ROCm by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/1906

Fix launcher for reading env vars by @szhengac in https://github.com/microsoft/DeepSpeed/pull/1907

Fix OOM and type mismatch by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1884

New Contributors

@PaperclipBadger made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1866

@matherit made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1868

@VisionTheta made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1885

@szhengac made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1822

Misc

v0.6.2 was skipped due to a build/deploy issue with that release

Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.6.1...v0.6.3
Source code(tar.gz)
Source code(zip)
v0.6.1(Mar 28, 2022)

Source code(tar.gz)
Source code(zip)
v0.6.0(Mar 7, 2022)
DeepSpeed v0.6.0

Release notes

New features

Advancing MoE inference and training to power next-generation AI scale

MoE inference

PR-MoE model support

AMD support (#1430)

Various ZeRO Stage3 Optimizations + Improvements (#1453)

Special thanks to our contributors in this release

@stas00, @jithunnair-amd, @rraminen, @jeffdaily, @okakarpa, @jfc4050, @raamjad, @aphedges, @SeanNaren, @liamcli, @andriyor, @manuelciosici
Source code(tar.gz)
Source code(zip)
v0.5.10(Jan 19, 2022)

Source code(tar.gz)
Source code(zip)
v0.5.9(Jan 4, 2022)

Source code(tar.gz)
Source code(zip)
v0.5.8(Dec 1, 2021)

Source code(tar.gz)
Source code(zip)
v0.5.7(Dec 1, 2021)

Source code(tar.gz)
Source code(zip)
v0.5.6(Nov 11, 2021)

Source code(tar.gz)
Source code(zip)
v0.5.5(Nov 5, 2021)

Source code(tar.gz)
Source code(zip)
v0.5.4(Oct 6, 2021)

Source code(tar.gz)
Source code(zip)
v0.5.3(Sep 18, 2021)

Source code(tar.gz)
Source code(zip)
v0.5.2(Sep 14, 2021)

Source code(tar.gz)
Source code(zip)
v0.5.1(Aug 26, 2021)

Source code(tar.gz)
Source code(zip)
v0.5.0(Aug 17, 2021)
Mixture of Experts (MoE) support

Curriculum learning

Source code(tar.gz)
Source code(zip)
v0.4.5(Aug 10, 2021)

Source code(tar.gz)
Source code(zip)
v0.4.4(Jul 30, 2021)

Source code(tar.gz)
Source code(zip)
v0.4.3(Jul 13, 2021)

Source code(tar.gz)
Source code(zip)
v0.4.2(Jul 1, 2021)

Source code(tar.gz)
Source code(zip)