DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

Overview

Build Status PyPI version Documentation Status License MIT Downloads

03/2021: DeepSpeed is hiring! Come join us: SDE 2, Sr. SDE, Sr. Researcher

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

10x Larger Models

10x Faster Training

Minimal Code Change

DeepSpeed delivers extreme-scale model training for everyone, from data scientists training on massive supercomputers to those training on low-end clusters or even on a single GPU:

  • Extreme scale: Using current generation of GPU clusters with hundreds of devices, 3D parallelism of DeepSpeed can efficiently train deep learning models with trillions of parameters.
  • Extremely memory efficient: With just a single GPU, ZeRO-Offload of DeepSpeed can train models with over 10B parameters, 10x bigger than the state of arts, democratizing multi-billion-parameter model training such that many deep learning scientists can explore bigger and better models.
  • Extremely long sequence length: Sparse attention of DeepSpeed powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution comparing with dense transformers.
  • Extremely communication efficient: 3D parallelism improves communication efficiency allows users to train multi-billion-parameter models 2–7x faster on clusters with limited network bandwidth. 1-bit Adam/1-bit LAMB reduce communication volume by up to 5x while achieving similar convergence efficiency to Adam/LAMB, allowing for scaling to different types of GPU clusters and networks.

Early adopters of DeepSpeed have already produced a language model (LM) with over 17B parameters called Turing-NLG, establishing a new SOTA in the LM category.

DeepSpeed is an important part of Microsoft’s new AI at Scale initiative to enable next-generation AI capabilities at scale, where you can find more information here.

For further documentation, tutorials, and technical deep-dives please see deepspeed.ai!

News

Table of Contents

Section Description
Why DeepSpeed? DeepSpeed overview
Install Installation details
Features Feature list and overview
Further Reading Documentation, tutorials, etc.
Contributing Instructions for contributing
Publications Publications related to DeepSpeed
Videos Videos related to DeepSpeed

Why DeepSpeed?

Training advanced deep learning models is challenging. Beyond model design, model scientists also need to set up the state-of-the-art training techniques such as distributed training, mixed precision, gradient accumulation, and checkpointing. Yet still, scientists may not achieve the desired system performance and convergence rate. Large model sizes are even more challenging: a large model easily runs out of memory with pure data parallelism and it is difficult to use model parallelism. DeepSpeed addresses these challenges to accelerate model development and training.

Installation

The quickest way to get started with DeepSpeed is via pip, this will install the latest release of DeepSpeed which is not tied to specific PyTorch or CUDA versions. DeepSpeed includes several C++/CUDA extensions that we commonly refer to as our 'ops'. By default, all of these extensions/ops will be built just-in-time (JIT) using torch's JIT C++ extension loader that relies on ninja to build and dynamically link them at runtime.

Note: PyTorch must be installed before installing DeepSpeed.

pip install deepspeed

After installation, you can validate your install and see which extensions/ops your machine is compatible with via the DeepSpeed environment report.

ds_report

If you would like to pre-install any of the DeepSpeed extensions/ops (instead of JIT compiling) or install pre-compiled ops via PyPI please see our advanced installation instructions.

Features

Below we provide a brief feature list, see our detailed feature overview for descriptions and usage.

Further Reading

All DeepSpeed documentation can be found on our website: deepspeed.ai

Article Description
DeepSpeed Features DeepSpeed features
Getting Started First steps with DeepSpeed
DeepSpeed JSON Configuration Configuring DeepSpeed
API Documentation Generated DeepSpeed API documentation
CIFAR-10 Tutorial Getting started with CIFAR-10 and DeepSpeed
Megatron-LM Tutorial Train GPT2 with DeepSpeed and Megatron-LM
BERT Pre-training Tutorial Pre-train BERT with DeepSpeed
Learning Rate Range Test Tutorial Faster training with large learning rates
1Cycle Tutorial SOTA learning schedule in DeepSpeed

Contributing

DeepSpeed welcomes your contributions! Please see our contributing guide for more details on formatting, testing, etc.

Contributor License Agreement

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Publications

  1. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. (2019) ZeRO: memory optimizations toward training trillion parameter models. arXiv:1910.02054 and In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20).
  2. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. (2020) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '20, Tutorial).
  3. Minjia Zhang, Yuxiong He. (2020) Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. arXiv:2010.13369 and NeurIPS 2020.
  4. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. (2021) ZeRO-Offload: Democratizing Billion-Scale Model Training. arXiv:2101.06840.
  5. Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He. (2021) 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. arXiv:2102.02888.
  6. Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. (2021) ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. arXiv:2104.07857.
  7. Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He. (2021) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. arXiv:2104.06069.

Videos

  1. DeepSpeed KDD 2020 Tutorial
    1. Overview
    2. ZeRO + large model training
    3. 17B T-NLG demo
    4. Fastest BERT training + RScan tuning
    5. DeepSpeed hands on deep dive: part 1, part 2, part 3
    6. FAQ
  2. Microsoft Research Webinar
  3. DeepSpeed on AzureML
  4. Community Tutorials
Comments
  • [REQUEST] how to Wrap normalization layers like LayerNorm in FP32 when use zero (fp16 or bf16)?

    [REQUEST] how to Wrap normalization layers like LayerNorm in FP32 when use zero (fp16 or bf16)?

    Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

    Describe the solution you'd like A clear and concise description of what you want to happen.

    Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

    Additional context Add any other context or screenshots about the feature request here.

    enhancement 
    opened by xiaohu2015 0
  • [BUG] some docs have broken formatting

    [BUG] some docs have broken formatting

    Describe the bug

    The API arguments docs aren't formatted, e.g. these ones (but there are probably more of those):

    https://deepspeed.readthedocs.io/en/latest/training.html#model-saving https://deepspeed.readthedocs.io/en/latest/training.html#gradient-accumulation https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html (most of the page)

    e.g. have a look at: https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html#loading-training-checkpoints

    here all the args are piled into one para, instead of being itemized and nothing is formatted:

    Load training checkpoint :param load_dir: Required. Directory to load the checkpoint from :param tag: Checkpoint tag used as a unique identifier for checkpoint, if not provided will attempt to load tag in ‘latest’ file :param load_module_strict: Optional. Boolean to strictly enforce that the keys in state_dict of module and checkpoint match. :param load_optimizer_states: Optional. Boolean to load the training optimizer states from Checkpoint. Ex. ADAM’s momentum and variance :param load_lr_scheduler_states: Optional. Boolean to add the learning rate scheduler states from Checkpoint. :param load_module_only: Optional. Boolean to load only the model weights from the checkpoint. Ex. warmstarting. :param custom_load_fn: Optional. Custom model load function.

    bug training 
    opened by stas00 0
  • [GatheredParameters] fix memory leak

    [GatheredParameters] fix memory leak

    Currently on exit from GatheredParameters with modified_rank=None memory is leaked as the gathered param remains gathered (the leak remains until the param is gathered again, most likely the first forward).

    This PR fixes this problem by re-partitioning the param on exit from the GatheredParameters context.

    A new test is supplied that reproduces this scenario and which fails prior to this PR.

    @tjruwase

    opened by stas00 0
  • [GatheredParameters] add support for any iterable

    [GatheredParameters] add support for any iterable

    This PR extends GatheredParameters to support any iterable of parameters.

    Currently there is an issue if someone does:

    with deepspeed.zero.GatheredParameters(model.parameters(), ...):
    

    it gets silently skipped and no gathering happens.

    I raised this issue here https://github.com/microsoft/DeepSpeed/issues/2658 as it can be a huge problem if this happens during model weights init which 99% of the time will silently do nothing on 0-length vectors and the user isn't the wiser that their training is going to break because of that. This is a very important issue. Please kindly give it extra attention. I run into it myself and had users report the same issue.

    So this PR at least makes the most obvious mistake no longer a mistake as intuitively model.parameters() should just work and not require the user to remember to do list(model.parameters()) as there is no assert if it's not done in this way.

    I modified one of the tests to ensure this case is tested, the list one is obvious sub-case of the generator. but I can fork the test and do each explicitly if you prefer that.

    I changed the API doc to match the new reality. The tutorials/docs don't seem to be discussing GatheredParameters's args so nothing to change there,

    @tjruwase

    opened by stas00 0
  • [fp16] lower `initial_scale_power` to `16`

    [fp16] lower `initial_scale_power` to `16`

    I'm proposing to change the default initial_scale_power to 16 from the current 32. Here is why:

    From wikipedia

    The minimum strictly positive (subnormal) value is 2−24 ≈ 5.96 × 10−8. The minimum positive normal value is 2−14 ≈ 6.10 × 10−5. The maximum representable value is (2−2**−10) × 215 = 65504.

    So I guess if the loss were to be 2**−24 then the maximum possible loss scale could be 2**48 (24+16) before it overflows, so 2**32 mathematically passes as a legit loss scale, except it's fantastically improbable.

    But practically have you ever seen loss<1 ? And loss>1 then takes us to initial_scale_power=16 as the practical starting point, which is likely to lead to just a few skipped optim states.

    (while this might sounds as a pointless change - who cares about a few skipped steps which are likely to be totally insignificant when training for thousands of steps, these things affect situations like writing tests, or debugging a failing from the beginning training, etc.)


    I hope this is not a backward compatibility breaking change, as someone not specifying initial_scale_power explicitly and relying on the default 32 will now have a slightly different outcome as they will start training sooner (less skipping). If it is, then we should leave the default 32 but change the doc to use 16 and add a note to why.

    So please kindly discuss among your colleagues whether the proposed change is a good idea as is, or whether it'd be safer to not change the default, but the docs only. Thank you.

    @tjruwase

    opened by stas00 0
  • Fix INT8-quantization for BLOOM, OPT, and Neo-X

    Fix INT8-quantization for BLOOM, OPT, and Neo-X

    This PR addresses https://github.com/microsoft/DeepSpeed/issues/2616 and https://github.com/microsoft/DeepSpeed/issues/2379

    Also, this adds the support for INT8 inference of the different model architectures quantizing form the HF checkpoint directly. Here is an example using the DeepSpeedExamples inference test-suite running facebook/opt-30b using only one 32GB NVIDIA V100 card:

    deepspeed --num_nodes 1 --num_gpus 1 inference-test.py --ds_inference --use_kernel --name facebook/opt-30b --use_meta_tensor --checkpoint_path ~/.cache/huggingface/hub/models--facebook--opt-30b/snapshots/463007d7da4e87fe962909a027811a8c0b32ede8/ --dtype int8
    

    producing the following text:

    ------------------------------------------------------
    Free memory : 0.238525 (GigaBytes)  
    Total memory: 31.748535 (GigaBytes)  
    Requested memory: 0.140137 (GigaBytes) 
    Setting maximum total tokens (input + output) to 82 
    ------------------------------------------------------
    generation time is 10.450812101364136 sec
    
    in=DeepSpeed is a machine learning framework
    out=DeepSpeed is a machine learning framework for large-scale, complex data
    
    DeepSpeed is a machine learning framework specifically designed to solve some of the most complex and large-scale problems. The goal of DeepSpeed is to provide a rich infrastructure on top of which researchers can build highly
    ------------------------------------------------------------
    [2023-01-04 11:23:05,806] [INFO] [launch.py:350:main] Process 33466 exits successfully.
    

    Note that the memory is too tight here, however, we can still generate 50 tokens using the input text!

    opened by RezaYazdaniAminabadi 0
Releases(v0.7.7)
  • v0.7.7(Dec 12, 2022)

    What's Changed

    • Update the locator for Megatron-LM by @rapsealk in https://github.com/microsoft/DeepSpeed/pull/2564
    • use get_global_rank if available by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2567
    • Add Determined to open-source DL frameworks by @sirredbeard in https://github.com/microsoft/DeepSpeed/pull/2573
    • Support fp32 gradaccum for bf16 model by @delock in https://github.com/microsoft/DeepSpeed/pull/2566
    • Drop Maxwell Support by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2574
    • Fix quantized-inference & Add generic support of checkpoint loading by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2547
    • Fix MegatronLayerPolicy to have megatron_v2=True by @lekurile in https://github.com/microsoft/DeepSpeed/pull/2579
    • Update barrier and reduce_scatter_base to conform to PyTorch signatures by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2570
    • Support N-dimension input in quantization kernel by @lokoppakmsft in https://github.com/microsoft/DeepSpeed/pull/2575
    • Add checkpoint sharding unit tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2561
    • Updating docs README by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/2587
    • Updating API docs by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/2586
    • Fix issues w. python 3.6 + add py-version checks to CI by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2589
    • [benchmarks] get mask token from tokenizer by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2592

    New Contributors

    • @rapsealk made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2564
    • @sirredbeard made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2573

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.7.6...v0.7.7

    Source code(tar.gz)
    Source code(zip)
  • v0.7.6(Dec 1, 2022)

    What's Changed

    • DeepSpeed inference config. (#2459) by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/2472
    • Update docs to autogenerate pydantic config model docs by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2509
    • Add max_tokens alias to max_out_tokens arg to maintain backwards compatibility by @lekurile in https://github.com/microsoft/DeepSpeed/pull/2508
    • Deepspeed quantization library v0.1 by @lokoppakmsft in https://github.com/microsoft/DeepSpeed/pull/2450
    • Fix backward compatibility for InferenceConfig by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2516
    • Add missing Inference sub-configs by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2518
    • Add note about nvcc/hipcc requirement by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2519
    • Update codeowners by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2525
    • Dequantization Utils Library by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2521
    • Fixes for torch 1.14 due to new torch.numel return type by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2522
    • Ensure MOE is initialized for SD by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2534
    • Make DS-Inference config readable from JSON by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2537
    • Add MII tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2533
    • Remove mutable default parameter in init_inference() by @aphedges in https://github.com/microsoft/DeepSpeed/pull/2540
    • Change Where DS/Triton is Used in Stable Diffusion by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2536
    • Pass down the new DS inference config to replace_transformer_layer. by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/2539
    • Adding Gradient Accumulation Data Type Config by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/2512
    • Report progress at gradient accumulation boundary by @ShijieZZZZ in https://github.com/microsoft/DeepSpeed/pull/2553
    • encoded ds config into command line argument when launching child processes in autotuning by @cli99 in https://github.com/microsoft/DeepSpeed/pull/2524
    • Add missing MoE fields to inference config for backward compatibility by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2556
    • Abstract accelerator (step 1) by @delock in https://github.com/microsoft/DeepSpeed/pull/2504
    • Fix invalid check of recorded parameter orders in zero stage3. by @inkcherry in https://github.com/microsoft/DeepSpeed/pull/2550

    New Contributors

    • @ShijieZZZZ made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2553
    • @delock made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2504
    • @inkcherry made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2550

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.7.5...v0.7.6

    Source code(tar.gz)
    Source code(zip)
  • v0.7.5(Nov 14, 2022)

    What's Changed

    • Fix Bug #2319 by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/2438
    • update pytorch pool operator function signiture by @cli99 in https://github.com/microsoft/DeepSpeed/pull/2443
    • Fix build issues on Windows by @eltonzheng in https://github.com/microsoft/DeepSpeed/pull/2428
    • rollback ds config changes by @cli99 in https://github.com/microsoft/DeepSpeed/pull/2395
    • Use CUDA events for inference model profiling by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2371
    • Fixing a config mismatch in unit test. by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/2447
    • Reduction Kernel Utility by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2436
    • deepspeed/launcher/launch.py: add option enable_each_rank_log by @guoyejun in https://github.com/microsoft/DeepSpeed/pull/2409
    • Fixes for various CI problems by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2457
    • Cache Allocation and Softmax Fixes by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2433
    • Fix checkpoint loading at inference-engine by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2429
    • Create a new folder structure to isolate model-specific code in DS by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/2464
    • don't gather partitioned activations for mp size 1 by @guoyejun in https://github.com/microsoft/DeepSpeed/pull/2454
    • Updating autotune json default in docs. by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/2476
    • Added MLFLOW environment variables for logging metrics within trainig… by @savitamittal1 in https://github.com/microsoft/DeepSpeed/pull/2477
    • fix accelerate link in README by @kyoto7250 in https://github.com/microsoft/DeepSpeed/pull/2481
    • Fix Stable-Diffusion: Add correct memory-allocation at DeepSpeed-Attention by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2474
    • Fix CI issues related to cupy install by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2483
    • Add scale_attn_by_inverse_layer_idx feature by @hyunwoongko in https://github.com/microsoft/DeepSpeed/pull/2486
    • Stable Diffusion Enhancements by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2491
    • stage_1_and_2.py: no allreduce needed when mp size is 1 by @guoyejun in https://github.com/microsoft/DeepSpeed/pull/2494
    • Make bf16_optimizer work for non pipeline parallelism by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2470
    • Fix nightly CI tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2493
    • Make data contiguous before the inplace reshape-copy_ function. by @lokoppakmsft in https://github.com/microsoft/DeepSpeed/pull/2489
    • Fix typos: deepseed -> deepspeed by @jinyouzhi in https://github.com/microsoft/DeepSpeed/pull/2499

    New Contributors

    • @guoyejun made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2409
    • @savitamittal1 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2477
    • @kyoto7250 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2481
    • @lokoppakmsft made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2489
    • @jinyouzhi made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2499

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.7.4...v0.7.5

    Source code(tar.gz)
    Source code(zip)
  • v0.7.4(Oct 21, 2022)

    What's Changed

    • MOE residual matmult unit test by @samadejacobs in https://github.com/microsoft/DeepSpeed/pull/2323
    • MOE matmult with memaccess by @samadejacobs in https://github.com/microsoft/DeepSpeed/pull/2336
    • Refactor residual add kernels by @arashb in https://github.com/microsoft/DeepSpeed/pull/2333
    • mem access for quantize kernel by @GuanhuaWang in https://github.com/microsoft/DeepSpeed/pull/2331
    • increase min pre-commit versions by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2346
    • Extend scratch buffer for long prompts by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2212
    • [docs] fix zero docs by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2350
    • Staging profile inference v1 (#2348) by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/2349
    • Kernel Data Conversion Utility by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2327
    • Add Onebit Optimizers in init by @l4d2boomer in https://github.com/microsoft/DeepSpeed/pull/2340
    • docs(mixture-of-experts-inference): fix typo in tuto by @jqueguiner in https://github.com/microsoft/DeepSpeed/pull/2345
    • Use blob storage for datasets in unit tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2342
    • Refactor gptj_residual_add kernels for better readability by @arashb in https://github.com/microsoft/DeepSpeed/pull/2358
    • Updated issue templates by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2363
    • fix cuda invalid config error in dequant kernel by @GuanhuaWang in https://github.com/microsoft/DeepSpeed/pull/2362
    • Add missing pytest fixture scope by @arashb in https://github.com/microsoft/DeepSpeed/pull/2353
    • Extend residual_add kernel tests to cover pre_attn_norm by @arashb in https://github.com/microsoft/DeepSpeed/pull/2354
    • Refactor fused_bias_residual kernels for better readability by @arashb in https://github.com/microsoft/DeepSpeed/pull/2356
    • Capture error message during sweep tests by @molly-smith in https://github.com/microsoft/DeepSpeed/pull/2351
    • Fix an exception when auto-casting dicts to fp16 by @mjksmith in https://github.com/microsoft/DeepSpeed/pull/2370
    • Refactor remaining distributed tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2216
    • Fix the MLP output tensor's shape by @arashb in https://github.com/microsoft/DeepSpeed/pull/2380
    • add 11.8 to cuda_minor_mismatch_ok to allow building with current CUDA by @Thomas-MMJ in https://github.com/microsoft/DeepSpeed/pull/2390
    • Pin Transformers test version by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2402
    • Change type to tuple in replace_wo_policy isinstance check by @lekurile in https://github.com/microsoft/DeepSpeed/pull/2387
    • Checkpoint backwards-compatbility workaround by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2384
    • Add Predicated Global Load to Memory Access Utils by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2373
    • MII blog post by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2418
    • Fix figure reference by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/2419
    • Add SLURM Multinode Runner by @dashstander in https://github.com/microsoft/DeepSpeed/pull/2404
    • Fix issue with corrupted output on long generation for GPT by @andrewchernyh in https://github.com/microsoft/DeepSpeed/pull/2359
    • Fix GPT Neo-X multi-gpu inference by @andrewchernyh in https://github.com/microsoft/DeepSpeed/pull/2401
    • CI fixes related to triton by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2422
    • [docs] update mii blog title by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2423
    • add SD injection policy by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2381
    • Fix checkpoint loading when it is a dictionary by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2425
    • Make error regex more generic in collect_results.py by @molly-smith in https://github.com/microsoft/DeepSpeed/pull/2415
    • fixes #2389 by @clumsy in https://github.com/microsoft/DeepSpeed/pull/2411
    • Fix for inference gpt-j test by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2430
    • Fixing bug 2361 by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/2410
    • Universal checkpoint for zero stage 1 by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2284
    • only add deps if extra is explicitly called by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2432
    • Add TestInjectionPolicy inference unittest class for testing custom injection policies by @lekurile in https://github.com/microsoft/DeepSpeed/pull/2426
    • [memory estimators] new config args sync by @stas00 in https://github.com/microsoft/DeepSpeed/pull/2431
    • parallelize writing of layer checkpoint files across data parallel instances by @adammoody in https://github.com/microsoft/DeepSpeed/pull/1419
    • Fix broken link to DeepSpeed Megatron fork by @lekurile in https://github.com/microsoft/DeepSpeed/pull/2440

    New Contributors

    • @l4d2boomer made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2340
    • @jqueguiner made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2345
    • @mjksmith made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2370
    • @Thomas-MMJ made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2390
    • @lekurile made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2387
    • @dashstander made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2404
    • @andrewchernyh made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2359
    • @clumsy made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2411
    • @jomayeri made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2410

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.7.3...v0.7.4

    Source code(tar.gz)
    Source code(zip)
  • v0.7.3(Sep 19, 2022)

    What's Changed

    • Add blob storage to CI runners by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2260
    • Update replace_module.py, test-gptj.py related fix by @molly-smith in https://github.com/microsoft/DeepSpeed/pull/2269
    • Fix OrderedDict import for python3.6 by @Dipet in https://github.com/microsoft/DeepSpeed/pull/2267
    • Ds inference/fix mp2 by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2270
    • Trajepl: nebula load fix by @trajepl in https://github.com/microsoft/DeepSpeed/pull/2182
    • Prevent torch ext folder mkdir at tmp by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2274
    • Ds-inference Int8 support through ZeroQuant technology by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2217
    • add a new unit test for cuda ops by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/2278
    • Addition to code owners file by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2279
    • Memory Access Utility by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2276
    • Fp32 accuracy bug fix by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2285
    • Refactor universal checkpointing and tensor fragments by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2253
    • [ds-inference] fix progress bar by @stas00 in https://github.com/microsoft/DeepSpeed/pull/2286
    • Offload all gradients to nvme by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2282
    • fused bias relu unittest by @molly-smith in https://github.com/microsoft/DeepSpeed/pull/2297
    • Fix for pytest picking up wrong deepspeed by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2299
    • Fix for Zero3 when MP>1 by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2289
    • Unit test for bias add kernel by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2298
    • Update relu.cu with mem_access_utils by @molly-smith in https://github.com/microsoft/DeepSpeed/pull/2306
    • Add tensor parallel inference unit tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2232
    • Fix the residual add mp scaling for GPTNeoX by @arashb in https://github.com/microsoft/DeepSpeed/pull/2310
    • Add unit tests for residual_add kernel by @arashb in https://github.com/microsoft/DeepSpeed/pull/2307
    • add inference eval scripts by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2303
    • Upgrade P40 tests to torch 1.8 by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2316
    • ZeRO-Inference blog by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2271
    • ZeRO-Inference blog - wrap up by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2321
    • ZeRO-Inference blog - Update README by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2322
    • Refactor relu bias add with mem_access utils by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2317
    • add quant unit test by @GuanhuaWang in https://github.com/microsoft/DeepSpeed/pull/2315
    • only override forward if using cuda-graph by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2291
    • Add more options to inference benchmark by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2325

    New Contributors

    • @molly-smith made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2269

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.7.2...v0.7.3

    Source code(tar.gz)
    Source code(zip)
  • v0.7.2(Aug 25, 2022)

    What's Changed

    • Enable contiguous gradients with Z1+MoE by @siddharth9820 in https://github.com/microsoft/DeepSpeed/pull/2250
    • Correctly detect CPU optimizer usage by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2257
    • Update Half Precision Kernel Compatibility by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2261
    • fix #2240: wrong time unit in flops_profiler by @yzs981130 in https://github.com/microsoft/DeepSpeed/pull/2241

    New Contributors

    • @cmikeh2 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2261
    • @yzs981130 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2241

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.7.1...v0.7.2

    Source code(tar.gz)
    Source code(zip)
  • v0.7.1(Aug 23, 2022)

    What's Changed

    • Fix for distributed tests on pytorch>=1.12 by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2141
    • delay torch import for inference compatability check by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2167
    • Fix wrong unit of latency in flops-profiler (#2090) by @zionwu in https://github.com/microsoft/DeepSpeed/pull/2095
    • [docs] adoption updates by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2173
    • Update for AMD CI workflow by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2172
    • [docs] update offload docs to include stage 1 by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2178
    • Fixing model partitioning without injection by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2179
    • Match compute and reduce dtype by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2145
    • Enable fused_lamb_cuda_kernel on ROCm by @rraminen in https://github.com/microsoft/DeepSpeed/pull/2148
    • Update README to latest Composer version by @hanlint in https://github.com/microsoft/DeepSpeed/pull/2177
    • [deepspeed/autotuner] Missing hjson import by @rahilbathwal5 in https://github.com/microsoft/DeepSpeed/pull/2175
    • [docs] add more models to adoption by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2189
    • [CI] fix lightning tests by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2190
    • Fix typos on README.md by @gasparitiago in https://github.com/microsoft/DeepSpeed/pull/2192
    • Fix the layer-past for GPT based models by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2196
    • Add gradient_average flag support for sparse grads by @Dipet in https://github.com/microsoft/DeepSpeed/pull/2188
    • Adding the compression tutorial on GPT distillation and quantization by @minjiaz in https://github.com/microsoft/DeepSpeed/pull/2197
    • Log user config exactly by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2201
    • Fix the tensor-slicing copy for qkv parameters by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2198
    • Refactor Distributed Tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2180
    • fix table syntax by @kamalkraj in https://github.com/microsoft/DeepSpeed/pull/2204
    • Correctly detect offload configuration by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2208
    • add cuda 11.7 by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2211
    • use torch 1.9 in accelerate tests by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2215
    • [zero-3] print warning once and support torch parameter by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/2127
    • Add support of OPT models by @arashb in https://github.com/microsoft/DeepSpeed/pull/2205
    • fix typos in readme. by @zhjohnchan in https://github.com/microsoft/DeepSpeed/pull/2218
    • Fix regression w. dist_init_required by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2225
    • add doc for new bert example by @conglongli in https://github.com/microsoft/DeepSpeed/pull/2224
    • Remove the random-generator from context during inference by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2228
    • allow saving ckpt w/o ckpt json + bloom copy fix by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2237
    • Correctly detect zero_offload by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2213
    • [docs] update community videos by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2249
    • Refactor dist tests: Checkpointing by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2202
    • Make OPT policy backward compatible with pre-OPT transformers versions by @arashb in https://github.com/microsoft/DeepSpeed/pull/2254
    • fix ds-inference without policy by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2247

    New Contributors

    • @zionwu made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2095
    • @hanlint made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2177
    • @rahilbathwal5 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2175
    • @gasparitiago made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2192
    • @arashb made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2205
    • @zhjohnchan made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2218

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.7.0...v0.7.1

    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Aug 1, 2022)

    New features

    • DeepSpeed Compression: https://www.microsoft.com/en-us/research/blog/deepspeed-compression-a-composable-library-for-extreme-compression-and-zero-cost-quantization/

    What's Changed

    • Adding DeepSpeed Compression Composer by @yaozhewei in https://github.com/microsoft/DeepSpeed/pull/2105
    • Remove hardcoded ROCm install path by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2093
    • Fix softmax dim of Residual MoE implementation in moe/layer.py by @hero007feng in https://github.com/microsoft/DeepSpeed/pull/2110
    • reduce ds-inference log verbosity by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2111
    • DeepSpeed Compression announcement by @conglongli in https://github.com/microsoft/DeepSpeed/pull/2114
    • Checkpoint reshaping by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1953
    • Fix init_process_group by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2121
    • DS Benchmarks QoL Improvements by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2120
    • [ROCm] Wrong command broke ROCm build. by @jpvillam-amd in https://github.com/microsoft/DeepSpeed/pull/2118
    • DeepSpeed Communication Profiling and Logging by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2012
    • Add flake8 to pre-commit checks by @aphedges in https://github.com/microsoft/DeepSpeed/pull/2051
    • Fix conflict between Tutel and top-2 gate in MoE layer by @yetiansh in https://github.com/microsoft/DeepSpeed/pull/2053
    • adding HF Accelerate+DS tests workflow by @pacman100 in https://github.com/microsoft/DeepSpeed/pull/2134
    • [inference tests] turn off time check for now by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2142
    • Allow turning off loss scaling wrt GAS + update tput calculator by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2140
    • Refactor ZeRO configs to use Pydantic by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2004
    • Add purely-local sliding window sparse attention config by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/1962
    • Trajepl/nebula ckpt engine by @trajepl in https://github.com/microsoft/DeepSpeed/pull/2085
    • Graceful exit on failures for multi-node runs by @jerrymannil in https://github.com/microsoft/DeepSpeed/pull/2008
    • fix: fix BF16_Optimizer compatibility issue by @shjwudp in https://github.com/microsoft/DeepSpeed/pull/2152
    • Fix random token-generation issue + MP-checkpoint loading/saving by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2132
    • Added retain_graph as a kwarg to the main engine backward function by @ncilfone in https://github.com/microsoft/DeepSpeed/pull/1149
    • Elastic Training support in DeepSpeed by @aj-prime in https://github.com/microsoft/DeepSpeed/pull/2156
    • prevent cuda 10 builds of inference kernels on ampere by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2157
    • [zero-3] shutdown zero.Init from within ds.init by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2150
    • enable fp16 input autocasting by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2158
    • Release swap buffers for persisted params by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2089
    • Tensor parallelism for Mixture of Experts by @siddharth9820 in https://github.com/microsoft/DeepSpeed/pull/2074

    New Contributors

    • @hero007feng made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2110
    • @jpvillam-amd made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2118
    • @yetiansh made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2053
    • @pacman100 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2134
    • @jimwu6 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2144
    • @trajepl made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2085
    • @ncilfone made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1149

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.6.7...v0.7.0

    Source code(tar.gz)
    Source code(zip)
  • v0.6.7(Jul 19, 2022)

    What's Changed

    • Add Inference support for running the BigScience-BLOOM Architecture by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2083
    • [ds-inference] checkpoint loading => tqdm by @stas00 in https://github.com/microsoft/DeepSpeed/pull/2107
    • Dont overwrite hook handles in flop profiler by @Sanger2000 in https://github.com/microsoft/DeepSpeed/pull/2106
    • Support HuggingFace NeoX injection policy by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2087

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.6.6...v0.6.7

    Source code(tar.gz)
    Source code(zip)
  • v0.6.6(Jul 18, 2022)

    What's Changed

    • [docs] add 530b paper by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1979
    • small fix for the HF Bert models by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/1984
    • Add unit test for various model families and inference tasks by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/1981
    • Fix for lightning tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/1988
    • fix typo when getting kernel dim in conv calculation by @cli99 in https://github.com/microsoft/DeepSpeed/pull/1989
    • Add torch-latest and torch-nightly CI workflows by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/1990
    • [bug] Add user-defined launcher args for MPI launcher by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/1933
    • Propagate max errorcode to deepspeed when using PDSH launcher by @jerrymannil in https://github.com/microsoft/DeepSpeed/pull/1994
    • [docs] add new build badges to landing page by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1998
    • DeepSpeed Comm. Backend v1 by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/1985
    • Relax DeepSpeed MoE ZeRO-1 Assertion by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2007
    • update CODEOWNERS by @conglongli in https://github.com/microsoft/DeepSpeed/pull/2017
    • [CI] force upgrade HF dependencies & output py env by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2015
    • [inference] test suite for ds-kernels (bert, roberta, gpt2, gpt-neo, gpt-j) by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1992
    • DeepSpeed examples refresh by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2021
    • Fix transformer API for training-evaluation pipeline by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2018
    • DataLoader Length Fix by @Sanger2000 in https://github.com/microsoft/DeepSpeed/pull/1718
    • DeepSpeed Monitor Module (Master) by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2013
    • Use partition numel by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2011
    • fix import errors by @KMFODA in https://github.com/microsoft/DeepSpeed/pull/2026
    • Fix inference unit test import error catching by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2024
    • Retain available params until last use by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2016
    • Split parameter offload from z3 by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2009
    • Fix flops profiler print statements by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2038
    • Add compression papers by @conglongli in https://github.com/microsoft/DeepSpeed/pull/2042
    • Fix the half-precision version of CPU-Adam by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2032
    • Fix for AMD unit tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2047
    • Wrong partition_id while copying fp32_params -> fp16 params in Z2 for MoE by @siddharth9820 in https://github.com/microsoft/DeepSpeed/pull/2058
    • Fix missing import in replace_module.py by @aphedges in https://github.com/microsoft/DeepSpeed/pull/2050
    • Comms Benchmarks by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2040
    • add ds inference paper by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2072
    • Comments for better understanding of zero stage1_2 by @kisseternity in https://github.com/microsoft/DeepSpeed/pull/2027
    • [docs] fix broken read-the-docs build by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2075
    • Fix building package without a GPU by @aphedges in https://github.com/microsoft/DeepSpeed/pull/2049
    • Fix partition id in the fp32->fp16 param copying step for z2+cpu-offload by @siddharth9820 in https://github.com/microsoft/DeepSpeed/pull/2059
    • Codeowner addendum and fix to small model debugging script by @samadejacobs in https://github.com/microsoft/DeepSpeed/pull/2076
    • remove require grad in params count by @cli99 in https://github.com/microsoft/DeepSpeed/pull/2065
    • Add missing newline for ZeroOneAdam parameter table by @manuelciosici in https://github.com/microsoft/DeepSpeed/pull/2088
    • fixed "None type has no len()" by @xiazeyu in https://github.com/microsoft/DeepSpeed/pull/2091
    • Improving memory utilization of Z2+MoE by @siddharth9820 in https://github.com/microsoft/DeepSpeed/pull/2079

    New Contributors

    • @jerrymannil made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1994
    • @Sanger2000 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1718
    • @KMFODA made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2026
    • @siddharth9820 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2058
    • @samadejacobs made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2076
    • @xiazeyu made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2091

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.6.5...v0.6.6

    Source code(tar.gz)
    Source code(zip)
  • v0.6.5(May 25, 2022)

    What's Changed

    • GatheredParameters - accept a tuple of params by @stas00 in https://github.com/microsoft/DeepSpeed/pull/1941
    • Update partition_parameters.py by @manuelciosici in https://github.com/microsoft/DeepSpeed/pull/1943
    • fix step in adam by @szhengac in https://github.com/microsoft/DeepSpeed/pull/1823
    • [pipe] prevent deadlock with multiple evals sequence by @stas00 in https://github.com/microsoft/DeepSpeed/pull/1944
    • Fairseq support by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1915
    • DeepSpeed needs to start cleaning up by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1947
    • trivial fix by @kisseternity in https://github.com/microsoft/DeepSpeed/pull/1954
    • Enabling CUDA-graph for the bert-type models by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/1952
    • Add loss scale guard to avoid inf loop by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/1958
    • [launcher] add option to bypass ssh check by @liamcli in https://github.com/microsoft/DeepSpeed/pull/1957
    • Bump nokogiri from 1.13.4 to 1.13.6 in /docs by @dependabot in https://github.com/microsoft/DeepSpeed/pull/1965
    • Fix typo in timer.py by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/1964
    • [docs] fix dependabot version issue by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1966
    • Don't add curand on rocm by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1968
    • Add Unidirectional Sparse Attention Type to BigBird and BSLongformer by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/1959
    • Fix: Sparse tensors not updating by @Dipet in https://github.com/microsoft/DeepSpeed/pull/1914
    • Fixing several bugs in the inference-api and the kernels by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/1951

    New Contributors

    • @Quentin-Anthony made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1958

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.6.4...v0.6.5

    Source code(tar.gz)
    Source code(zip)
  • v0.6.4(May 6, 2022)

    What's Changed

    • [fix] Windows installs cannot import fcntl by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/1921
    • [build] explicitly add op_builder to manifest by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1920
    • Enable DeepSpeed inference on ROCm by @rraminen in https://github.com/microsoft/DeepSpeed/pull/1922
    • bf16 inference by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1917
    • spell err by @kisseternity in https://github.com/microsoft/DeepSpeed/pull/1929
    • [ZeRO-3] Rename confusing log message by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1932
    • [bug] Fix time log error in PipelineEngine by @Codle in https://github.com/microsoft/DeepSpeed/pull/1934
    • Improve z3 trace management by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1916

    New Contributors

    • @kisseternity made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1929
    • @Codle made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1934

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.6.3...v0.6.4

    Source code(tar.gz)
    Source code(zip)
  • v0.6.3(Apr 27, 2022)

    What's Changed

    • Fix setup.py crash when torch is not installed. by @PaperclipBadger in https://github.com/microsoft/DeepSpeed/pull/1866
    • Add support for AWS SageMaker. by @matherit in https://github.com/microsoft/DeepSpeed/pull/1868
    • Fix broken links by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1873
    • [docs] add amd blog to website by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1874
    • [docs] add moe paper by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1875
    • Supporting multiple modules injection with a single policy when they … by @samyam in https://github.com/microsoft/DeepSpeed/pull/1869
    • [docs] fix dead links by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1877
    • add now required -lcurand to solve undefined symbol: curandCreateGenerator by @stas00 in https://github.com/microsoft/DeepSpeed/pull/1879
    • Bug fix for flops profilers output by @VisionTheta in https://github.com/microsoft/DeepSpeed/pull/1885
    • Bump nokogiri from 1.13.3 to 1.13.4 in /docs by @dependabot in https://github.com/microsoft/DeepSpeed/pull/1889
    • [docs] fix commonmarker security issue by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1892
    • bf16+pipeline parallelism by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1801
    • fix file ordering by @szhengac in https://github.com/microsoft/DeepSpeed/pull/1822
    • Use f-strings where possible by @manuelciosici in https://github.com/microsoft/DeepSpeed/pull/1900
    • [partition_parameters.py] better diagnostics by @stas00 in https://github.com/microsoft/DeepSpeed/pull/1887
    • comm backend: cast bool when not supported by torch2cupy by @conglongli in https://github.com/microsoft/DeepSpeed/pull/1894
    • Use cuda events to improve timing for multi-stream execution by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1881
    • Fix multiple zero 3 tracing errors by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1901
    • Improve ds_report output for HIP/ROCm by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/1906
    • Fix launcher for reading env vars by @szhengac in https://github.com/microsoft/DeepSpeed/pull/1907
    • Fix OOM and type mismatch by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1884

    New Contributors

    • @PaperclipBadger made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1866
    • @matherit made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1868
    • @VisionTheta made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1885
    • @szhengac made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1822

    Misc

    • v0.6.2 was skipped due to a build/deploy issue with that release

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.6.1...v0.6.3

    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Mar 7, 2022)

  • v0.5.10(Jan 19, 2022)

  • v0.5.0(Aug 17, 2021)

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

English | 简体中文 Easy Parallel Library Overview Easy Parallel Library (EPL) is a general and efficient library for distributed model training. Usability

Alibaba 185 Dec 21, 2022
Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation", Haoxiang Wang, Han Zhao, Bo Li.

Bridging Multi-Task Learning and Meta-Learning Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Trainin

AI Secure 57 Dec 15, 2022
Fine-Tune EleutherAI GPT-Neo to Generate Netflix Movie Descriptions in Only 47 Lines of Code Using Hugginface And DeepSpeed

GPT-Neo-2.7B Fine-Tuning Example Using HuggingFace & DeepSpeed Installation cd venv/bin ./pip install -r ../../requirements.txt ./pip install deepspe

Nikita 180 Jan 5, 2023
An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow implementation of SERank model. The code is developed based on TF-Ranking.

SERank An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow

Zhihu 44 Oct 20, 2022
QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

null 152 Jan 2, 2023
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code here will be included in upstream Pytorch eventually. The intention of Apex is to make up-to-date utilities available to users as quickly as possible.

NVIDIA Corporation 6.9k Jan 3, 2023
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Introduction This is a Python package available on PyPI for NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pyto

Artit 'Art' Wangperawong 5 Sep 29, 2021
ViDT: An Efficient and Effective Fully Transformer-based Object Detector

ViDT: An Efficient and Effective Fully Transformer-based Object Detector by Hwanjun Song1, Deqing Sun2, Sanghyuk Chun1, Varun Jampani2, Dongyoon Han1,

NAVER AI 262 Dec 27, 2022
Colar: Effective and Efficient Online Action Detection by Consulting Exemplars, CVPR 2022.

Colar: Effective and Efficient Online Action Detection by Consulting Exemplars This repository is the official implementation of Colar. In this work,

LeYang 246 Dec 13, 2022
ktrain is a Python library that makes deep learning and AI more accessible and easier to apply

Overview | Tutorials | Examples | Installation | FAQ | How to Cite Welcome to ktrain News and Announcements 2020-11-08: ktrain v0.25.x is released and

Arun S. Maiya 1.1k Jan 2, 2023
🏎️ Accelerate training and inference of 🤗 Transformers with easy to use hardware optimization tools

Hugging Face Optimum ?? Optimum is an extension of ?? Transformers, providing a set of performance optimization tools enabling maximum efficiency to t

Hugging Face 842 Dec 30, 2022
library for nonlinear optimization, wrapping many algorithms for global and local, constrained or unconstrained, optimization

NLopt is a library for nonlinear local and global optimization, for functions with and without gradient information. It is designed as a simple, unifi

Steven G. Johnson 1.4k Dec 25, 2022
SciKit-Learn Laboratory (SKLL) makes it easy to run machine learning experiments.

SciKit-Learn Laboratory This Python package provides command-line utilities to make it easier to run machine learning experiments with scikit-learn. O

ETS 528 Nov 25, 2022
FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

FuseDream This repo contains code for our paper (paper link): FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimizat

XCL 191 Dec 31, 2022
An efficient and easy-to-use deep learning model compression framework

TinyNeuralNetwork 简体中文 TinyNeuralNetwork is an efficient and easy-to-use deep learning model compression framework, which contains features like neura

Alibaba 441 Dec 25, 2022
This repository provides an efficient PyTorch-based library for training deep models.

An Efficient Library for Training Deep Models This repository provides an efficient PyTorch-based library for training deep models. Installation Make

Bytedance Inc. 123 Jan 5, 2023
Distributed Asynchronous Hyperparameter Optimization better than HyperOpt.

UltraOpt : Distributed Asynchronous Hyperparameter Optimization better than HyperOpt. UltraOpt is a simple and efficient library to minimize expensive

null 98 Aug 16, 2022
Distributed Asynchronous Hyperparameter Optimization in Python

Hyperopt: Distributed Hyperparameter Optimization Hyperopt is a Python library for serial and parallel optimization over awkward search spaces, which

null 6.5k Jan 1, 2023