Official code for "Distributed Deep Learning in Open Collaborations" (NeurIPS 2021)

Related tags

Deep Learning DeDLOC
Overview

Distributed Deep Learning in Open Collaborations

This repository contains the code for the NeurIPS 2021 paper

"Distributed Deep Learning in Open Collaborations"

Michael Diskin*, Alexey Bukhtiyarov*, Max Ryabinin*, Lucile Saulnier, Quentin Lhoest, Anton Sinitsin, Dmitry Popov, Dmitry Pyrkin, Maxim Kashirin, Alexander Borzunov, Albert Villanova del Moral, Denis Mazur, Ilia Kobelev, Yacine Jernite, Thomas Wolf, Gennady Pekhimenko

Link: ArXiv

Note

This repository contains a snapshot of the code used to conduct experiments in the paper.

Please use the up-to-date version of our library if you want to try out collaborative training and/or set up your own experiment. It contains many substantial improvements, including better documentation and fixed bugs.

Installation

Before running the experiments, please set up the environment by following the steps below:

  • Prepare an environment with python 3.7-3.9. Anaconda is recommended, but not required
  • Install the hivemind library from the master branch or by running pip install hivemind==0.9.9.post1

For all distributed experiments, the installation procedure must be repeated on every machine that participates in the experiment. We recommend using machines with at least 2 CPU cores, 16 GB RAM and, when applicable, a low/mid-tier NVIDIA GPU.

Experiments

The code is divided into several sections matching the corresponding experiments:

  • albert contains the code for controlled experiments with ALBERT-large on WikiText-103;
  • swav is for training SwAV on ImageNet data;
  • sahajbert contains the code used to conduct a public collaborative experiment for the Bengali language ALBERT;
  • p2p is a step-by-step tutorial that explains decentralized NAT traversal and circuit relays.

We recommend running albert experiments first: other experiments build on top of its code and may reqire more careful setup (e.g. for public participation). Furthermore, for this experiment, we provide a script for launching experiments using preemptible GPUs in the cloud.

Acknowledgements

This project is the result of a collaboration between Yandex, Hugging Face, MIPT, HSE University, University of Toronto, Vector Institute, and Neuropark.

We also thank Stas Bekman, Dmitry Abulkhanov, Roman Zhytar, Alexander Ploshkin, Vsevolod Plokhotnyuk and Roman Kail for their invaluable help with building the training infrastructure. Also, we thank Abhishek Thakur for helping with downstream evaluation and Tanmoy Sarkar with Omar Sanseviero, who helped us organize the collaborative experiment and gave regular status updates to the participants over the course of the training run.

Contacts

Feel free to ask any questions in our Discord chat or by email.

Citation

@inproceedings{diskin2021distributed,
    title = {Distributed Deep Learning In Open Collaborations},
    author = {Michael Diskin and Alexey Bukhtiyarov and Max Ryabinin and Lucile Saulnier and Quentin Lhoest and Anton Sinitsin and Dmitry Popov and Dmitriy Pyrkin and Maxim Kashirin and Alexander Borzunov and Albert Villanova del Moral and Denis Mazur and Ilia Kobelev and Yacine Jernite and Thomas Wolf and Gennady Pekhimenko},
    booktitle = {Advances in Neural Information Processing Systems},
    editor = {A. Beygelzimer and Y. Dauphin and P. Liang and J. Wortman Vaughan},
    year = {2021},
    url = {https://openreview.net/forum?id=FYHktcK-7v}
}
You might also like...
Code to reproduce the experiments from our NeurIPS 2021 paper " The Limitations of Large Width in Neural Networks: A Deep Gaussian Process Perspective"

Code To run: python runner.py new --save SAVE_NAME --data PATH_TO_DATA_DIR --dataset DATASET --model model_name [options] --n 1000 - train - t

Companion code for the paper "An Infinite-Feature Extension for Bayesian ReLU Nets That Fixes Their Asymptotic Overconfidence" (NeurIPS 2021)

ReLU-GP Residual (RGPR) This repository contains code for reproducing the following NeurIPS 2021 paper: @inproceedings{kristiadi2021infinite, title=

This repo includes our code for evaluating and improving transferability in domain generalization (NeurIPS 2021)

Transferability for domain generalization This repo is for evaluating and improving transferability in domain generalization (NeurIPS 2021), based on

Code for MarioNette: Self-Supervised Sprite Learning, in NeurIPS 2021
Code for MarioNette: Self-Supervised Sprite Learning, in NeurIPS 2021

MarioNette | Webpage | Paper | Video MarioNette: Self-Supervised Sprite Learning Dmitriy Smirnov, Michaƫl Gharbi, Matthew Fisher, Vitor Guizilini, Ale

Code for Parameter Prediction for Unseen Deep Architectures (NeurIPS 2021)
Code for Parameter Prediction for Unseen Deep Architectures (NeurIPS 2021)

Parameter Prediction for Unseen Deep Architectures (NeurIPS 2021) authors: Boris Knyazev, Michal Drozdzal, Graham Taylor, Adriana Romero-Soriano Overv

Code for our NeurIPS 2021 paper 'Exploiting the Intrinsic Neighborhood Structure for Source-free Domain Adaptation'

Exploiting the Intrinsic Neighborhood Structure for Source-free Domain Adaptation (NeurIPS 2021) Code for our NeurIPS 2021 paper 'Exploiting the Intri

This GitHub repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.'

About Repository This repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.' About Code

Code for
Code for "Adversarial Attack Generation Empowered by Min-Max Optimization", NeurIPS 2021

Min-Max Adversarial Attacks [Paper] [arXiv] [Video] [Slide] Adversarial Attack Generation Empowered by Min-Max Optimization Jingkang Wang, Tianyun Zha

[NeurIPS 2021] Code for Unsupervised Learning of Compositional Energy Concepts

Unsupervised Learning of Compositional Energy Concepts This is the pytorch code for the paper Unsupervised Learning of Compositional Energy Concepts.

Comments
  • Problems when trying to run the albert example

    Problems when trying to run the albert example

    Hi! Thank you for this amazing project! I'm trying to reproduce the experiment result in the paper but encountered some questions:

    I'm using python3.9 and following the instructions in the readme file.

    1. Data pre-processing

    When I tried to run the command python tokenize_wikitext103.py, it shows an error message like:

    Traceback (most recent call last):                                                                                                       
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker                           
        result = (True, func(*args, **kwds))                                                                                                 
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 518, in wrapper                     
        out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                                                                   
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 485, in wrapper                     
        out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                                                                   
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/fingerprint.py", line 411, in wrapper                       
        out = func(self, *args, **kwargs)                                                                                                    
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2469, in _map_single                
        batch = apply_function_on_filtered_inputs(                                                                                           
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2357, in apply_function_on_filtered_inputs
        processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)                                                                 
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2052, in decorated                  
        result = f(decorated_item, *args, **kwargs)                                                                                          
      File "/home/su/DeDLOC/albert/tokenize_wikitext103.py", line 82, in tokenize_function                                                   
        instances = create_instances_from_document(tokenizer, text, max_seq_length=512)                                                      
      File "/home/su/DeDLOC/albert/tokenize_wikitext103.py", line 24, in create_instances_from_document                                      
        segmented_sents = list(nltk.sent_tokenize(document))                                                                                 
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 107, in sent_tokenize               
        return tokenizer.tokenize(text)                                                                                                      
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1276, in tokenize                      
        return list(self.sentences_from_text(text, realign_boundaries))                                                                      
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in sentences_from_text           
        return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]                                                          
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in <listcomp>                    
        return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]                                                          
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1322, in span_tokenize                 
        for sentence in slices:                                                                                                              
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1421, in _realign_boundaries
        for sentence1, sentence2 in _pair_iter(slices):
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 318, in _pair_iter
        prev = next(iterator)
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1395, in _slices_from_text
        for match, context in self._match_potential_end_contexts(text):
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1382, in _match_potential_end_contexts
        before_words[match] = split[-1]
    IndexError: list index out of range
    

    2. the API URL does not exist

    When I tried to run the GPU trainer, it shows this error message:

    Traceback (most recent call last):
      File "/home/su/DeDLOC/albert/run_trainer.py", line 297, in <module>
        main()
      File "/home/su/DeDLOC/albert/run_trainer.py", line 225, in main
        tokenizer = AlbertTokenizerFast.from_pretrained(dataset_args.tokenizer_path, cache_dir=dataset_args.cache_dir)
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1654, in from_pretrained
        fast_tokenizer_file = get_fast_tokenizer_file(
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 3486, in get_fast_tokenizer_file
        all_files = get_list_of_files(
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/transformers/file_utils.py", line 2103, in get_list_of_files
        return list_repo_files(path_or_repo, revision=revision, token=token)
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 602, in list_repo_files
        info = self.model_info(
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 586, in model_info
        r.raise_for_status()
      File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/requests/models.py", line 953, in raise_for_status
        raise HTTPError(http_error_msg, response=self)
    requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/data/tokenizer
    
    
    opened by soodoshll 3
  • Example Colab notebooks for collaborative training participants, from Appendix H1?

    Example Colab notebooks for collaborative training participants, from Appendix H1?

    I note that in Appendix H1 of the paper, it mentions Colab notebooks which participants can use to join in the training process. Can examples of these be released?

    I am very fascinated by the potential of DeDLOC for decentralized organizations like Masakhane, so I'm curious how hard it might be to, say, setup the training of large translation models!

    opened by cdleong 3
  • fp16 error when running sahajbert participant snippet on Colab Pro

    fp16 error when running sahajbert participant snippet on Colab Pro

    When running the snippet from the sahajbert example on Colab pro, I get this:

    ValueError: Mixed precision training with AMP or APEX (`--fp16`) and FP16 evaluation can only be used on CUDA devices.
    

    But when I add this to the arguments, it goes away:

    --fp16 0
    

    Should we update the snippet like so?

    HIVEMIND_THREADS=128 python run_trainer.py \
    --output_dir ./outputs_trainer --overwrite_output_dir  --logging_dir ./logs_trainer \
    --logging_first_step --logging_steps 100   --initial_peers COORDINATOR_IP:COORDINATOR_PORT \
    --experiment_prefix NAME_YOUR_EXPERIMENT --seed 42 --averaging_timeout 120  --bandwidth 1000 \
    --fp16 0
    
    opened by cdleong 2
  • NAT-traversal Tutorial for public access nodes

    NAT-traversal Tutorial for public access nodes

    Dear all,

    When I follow up the NAT-traversal tutorial and use this script on both AWS instances (security group has been configured) and google colab:

    from hivemind.p2p.p2p_daemon import P2P
    import asyncio
    import nest_asyncio
    nest_asyncio.apply()
    async def test():
        node = await P2P.create()
        print(await node._client.identify())
    
    loop = asyncio.get_event_loop()
    loop.run_until_complete(test())
    

    I can only get information like this:

    (<libp2p.peer.id.ID (QmcNoM62vZoExYZbwuQFBGrRuQkZnqQ8FWejBwNirZtt8p)>, (<Multiaddr /ip4/127.0.0.1/tcp/35651>,))
    

    I am not sure why there does not have public information like udp/public-ip. Could you provide more details about public access node and how to get its printable information? Thanks!

    opened by HuYang719 3
Official implementation of NeurIPS 2021 paper "Contextual Similarity Aggregation with Self-attention for Visual Re-ranking"

CSA: Contextual Similarity Aggregation with Self-attention for Visual Re-ranking PyTorch training code for CSA (Contextual Similarity Aggregation). We

Hui Wu 19 Oct 21, 2022
Official Pytorch implementation of "Unbiased Classification Through Bias-Contrastive and Bias-Balanced Learning (NeurIPS 2021)

Unbiased Classification Through Bias-Contrastive and Bias-Balanced Learning (NeurIPS 2021) Official Pytorch implementation of Unbiased Classification

Youngkyu 17 Jan 1, 2023
This is an official PyTorch implementation of Task-Adaptive Neural Network Search with Meta-Contrastive Learning (NeurIPS 2021, Spotlight).

NeurIPS 2021 (Spotlight): Task-Adaptive Neural Network Search with Meta-Contrastive Learning This is an official PyTorch implementation of Task-Adapti

Wonyong Jeong 15 Nov 21, 2022
Official implementation of "Open-set Label Noise Can Improve Robustness Against Inherent Label Noise" (NeurIPS 2021)

Open-set Label Noise Can Improve Robustness Against Inherent Label Noise NeurIPS 2021: This repository is the official implementation of ODNL. Require

Hongxin Wei 12 Dec 7, 2022
Official implementation of Generalized Data Weighting via Class-level Gradient Manipulation (NeurIPS 2021).

Generalized Data Weighting via Class-level Gradient Manipulation This repository is the official implementation of Generalized Data Weighting via Clas

null 9 Nov 3, 2021
The official implementation of NeurIPS 2021 paper: Finding Optimal Tangent Points for Reducing Distortions of Hard-label Attacks

The official implementation of NeurIPS 2021 paper: Finding Optimal Tangent Points for Reducing Distortions of Hard-label Attacks

machen 11 Nov 27, 2022
Official implementation of Neural Bellman-Ford Networks (NeurIPS 2021)

NBFNet: Neural Bellman-Ford Networks This is the official codebase of the paper Neural Bellman-Ford Networks: A General Graph Neural Network Framework

MilaGraph 136 Dec 21, 2022
Official Pytorch implementation for Deep Contextual Video Compression, NeurIPS 2021

Introduction Official Pytorch implementation for Deep Contextual Video Compression, NeurIPS 2021 Prerequisites Python 3.8 and conda, get Conda CUDA 11

null 51 Dec 3, 2022
Official implementation of NeurIPS'2021 paper TransformerFusion

TransformerFusion: Monocular RGB Scene Reconstruction using Transformers Project Page | Paper | Video TransformerFusion: Monocular RGB Scene Reconstru

Aljaz Bozic 118 Dec 25, 2022
Code for our NeurIPS 2021 paper Mining the Benefits of Two-stage and One-stage HOI Detection

CDN Code for our NeurIPS 2021 paper "Mining the Benefits of Two-stage and One-stage HOI Detection". Contributed by Aixi Zhang*, Yue Liao*, Si Liu, Mia

null 71 Dec 14, 2022