Official code for "Distributed Deep Learning in Open Collaborations" (NeurIPS 2021)

Yandex Research

Last update: Sep 15, 2022

Related tags

Deep Learning DeDLOC

Overview

Distributed Deep Learning in Open Collaborations

This repository contains the code for the NeurIPS 2021 paper

"Distributed Deep Learning in Open Collaborations"

Michael Diskin*, Alexey Bukhtiyarov*, Max Ryabinin*, Lucile Saulnier, Quentin Lhoest, Anton Sinitsin, Dmitry Popov, Dmitry Pyrkin, Maxim Kashirin, Alexander Borzunov, Albert Villanova del Moral, Denis Mazur, Ilia Kobelev, Yacine Jernite, Thomas Wolf, Gennady Pekhimenko

Link: ArXiv

Note

This repository contains a snapshot of the code used to conduct experiments in the paper.

Please use the up-to-date version of our library if you want to try out collaborative training and/or set up your own experiment. It contains many substantial improvements, including better documentation and fixed bugs.

Installation

Before running the experiments, please set up the environment by following the steps below:

Prepare an environment with python 3.7-3.9. Anaconda is recommended, but not required
Install the hivemind library from the master branch or by running pip install hivemind==0.9.9.post1

For all distributed experiments, the installation procedure must be repeated on every machine that participates in the experiment. We recommend using machines with at least 2 CPU cores, 16 GB RAM and, when applicable, a low/mid-tier NVIDIA GPU.

Experiments

The code is divided into several sections matching the corresponding experiments:

albert contains the code for controlled experiments with ALBERT-large on WikiText-103;
swav is for training SwAV on ImageNet data;
sahajbert contains the code used to conduct a public collaborative experiment for the Bengali language ALBERT;
p2p is a step-by-step tutorial that explains decentralized NAT traversal and circuit relays.

We recommend running albert experiments first: other experiments build on top of its code and may reqire more careful setup (e.g. for public participation). Furthermore, for this experiment, we provide a script for launching experiments using preemptible GPUs in the cloud.

Acknowledgements

This project is the result of a collaboration between Yandex, Hugging Face, MIPT, HSE University, University of Toronto, Vector Institute, and Neuropark.

We also thank Stas Bekman, Dmitry Abulkhanov, Roman Zhytar, Alexander Ploshkin, Vsevolod Plokhotnyuk and Roman Kail for their invaluable help with building the training infrastructure. Also, we thank Abhishek Thakur for helping with downstream evaluation and Tanmoy Sarkar with Omar Sanseviero, who helped us organize the collaborative experiment and gave regular status updates to the participants over the course of the training run.

Contacts

Feel free to ask any questions in our Discord chat or by email.

Citation

@inproceedings{diskin2021distributed,
    title = {Distributed Deep Learning In Open Collaborations},
    author = {Michael Diskin and Alexey Bukhtiyarov and Max Ryabinin and Lucile Saulnier and Quentin Lhoest and Anton Sinitsin and Dmitry Popov and Dmitriy Pyrkin and Maxim Kashirin and Alexander Borzunov and Albert Villanova del Moral and Denis Mazur and Ilia Kobelev and Yacine Jernite and Thomas Wolf and Gennady Pekhimenko},
    booktitle = {Advances in Neural Information Processing Systems},
    editor = {A. Beygelzimer and Y. Dauphin and P. Liang and J. Wortman Vaughan},
    year = {2021},
    url = {https://openreview.net/forum?id=FYHktcK-7v}
}

Code to reproduce the experiments from our NeurIPS 2021 paper " The Limitations of Large Width in Neural Networks: A Deep Gaussian Process Perspective"

Code To run: python runner.py new --save SAVE_NAME --data PATH_TO_DATA_DIR --dataset DATASET --model model_name [options] --n 1000 - train - t

5 Dec 12, 2022

Companion code for the paper "An Infinite-Feature Extension for Bayesian ReLU Nets That Fixes Their Asymptotic Overconfidence" (NeurIPS 2021)

ReLU-GP Residual (RGPR) This repository contains code for reproducing the following NeurIPS 2021 paper: @inproceedings{kristiadi2021infinite, title=

4 Dec 26, 2021

This repo includes our code for evaluating and improving transferability in domain generalization (NeurIPS 2021)

Transferability for domain generalization This repo is for evaluating and improving transferability in domain generalization (NeurIPS 2021), based on

9 Nov 29, 2022

Code for MarioNette: Self-Supervised Sprite Learning, in NeurIPS 2021

MarioNette | Webpage | Paper | Video MarioNette: Self-Supervised Sprite Learning Dmitriy Smirnov, Michaël Gharbi, Matthew Fisher, Vitor Guizilini, Ale

28 Nov 18, 2022

Code for Parameter Prediction for Unseen Deep Architectures (NeurIPS 2021)

Parameter Prediction for Unseen Deep Architectures (NeurIPS 2021) authors: Boris Knyazev, Michal Drozdzal, Graham Taylor, Adriana Romero-Soriano Overv

462 Jan 3, 2023

Code for our NeurIPS 2021 paper 'Exploiting the Intrinsic Neighborhood Structure for Source-free Domain Adaptation'

Exploiting the Intrinsic Neighborhood Structure for Source-free Domain Adaptation (NeurIPS 2021) Code for our NeurIPS 2021 paper 'Exploiting the Intri

53 Dec 25, 2022

This GitHub repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.'

About Repository This repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.' About Code

1 Nov 9, 2021

Code for "Adversarial Attack Generation Empowered by Min-Max Optimization", NeurIPS 2021

Min-Max Adversarial Attacks [Paper] [arXiv] [Video] [Slide] Adversarial Attack Generation Empowered by Min-Max Optimization Jingkang Wang, Tianyun Zha

12 Nov 23, 2022

[NeurIPS 2021] Code for Unsupervised Learning of Compositional Energy Concepts

Unsupervised Learning of Compositional Energy Concepts This is the pytorch code for the paper Unsupervised Learning of Compositional Energy Concepts.

45 Nov 30, 2022

Comments

Problems when trying to run the albert example

Hi! Thank you for this amazing project! I'm trying to reproduce the experiment result in the paper but encountered some questions:

I'm using python3.9 and following the instructions in the readme file.

1. Data pre-processing

When I tried to run the command python tokenize_wikitext103.py, it shows an error message like:

Traceback (most recent call last):                                                                                                       
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker                           
    result = (True, func(*args, **kwds))                                                                                                 
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 518, in wrapper                     
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                                                                   
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 485, in wrapper                     
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                                                                   
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/fingerprint.py", line 411, in wrapper                       
    out = func(self, *args, **kwargs)                                                                                                    
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2469, in _map_single                
    batch = apply_function_on_filtered_inputs(                                                                                           
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2357, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)                                                                 
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2052, in decorated                  
    result = f(decorated_item, *args, **kwargs)                                                                                          
  File "/home/su/DeDLOC/albert/tokenize_wikitext103.py", line 82, in tokenize_function                                                   
    instances = create_instances_from_document(tokenizer, text, max_seq_length=512)                                                      
  File "/home/su/DeDLOC/albert/tokenize_wikitext103.py", line 24, in create_instances_from_document                                      
    segmented_sents = list(nltk.sent_tokenize(document))                                                                                 
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 107, in sent_tokenize               
    return tokenizer.tokenize(text)                                                                                                      
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1276, in tokenize                      
    return list(self.sentences_from_text(text, realign_boundaries))                                                                      
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in sentences_from_text           
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]                                                          
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in <listcomp>                    
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]                                                          
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1322, in span_tokenize                 
    for sentence in slices:                                                                                                              
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1421, in _realign_boundaries
    for sentence1, sentence2 in _pair_iter(slices):
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 318, in _pair_iter
    prev = next(iterator)
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1395, in _slices_from_text
    for match, context in self._match_potential_end_contexts(text):
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1382, in _match_potential_end_contexts
    before_words[match] = split[-1]
IndexError: list index out of range

2. the API URL does not exist

When I tried to run the GPU trainer, it shows this error message:

Traceback (most recent call last):
  File "/home/su/DeDLOC/albert/run_trainer.py", line 297, in <module>
    main()
  File "/home/su/DeDLOC/albert/run_trainer.py", line 225, in main
    tokenizer = AlbertTokenizerFast.from_pretrained(dataset_args.tokenizer_path, cache_dir=dataset_args.cache_dir)
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1654, in from_pretrained
    fast_tokenizer_file = get_fast_tokenizer_file(
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 3486, in get_fast_tokenizer_file
    all_files = get_list_of_files(
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/transformers/file_utils.py", line 2103, in get_list_of_files
    return list_repo_files(path_or_repo, revision=revision, token=token)
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 602, in list_repo_files
    info = self.model_info(
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 586, in model_info
    r.raise_for_status()
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/data/tokenizer

opened by soodoshll 3

Example Colab notebooks for collaborative training participants, from Appendix H1?

I note that in Appendix H1 of the paper, it mentions Colab notebooks which participants can use to join in the training process. Can examples of these be released?

I am very fascinated by the potential of DeDLOC for decentralized organizations like Masakhane, so I'm curious how hard it might be to, say, setup the training of large translation models!

opened by cdleong 3

fp16 error when running sahajbert participant snippet on Colab Pro

When running the snippet from the sahajbert example on Colab pro, I get this:

ValueError: Mixed precision training with AMP or APEX (`--fp16`) and FP16 evaluation can only be used on CUDA devices.

But when I add this to the arguments, it goes away:

--fp16 0

Should we update the snippet like so?

HIVEMIND_THREADS=128 python run_trainer.py \
--output_dir ./outputs_trainer --overwrite_output_dir  --logging_dir ./logs_trainer \
--logging_first_step --logging_steps 100   --initial_peers COORDINATOR_IP:COORDINATOR_PORT \
--experiment_prefix NAME_YOUR_EXPERIMENT --seed 42 --averaging_timeout 120  --bandwidth 1000 \
--fp16 0

opened by cdleong 2

NAT-traversal Tutorial for public access nodes
Dear all,

When I follow up the NAT-traversal tutorial and use this script on both AWS instances (security group has been configured) and google colab:

from hivemind.p2p.p2p_daemon import P2P import asyncio import nest_asyncio nest_asyncio.apply() async def test(): node = await P2P.create() print(await node._client.identify()) loop = asyncio.get_event_loop() loop.run_until_complete(test())

I can only get information like this:

(<libp2p.peer.id.ID (QmcNoM62vZoExYZbwuQFBGrRuQkZnqQ8FWejBwNirZtt8p)>, (<Multiaddr /ip4/127.0.0.1/tcp/35651>,))

I am not sure why there does not have public information like udp/public-ip. Could you provide more details about public access node and how to get its printable information? Thanks!
opened by HuYang719 3

Owner

Yandex Research

GitHub https://arxiv.org/abs/2106.10207

Official implementation of NeurIPS 2021 paper "Contextual Similarity Aggregation with Self-attention for Visual Re-ranking"

CSA: Contextual Similarity Aggregation with Self-attention for Visual Re-ranking PyTorch training code for CSA (Contextual Similarity Aggregation). We

19 Oct 21, 2022

Official Pytorch implementation of "Unbiased Classification Through Bias-Contrastive and Bias-Balanced Learning (NeurIPS 2021)

Unbiased Classification Through Bias-Contrastive and Bias-Balanced Learning (NeurIPS 2021) Official Pytorch implementation of Unbiased Classification

17 Jan 1, 2023

This is an official PyTorch implementation of Task-Adaptive Neural Network Search with Meta-Contrastive Learning (NeurIPS 2021, Spotlight).

NeurIPS 2021 (Spotlight): Task-Adaptive Neural Network Search with Meta-Contrastive Learning This is an official PyTorch implementation of Task-Adapti

15 Nov 21, 2022

Official implementation of "Open-set Label Noise Can Improve Robustness Against Inherent Label Noise" (NeurIPS 2021)

Open-set Label Noise Can Improve Robustness Against Inherent Label Noise NeurIPS 2021: This repository is the official implementation of ODNL. Require

12 Dec 7, 2022

Official implementation of Generalized Data Weighting via Class-level Gradient Manipulation (NeurIPS 2021).

Generalized Data Weighting via Class-level Gradient Manipulation This repository is the official implementation of Generalized Data Weighting via Clas

9 Nov 3, 2021

The official implementation of NeurIPS 2021 paper: Finding Optimal Tangent Points for Reducing Distortions of Hard-label Attacks

11 Nov 27, 2022

Official code for "Distributed Deep Learning in Open Collaborations" (NeurIPS 2021)

Related tags

Overview

Distributed Deep Learning in Open Collaborations

Note

Installation

Experiments

Acknowledgements

Contacts

Citation

You might also like...

Code to reproduce the experiments from our NeurIPS 2021 paper " The Limitations of Large Width in Neural Networks: A Deep Gaussian Process Perspective"

Companion code for the paper "An Infinite-Feature Extension for Bayesian ReLU Nets That Fixes Their Asymptotic Overconfidence" (NeurIPS 2021)

This repo includes our code for evaluating and improving transferability in domain generalization (NeurIPS 2021)

Code for MarioNette: Self-Supervised Sprite Learning, in NeurIPS 2021

Code for Parameter Prediction for Unseen Deep Architectures (NeurIPS 2021)

Code for our NeurIPS 2021 paper 'Exploiting the Intrinsic Neighborhood Structure for Source-free Domain Adaptation'

This GitHub repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.'

Code for "Adversarial Attack Generation Empowered by Min-Max Optimization", NeurIPS 2021

[NeurIPS 2021] Code for Unsupervised Learning of Compositional Energy Concepts

Comments

Problems when trying to run the albert example

1. Data pre-processing

2. the API URL does not exist

Example Colab notebooks for collaborative training participants, from Appendix H1?

fp16 error when running sahajbert participant snippet on Colab Pro

NAT-traversal Tutorial for public access nodes

Owner

Yandex Research

Official implementation of NeurIPS 2021 paper "Contextual Similarity Aggregation with Self-attention for Visual Re-ranking"

Official Pytorch implementation of "Unbiased Classification Through Bias-Contrastive and Bias-Balanced Learning (NeurIPS 2021)

This is an official PyTorch implementation of Task-Adaptive Neural Network Search with Meta-Contrastive Learning (NeurIPS 2021, Spotlight).

Official implementation of "Open-set Label Noise Can Improve Robustness Against Inherent Label Noise" (NeurIPS 2021)

Official implementation of Generalized Data Weighting via Class-level Gradient Manipulation (NeurIPS 2021).

The official implementation of NeurIPS 2021 paper: Finding Optimal Tangent Points for Reducing Distortions of Hard-label Attacks

Official implementation of Neural Bellman-Ford Networks (NeurIPS 2021)

Official Pytorch implementation for Deep Contextual Video Compression, NeurIPS 2021

Official implementation of NeurIPS'2021 paper TransformerFusion

Code for our NeurIPS 2021 paper Mining the Benefits of Two-stage and One-stage HOI Detection