Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

Overview

Hivemind: decentralized deep learning in PyTorch

Build status Documentation Status Gitter

Hivemind is a PyTorch library to train large neural networks across the Internet. Its intended usage is training a single Transformer model on hundreds of computers from different universities, companies, and volunteers.

img

Key Features

  • Train neural networks of arbitrary size: parts of their layers are distributed across the participants.
  • Distributed training without a master node: Distributed Hash Table allows connecting computers in a decentralized network.
  • Fault-tolerant backpropagation: forward and backward passes succeed even if some nodes are unresponsive or take too long to respond.
  • Decentralized parameter averaging: iteratively aggregate updates from multiple workers without the need to synchronize across the entire network.

To learn more about the ideas behind this library, see https://learning-at-home.github.io or read the NeurIPS 2020 paper.

Installation

Before installing hivemind, make sure that your environment has Python 3.7+ and PyTorch with a version at least as new as 1.6.0.

To start using this library, you can either use the pip package manager or build it from source. Since currently the release cycle is not established yet, we recommend installing hivemind from source to keep up with the latest bugfixes and improvements.

With pip

If your versions of Python and PyTorch match the requirements, you can install hivemind from pip:

pip install hivemind

From source

To install hivemind from source, simply clone the repository and install

git clone https://github.com/learning-at-home/hivemind.git
cd hivemind
pip install .

If you would like to verify that your installation is working properly, you can install with pip install -e .[dev] instead. Then, you can run the tests with pytest tests/.

Documentation

Contributing

Hivemind is currently at the active development stage, and we welcome all contributions. Everything, from bug fixes and documentation improvements to entirely new features, is equally appreciated.

If you want to contribute to hivemind but don't know where to start, take a look at the unresolved issues. Open a new issue or join our chat room in case you want to discuss new functionality or report a possible bug. Bug fixes are always welcome, but new features should be preferably discussed with maintainers beforehand.

If you want to start contributing to the source code of hivemind, please see the contributing guidelines first. To learn more about other ways to contribute, read our guide.

Citation

If you found hivemind useful for your experiments, you can cite the paper that inspired it:

@inproceedings{ryabinin2020crowdsourced,
 author = {Ryabinin, Max and Gusev, Anton},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
 pages = {3659--3672},
 publisher = {Curran Associates, Inc.},
 title = {Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts},
 url = {https://proceedings.neurips.cc/paper/2020/file/25ddc0f8c9d3e22e03d3076f98d83cb2-Paper.pdf},
 volume = {33},
 year = {2020}
}

The initial implementation of hivemind used for the paper is available at mryab/learning-at-home.

In the documentation, we list several related projects and acknowledgements.

Comments
  • Fine-tuning BERT on GLUE with hivemind

    Fine-tuning BERT on GLUE with hivemind

    Describe the bug while running hivemind albert experiment, we have one monitor peer and two worker peers. One of the nodes is working fine

    But the other peer is stack at downloading parameters from peer. The reason I guess the is the speed of training. If first node train too fast then the other node cannot join, stuck in the download parameter. Can we limit the training speed or force the first node to wait others to join?

    To Reproduce If applicable, please create a minimal script that reproduces the problem for you. It would be great to include script outputs as well.

    If we change the albert to bert in the example, the speed for each iteration would be faster, then the new worker cannot join the training.

    Environment Please list:

    • python version (e.g. 3.8.1); 3.8
    • hivemind.version; 1.1.0.dev0
    discussion 
    opened by elricwan 32
  • Convert hivemind.server to libp2p backend

    Convert hivemind.server to libp2p backend

    #242 In this PR we are trying to get rid of GRPC in MoE module of hivemind. I do this PR draft now to be able to start review process and do it piece by piece, because some core thing are done and some results are achieved (I believe). There is still work to be done (it will be mentioned at the end of this message).

    What is currently done:

    • RemoteExperts are able to communicate through libp2p
    • Some throughput performance optimizations were done: we achieved ~2GiB/sec at ffn_forward benchmark using GTX3060. It is done by enabling balancing multiple handlers of the same protocol in p2pd. Thus we can do packing/unpacking of messages on forward/backward in parallel. Results are at the end of the message
    • Examples from hivemind tutorial are working
    • tests/test_dht_experts.py are passing

    What are topics to discuss:

    • [x] Current implementation of RemoteExpert is quite heavy compared to GRPC version. GRPC version was in fact a few string fields containing endpoints, nothing more. Current version contains heavy object: connection to p2p-daemon, which is not serializable. Probably this is not the best decision.
    • [x] Current moe/server/Server will not probably work if it has no DHT value inside, however old api documentation says that DHT is not mandatory for Server instance. There are two ways I see: make DHT mandatory or create a P2P instance inside a Server. Doing second option DHT inside Server probably might not be useful anymore

    What is yet to be done:

    • [x] Separate RemoteExpertInfo with endpoints from RemoteExpert and make function to make expert from its info
    • [x] Separate thread and queue for async actions from _RemoteModuleCall
    • [x] tests/test_moe.py and tests/test_training.py. And probably some other tests.
    • [x] Rewrite RemoteMixtureOfExpert to the P2P (as RemoteExpert)
    • [x] Refactoring of Server instance after discussion mentioned above
    • [x] Wildly benchmark every possible scenario
    • [x] FIx documentation for tutorials
    • [x] Add test on scenarios, which might be not covered ~~After this PR is done we have to discuss dht.replicate_p2p() because it is not fork-safe and not in an obvious way~~

    Current benchmarks results:

    Current benchmarks were performed with GTX3060 and --preset ffn_forward. Each experiment was performed at least 5 times and results bellow are average. On demand I can provide more detailed data. Also worth mentioning that this results can change during review process.

    | Branch | Batch size | Number of handlers | Throughput | | ---------------- | -------------- | -------------------------- | -------------------- | | server-p2p | 1024 | 1 | ~838 MiB/sec | | server-p2p | 1024 | 5 | ~2085 MiB/sec | | server-p2p | 1024 | 10 | ~2055 MiB/sec | | master | 1024 | *default | ~1526 MiB/sec | | master | 2048 | *default | ~2248 MiB/sec |

    *default in column Number of handlers: TLDR it is 64. It means formula max(1, num_handlers or num_clients // 2) where num_handlers = None and num_clients = 128`.

    What can be done after merging this

    Thing discovered during review. They are not blocking this PR, but it is better to do them.

    • [ ] Get rid of multiaddrs everywhere. P2P daemon should be able to communicate using PeerID only
    • [ ] In some places there are CPU-bound thing happening inside async task. It is better to move them into thread executors. For example hivemind/moe/client/expert.py forward/Backward
    • [ ] currently hivemind.Server does not check that inputs are correct. If user sends malformed inputs, it may OOM the server. We should check for that in some future PR. See #3
    • [ ] if clients sends tensor of shape [0, 123], it will be split into zero messages and uid will not be passed. Server will receive uid=None and fail with cryptic KeyError(None). We should either forbid this on client side or ensure that zero-element tensors are serialized into a stream with first emty message.
    • [ ] Test load balancing for unary handlers on python side
    server mixture-of-experts p2p 
    opened by GreenFatGuy 12
  • Averaging is extremely slow in some setups

    Averaging is extremely slow in some setups

    Error log from client-mode peer:

    [2021/08/16 22:54:50.049][INFO][optim.collaborative.step:229] Beginning global optimizer step 0
    [2021/08/16 22:54:50.253][INFO][optim.collaborative.fetch_collaboration_state:444] Collaboration accumulated 3696 samples from 2 peers; ETA 0.00 seconds (refresh in 0.50s.)
    [2021/08/16 22:54:50.445][INFO][optim.collaborative.fetch_collaboration_state:444] Collaboration accumulated 3696 samples from 2 peers; ETA 0.00 seconds (refresh in 0.50s.)
    /usr/local/lib/python3.7/dist-packages/numpy/core/fromnumeric.py:87: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
      return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
    [2021/08/16 22:56:57.350][ERROR][averaging.averager._run_allreduce:426] 
    Traceback (most recent call last):
      File "/content/hivemind/hivemind/averaging/averager.py", line 416, in _run_allreduce
        averaging_outputs = [output async for output in allreduce]
      File "/content/hivemind/hivemind/averaging/averager.py", line 416, in <listcomp>
        averaging_outputs = [output async for output in allreduce]
      File "/content/hivemind/hivemind/averaging/allreduce.py", line 132, in run
        async for averaged_tensor_delta in self.tensor_part_container.iterate_output_tensors():
      File "/content/hivemind/hivemind/averaging/partition.py", line 134, in iterate_output_tensors
        await self._output_part_available[peer_index].wait()
      File "/usr/lib/python3.7/asyncio/locks.py", line 293, in wait
        await fut
    concurrent.futures._base.CancelledError
    [2021/08/16 22:56:57.352][INFO][optim.collaborative.step:250] Skipped averaging: averaging round failed with TimeoutError().
    [2021/08/16 22:56:57.368][INFO][optim.collaborative.step:266] Optimizer step: done!
    

    Error log from regular peer:

    [2021/08/16 22:55:12.202][INFO][optim.collaborative.step:229] Beginning global optimizer step 0
    [2021/08/16 22:55:12.212][INFO][optim.collaborative.fetch_collaboration_state:442] Collaboration accumulated 3856 samples from 2 peers; ETA 0.00 seconds (refresh in 0.50s.)
    [2021/08/16 22:56:57.266][ERROR][averaging.averager._run_allreduce:426]
    Traceback (most recent call last):
      File "/storage/hdd1/jheuristic/exp/testTPU/hivemind/hivemind/averaging/averager.py", line 416, in _run_allreduce
        averaging_outputs = [output async for output in allreduce]
      File "/storage/hdd1/jheuristic/exp/testTPU/hivemind/hivemind/averaging/averager.py", line 416, in <listcomp>
        averaging_outputs = [output async for output in allreduce]
      File "/storage/hdd1/jheuristic/exp/testTPU/hivemind/hivemind/averaging/allreduce.py", line 132, in run
        async for averaged_tensor_delta in self.tensor_part_container.iterate_output_tensors():
      File "/storage/hdd1/jheuristic/exp/testTPU/hivemind/hivemind/averaging/partition.py", line 134, in iterate_output_tensors
        await self._output_part_available[peer_index].wait()
      File "/home/jheuristic/anaconda3/envs/TPU/lib/python3.9/asyncio/locks.py", line 226, in wait
        await fut
    asyncio.exceptions.CancelledError
    [2021/08/16 22:56:57.268][ERROR][averaging.averager._step:365] Averager caught MatchmakingException('Unable to run All-Reduce: ')
    Traceback (most recent call last):
      File "/storage/hdd1/jheuristic/exp/testTPU/hivemind/hivemind/averaging/averager.py", line 416, in _run_allreduce
        averaging_outputs = [output async for output in allreduce]
      File "/storage/hdd1/jheuristic/exp/testTPU/hivemind/hivemind/averaging/averager.py", line 416, in <listcomp>
        averaging_outputs = [output async for output in allreduce]
      File "/storage/hdd1/jheuristic/exp/testTPU/hivemind/hivemind/averaging/allreduce.py", line 132, in run
        async for averaged_tensor_delta in self.tensor_part_container.iterate_output_tensors():
      File "/storage/hdd1/jheuristic/exp/testTPU/hivemind/hivemind/averaging/partition.py", line 134, in iterate_output_tensors
        await self._output_part_available[peer_index].wait()
      File "/home/jheuristic/anaconda3/envs/TPU/lib/python3.9/asyncio/locks.py", line 226, in wait
        await fut
    asyncio.exceptions.CancelledError
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/storage/hdd1/jheuristic/exp/testTPU/hivemind/hivemind/averaging/averager.py", line 348, in _step
        await asyncio.wait_for(
      File "/home/jheuristic/anaconda3/envs/TPU/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
        fut.result()
      File "/storage/hdd1/jheuristic/exp/testTPU/hivemind/hivemind/averaging/averager.py", line 427, in _run_allreduce
        raise MatchmakingException(f"Unable to run All-Reduce: {e}")
    hivemind.averaging.matchmaking.MatchmakingException: Unable to run All-Reduce:
    [2021/08/16 22:56:57.270][INFO][optim.collaborative.step:250] Skipped averaging: averaging round failed with MatchmakingException('Unable to run All-Reduce: ').
    [2021/08/16 22:56:57.301][INFO][optim.collaborative.step:266] Optimizer step: done!
    
    bug averaging 
    opened by yhn112 10
  • [BUG] Dead lock when 'Downloading parameters' cost Took too much time

    [BUG] Dead lock when 'Downloading parameters' cost Took too much time

    Describe the bug while running hivemind albert experiment, we have one monitor peer and two worker peers. One of the nodes is working fine

    But the other peer is stack at downloading parameters from peer peer log is:

    [2021/11/01 07:21:50.962][INFO][averaging.averager._load_state_from_peers:577] Downloading parameters from peer QmYQsw4kqPujvWv52sFsCorZs69LNhkxAsgBhBwGbaFfez
    [2021/11/01 07:28:17.871][INFO][averaging.averager._load_state_from_peers:597] Finished downloading state from QmYQsw4kqPujvWv52sFsCorZs69LNhkxAsgBhBwGbaFfez
    
    
    /opt/conda/lib/python3.9/site-packages/transformers/trainer.py:1347: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
    
      nn.utils.clip_grad_norm_(
    
    [2021/11/01 07:28:18.759][INFO][__main__.on_step_end:153] Step 0
    
    [2021/11/01 07:28:18.760][INFO][__main__.on_step_end:154] Your current contribution: 0 samples
    
    [2021/11/01 07:28:18.760][INFO][__main__.on_step_end:155] Performance: 0.002546124167199564 samples per second.
    
    [2021/11/01 07:28:18.760][INFO][__main__.on_step_end:157] Local loss: 11.4107
    
    [2021/11/01 07:28:18.986][INFO][optim.collaborative.fetch_collaboration_state:442] Collaboration accumulated 81 samples from 1 peers; ETA 36.99 seconds (refresh in 9.25s.)
    
    [2021/11/01 07:28:19.004][INFO][optim.collaborative.step:208] Peer is out of sync.
    
    [2021/11/01 07:28:20.243][INFO][averaging.averager._load_state_from_peers:577] Downloading parameters from peer QmXpVXnAY6L7WqeW4pzstGK18S1LySDonPmrxQka3GztJa
    

    To Reproduce the monitor running script: python run_training_monitor.py --host_maddrs '/ip4/0.0.0.0/tcp/38888' --experiment_prefix albert --wandb_project albert

    the worker peer script: python run_trainer.py --experiment_prefix albert --host_maddrs '/ip4/0.0.0.0/tcp/39997' --initial_peers [INITIAL_PEERS_FROM_MONITOR] --seed 42 --logging_first_step --logging_steps 100 --output_dir /train --overwrite_output_dir --logging_dir /train --target_batch_size 1024 --averaging_expiration 10 --per_device_train_batch_size 1 --gradient_accumulation_steps 1

    Environment I was running this experiment in a docker container Please list:

    • python version 3.9.7
    • hivemind.version; 0.10.0
    • Please copy and paste the output from pytorch [environment collection script]
    Collecting environment information...
    PyTorch version: 1.10.0
    Is debug build: False
    CUDA used to build PyTorch: 11.1
    ROCM used to build PyTorch: N/A
    
    OS: Ubuntu 20.04.3 LTS (x86_64)
    GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
    Clang version: Could not collect
    CMake version: Could not collect
    Libc version: glibc-2.31
    
    Python version: 3.9.7 (default, Sep 16 2021, 13:09:58)  [GCC 7.5.0] (64-bit runtime)
    Python platform: Linux-5.11.0-37-generic-x86_64-with-glibc2.31
    Is CUDA available: True
    CUDA runtime version: Could not collect
    GPU models and configuration: GPU 0: GeForce RTX 3090
    Nvidia driver version: 460.91.03
    cuDNN version: Could not collect
    HIP runtime version: N/A
    MIOpen runtime version: N/A
    
    Versions of relevant libraries:
    [pip3] mypy-extensions==0.4.3
    [pip3] numpy==1.21.3
    [pip3] pytorch-ranger==0.1.1
    [pip3] torch==1.10.0
    [pip3] torch-optimizer==0.3.0
    [pip3] torchaudio==0.10.0
    [pip3] torchvision==0.11.1
    [conda] blas                      1.0                         mkl  
    [conda] cudatoolkit               11.1.74              h6bb024c_0    nvidia
    [conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
    [conda] mkl                       2021.3.0           h06a4308_520  
    [conda] mkl-service               2.4.0            py39h7f8727e_0  
    [conda] mkl_fft                   1.3.1            py39hd3c417c_0  
    [conda] mkl_random                1.2.2            py39h51133e4_0  
    [conda] mypy-extensions           0.4.3                    pypi_0    pypi
    [conda] numpy                     1.21.3                   pypi_0    pypi
    [conda] numpy-base                1.21.2           py39h79a1101_0  
    [conda] pytorch                   1.10.0          py3.9_cuda11.1_cudnn8.0.5_0    pytorch
    [conda] pytorch-mutex             1.0                        cuda    pytorch
    [conda] pytorch-ranger            0.1.1                    pypi_0    pypi
    [conda] torch                     1.10.0                   pypi_0    pypi
    [conda] torch-optimizer           0.3.0                    pypi_0    pypi
    [conda] torchaudio                0.10.0                   pypi_0    pypi
    [conda] torchvision               0.11.1                   pypi_0    pypi
    

    Considering the file transfer speed, I tested the bandwidth with iperf:

    ------------------------------------------------------------
    Client connecting to 10.8.0.4, TCP port 5001
    TCP window size: 45.0 KByte (default)
    ------------------------------------------------------------
    [  3] local 10.8.0.5 port 39674 connected with 10.8.0.4 port 5001
    [ ID] Interval       Transfer     Bandwidth
    [  3]  0.0-10.4 sec  21.0 MBytes  16.9 Mbits/sec
    
    bug 
    opened by finger92 7
  • Gating function averaging

    Gating function averaging

    In our preliminary experiments, all peers have independent gating functions and we can only synchronize them manually. It would be great to implement some sort of builtin averaging mechanism.

    For instance, every T seconds, assemble peers into groups at random, then perform all-reduce within each group. In case of failure, rollback and repeat T seconds later.

    enhancement help wanted 
    opened by justheuristic 7
  • GPU lost

    GPU lost

    Hi there,

    In some experiments, I face the situation where one gpu is lost during the training. And I have to restart the work again. Have ever encountered that issue? Thank you.

    opened by elricwan 6
  • Convert hivemind.Server/RemoteModuleCall/RemoteCallMany to libp2p backend

    Convert hivemind.Server/RemoteModuleCall/RemoteCallMany to libp2p backend

    [depends on #238 to be merged ] After we've implemented P2P transport with nat traversal, we should switch the main components to libp2p backend to take advantage of this new transport.

    One of three main components is hivemind.server.Server and its counterpart hivemind.client.RemoteExpert

    On a client side, hivemind creates a RemoteExpert pytorch module that calls experts via _RemoteModuleCall (and _RemoteCallMany for DMoE)

    A server receives incoming connections with several ConnectionHandler processes running in parallel. These processes run gRPC servers and hence should be switched to libp2p.

    • Checklist
      • [x] find some way to attach several processes to one RPC (as in server/connection_handler.py)
      • [ ] make sure it passes tests/test_moe.py
      • [ ] make sure it passes tests/test_training.py
      • [x] tune performance in tests/benchmark_througphput.py
    enhancement server 
    opened by justheuristic 6
  • [BUG] Loss did not decrease in Albert example after 125000 max step.

    [BUG] Loss did not decrease in Albert example after 125000 max step.

    Describe the bug I run the albert example with wikitext data. I use one peer, default settings (target_batch_size=4096, train_batch_size=4, max_step=125000, lr=0.00176), but the loss did not decrease after training, it start as 11 and finish as 11.

    Jan 15 10:30:14.734 [INFO] Step #1 loss = 11.04938 Jan 15 10:32:14.842 [INFO] Step #2 loss = 11.05589 Jan 15 10:34:14.975 [INFO] Step #3 loss = 11.06803 Jan 15 10:36:15.093 [INFO] Step #4 loss = 11.06271 Jan 15 10:38:15.228 [INFO] Step #5 loss = 11.06433 Jan 15 10:40:15.337 [INFO] Step #6 loss = 11.05447 Jan 15 10:41:45.401 [INFO] Step #7 loss = 11.06115 Jan 15 10:43:45.541 [INFO] Step #8 loss = 11.06025 .......... Jan 15 18:09:13.117 [INFO] Step #238 loss = 11.05597 Jan 15 18:11:13.233 [INFO] Step #239 loss = 11.06724 Jan 15 18:13:13.369 [INFO] Step #240 loss = 11.06289 Jan 15 18:15:13.494 [INFO] Step #241 loss = 11.05922 Jan 15 18:16:43.577 [INFO] Step #242 loss = 11.05226 Jan 15 18:18:43.691 [INFO] Step #243 loss = 11.05418 Jan 15 18:20:43.843 [INFO] Step #244 loss = 11.05638

    To Reproduce Run the script in albert example. For monitor, I run:

    python run_training_monitor.py
    --experiment_prefix albert_experiment
    --wandb_project albert_wandb

    For trainer, I run:

    IP=/ip4/192.168.0.188/tcp/45731/p2p/QmSRerwCPUSreHhwMuTLHoVHqTfWuT8J57w3sXFZtU8ECo

    WANDB_DISABLED=true CUDA_VISIBLE_DEVICES=0 python run_trainer.py
    --experiment_prefix albert_experiment
    --initial_peers $IP
    --logging_first_step
    --output_dir ./outputs
    --overwrite_output_dir
    --logging_dir ./logs
    --dataset_path="/home/protago/Xiangpeng/hivemind/examples/albert/data/albert_tokenized_wikitext"
    --per_device_train_batch_size 4
    --learning_rate 0.00176
    --num_train_epochs=5
    --save_steps=60000

    Environment Please list:

    If the script doesn't work, please report pytorch and numpy versions manually. We also encourage you to include any additional information that you believe can help us solve the issue.

    bug 
    opened by elricwan 5
  • Delayed Parameter Update when step(wait=False)

    Delayed Parameter Update when step(wait=False)

    Is your feature request related to a problem? Please describe.

    Eh, this could be a question. I'm trying to use TrainingAverager with step(wait=False). That requires data_lock and use_old_local_tensor=True follows.

    When use_old_local_tensor=True, is it correct to simply add the weight difference between local model and all-reduced model to the new model parameters? The gradients calculated from the old model parameter is being added to the new model parameters. That doesn't seem quite right.

    Describe the solution you'd like

    https://arxiv.org/abs/2101.06840 proposes Delayed Parameter Update. Parameter update is delayed by one step. Apparently, it makes little difference in the training curve if DPU is applied after 40 iterations in BERT-large training.

    I think to implement DPU, you simply have to copy back the averaged tensor back to the model in the beginning of step().

    Describe alternatives you've considered

    I understand that if the weight difference is not added back, the local steps taken before the asynchronous all-reduce completes are being wasted. Not only it defeats purpose of asynchronous all-reduce(if local updates are going to be wasted until async completes, why not just go sync) but it also skips over input data which could trouble training.

    enhancement help wanted 
    opened by bgyoon 5
  • Set default DHT num_workers = 4

    Set default DHT num_workers = 4

    This change seems to speed up (a) DHT get requests by 3.6x and (b) DHT creation by 1.2x (probably due to speeding up the communication with initial nodes).

    benchmark_dht.py

    nora, this PR, max_workers = 8:

    2021-07-29 00 40 15

    nora, master (fb4813347a18a01d2c780232a5f86266bbd49d26, see #318), max_workers = 1:

    Screen Shot 2021-07-15 at 6 47 42 AM
    opened by borzunov 5
  • Tutorial: ALBERT-large collaborative training

    Tutorial: ALBERT-large collaborative training

    Let's implement a basic example for collaborative training with ALBERT

    • core training code ( @leshanbog )

      • [x] implement basic training scripts (run_first_peer/run_trainer) based on mryab/collaborative-training
      • [x] achieve exact match with old training code
      • [x] test fault tolerance against common network failures
    • update metric logging code (@yhn112 )

      • [x] tune first peer's wandb to avoid crashes (or restart on crashes)
      • [x] use the same DHT key prefix for metrics and averaging (aka self.prefix)
      • [x] make the prototype pep8-compliant
    • add basic security layer: (@borzunov )

      • [x] protect value types in:
        • [x] progress
        • [x] averaging
        • [x] metrics
      • [x] ensure that the averager validate reasonable min/max values (i.e. for batches_processed)
      • [x] make sure it supports DataParallel on peers
      • [x] make sure it will save/load scheduler state dict correctly with optimizer
    • add full description in README.md

    opened by justheuristic 5
  • [BUG] Enable to train a bloat16-compressed model

    [BUG] Enable to train a bloat16-compressed model

    Describe the bug

    Jan 04 22:30:14.302 [INFO] test-run-1112b accumulated 10 samples for epoch #0 from 2 peers. ETA 0.00 sec (refresh in 0.50 sec)
    Jan 04 22:30:14.476 [INFO] Beginning optimizer step #0
    Jan 04 22:31:26.924 [ERROR] [hivemind.optim.power_sgd_averager._aggregate_with_group:187] Expected out tensor to have dtype c10::BFloat16, but got float instead
    Traceback (most recent call last):
      File "/usr/local/lib/python3.10/dist-packages/hivemind/optim/power_sgd_averager.py", line 159, in _aggregate_with_group
        torch.matmul(m.reshape(-1, q.size(0)), q, out=p)
    RuntimeError: Expected out tensor to have dtype c10::BFloat16, but got float instead
    Jan 04 22:31:26.925 [WARN] [hivemind.optim.power_sgd_averager._register_allreduce_group:129] All-reduce group b'\xc4\xaf\xafv&\xcd\xd6q\xa3\xea\x9d-\x13\x0f\xa4hNQ\xf6>PHASE_P' did not finish.
    Jan 04 22:31:26.925 [WARN] [hivemind.optim.power_sgd_averager._register_allreduce_group:129] All-reduce group b'\xc4\xaf\xafv&\xcd\xd6q\xa3\xea\x9d-\x13\x0f\xa4hNQ\xf6>PHASE_Q' did not finish.
    Jan 04 22:31:26.925 [WARN] [hivemind.averaging.averager._step:482] PowerSGDGradientAverager caught MatchmakingException('Unable to run All-Reduce: Expected out tensor to have dtype c10::BFloat16, but got float instead'), retrying
    Jan 04 22:35:47.094 [ERROR] [hivemind.optim.power_sgd_averager._aggregate_with_group:187] Expected out tensor to have dtype c10::BFloat16, but got float instead
    Traceback (most recent call last):
      File "/usr/local/lib/python3.10/dist-packages/hivemind/optim/power_sgd_averager.py", line 159, in _aggregate_with_group
        torch.matmul(m.reshape(-1, q.size(0)), q, out=p)
    RuntimeError: Expected out tensor to have dtype c10::BFloat16, but got float instead
    Jan 04 22:35:47.094 [WARN] [hivemind.optim.power_sgd_averager._register_allreduce_group:129] All-reduce group b'Q\x84\x8d\xa9\xf3\x90\xd4\xdf\xcc]\x153\x0c+\x9e\x90|\xed|\x8ePHASE_P' did not finish.
    Jan 04 22:35:47.094 [WARN] [hivemind.optim.power_sgd_averager._register_allreduce_group:129] All-reduce group b'Q\x84\x8d\xa9\xf3\x90\xd4\xdf\xcc]\x153\x0c+\x9e\x90|\xed|\x8ePHASE_Q' did not finish.
    Jan 04 22:35:47.095 [WARN] [hivemind.averaging.averager._step:482] PowerSGDGradientAverager caught MatchmakingException('Unable to run All-Reduce: Expected out tensor to have dtype c10::BFloat16, but got float instead'), retrying
    Jan 04 22:40:07.221 [ERROR] [hivemind.optim.power_sgd_averager._aggregate_with_group:187] Expected out tensor to have dtype c10::BFloat16, but got float instead
    
    

    To Reproduce

    git clone https://github.com/the-beee/naifu-diffusion
    cd naifu-diffusion
    pip install -r requirements.txt
    python trainer.py
    

    Please update config/distributed.yaml to include the peers address in the hivemind section, before starting the second peer.

    Environment

    Collecting environment information...
    PyTorch version: 1.13.1+cu117
    Is debug build: False
    CUDA used to build PyTorch: 11.7
    ROCM used to build PyTorch: N/A
    
    OS: Ubuntu 22.04.1 LTS (x86_64)
    GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
    Clang version: Could not collect
    CMake version: Could not collect
    Libc version: glibc-2.35
    
    Python version: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (64-bit runtime)
    Python platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
    Is CUDA available: False
    CUDA runtime version: No CUDA
    CUDA_MODULE_LOADING set to: N/A
    GPU models and configuration: No CUDA
    Nvidia driver version: No CUDA
    cuDNN version: No CUDA
    HIP runtime version: N/A
    MIOpen runtime version: N/A
    Is XNNPACK available: True
    
    Versions of relevant libraries:
    [pip3] numpy==1.24.1
    [pip3] pytorch-lightning==1.8.6
    [pip3] torch==1.13.1
    [pip3] torch-ema==0.3
    [pip3] torchmetrics==0.11.0
    [pip3] torchvision==0.14.1
    [pip3] hivemind==1.1.4
    [conda] Could not collect
    
    bug 
    opened by the-beee 0
  • Add Codespell to CI, fix typos

    Add Codespell to CI, fix typos

    This PR applies Codespell to the repo and attempts to fix most of the typos found by this tool; the rest are debatable. Also, it adds Codespell to CI to prevent (or at least highlight) future typos, you can see that it works by navigating the PR diff or the diff for this commit.

    opened by mryab 2
  • Mismatched protobuf versions in sub-dependencies

    Mismatched protobuf versions in sub-dependencies

    When installing hivemind (as a dependency of petals) using pipenv, pipenv failed to resolve a valid version for protobuf. Could not find a version that matches protobuf<4.0.0,<4.0dev,<5.0dev,>=3.12.2,>=3.20.3,>=4.21.6

    Here's the trimmed dependency graph for hivemind to show the conflicts:

    - hivemind [required: ==1.1.3, installed: 1.1.3]
      - grpcio-tools [required: >=1.33.2, installed: 1.51.1]
        - protobuf [required: <5.0dev,>=4.21.6, installed: 3.20.3]
      - protobuf [required: <4.0.0,>=3.12.2, installed: 3.20.3]
    

    I haven't tested if this causes any actual issues, but it looks risky.

    opened by briansemrau 2
  • [BUG] Cyclic references in TaskPool

    [BUG] Cyclic references in TaskPool

    Found in https://github.com/bigscience-workshop/petals/pull/150/files by @borzunov

    TL;DR ModuleBackend's contain TaskPools as properties, but TaskPools refer to ModuleBackend's instance methods (e.g. self.forward)

    This is harmless for run_server, but will potentially cause memory leaks if server is deleted and recreated.

    bug 
    opened by justheuristic 0
  • Read {run_id}_progress from DHT manually throws exceptions

    Read {run_id}_progress from DHT manually throws exceptions

    Hi,

    I can't seem to be able to read the training information (like here) out of the DHT that was created by hivemind.

    I can connect to the DHT and run the following:

    > dht.store("key", "value", expiration=get_dht_time() + 600)
    > dht.get("key")
    ValueWithExpiration(value='value', expiration_time=1670845892.2483625)
    

    However, when training with hivemind, I can't seem to be able to get the data with two different behaviors after calling the get function after each other.

    Only the second call shows some actual training progress data, but not complete (1 out of 4 peers) and not in a way that allows me to access it compared to the documentation.

    It seems that there is some issue with the get call being run asynchronously and not being able to decode the returning LocalTrainingProgress.

    How does the tutorial data get/store differ from what hivemind does with the LocalTrainingProgress?

    First call to get

    >>> dht.get("hivemind-123_progress")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/dht/dht.py", line 173, in get
        return future if return_future else future.result()
      File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/utils/mpfuture.py", line 257, in result
        return super().result(timeout)
      File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/concurrent/futures/_base.py", line 446, in result
        return self.__get_result()
      File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
        raise self._exception
    msgpack.exceptions.ExtraData: unpack(b) received extra data.
    

    Second call to get

    >>> dht.get("hivemind-123_progress")
    Dec 12 12:43:20.841 [ERROR] [asyncio._run:129] Task exception was never retrieved
    future: <Task finished name='Task-13381' coro=<DHT._get() done, defined at /home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/dht/dht.py:175> exception=ExtraData({'peer_id': b"\x12 W\xb23\xa4\x85\xd0\xfa\xad\n[t\xec\xc7\xfe'\xed\x1d\x94\x03\n\xf6\x11e\xf4\xe3j,\xf7\xae\xd5h\xca", 'epoch': 24, 'samples_accumulated': 0, 'samples_per_second': 10.078083213276257, 'time': 1670842945.1815588, 'client_mode': False}, b'[signature:P3NGbBDc4ujJwy2afKJSEXD/lsM1s7icix+h5LoxGk1K6ZFvq5vaf7vs4mokUm0TmYbeGMq85DV1M3nr/+lrVg/WGAtC3moq9iiigaKiNnhszcZPx1ls+UOoIbZXGh35kdIzCIr2qsV9GxheuPaohErMoEzxN+kAytZ+wEtxoxEgOCAXEdOGVmee0Dx6eIQVzs96d7aIEpucNLGRu8ylOvgjcZNOu+MMyqVTom3R6yvl8RRTh3Dj/0cS7a0ajo+osIx7ENIadL8Zh8Vqmw+evLR2dZhAULYhN/wq1C/8dNYZzM1C2spbjG9hMYlD33RUhmD0gE+rWP0OKHA7vUPtSA==]')>
    Traceback (most recent call last):
      File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/dht/dht.py", line 177, in _get
        result = await self._node.get(key, latest=latest, **kwargs)
      File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/dht/node.py", line 543, in get
        result = await self.get_many([key], **kwargs)
      File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/dht/node.py", line 565, in get_many
        results_by_id = await self.get_many_by_id(key_ids, sufficient_expiration_time, **kwargs)
      File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/dht/node.py", line 620, in get_many_by_id
        search_results[key_id].add_candidate(self.protocol.storage.get(key_id), source_node_id=self.node_id)
      File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/dht/node.py", line 844, in add_candidate
        self.finish_search()
      File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/dht/node.py", line 873, in finish_search
        self.serializer.loads(value_bytes), item_expiration_time
      File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/utils/serializer.py", line 72, in loads
        return msgpack.loads(buf, ext_hook=cls._decode_ext_types, raw=False)
      File "msgpack/_unpacker.pyx", line 201, in msgpack._cmsgpack.unpackb
    msgpack.exceptions.ExtraData: unpack(b) received extra data.
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/dht/dht.py", line 173, in get
        return future if return_future else future.result()
      File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/site-packages/hivemind/utils/mpfuture.py", line 257, in result
        return super().result(timeout)
      File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/concurrent/futures/_base.py", line 446, in result
        return self.__get_result()
      File "/home/ubuntu/miniconda3/envs/conda-hivemind/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
        raise self._exception
    msgpack.exceptions.ExtraData: unpack(b) received extra data.
    
    opened by cirquit 1
Releases(1.1.4)
  • 1.1.4(Dec 2, 2022)

    What's Changed

    • Update p2pd to v0.3.13 by @borzunov in https://github.com/learning-at-home/hivemind/pull/527

    Full Changelog: https://github.com/learning-at-home/hivemind/compare/1.1.3...1.1.4

    Source code(tar.gz)
    Source code(zip)
  • 1.1.3(Nov 29, 2022)

    What's Changed

    • Update moe.md by @cirquit in https://github.com/learning-at-home/hivemind/pull/516
    • Fix "unable to open shared memory" while using MPFuture by @borzunov in https://github.com/learning-at-home/hivemind/pull/517
    • Fix MPFuture failing outside inference mode by @borzunov in https://github.com/learning-at-home/hivemind/pull/521
    • Bump torch to >=1.9.0 by @borzunov in https://github.com/learning-at-home/hivemind/pull/522
    • Fix P2PDaemon's idle timeout by @borzunov in https://github.com/learning-at-home/hivemind/pull/523
    • Support torch.bfloat16 in hivemind.compression by @borzunov in https://github.com/learning-at-home/hivemind/pull/524
    • Remove stale PeerIDs in hivemind-dht's routing table by @borzunov in https://github.com/learning-at-home/hivemind/pull/525

    New Contributors

    • @cirquit made their first contribution in https://github.com/learning-at-home/hivemind/pull/516

    Full Changelog: https://github.com/learning-at-home/hivemind/compare/1.1.2...1.1.3

    Source code(tar.gz)
    Source code(zip)
  • 1.1.2(Oct 19, 2022)

    What's Changed

    • Forbid protobuf 4.x in requirements by @justheuristic in https://github.com/learning-at-home/hivemind/pull/508
    • Check if identity is already taken by @borzunov in https://github.com/learning-at-home/hivemind/pull/511
    • Add Petals to "Example Use Cases" by @borzunov in https://github.com/learning-at-home/hivemind/pull/512
    • Follow up #501 and #511 with minor fixes by @borzunov in https://github.com/learning-at-home/hivemind/pull/513
    • Update bitsandbytes, relax its version constraint by @mryab in https://github.com/learning-at-home/hivemind/pull/510

    Full Changelog: https://github.com/learning-at-home/hivemind/compare/1.1.1...1.1.2

    Source code(tar.gz)
    Source code(zip)
  • 1.1.1(Sep 13, 2022)

    What's Changed

    • Handle errors in Runtime by @justheuristic in https://github.com/learning-at-home/hivemind/pull/489
    • metadata type changed to bytes by @GreenFatGuy in https://github.com/learning-at-home/hivemind/pull/491
    • fix: Parameter Averaging quickstart clarification by @IAL32 in https://github.com/learning-at-home/hivemind/pull/492
    • Make DHT ignore SIGINT by @dbaranchuk in https://github.com/learning-at-home/hivemind/pull/493
    • Update README with latest projects and publications by @mryab in https://github.com/learning-at-home/hivemind/pull/494
    • Add links to "Example Use Cases" by @borzunov in https://github.com/learning-at-home/hivemind/pull/497
    • Support bfloat16 for autograd by @dbaranchuk in https://github.com/learning-at-home/hivemind/pull/499
    • Remove libp2p handlers when ConnectionHandler, DHT, and DecentralizedAverager are shut down by @borzunov in https://github.com/learning-at-home/hivemind/pull/501
    • Fix PyTorch warning suppression by @borzunov in https://github.com/learning-at-home/hivemind/pull/502
    • Fix a potential deadlock in await_asynchronously with nested locks by @justheuristic in https://github.com/learning-at-home/hivemind/pull/503
    • Require TaskPoolBase to implement load_batch_to_runtime by @justheuristic in https://github.com/learning-at-home/hivemind/pull/506
    • Change runtime.py to choose tasks with lowest (instead of highest) priority by @justheuristic in https://github.com/learning-at-home/hivemind/pull/505
    • Add support for quantization with bitsandbytes by @mryab in https://github.com/learning-at-home/hivemind/pull/490

    New Contributors

    • @IAL32 made their first contribution in https://github.com/learning-at-home/hivemind/pull/492
    • @dbaranchuk made their first contribution in https://github.com/learning-at-home/hivemind/pull/493

    Full Changelog: https://github.com/learning-at-home/hivemind/compare/1.1.0...1.1.1

    Source code(tar.gz)
    Source code(zip)
  • 1.1.0(Jun 20, 2022)

    Release highlights

    • Starting from this release, all components of hivemind.moe use libp2p for communication. This comes with the same benefits as in averaging and DHT previously (simplified NAT traversal, better performance, etc.) and marks the end of gRPC usage in hivemind. The user API is mostly the same: if you were using abstractions like RemoteMixtureOfExperts, the code should not be changed, although cross-release training is not possible.
    • If you need another way to reduce the network footprint during training with hivemind.Optimizer, you can now use PowerSGD for gradient averaging. This method decreases the communication costs by factorizing the gradients of the model and aggregating the factorized versions. To enable this method in your code, pass grad_averager_factory=partial(PowerSGDGradientAverager, averager_rank=RANK) when creating an instance of Optimizer. Here, RANK denotes the factorization rank; lower values give higher compression at the cost of the reconstruction quality.
    • Similarly to hivemind-server, it is now possible to launch a dedicated DHT instance with a command-line tool. The tool, available via hivemind-dht, can be used to quickly create a lightweight peer that is used mostly for connecting others to the DHT (for example, on a publicly available server) or for DHT metadata replication.
    • Previously, restarting a libp2p instance required generating a new P2P identity, which resulted in a new multiaddress. Thus, it was difficult to use the same command to connect to a peer in case of repeated launches, which is often the case during debugging. Now, you can store the persistent peer identity of a peer in a file and reuse it between launches: this is done by specifying the --identity_path argument, available both in the ALBERT example and CLI tools of hivemind.

    Deprecations

    • The parameters quic, use_relay_hop, and use_relay_discovery of hivemind.P2P are deprecated since our update of the libp2p dependency in the p2p daemon. They will be removed in the 1.2.0 release of hivemind

    What's Changed

    • Pin pytest version in requirements-dev, use file_descriptor in tests by @justheuristic in https://github.com/learning-at-home/hivemind/pull/454
    • Pin isort version, bump black by @mryab in https://github.com/learning-at-home/hivemind/pull/456
    • Clean compression/init.py by @borzunov in https://github.com/learning-at-home/hivemind/pull/460
    • Do not use offload_optimizer with local_updates by deafult by @foksly in https://github.com/learning-at-home/hivemind/pull/462
    • Add PowerSGD for compressed gradient averaging by @artek0chumak in https://github.com/learning-at-home/hivemind/pull/432
    • Bump Black to 22.3.0, pin Golang version by @mryab in https://github.com/learning-at-home/hivemind/pull/466
    • use_local_updates in optimizer by @justheuristic in https://github.com/learning-at-home/hivemind/pull/468
    • Update p2pd to v0.3.8 (and libp2p to v0.17.0) by @borzunov in https://github.com/learning-at-home/hivemind/pull/469
    • Generate new private key if identity file doesn't exist by @borzunov in https://github.com/learning-at-home/hivemind/pull/473
    • Convert hivemind.server to libp2p backend by @GreenFatGuy in https://github.com/learning-at-home/hivemind/pull/470
    • Implement a CLI for hivemind.DHT by @mryab in https://github.com/learning-at-home/hivemind/pull/465
    • Use PeerID exclusively to address MoE experts by @justheuristic in https://github.com/learning-at-home/hivemind/pull/479
    • Remove deprecated code in hivemind.optim and hivemind.averaging before the 1.1.0 release by @mryab in https://github.com/learning-at-home/hivemind/pull/480
    • Fix shape validation in GradientAverager by @mryab in https://github.com/learning-at-home/hivemind/pull/481
    • Change expiration time in declare_experts, fix update_period discrepancy by @justheuristic in https://github.com/learning-at-home/hivemind/pull/482
    • Add identity_path option for MoE.Server runners by @GreenFatGuy in https://github.com/learning-at-home/hivemind/pull/484
    • Simplify ExpertBackend interface by @justheuristic in https://github.com/learning-at-home/hivemind/pull/483
    • Clean up imports, remove unused utils by @mryab in https://github.com/learning-at-home/hivemind/pull/486
    • finish renaming experts -> module_backends in ConnectionHandler by @justheuristic in https://github.com/learning-at-home/hivemind/pull/487
    • Remove gRPC services and grpcio requirement by @mryab in https://github.com/learning-at-home/hivemind/pull/485

    New Contributors

    • @GreenFatGuy made their first contribution in https://github.com/learning-at-home/hivemind/pull/470

    Full Changelog: https://github.com/learning-at-home/hivemind/compare/1.0.1...1.1.0

    Source code(tar.gz)
    Source code(zip)
  • 1.0.1(Feb 7, 2022)

    What's Changed

    • Improve user-friendliness and fix misc errors in Optimizer, Averager and P2P by @justheuristic @pr-Mais @borzunov @mrseeker @mryab in https://github.com/learning-at-home/hivemind/pull/428
    • Skip gradient averaging if there are no other peers by @justheuristic @soodoshll @borzunov in https://github.com/learning-at-home/hivemind/pull/440
    • Move hivemind.Server from init, streamline imports by @mryab in https://github.com/learning-at-home/hivemind/pull/441
    • Change make_empty to make_zeros for TensorDescriptor by @mryab in https://github.com/learning-at-home/hivemind/pull/442
    • Fix offloaded optimizer with single peer by @justheuristic @elricwan @borzunov in https://github.com/learning-at-home/hivemind/pull/450
    • Fix "too many open files" issue by @yhn112 in https://github.com/learning-at-home/hivemind/pull/444

    Full Changelog: https://github.com/learning-at-home/hivemind/compare/1.0.0...1.0.1

    Source code(tar.gz)
    Source code(zip)
  • 1.0.0(Dec 20, 2021)

    What's Changed

    • Fix averager speed for TCP connections by @borzunov in https://github.com/learning-at-home/hivemind/pull/373
    • Fix "Too many open files" and load state freezing by @justheuristic in https://github.com/learning-at-home/hivemind/pull/371
    • Prefetch while reading rpc_aggregate_part() outputs by @borzunov in https://github.com/learning-at-home/hivemind/pull/370
    • Use ModeClient in libp2p DHT in case of --client_mode by @borzunov in https://github.com/learning-at-home/hivemind/pull/374
    • Integrate p2pd logs and outputs into hivemind logging by @borzunov in https://github.com/learning-at-home/hivemind/pull/375
    • Split compression strategies into separate classes by @justheuristic in https://github.com/learning-at-home/hivemind/pull/366
    • Implement colored logs by @borzunov in https://github.com/learning-at-home/hivemind/pull/377
    • Parametrize max message size for persistent connections by @deniskamazur in https://github.com/learning-at-home/hivemind/pull/376
    • Make log handlers configurable, shorten entries by @borzunov in https://github.com/learning-at-home/hivemind/pull/378
    • Enable log handler in benchmarks and run_server by @borzunov in https://github.com/learning-at-home/hivemind/pull/380
    • Fix step_tolerance in CollaborativeOptimizer by @justheuristic in https://github.com/learning-at-home/hivemind/pull/383
    • Fix pickle vulnerability by @deniskamazur in https://github.com/learning-at-home/hivemind/pull/386
    • Remove arguments with default values from example instructions by @borzunov in https://github.com/learning-at-home/hivemind/pull/388
    • Implement weight as part of the allreduce protocol, not matchmaking by @justheuristic in https://github.com/learning-at-home/hivemind/pull/384
    • Support different AMP & buffer configurations in one experiment, fix minor bugs by @justheuristic in https://github.com/learning-at-home/hivemind/pull/389
    • Fix codecov_in_develop_mode with pip>=21.2 by @justheuristic in https://github.com/learning-at-home/hivemind/pull/393
    • Fix minor issues in documentation by @borzunov in https://github.com/learning-at-home/hivemind/pull/392
    • Apply averager updates asynchronously by @justheuristic in https://github.com/learning-at-home/hivemind/pull/395
    • Fix schema typing by @justheuristic in https://github.com/learning-at-home/hivemind/pull/396
    • backport PerformanceEMA from server_side_averaging by @justheuristic in https://github.com/learning-at-home/hivemind/pull/397
    • Add an option to pre-schedule averaging by @justheuristic in https://github.com/learning-at-home/hivemind/pull/398
    • Move DHT to dht/dht.py, update DHT figure by @justheuristic in https://github.com/learning-at-home/hivemind/pull/399
    • [hotfix] replace StepControl.can_modify with began_allreduce by @justheuristic in https://github.com/learning-at-home/hivemind/pull/402
    • move PerformanceEMA to utils, TrainingAverager to optim, update utils by @justheuristic in https://github.com/learning-at-home/hivemind/pull/405
    • Add GradientAverager with support for delayed averaging by @justheuristic in https://github.com/learning-at-home/hivemind/pull/404
    • [hivemind.Optimizer] TrainingStateAverager by @justheuristic in https://github.com/learning-at-home/hivemind/pull/407
    • Catch OSError in MPFuture by @artek0chumak in https://github.com/learning-at-home/hivemind/pull/409
    • [hivemind.Optimizer] ProgressTracker by @justheuristic in https://github.com/learning-at-home/hivemind/pull/408
    • Fix minor bugs in GradientAverager by @justheuristic in https://github.com/learning-at-home/hivemind/pull/410
    • Make target group size optional by @justheuristic in https://github.com/learning-at-home/hivemind/pull/412
    • Prepare GradScaler for hivemind.Optimizer by @justheuristic in https://github.com/learning-at-home/hivemind/pull/413
    • Patch recursive cancel in StepControl by @justheuristic in https://github.com/learning-at-home/hivemind/pull/411
    • Replace the invalid link to discord by @artek0chumak in https://github.com/learning-at-home/hivemind/pull/414
    • Implement state sharing priority by @justheuristic in https://github.com/learning-at-home/hivemind/pull/415
    • Implement core functionality of hivemind.Optimizer by @justheuristic in https://github.com/learning-at-home/hivemind/pull/403
    • DHT Benchmark with asynchronous w/r by @MuXauJl11110 in https://github.com/learning-at-home/hivemind/pull/406
    • Hotfix: load_state_from_peers with offload_optimizer by @justheuristic in https://github.com/learning-at-home/hivemind/pull/417
    • Improve Optimizer docs, update quickstart to use Optimizer by @justheuristic in https://github.com/learning-at-home/hivemind/pull/416
    • Quickstart: typos and references by @justheuristic in https://github.com/learning-at-home/hivemind/pull/420
    • Remove trailing dots in log messages and errors by @borzunov in https://github.com/learning-at-home/hivemind/pull/419
    • Do not log caller for INFO messages by @borzunov in https://github.com/learning-at-home/hivemind/pull/418
    • Improve hivemind.optim.experimental and averager stability by @borzunov in https://github.com/learning-at-home/hivemind/pull/421
    • Add minor tweaks learned from the NeurIPS demo run by @justheuristic in https://github.com/learning-at-home/hivemind/pull/422
    • Improve All-Reduce fault-tolerance by @justheuristic in https://github.com/learning-at-home/hivemind/pull/423
    • Fix All-Reduce fault-tolerance: catch Exception instead of BaseException by @justheuristic in https://github.com/learning-at-home/hivemind/pull/424
    • Fix Task was destroeyd but is pending (put items) by @justheuristic in https://github.com/learning-at-home/hivemind/pull/427
    • Use hivemind.Optimizer in examples/albert by @mryab in https://github.com/learning-at-home/hivemind/pull/426

    New Contributors

    • @artek0chumak made their first contribution in https://github.com/learning-at-home/hivemind/pull/409
    • @MuXauJl11110 made their first contribution in https://github.com/learning-at-home/hivemind/pull/406

    Full Changelog: https://github.com/learning-at-home/hivemind/compare/0.10.0...1.0.0

    Source code(tar.gz)
    Source code(zip)
  • 0.10.0(Aug 26, 2021)

    This release contains the following new features and bugfixes:

    • Fix deadlocks in DecentralizedAverager and MPFuture (#331) (@borzunov @justheuristic)
    • Resolve deadlock in MPFuture (#337) (@justheuristic @borzunov @yhn112)
    • Convert averager to libp2p backend (#323) (@borzunov @mryab)
    • Refactor naming and serialization for PeerIDs (#339) (@borzunov)
    • Set default DHT num_workers = 4 (#342) (@borzunov @deniskamazur @justheuristic @mryab)
    • Fix typo in dht.md (#345) (@justheuristic)
    • Fix some warnings related to asyncio (#346) (@borzunov)
    • Speed up P2P client creation (#343) (@deniskamazur @borzunov)
    • Propagate startup errors from DHT and averager processes (#347) (@borzunov)
    • Add less comparator for PeerID (#353) (@deniskamazur @borzunov)
    • Fix minor asyncio issues in averager (#356) (@borzunov @justheuristic)
    • Optimize unary handlers with persistent connections to P2P daemon (#328) (@deniskamazur)
    • Fix import error breaking AllReduceRunner._send_error_to_peer() (#360) (@borzunov)
    • Fix logger warning in P2P (#361) (@borzunov)
    • Disable QUIC (#355) (@borzunov)
    • Disable elasticity for averaging, add error handling (#362) (@justheuristic @mryab)
    • Improve Matchmaking finalizers (#357) (@borzunov)
    • Allow to specify P2P identity file (#363) (@borzunov)
    • Fix loglevel for a message in _read_from_persistent_conn() (#364) (@borzunov)
    Source code(tar.gz)
    Source code(zip)
  • 0.9.10(Jul 16, 2021)

    This release contains the following features and bugfixes:

    • Add p2pd to package_data (#287) (@mryab)
    • Add per-tensor compression, make All-Reduce faster and more flexible (#272) (@justheuristic @mponty @mryab @yhn112 @borzunov)
    • Fix race condition while reserving ports in P2P (#299) (@borzunov)
    • Add graceful shutdown to DHT and Averager (#301) (@justheuristic @mryab)
    • Make checkpointing optional in example (#303) (@yhn112)
    • Refactor MPFuture to use a single pipe/thread per process (#298) (@justheuristic @borzunov @mryab @yhn112)
    • Split hivemind.client into hivemind.averaging and hivemind.moe (#304) (@mryab)
    • Update readthedocs with hivemind.optim (#288) (@yhn112 @justheuristic)
    • Minor fixes in examples/albert (#308) (@yhn112)
    • Upload the model with push_to_hub in examples (#297) (@leshanbog @mryab @justheuristic)
    • Account for multi-gpu devices in examples/albert (#309) (@justheuristic)
    • Convert DHT to libp2p backend (#296) (@borzunov @skobellev)
    • Simplify argument parsing, update docs in ALBERT example (#315) (@mryab @justheuristic @yhn112)
    • Improve P2P handler throughput and interface (#316) (@borzunov)
    • Remove shared memory from MPFuture, fix minor bugs (#317) (@justheuristic @borzunov @mryab)
    • Implement protobuf-based stream handlers over libp2p backend (#318) (@borzunov)
    • Refactor for v0.9.10 and fix example (#319) (@justheuristic @borzunov)
    • Update quickstart tutorials and acknowledgements (#307) (@justheuristic @yhn112 @borzunov @mryab)
    Source code(tar.gz)
    Source code(zip)
  • 0.9.9(Jun 22, 2021)

    This release contains the following improvements and bugfixes:

    • Add relay options to P2P (#268) (@deniskamazur)
    • Add packaging to requirements (#269) (@deniskamazur)
    • Disable p2pd compilation by default (#270) (@yhn112 @justheuristic)
    • Measure testing coverage on pull request (#271) (@yhn112)
    • Update p2pd md5 checksum (#273) (@deniskamazur)
    • Use logging in benchmarks, fix libp2p-related issues (#280) (@justheuristic)
    • Add BibTeX reference for the library to README (#283) (@mryab)
    • Fix Codecov (#282) (@yhn112)
    • Remove use of packaging module (#284) (@borzunov)
    • Support auxiliary peers in CollaborativeOptimizer (#279) (@yhn112 @justheuristic @mryab)
    Source code(tar.gz)
    Source code(zip)
  • 0.9.8(Jun 7, 2021)

    This release contains the following improvements and bugfixes:

    • Implement combining validators (#249) (@borzunov)
    • Decentralized adaptive optimizers (#243) (@nevec)
    • Add nltk to ALBERT example's requirements (#251) (@borzunov)
    • Protect training progress and metrics with signatures and DHT schema validation (#250) (@borzunov)
    • Add state checkpointing and uploading in coordinator (#241) (@leshanbog @mryab)
    • Fix random freezes in averager.step, improve error handling (#254) (@justheuristic @yhn112 @borzunov @mryab)
    • Fix device in Switch-MoE, overhaul Server architecture (#256) (@mryab)
    • Log more stats for user, move performance stats to examples (#257) (@yhn112)
    • Implement authorization for a moderated Hivemind network (#255) (@borzunov)
    • Improve error handling, remove deprecated functionality (#261) (@justheuristic @mryab)
    • Log correct loss in examples/albert/run_first_peer.py (#265) (@borzunov)
    • Fixed nan when compressing the tensor of zeros (#266) (@Vsevolod-pl)
    • Support auxiliary participants in AllReduceProtocol (#260) (@foksly)
    • Log collaboration step to Wandb, store metrics only if peer is synchronized (#267) (@borzunov @yhn112 @justheuristic)
    • Add initial support for connecting via libp2p (#238) (@MaximKsh @deniskamazur @skobellev @leshanbog @borzunov @mryab @yhn112)
    Source code(tar.gz)
    Source code(zip)
  • 0.9.7(Apr 27, 2021)

    This release contains the following improvements and bugfixes:

    • Add RSA signature protection for DHT records (#187) (@borzunov)
    • Improve Runtime exception handling (#207) (@mryab)
    • Implement basic decentralized optimizers (#210) (@justheuristic, @mryab)
    • Add gradient clipping support to ExpertBackend (#214) (@mryab)
    • Convert SerializerBase to an abstract class (#212) (@mryab)
    • Replace FeedforwardBlock with a correct implementation (#211) (@mryab)
    • Disentangle DecentralizedAverager components, add averaging weights (#217) (@justheuristic @mryab)
    • Add CollaborativeOptimizer, TrainingAverager (#215) (@leshanbog @nevec @mryab)
    • Move compression-related code to hivemind.utils.compression (#213) (@mryab)
    • Prevent DecentralizedSGD from accidentally skipping a fraction of training batches (#218) (@ploshkin)
    • Add uniform compression (#202) (@mponty)
    • Add gradient buffers to CollaborativeOptimizer (#220) (@justheuristic)
    • Improve zero_grad behavior in CollaborativeOptimizer (#221) (@justheuristic)
    • Reset gradient buffers when synchronizing with peers (#222) (@justheuristic)
    • Add tool for custom user experts (#189) (@romakail @justheuristic)
    • Delta gradients transmission (#225) (@Vsevolod-pl)
    • Statistics averaging (#229) (@nevec)
    • Ensure version-consistent result rounding in load_balance_peers (#230) (@justheuristic @mryab)
    • Add Switch Transformers-like RemoteMixtureOfExperts (#228) (@mryab)
    • Add example for collaborative ALBERT training (#226) (@leshanbog @yhn112 @nevec @mryab)
    • Fix loss metric calculation (#240) (@yhn112)
    • Add DHT schema validator (#227) (@borzunov)
    • Fix server hanging in certain cases when connection is lost (#247) (@justheuristic)
    • Add Dockerfile, refactor tests (#245) (@mryab)
    • Fix incorrect data types/values in RemoteSwitchMixtureOfExperts (#246) (@mryab)
    Source code(tar.gz)
    Source code(zip)
  • 0.9.6(Apr 2, 2021)

    This release adds several new features:

    • Client-only averaging in AllReduce (#176)
    • Expert learning rate scheduling (#196)
    • Quantile compression (#182)

    Also, this release contains the following fixes and improvements:

    • Fix scalar deserialization (#190)
    • Extract expert-specific methods from DHT (#192)
    Source code(tar.gz)
    Source code(zip)
  • 0.9.5(Mar 5, 2021)

    This release fixes several known bugs and security vulnerabilities:

    • Copytree implementation for py37 compatibility (#162)
    • Remove pickle.loads in Averager (#160)
    • Support edge cases for DHT key/subkey/value (#167)
    • Fix the remaining tests for py37 (#166)
    • Move Averager metadata serialization out of user scope (#168)
    • Handle edge cases in DecentralizedAverager (#171)
    • Fix a typo in quickstart.md (#174)
    • Serialize DHTID source with msgpack (#172)
    • Move CLI server launch script to hivemind/hivemind_cli (#173)
    Source code(tar.gz)
    Source code(zip)
  • 0.9.0(Feb 28, 2021)

    • Implement DecentralizedAverager for averaging model parameters & statistics across DHT peers (#119 #123 #134 #140 #141)
    • Accelerate RemoteMixtureOfExperts beam search with new key structure (#97 #101 #103 #109)
    • Implement lossy compression algorithms for tensors (#102 #106 #112)
    • Detect anomalies in RemoteMixtureOfExperts (#132)
    • Configure gRPC channels for long-term stability (#129 #131)
    • Load expert checkpoints on server startup (#138)
    • Support attention mask in example TransformerEncoder layer (#126)
    • Add the contribution guide (#156)

    Bugfixes:

    • Fix wrong getattr in hivemind.Server (#122)

    Enhancements:

    • Suport python3.9 and torch1.7 (#142)
    • Blacklist nonresponsive peers with exponential backoff (#114)
    • Reuse grpc channels between calls (#120)
    • Verify DHT peer accessibility and local clock (#137)
    • Improve logging, remove duplicate log entries (#135)
    • Improve test coverage (#116)
    Source code(tar.gz)
    Source code(zip)
  • 0.8.2(Aug 28, 2020)

  • 0.8.1(Aug 27, 2020)

    Minor update:

    • you can now create minimalistic hivemind server via ./script/run_server.py @Vsevolod-pl
    • ./script/run_server.py can now sample experts from a pre-defined grid, e.g. expert.[0:256].[0:256]
    • added quickstart tutorial @justheuristic
    Source code(tar.gz)
    Source code(zip)
  • v0.8.0(Aug 23, 2020)

    • Speed up tests, shutdown threads in server via threading.Event
    • Compile protobuf in setup.py
    • Update circleci pipelines
    • Update RTD pipeline
    • Refactor custom build_ext into install and develop
    Source code(tar.gz)
    Source code(zip)
WAGMA-SGD is a decentralized asynchronous SGD for distributed deep learning training based on model averaging.

WAGMA-SGD is a decentralized asynchronous SGD based on wait-avoiding group model averaging. The synchronization is relaxed by making the collectives externally-triggerable, namely, a collective can be initiated without requiring that all the processes enter it. It partially reduces the data within non-overlapping groups of process, improving the parallel scalability.

Shigang Li 6 Jun 18, 2022
Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way

Apache Liminals goal is to operationalise the machine learning process, allowing data scientists to quickly transition from a successful experiment to an automated pipeline of model training, validation, deployment and inference in production. Liminal provides a Domain Specific Language to build ML workflows on top of Apache Airflow.

The Apache Software Foundation 121 Dec 28, 2022
Tangram makes it easy for programmers to train, deploy, and monitor machine learning models.

Tangram Website | Discord Tangram makes it easy for programmers to train, deploy, and monitor machine learning models. Run tangram train to train a mo

Tangram 1.4k Jan 5, 2023
A Tools that help Data Scientists and ML engineers train and deploy ML models.

Domino Research This repo contains projects under active development by the Domino R&D team. We build tools that help Data Scientists and ML engineers

Domino Data Lab 73 Oct 17, 2022
Uber Open Source 1.6k Dec 31, 2022
A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

Davis E. King 11.6k Jan 2, 2023
MIT-Machine Learning with Python–From Linear Models to Deep Learning

MIT-Machine Learning with Python–From Linear Models to Deep Learning | One of the 5 courses in MIT MicroMasters in Statistics & Data Science Welcome t

null 2 Aug 23, 2022
Falken provides developers with a service that allows them to train AI that can play their games

Falken provides developers with a service that allows them to train AI that can play their games. Unlike traditional RL frameworks that learn through rewards or batches of offline training, Falken is based on training AI via realtime, human interactions.

Google Research 223 Jan 3, 2023
A collection of interactive machine-learning experiments: 🏋️models training + 🎨models demo

?? Interactive Machine Learning experiments: ??️models training + ??models demo

Oleksii Trekhleb 1.4k Jan 6, 2023
My project contrasts K-Nearest Neighbors and Random Forrest Regressors on Real World data

kNN-vs-RFR My project contrasts K-Nearest Neighbors and Random Forrest Regressors on Real World data In many areas, rental bikes have been launched to

null 1 Oct 28, 2021
Used Logistic Regression, Random Forest, and XGBoost to predict the outcome of Search & Destroy games from the Call of Duty World League for the 2018 and 2019 seasons.

Call of Duty World League: Search & Destroy Outcome Predictions Growing up as an avid Call of Duty player, I was always curious about what factors led

Brett Vogelsang 2 Jan 18, 2022
The project's goal is to show a real world application of image segmentation using k means algorithm

The project's goal is to show a real world application of image segmentation using k means algorithm

null 2 Jan 22, 2022
Built on python (Mathematical straight fit line coordinates error predictor machine learning foundational model)

Sum-Square_Error-Business-Analytical-Tool- Built on python (Mathematical straight fit line coordinates error predictor machine learning foundational m

om Podey 1 Dec 3, 2021
XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

null 92 Dec 14, 2022
Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Microsoft 366 Jan 3, 2023
A data preprocessing package for time series data. Design for machine learning and deep learning.

A data preprocessing package for time series data. Design for machine learning and deep learning.

Allen Chiang 152 Jan 7, 2023
A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

Daniel Formoso 5.7k Dec 30, 2022
QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

null 152 Jan 2, 2023
A Lucid Framework for Transparent and Interpretable Machine Learning Models.

Currently a Beta-Version lucidmode is an open-source, low-code and lightweight Python framework for transparent and interpretable machine learning mod

lucidmode 15 Aug 12, 2022