Pytorch Lightning Distributed Accelerators using Ray

Overview

Distributed PyTorch Lightning Training on Ray

This library adds new PyTorch Lightning accelerators for distributed training using the Ray distributed computing framework.

These PyTorch Lightning Accelerators on Ray enable quick and easy parallel training while still leveraging all the benefits of PyTorch Lightning and using your desired training protocol, either PyTorch Distributed Data Parallel or Horovod.

Once you add your accelerator to the PyTorch Lightning Trainer, you can parallelize training to all the cores in your laptop, or across a massive multi-node, multi-GPU cluster with no additional code changes.

This library also comes with an integration with Ray Tune for distributed hyperparameter tuning experiments.

Installation

You can install the master branch of ray_lightning_accelerators like so:

pip install git+https://github.com/ray-project/ray_lightning_accelerators#ray_lightning

PyTorch Distributed Data Parallel Accelerator on Ray

The RayAccelerator provides Distributed Data Parallel training on a Ray cluster. PyTorch DDP is used as the distributed training protocol, and Ray is used to launch and manage the training worker processes.

Here is a simplified example:

import pytorch_lightning as ptl
from ray_lightning import RayAccelerator

# Create your PyTorch Lightning model here.
ptl_model = MNISTClassifier(...)
accelerator = RayAccelerator(num_workers=4, cpus_per_worker=1, use_gpu=True)

# If using GPUs, set the ``gpus`` arg to a value > 0.
# The actual number of GPUs is determined by ``num_workers``.
trainer = pl.Trainer(..., gpus=1, accelerator=accelerator)
trainer.fit(ptl_model)

Because Ray is used to launch processes, instead of the same script being called multiple times, you CAN use this accelerator even in cases when you cannot use the standard DDPAccelerator such as

  • Jupyter Notebooks, Google Colab, Kaggle
  • Calling fit or test multiple times in the same script

Horovod Accelerator on Ray

Or if you prefer to use Horovod as the distributed training protocol, use the HorovodRayAccelerator instead.

import pytorch_lightning as ptl
from ray.util.lightning_accelerators import HorovodRayAccelerator

# Create your PyTorch Lightning model here.
ptl_model = MNISTClassifier(...)

# 2 nodes, 4 workers per node, each using 1 CPU and 1 GPU.
accelerator = HorovodRayAccelerator(num_hosts=2, num_slots=4, use_gpu=True)

# If using GPUs, set the ``gpus`` arg to a value > 0.
# The actual number of GPUs is determined by ``num_slots``.
trainer = pl.Trainer(..., gpus=1, accelerator=accelerator)
trainer.fit(ptl_model)

Multi-node Distributed Training

Using the same examples above, you can run distributed training on a multi-node cluster with just 2 simple steps.

  1. Use Ray's cluster launcher to start a Ray cluster- ray up my_cluster_config.yaml.
  2. Execute your Python script on the Ray cluster- ray submit my_cluster_config.yaml train.py. This will rsync your training script to the head node, and execute it on the Ray cluster.

You no longer have to set environment variables or configurations and run your training script on every single node.

Hyperparameter Tuning with Ray Tune

ray_lightning also integrates with Ray Tune to provide distributed hyperparameter tuning for your distributed model training. You can run multiple PyTorch Lightning training runs in parallel, each with a different hyperparameter configuration, and each training run parallelized by itself. All you have to do is move your training code to a function, pass the function to tune.run, and make sure to add the appropriate callback (Either TuneReportCallback or TuneReportCheckpointCallback) to your PyTorch Lightning Trainer.

Example using ray_lightning with Tune:

def train_mnist(config):
    
    # Create your PTL model.
    model = MNISTClassifier(config)

    # Create the Tune Reporting Callback
    metrics = {"loss": "ptl/val_loss", "acc": "ptl/val_accuracy"}
    callbacks = [TuneReportCallback(metrics, on="validation_end")]
    
    trainer = pl.Trainer(
        max_epochs=4,
        callbacks=callbacks,
        accelerator=RayAccelerator(num_workers=4, use_gpu=False))
    trainer.fit(model)
    
config = {
    "layer_1": tune.choice([32, 64, 128]),
    "layer_2": tune.choice([64, 128, 256]),
    "lr": tune.loguniform(1e-4, 1e-1),
    "batch_size": tune.choice([32, 64, 128]),
}

# Make sure to specify how many actors each training run will create via the "extra_cpu" field.
analysis = tune.run(
        train_mnist,
        metric="loss",
        mode="min",
        config=config,
        num_samples=num_samples,
        resources_per_trial={
            "cpu": 1,
            "extra_cpu": 4
        },
        name="tune_mnist")
        
print("Best hyperparameters found were: ", analysis.best_config)

FAQ

RaySGD already has a Pytorch Lightning integration. What's the difference between this integration and that?

The key difference is which Trainer you'll be interacting with. In this library, you will still be using Pytorch Lightning's Trainer. You'll be able to leverage all the features of Pytorch Lightning, and Ray is used just as a backend to handle distributed training.

With RaySGD's integration, you'll be converting your LightningModule to be RaySGD compatible, and will be interacting with RaySGD's TorchTrainer. RaySGD's TorchTrainer is not as feature rich nor as easy to use as Pytorch Lightning's Trainer (no built in support for logging, early stopping, etc.). However, it does have built in support for fault-tolerant and elastic training. If these are hard requirements for you, then RaySGD's integration with PTL might be a better option.

I see that RayAccelerator is based off of Pytorch Lightning's DDPSpawnAccelerator. However, doesn't the PTL team discourage the use of spawn?

As discussed here, using a spawn approach instead of launch is not all that detrimental. The original factors for discouraging spawn were:

  1. not being able to use 'spawn' in a Jupyter or Colab notebook, and
  2. not being able to use multiple workers for data loading.

Neither of these should be an issue with the RayAccelerator due to Ray's serialization mechanisms. The only thing to keep in mind is that when using this accelerator, your model does have to be serializable/pickleable.

Comments
  • Unable to use both GPUs

    Unable to use both GPUs

    Hello, thanks for this amazing library for Lightning!

    I am trying to run Lighting and Ray Tune on a system with 2 GPU. As a start, I want to use both of GPU to train 1 trial at a time.

    However, when I use

    def train_model(config):
        ...
        trainer = pl.Trainer(
            gpus=2,
            accelerator="ddp",
            callbacks=[checkpoint_callback, tune_report_callback],
            plugins=[RayPlugin(num_workers=1, use_gpu=True)],
            precision=16,
        )
        trainer.fit(model, dm)
    
    if __name__ == "__main__":
    
        ray.init()
    
        config = {"batch_size": 256}
    
        analysis = tune.run(
            train_model,
            metric="loss",
            mode="min",
            config=config,
            num_samples=1,
            resources_per_trial={"gpu": 2},
            name="test",
        )
    

    I get the error about the actor or task not being able to be scheduled.

    == Status ==
    Memory usage on this node: 17.5/125.8 GiB
    Using FIFO scheduling algorithm.
    Resources requested: 1/32 CPUs, 2/2 GPUs, 0.0/70.46 GiB heap, 0.0/23.58 GiB objects (0/1.0 accelerator_type:RTX)
    Result logdir: /home/x/ray_results/test
    Number of trials: 1/1 (1 RUNNING)
    
    ...
    
    (pid=109446) GPU available: True, used: True
    (pid=109446) TPU available: None, using: 0 TPU cores
    (pid=109446) Using native 16bit precision.
    (pid=109446) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
    2021-03-30 01:12:21,546 WARNING worker.py:1107 -- The actor or task with ID ffffffffffffffff609630d00bec4e0790a0da3f01000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {31.000000/32.000000 CPU, 70.556641 GiB/70.556641 GiB memory, 0.000000/2.000000 GPU, 23.583984 GiB/23.583984 GiB object_store_memory, 1.000000/1.000000 node:192.168.1.159, 1.000000/1.000000 accelerator_type:RTX}
    . In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
    

    Firstly, I tried changing Trainer to use gpus=1 and num_workers=2, but we still get the same error

        trainer = pl.Trainer(
            gpus=1,
            accelerator="ddp",
            callbacks=[checkpoint_callback, tune_report_callback],
            plugins=[RayPlugin(num_workers=2, use_gpu=True)],
            precision=16,
        )
    
    == Status ==
    Memory usage on this node: 17.1/125.8 GiB
    Using FIFO scheduling algorithm.
    Resources requested: 1/32 CPUs, 2/2 GPUs, 0.0/70.07 GiB heap, 0.0/23.44 GiB objects (0/1.0 accelerator_type:RTX)
    Result logdir: /home/x/ray_results/test
    Number of trials: 1/1 (1 RUNNING)
    
    ...
    
    (pid=111559) GPU available: True, used: True
    (pid=111559) TPU available: None, using: 0 TPU cores
    (pid=111559) Using native 16bit precision.
    (pid=111559) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
    2021-03-30 01:14:17,641 WARNING worker.py:1107 -- The actor or task with ID ffffffffffffffff98dc17f283c4962e8f312ee401000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {31.000000/32.000000 CPU, 70.068359 GiB/70.068359 GiB memory, 0.000000/2.000000 GPU, 1.000000/1.000000 accelerator_type:RTX, 1.000000/1.000000 node:192.168.1.159, 23.437500 GiB/23.437500 GiB object_store_memory}
    . In total there are 0 pending tasks and 2 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
    

    Next, I reduced num_workers=2 to num_workers=1, and the error remains

        trainer = pl.Trainer(
            val_check_interval=0.1,
            gpus=1,
            accelerator="ddp",
            callbacks=[checkpoint_callback, tune_report_callback],
            plugins=[RayPlugin(num_workers=1, use_gpu=True)],
            precision=16,
            progress_bar_refresh_rate=1000,  # refresh every 1000 iterations
        )
    
    == Status ==
    Memory usage on this node: 19.4/125.8 GiB
    Using FIFO scheduling algorithm.
    Resources requested: 1/32 CPUs, 2/2 GPUs, 0.0/68.7 GiB heap, 0.0/23.05 GiB objects (0/1.0 accelerator_type:RTX)
    Result logdir: /home/x/ray_results/test
    Number of trials: 1/1 (1 RUNNING)
    
    ...
    
    (pid=113817) GPU available: True, used: True
    (pid=113817) TPU available: None, using: 0 TPU cores
    (pid=113817) Using native 16bit precision.
    (pid=113817) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
    2021-03-30 01:16:23,959 WARNING worker.py:1107 -- The actor or task with ID ffffffffffffffffa8b7c70cbc343023efcc60f501000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {31.000000/32.000000 CPU, 68.701172 GiB/68.701172 GiB memory, 0.000000/2.000000 GPU, 23.046875 GiB/23.046875 GiB object_store_memory, 1.000000/1.000000 accelerator_type:RTX, 1.000000/1.000000 node:192.168.1.159}
    . In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
    

    Finally, I reduced resources_per_trial to resources_per_trial={"gpu": 1}, and it finally runs, but appears to be using only 1 GPU, not 2.

        trainer = pl.Trainer(
            gpus=1,
            accelerator="ddp",
            callbacks=[checkpoint_callback, tune_report_callback],
            plugins=[RayPlugin(num_workers=1, use_gpu=True)],
            precision=16,
        )
    
        analysis = tune.run(
            train_model,
            metric="loss",
            mode="min",
            config=config,
            num_samples=1,
            resources_per_trial={"gpu": 1},
            name="test",
        )
    
    == Status ==
    Memory usage on this node: 18.8/125.8 GiB
    Using FIFO scheduling algorithm.
    Resources requested: 1/32 CPUs, 1/2 GPUs, 0.0/69.24 GiB heap, 0.0/23.19 GiB objects (0/1.0 accelerator_type:RTX)
    Result logdir: /home/x/ray_results/test
    Number of trials: 1/1 (1 RUNNING)
    
    ...
    
    (pid=119147) GPU available: True, used: True
    (pid=119147) TPU available: None, using: 0 TPU cores
    (pid=119147) Using native 16bit precision.
    (pid=119147) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
    (pid=119167) initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
    (pid=119167) 
    (pid=119167)   | Name  | Type          | Params
    (pid=119167) ----------------------------------------
    (pid=119167) 0 | model | TestPLModel00 | 3.7 K 
    (pid=119167) ----------------------------------------
    (pid=119167) 3.7 K     Trainable params
    (pid=119167) 0         Non-trainable params
    (pid=119167) 3.7 K     Total params
    (pid=119167) 0.015     Total estimated model params size (MB)
    

    I am using

    • pytorch 1.7.0
    • pytorch-lightning 1.2.5
    • ray 1.2.0
    • ray-lightning 0.0.1

    What should be the correct way to train a single trial using both GPU devices?

    opened by nyxynyx 36
  • NCCL peer access is not supported error

    NCCL peer access is not supported error

    just the training was about to start (nvidia-smi shows the GPU memory filling up), there's a new error

    RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729062494/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled cuda error, NCCL version 2.7.8
    

    Starting the Python training script using NCCL_IB_DISABLE=1 python tune.py does not help.

    Enabled debug messages using NCCL_DEBUG="INFO" NCCL_IB_DISABLE=1 python tune.py, and the following new messages appeared:

    (pid=909597) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729062494/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled cuda error, NCCL version 2.7.8
    (pid=909608) z-pc:909608:910337 [0] NCCL INFO Channel 00/02 :    0   1
    (pid=909608) z-pc:909608:910337 [0] NCCL INFO Channel 01/02 :    0   1
    (pid=909608) z-pc:909608:910337 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
    (pid=909608) z-pc:909608:910337 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
    (pid=909608) z-pc:909608:910337 [0] NCCL INFO Channel 00 : 0[2d000] -> 1[2e000] via P2P/IPC
    (pid=909608) 
    (pid=909608) z-pc:909608:910337 [0] transport/p2p.cc:238 NCCL WARN failed to open CUDA IPC handle : 217 peer access is not supported between these two devices
    (pid=909608) z-pc:909608:910337 [0] NCCL INFO transport.cc:68 -> 1
    (pid=909608) z-pc:909608:910337 [0] NCCL INFO init.cc:766 -> 1
    (pid=909608) z-pc:909608:910337 [0] NCCL INFO init.cc:840 -> 1
    (pid=909608) z-pc:909608:910337 [0] NCCL INFO group.cc:73 -> 1 [Async thread]
    (pid=909611) z-pc:909611:910338 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
    (pid=909611) z-pc:909611:910338 [0] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1
    (pid=909611) z-pc:909611:910338 [0] NCCL INFO Channel 00 : 1[2e000] -> 0[2d000] via P2P/IPC
    (pid=909611) 
    (pid=909611) z-pc:909611:910338 [0] transport/p2p.cc:238 NCCL WARN failed to open CUDA IPC handle : 217 peer access is not supported between these two devices
    (pid=909611) z-pc:909611:910338 [0] NCCL INFO transport.cc:68 -> 1
    (pid=909611) z-pc:909611:910338 [0] NCCL INFO init.cc:766 -> 1
    (pid=909611) z-pc:909611:910338 [0] NCCL INFO init.cc:840 -> 1
    (pid=909611) z-pc:909611:910338 [0] NCCL INFO group.cc:73 -> 1 [Async thread]
    2021-03-30 18:20:32,548 ERROR trial_runner.py:616 -- Trial train_model_190f1_00000: Error processing event.
    $ nvidia-smi topo -m
    	GPU0	GPU1	CPU Affinity	NUMA Affinity
    GPU0	 X 	PHB	0-31		N/A
    GPU1	PHB	 X 	0-31		N/A
    

    Originally posted by @nyxynyx

    opened by amogkam 22
  • question: Do I need to use ray.init() before using the Ray Accelerator?

    question: Do I need to use ray.init() before using the Ray Accelerator?

    Hey, Thank you for creating a needed library.

    I am very new to using Ray, and I already had a project built around PL. I looked around on how to add Ray distributed training backend to my project, and I found this library that does not force me to not use the PL trainer.

    Now, I am trying to use the accelerator on my local machine, but I failed to do so. I think it's a really simple issue, because of my lack of knowledge.

    This the bit where I add the accelerator:

    if accelerator_use:
        ray.init()
        accelerator = RayAccelerator(num_workers=4, cpus_per_worker=1, use_gpu=True)
    else:
        accelerator = None
    

    I tried without using ray init and I got an error, and when I add ray init I get this:

    2021-03-06 14:54:45,217 WARNING worker.py:1107 -- The actor or task with ID ffffffffffffffff63964fa4841d4a2ecb45751801000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {7.000000/8.000000 CPU, 7.177734 GiB/7.177734 GiB memory, 0.000000/1.000000 GPU, 1.000000/1.000000 node:172.20.10.2, 2.441406 GiB/2.441406 GiB object_store_memory}
    . In total there are 0 pending tasks and 6 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
    
    opened by MohammedAljahdali 12
  • Compatibility with PTL 1.6

    Compatibility with PTL 1.6

    Todos:

    • [x] Check if we need to_state_stream / load_state_stream P(0)
    • [x] Check multi node (P0)
    • [x] Check multi GPU/multi node (P0)
    • [x] Fix / change tests (P0)
    • [x] Check that recent PRs are included, e.g. https://github.com/ray-project/ray_lightning/pull/156 P(0.5-1)
    • [x] Check Ray client (P1)
    • [x] Check fractional GPUs (P2)
    • [x] DDP sharded (P2)
    opened by krfricke 11
  • ray ddp fails with 2 gpu workers

    ray ddp fails with 2 gpu workers

      File "/home/ubuntu/anaconda3/envs/automm-dev-pl-latest/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1797, in __setup_profiler
        self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir)
      File "/home/ubuntu/anaconda3/envs/automm-dev-pl-latest/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 2249, in log_dir
        dirpath = self.strategy.broadcast(dirpath)
      File "/home/ubuntu/anaconda3/envs/automm-dev-pl-latest/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 215, in broadcast
        torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
      File "/home/ubuntu/anaconda3/envs/automm-dev-pl-latest/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1817, in broadcast_object_list
        broadcast(object_sizes_tensor, src=src, group=group)
      File "/home/ubuntu/anaconda3/envs/automm-dev-pl-latest/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1159, in broadcast
        work = default_pg.broadcast([tensor], opts)
    RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:891, internal error, NCCL version 21.0.3
    ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruptio
    

    use this branch: https://github.com/sxjscience/autogluon/tree/kaggle_california_house and install autogluon via bash full_install.sh Afterwards, try this script: https://gist.github.com/sxjscience/53bc799e37cc0680ca9e53c2fea75cd7 Internally, the ray strategy are constructed here: https://github.com/sxjscience/autogluon/blob/59f01b95381fba5651db17fd98fa84164ad168c2/multimodal/src/autogluon/multimodal/predictor.py#L1036-L1052 .

    opened by JiahaoYao 10
  • [Tune] Ray Tune + Ray Lightning too many tasks warning

    [Tune] Ray Tune + Ray Lightning too many tasks warning

    I noticed such warning log constantly being logged out while using Ray Tune + Ray Lightning. For example: Warning: More than 20000 tasks are pending submission to actor 386ebf690ec87ad0d825174701000000. To reduce memory usage, wait for these tasks to finish before sending more.

    Do I need to worry about it?

    opened by yinweisu 10
  • CUDA devices are not exposed when running in DDP mode with multiple GPUs

    CUDA devices are not exposed when running in DDP mode with multiple GPUs

    Hi all!

    First of all thanks for this great project!

    To my issue: When I tried your example for hyperparameter tuning I discovered that it only worked when using cpu. After some digging, I found out that the problem is related to get_tune_ddp_resources Since head_bundle only requests cpu it does not expose the required CUDA devices for the child bundles. Therefore, Lightning fails to run on GPU(s) since CUDA_VISIBLE_DEVICES is not set.

    As a workaround, I have added

    if use_gpu:
            os.environ["CUDA_VISIBLE_DEVICES"] = ",".join([str(i) for i in range(num_workers)])
    

    in __init__ of RayPlugin It gets the job done but seems incredibly hacky. Is there some other way around?

    Thanks in advance!

    opened by MarkusSpanring 10
  • Does not appear to be compatible with the current version of Lightning

    Does not appear to be compatible with the current version of Lightning

    I was excited to try this out but the code appears to not be working due to a missing import:

    Traceback (most recent call last):
      File "...", line 9, in <module>
        from ray_lightning import RayAccelerator
      File ".../lib/python3.8/site-packages/ray_lightning/__init__.py", line 1, in <module>
        from ray_lightning.ray_ddp import RayAccelerator
      File ".../lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 8, in <module>
        from pytorch_lightning.accelerators import DDPSpawnAccelerator
    ImportError: cannot import name 'DDPSpawnAccelerator' from 'pytorch_lightning.accelerators' (.../lib/python3.8/site-packages/pytorch_lightning/accelerators/__init__.py)
    
    opened by import-antigravity 10
  • [Windows] RuntimeError: Distributed package doesn't have NCCL built in

    [Windows] RuntimeError: Distributed package doesn't have NCCL built in

    Hey, I am having an issue when I run trainer.fit with the accelerator I get the following error:

    2021-03-08 13:45:49,085 INFO services.py:1172 -- View the Ray dashboard at http://127.0.0.1:8265
    GPU available: True, used: True
    TPU available: None, using: 0 TPU cores
    LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
    Using native 16bit precision.
    Global seed set to 1234
    initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
    Traceback (most recent call last):
      File "train.py", line 26, in cli_main
        train(None, cfg)
      File "train.py", line 102, in train
        trainer.fit(model, datamodule=dm)
      File "C:\Users\Mohammed\AppData\Local\conda\conda\envs\htts\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 510, in fit
        results = self.accelerator_backend.train()
      File "C:\Users\Mohammed\AppData\Local\conda\conda\envs\htts\lib\site-packages\ray_lightning\ray_ddp.py", line 184, in train
        results = process_results(futures, queue)
      File "C:\Users\Mohammed\AppData\Local\conda\conda\envs\htts\lib\site-packages\ray_lightning\util.py", line 103, in process_results
        ray.get(ready)
      File "C:\Users\Mohammed\AppData\Roaming\Python\Python38\site-packages\ray\_private\client_mode_hook.py", line 47, in wrapper
        return func(*args, **kwargs)
      File "C:\Users\Mohammed\AppData\Roaming\Python\Python38\site-packages\ray\worker.py", line 1456, in get
        raise value.as_instanceof_cause()
    ray.exceptions.RayTaskError(RuntimeError): ray::RayExecutor.execute() (pid=27620, ip=192.168.8.100)
      File "python\ray\_raylet.pyx", line 480, in ray._raylet.execute_task
      File "python\ray\_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
      File "C:\Users\Mohammed\AppData\Roaming\Python\Python38\site-packages\ray\function_manager.py", line 556, in actor_method_executor
        return method(__ray_actor, *args, **kwargs)
      File "C:\Users\Mohammed\AppData\Local\conda\conda\envs\htts\lib\site-packages\ray_lightning\ray_ddp.py", line 31, in execute
        return fn(*args, **kwargs)
      File "C:\Users\Mohammed\AppData\Local\conda\conda\envs\htts\lib\site-packages\ray_lightning\ray_ddp.py", line 218, in train_remote
        super(RayAccelerator, self).ddp_train(
      File "C:\Users\Mohammed\AppData\Local\conda\conda\envs\htts\lib\site-packages\pytorch_lightning\accelerators\ddp_spawn_accelerator.py", line 127, in ddp_train
        self.init_ddp_connection(
      File "C:\Users\Mohammed\AppData\Local\conda\conda\envs\htts\lib\site-packages\ray_lightning\ray_ddp.py", line 232, in init_ddp_connection
        torch.distributed.init_process_group(
      File "C:\Users\Mohammed\AppData\Local\conda\conda\envs\htts\lib\site-packages\torch\distributed\distributed_c10d.py", line 503, in init_process_group
        _update_default_pg(_new_process_group_helper(
      File "C:\Users\Mohammed\AppData\Local\conda\conda\envs\htts\lib\site-packages\torch\distributed\distributed_c10d.py", line 597, in _new_process_group_helper
        raise RuntimeError("Distributed package doesn't have NCCL "
    RuntimeError: Distributed package doesn't have NCCL built in
    
    opened by MohammedAljahdali 9
  • Optimising with respect to the epoch that scored highest for a trial (instead of the last epoch)

    Optimising with respect to the epoch that scored highest for a trial (instead of the last epoch)

    Hi,

    I am using metric="acc" for ray.tune, with mode="max". However, I think that the score for the last epoch is being used as the "best" score for the trial.

    E.g., for trial train_mnist_c1a02_00000 from the following console output, the reported acc is 0.942736:

    +-------------------------+------------+-------+--------------+-----------+-----------+-------------+--------+------------------+----------+----------+
    | Trial name              | status     | loc   |   batch_size |   layer_1 |   layer_2 |          lr |   iter |   total time (s) |      loss |      acc |
    |-------------------------+------------+-------+--------------+-----------+-----------+-------------+--------+------------------+-----------+----------|
    | train_mnist_c1a02_00000 | TERMINATED |       |           64 |       128 |       256 | 0.000120742 |     16 |    166.36  | -0.938587 | 0.942736 |
    | train_mnist_c1a02_00001 | TERMINATED |       |          128 |       128 |        64 | 0.000120068 |     16 |    138.23  | -0.923084 | 0.929161 |
    | train_mnist_c1a02_00002 | TERMINATED |       |           64 |        32 |       256 | 0.000308457 |     16 |    168.73  | -0.942267 | 0.945811 |
    | train_mnist_c1a02_00003 | TERMINATED |       |           64 |        32 |       256 | 0.0927983   |     16 |    162.749 | -0.103807 | 0.103807 |
    +-------------------------+------------+-------+--------------+-----------+-----------+-------------+--------+------------------+-----------+----------+
    

    However, when looking at progress.csv in tune_mnist/train_mnist_c1a02_00000_0_batch_size=64,layer_1=128,layer_2=256,lr=0.00012074_2021-08-25_20-18-14, the highest acc is 0.9441488981246948 (where 0.942736 is the score for the last epoch):

    loss,acc,time_this_iter_s,done,timesteps_total,episodes_total,training_iteration,experiment_id,date,timestamp,time_total_s,pid,hostname,node_ip,time_since_restore,timesteps_since_restore,iterations_since_restore,trial_id
    -0.7927070260047913,0.8189826607704163,28.36010980606079,False,,,1,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-18-49,1629886729,28.36010980606079,6438,b078,10.141.1.144,28.36010980606079,0,1,c1a02_00000
    -0.8159223198890686,0.829454779624939,9.054965734481812,False,,,2,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-18-58,1629886738,37.4150755405426,6438,b078,10.141.1.144,37.4150755405426,0,2,c1a02_00000
    -0.8259921669960022,0.835106372833252,9.04287576675415,False,,,3,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-19-07,1629886747,46.45795130729675,6438,b078,10.141.1.144,46.45795130729675,0,3,c1a02_00000
    -0.831539511680603,0.8390957117080688,8.813407182693481,False,,,4,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-19-16,1629886756,55.271358489990234,6438,b078,10.141.1.144,55.271358489990234,0,4,c1a02_00000
    -0.8353389501571655,0.8413397073745728,9.657632112503052,False,,,5,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-19-26,1629886766,64.92899060249329,6438,b078,10.141.1.144,64.92899060249329,0,5,c1a02_00000
    -0.896843671798706,0.9082446694374084,9.233526706695557,False,,,6,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-19-35,1629886775,74.16251730918884,6438,b078,10.141.1.144,74.16251730918884,0,6,c1a02_00000
    -0.9138997197151184,0.9224567413330078,9.06982707977295,False,,,7,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-19-44,1629886784,83.23234438896179,6438,b078,10.141.1.144,83.23234438896179,0,7,c1a02_00000
    -0.919984757900238,0.9273603558540344,9.12305760383606,False,,,8,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-19-53,1629886793,92.35540199279785,6438,b078,10.141.1.144,92.35540199279785,0,8,c1a02_00000
    -0.9245654940605164,0.9311003684997559,9.59031629562378,False,,,9,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-20-03,1629886803,101.94571828842163,6438,b078,10.141.1.144,101.94571828842163,0,9,c1a02_00000
    -0.9273290038108826,0.9330950379371643,9.259892702102661,False,,,10,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-20-12,1629886812,111.20561099052429,6438,b078,10.141.1.144,111.20561099052429,0,10,c1a02_00000
    -0.9301624298095703,0.935339093208313,9.178678750991821,False,,,11,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-20-21,1629886821,120.38428974151611,6438,b078,10.141.1.144,120.38428974151611,0,11,c1a02_00000
    -0.9327710270881653,0.9389959573745728,9.190800189971924,False,,,12,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-20-30,1629886830,129.57508993148804,6438,b078,10.141.1.144,129.57508993148804,0,12,c1a02_00000
    -0.9332801103591919,0.9382479786872864,9.453409433364868,False,,,13,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-20-40,1629886840,139.0284993648529,6438,b078,10.141.1.144,139.0284993648529,0,13,c1a02_00000
    -0.9350691437721252,0.9409075379371643,8.920868873596191,False,,,14,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-20-49,1629886849,147.9493682384491,6438,b078,10.141.1.144,147.9493682384491,0,14,c1a02_00000
    -0.9383015632629395,0.9441488981246948,9.268950700759888,False,,,15,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-20-58,1629886858,157.21831893920898,6438,b078,10.141.1.144,157.21831893920898,0,15,c1a02_00000
    -0.9385870099067688,0.942736029624939,9.141263961791992,False,,,16,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-21-07,1629886867,166.35958290100098,6438,b078,10.141.1.144,166.35958290100098,0,16,c1a02_00000
    
    

    I see that this was an issue here: https://github.com/ray-project/ray/issues/5174, but it isn't clear what the fix was.

    Thanks.

    Source:

    """Simple example using RayAccelerator and Ray Tune"""
    
    from pl_bolts.datamodules.mnist_datamodule import MNISTDataModule
    from ray import tune
    from ray_lightning.tests.utils import LightningMNISTClassifier
    from ray_lightning.tune import TuneReportCallback, get_tune_ddp_resources
    from ray_lightning import RayPlugin
    import os
    import pytorch_lightning as pl
    import ray
    
    DATA_DIR = "/datasets/work/hb-mlaifsp-mm/source/Datasets/mnist"
    NUM_WORKERS = 1
    NUM_SAMPLES = 4
    MAX_EPOCHS = 16
    USE_GPU = True
    
    def train_mnist(config):
    
    
        model = LightningMNISTClassifier(config, DATA_DIR)
    
        metrics = {"loss": "ptl/val_loss", "acc": "ptl/val_accuracy"}
        callbacks = [TuneReportCallback(metrics, on="validation_end")]
    
        trainer = pl.Trainer(
            max_epochs=MAX_EPOCHS,
            callbacks=callbacks,
            progress_bar_refresh_rate=0,
            plugins=[RayPlugin(num_workers=NUM_WORKERS, use_gpu=USE_GPU)],
        )
    
        dm = MNISTDataModule(data_dir=DATA_DIR, num_workers=NUM_WORKERS, batch_size=config["batch_size"])
        trainer.fit(model, dm)
    
    def tune_mnist():
        config = {
            "layer_1": tune.choice([32, 64, 128]),
            "layer_2": tune.choice([64, 128, 256]),
            "lr": tune.loguniform(1e-4, 1e-1),
            "batch_size": tune.choice([32, 64, 128]),
        }
    
        ray.init()
        analysis = tune.run(
            train_mnist,
            metric="acc",
            mode="max",
            local_dir=os.getcwd(),
            config=config,
            num_samples=NUM_SAMPLES,
            resources_per_trial=get_tune_ddp_resources(num_workers=NUM_WORKERS, use_gpu=USE_GPU),
            name="tune_mnist",
        )
        print("Best hyperparameters found were: ", analysis.best_config)
    
    if __name__ == "__main__":
        tune_mnist()
    
    opened by anicolson 8
  • Cloudpickle Dataset deserialization error

    Cloudpickle Dataset deserialization error

    Hi,

    When I try to run the code with RayPlugin in my tests, I get the following error:

    (pid=2127144) 2021-07-14 06:16:52,345   ERROR serialization.py:250 -- No module named 'test_runner'
    (pid=2127144) Traceback (most recent call last):
    (pid=2127144)   File "/home/rizhiy/miniconda3/envs/ntf/lib/python3.8/site-packages/ray/serialization.py", line 248, in deserialize_objects
    (pid=2127144)     obj = self._deserialize_object(data, metadata, object_ref)
    (pid=2127144)   File "/home/rizhiy/miniconda3/envs/ntf/lib/python3.8/site-packages/ray/serialization.py", line 190, in _deserialize_object
    (pid=2127144)     return self._deserialize_msgpack_data(data, metadata_fields)
    (pid=2127144)   File "/home/rizhiy/miniconda3/envs/ntf/lib/python3.8/site-packages/ray/serialization.py", line 168, in _deserialize_msgpack_data
    (pid=2127144)     python_objects = self._deserialize_pickle5_data(pickle5_data)
    (pid=2127144)   File "/home/rizhiy/miniconda3/envs/ntf/lib/python3.8/site-packages/ray/serialization.py", line 158, in _deserialize_pickle5_data
    (pid=2127144)     obj = pickle.loads(in_band)
    (pid=2127144) ModuleNotFoundError: No module named 'test_runner'
    

    test_runner is the name of my testing script, but I'm not sure why you would need to serialize anything inside it. It just loads data and calls Runner, which is my wrapper around pl.Trainer.

    How should I properly launch the training?

    opened by Rizhiy 8
  • Bump pytorch-lightning from 1.6.4 to 1.8.6

    Bump pytorch-lightning from 1.6.4 to 1.8.6

    Bumps pytorch-lightning from 1.6.4 to 1.8.6.

    Release notes

    Sourced from pytorch-lightning's releases.

    Weekly patch release

    App

    Added

    • Added partial support for fastapi Request annotation in configure_api handlers (#16047)
    • Added a nicer UI with URL and examples for the autoscaler component (#16063)
    • Enabled users to have more control over scaling out/in intervals (#16093)
    • Added more datatypes to the serving component (#16018)
    • Added work.delete method to delete the work (#16103)
    • Added display_name property to LightningWork for the cloud (#16095)
    • Added ColdStartProxy to the AutoScaler (#16094)
    • Added status endpoint, enable ready (#16075)
    • Implemented ready for components (#16129)

    Changed

    • The default start_method for creating Work processes locally on macOS is now 'spawn' (previously 'fork') (#16089)
    • The utility lightning.app.utilities.cloud.is_running_in_cloud now returns True during the loading of the app locally when running with --cloud (#16045)
    • Updated Multinode Warning (#16091)
    • Updated app testing (#16000)
    • Changed overwrite to True (#16009)
    • Simplified messaging in cloud dispatch (#16160)
    • Added annotations endpoint (#16159)

    Fixed

    • Fixed PythonServer messaging "Your app has started" (#15989)
    • Fixed auto-batching to enable batching for requests coming even after the batch interval but is in the queue (#16110)
    • Fixed a bug where AutoScaler would fail with min_replica=0 (#16092
    • Fixed a non-thread safe deepcopy in the scheduler (#16114)
    • Fixed HTTP Queue sleeping for 1 sec by default if no delta was found (#16114)
    • Fixed the endpoint info tab not showing up in the AutoScaler UI (#16128)
    • Fixed an issue where an exception would be raised in the logs when using a recent version of streamlit (#16139)
    • Fixed e2e tests (#16146)

    Full Changelog: https://github.com/Lightning-AI/lightning/compare/1.8.5.post0...1.8.6

    Minor patch release

    App

    • Fixed install/upgrade - removing single quote (#16079)
    • Fixed bug where components that are re-instantiated several times failed to initialize if they were modifying self.lightningignore (#16080)
    • Fixed a bug where apps that had previously been deleted could not be run again from the CLI (#16082)

    Pytorch

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • Make `_GPUAcceleratorl.get_parallel_devices` fit `GPUAccelerator` API

    Make `_GPUAcceleratorl.get_parallel_devices` fit `GPUAccelerator` API

    Closes https://github.com/ray-project/ray_lightning/issues/235

    The PTL Accelerator API is expected to return a List, and not None. This PR updates our _GPUAccelerator abstraction to fit this API.

    opened by amogkam 0
  • Support string based GPU ids

    Support string based GPU ids

    GPU device ids can be specified with an integer index, but may also be specified as strings.

    This PR ensures that both cases are supported by root_device. The code is taken from what is being done in Ray Train: https://sourcegraph.com/github.com/ray-project/ray/-/blob/python/ray/train/torch/train_loop_utils.py?L470-498

    Closes https://github.com/ray-project/ray_lightning/issues/236

    opened by amogkam 0
  • Update protobuf requirement from <=3.20.1 to <4.21.13

    Update protobuf requirement from <=3.20.1 to <4.21.13

    Updates the requirements on protobuf to permit the latest version.

    Commits

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • Multi-GPU training fails with `ValueError` on systems with UUID GPU IDs

    Multi-GPU training fails with `ValueError` on systems with UUID GPU IDs

    I'm currently trying to use ray_lightning to distribute model training over the resources in my ray cluster, like so:

    ngpu = int(ray.cluster_resources().get("GPU", 0))
    use_gpu = ngpu > 0
    num_workers = ngpu
    ncpu = 8
    strategy = RayStrategy(num_workers,ncpu,use_gpu, find_unused_parameters=False)
    # define dataloaders
    # define callbacks
    trainer = PlTrainer(
        logger=False,
        max_epochs=50,
        callbacks=callbacks,
        gpus=1,
        enable_model_summary=False,
        enable_checkpointing=False,
        strategy=strategy,
    )
    trainer.fit(lit_model, train_dataloader, val_dataloader)
    

    However, this code results in a ValueError:

      File "/home/gridsan/dgraff/molpal/molpal/models/mpnmodels.py", line 207, in train
        trainer.fit(lit_model, train_dataloader, val_dataloader)
      File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
        self._call_and_handle_interrupt(
      File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
        return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
      File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/ray_launcher.py", line 58, in launch
        ray_output = self.run_function_on_workers(
      File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/ray_launcher.py", line 249, in run_function_on_workers
        results = process_results(self._futures, self.tune_queue)
      File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/util.py", line 64, in process_results
        ray.get(ready)
      File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
        return func(*args, **kwargs)
      File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray/_private/worker.py", line 2289, in get
        raise value.as_instanceof_cause()
    ray.exceptions.RayTaskError(ValueError): ray::RayExecutor.execute() (pid=49053, ip=172.31.130.105, repr=<ray_lightning.launchers.utils.RayExecutor object at 0x7f392469a6d0>)
      File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/utils.py", line 52, in execute
        return fn(*args, **kwargs)
      File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/ray_launcher.py", line 295, in _wrapping_function
        self._strategy._worker_setup(process_idx=global_rank)
      File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 170, in _worker_setup
        self._process_group_backend = self._get_process_group_backend()
      File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 166, in _get_process_group_backend
        or get_default_process_group_backend_for_device(self.root_device)
      File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 295, in root_device
        cuda_visible_list = [
      File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 296, in <listcomp>
        int(dev) for dev in cuda_visible_str.split(",")
    ValueError: invalid literal for int() with base 10: 'GPU-dade4b6e-8461-eee0-e8bb-4f7e570856f4'
    

    It seems like the internal code relies on an ordinal GPU device naming scheme. I.e.,

    $ echo $CUDA_VISIBLE_DEVICES
    0,1
    

    which seems reasonable, given that what I typically encounter on most systems. But on my system, the GPU device naming looks something like this:

    $ echo $CUDA_VISIBLE_DEVICES
    GPU-23c5e712-9b16-e21a-df00-7dab564ade42,GPU-cdaae969-b14c-6b80-2fa2-de8e9efe87a1
    

    So it seems like there are two options:

    1. I could ask my sys-admins to rename the GPUs on the cluster to the more "standard" ordinal scheme. They'll probably tell me "No." and reference the CUDA_VISIBLE_DEVICES specification where it states that device names of the form GPU-<UUID> is the second option in addition to integer indices
    2. This block of code is altered ray_lightning/ray_ddp.py#L292:
    gpu_id = ray.get_gpu_ids()[0]  # NOTE: this value is cast to `int(...)` in the main branch. The could would break _here_ in the current code but breaks later v0.3
    cuda_visible_str = os.environ.get("CUDA_VISIBLE_DEVICES", "")
    if cuda_visible_str and cuda_visible_str != "NoDevFiles":
        cuda_visible_list = [
            int(dev) for dev in cuda_visible_str.split(",")
        ]
        device_id = cuda_visible_list.index(gpu_id)
        return torch.device("cuda", device_id)
    

    I think the block should be changed to:

    gpu_id = ray.get_gpu_ids()[0]
    cuda_visible_str = os.environ.get("CUDA_VISIBLE_DEVICES", "")
    if cuda_visible_str and cuda_visible_str != "NoDevFiles":
        cuda_visible_list = list(cuda_visible_str.split(","))
        device_id = cuda_visible_list.index(gpu_id)
        return torch.device("cuda", device_id)
    

    Thanks for the great work so far!

    opened by davidegraff 1
  • TypeError in a SLURM environment due to internal API break

    TypeError in a SLURM environment due to internal API break

    Using the master branch of ray-lightning with pytorch-lightning v1.6 in a SLURM environment leads to following exception:

    ray.exceptions.RayTaskError(TypeError): ray::ImplicitFunc.train() (pid=117539, ip=10.181.76.37, repr=train)
      File ".../lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 367, in train
        raise skipped from exception_cause(skipped)
      File ".../lib/python3.9/site-packages/ray/tune/trainable/function_trainable.py", line 335, in entrypoint
        return self._trainable_func(
      File ".../lib/python3.9/site-packages/ray/tune/trainable/function_trainable.py", line 652, in _trainable_func
        output = fn()
      File ".../random_search.py", line 122, in train
        trainer = Trainer(
      File ".../lib/python3.9/site-packages/pytorch_lightning/utilities/argparse.py", line 339, in insert_env_defaults
        return fn(self, **kwargs)
      File ".../lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 485, in __init__
        self._accelerator_connector = AcceleratorConnector(
      File ".../lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 204, in __init__
        self.cluster_environment: ClusterEnvironment = self._choose_and_init_cluster_environment()
      File ".../lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 549, in _choose_and_init_cluster_environment
        if self._is_slurm_managing_tasks():
      File ".../lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 562, in _is_slurm_managing_tasks
        total_requested_devices = len(self._parallel_devices) * self._num_nodes_flag
    TypeError: object of type 'NoneType' has no len()
    

    The _GPUAccelerator.get_parallel_devices method breaks the internal Pytorch Lightning API by returning None in some cases, is this intentional? Returning an empty List instead of None fixes my issue, but I don't know if None is required in other ray-lightning use cases.

    I would be more than happy to provide a PR if you think the fix is fine.

    Thank you for this very convenient package and keep up the fantastic work!

    opened by dcfidalgo 2
Releases(v0.3.0)
  • v0.3.0(Aug 23, 2022)

    What's Changed

    • Bump version for development by @amogkam in https://github.com/ray-project/ray_lightning/pull/122
    • Update README to render on Ray docs by @amogkam in https://github.com/ray-project/ray_lightning/pull/135
    • Fix bash code block in Readme by @Yard1 in https://github.com/ray-project/ray_lightning/pull/136
    • Fix for fractional GPU by @amogkam in https://github.com/ray-project/ray_lightning/pull/125
    • Update broken PTL link by @amogkam in https://github.com/ray-project/ray_lightning/pull/137
    • Fix hanging trainer.test() by @amogkam in https://github.com/ray-project/ray_lightning/pull/142
    • Fix ray_ddp_sharded_example by @chongxiaoc in https://github.com/ray-project/ray_lightning/pull/153
    • Pop kwargs to support LightningCLI by @amogkam in https://github.com/ray-project/ray_lightning/pull/154
    • ray_ddp: support logged_metrics as part of remote worker return value by @chongxiaoc in https://github.com/ray-project/ray_lightning/pull/156
    • Support PyTorch Lightning 1.6 by @JiahaoYao in https://github.com/ray-project/ray_lightning/pull/163
    • Fix docs formatting by @JiahaoYao in https://github.com/ray-project/ray_lightning/pull/188
    • fix issue #189 by @JiahaoYao in https://github.com/ray-project/ray_lightning/pull/190
    • [Ray lightning 1.6] update the change according to the comment in #163 by @JiahaoYao in https://github.com/ray-project/ray_lightning/pull/195

    New Contributors

    • @Yard1 made their first contribution in https://github.com/ray-project/ray_lightning/pull/136
    • @chongxiaoc made their first contribution in https://github.com/ray-project/ray_lightning/pull/153
    • @JiahaoYao made their first contribution in https://github.com/ray-project/ray_lightning/pull/163

    Full Changelog: https://github.com/ray-project/ray_lightning/compare/0.2.0...v0.3.0

    Source code(tar.gz)
    Source code(zip)
    ray_lightning-0.3.0-py3-none-any.whl(49.62 KB)
  • 0.2.0(Feb 2, 2022)

    • Support for PyTorch Lightning v1.5 (#115, #121)!
    • Update HorovodRayPlugin API to match the new Horovod on Ray API. num_hosts and num_slots args have been deprecated in favor of a generic num_workers arg (#71).
    • get_tune_ddp_resouces has been renamed to get_tune_resources and can now be used for both RayPlugin and HorovodRayPlugin (#71).
    • Rename the cpus_per_worker arg in get_tune_resources utility to num_cpus_per_worker to match the arg name in RayPlugin (#96).
    • Annotate the APIs as beta (#88).
    Source code(tar.gz)
    Source code(zip)
  • 0.1.1(Aug 20, 2021)

  • 0.1.0(Aug 12, 2021)

Owner
null
Computational modelling of ray propagation through optical elements using the principles of geometric optics (Ray Tracer)

Computational modelling of ray propagation through optical elements using the principles of geometric optics (Ray Tracer) Introduction By applying the

Son Gyo Jung 1 Jul 9, 2022
Distributed DataLoader For Pytorch Based On Ray

Dpex——用户无感知分布式数据预处理组件 一、前言 随着GPU与CPU的算力差距越来越大以及模型训练时的预处理Pipeline变得越来越复杂,CPU部分的数据预处理已经逐渐成为了模型训练的瓶颈所在,这导致单机的GPU配置的提升并不能带来期望的线性加速。预处理性能瓶颈的本质在于每个GPU能够使用的C

Dalong 23 Nov 2, 2022
An exploration of log domain "alternative floating point" for hardware ML/AI accelerators.

This repository contains the SystemVerilog RTL, C++, HLS (Intel FPGA OpenCL to wrap RTL code) and Python needed to reproduce the numerical results in

Facebook Research 373 Dec 31, 2022
A Python library for differentiable optimal control on accelerators.

A Python library for differentiable optimal control on accelerators.

Google 80 Dec 21, 2022
An essential implementation of BYOL in PyTorch + PyTorch Lightning

Essential BYOL A simple and complete implementation of Bootstrap your own latent: A new approach to self-supervised Learning in PyTorch + PyTorch Ligh

Enrico Fini 48 Sep 27, 2022
Generic template to bootstrap your PyTorch project with PyTorch Lightning, Hydra, W&B, and DVC.

NN Template Generic template to bootstrap your PyTorch project. Click on Use this Template and avoid writing boilerplate code for: PyTorch Lightning,

Luca Moschella 520 Dec 30, 2022
Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Pytorch Lightning 1.4k Jan 1, 2023
A general framework for deep learning experiments under PyTorch based on pytorch-lightning

torchx Torchx is a general framework for deep learning experiments under PyTorch based on pytorch-lightning. TODO list gan-like training wrapper text

Yingtian Liu 6 Mar 17, 2022
A small demonstration of using WebDataset with ImageNet and PyTorch Lightning

A small demonstration of using WebDataset with ImageNet and PyTorch Lightning

Tom 50 Dec 16, 2022
A small demonstration of using WebDataset with ImageNet and PyTorch Lightning

A small demonstration of using WebDataset with ImageNet and PyTorch Lightning This is a small repo illustrating how to use WebDataset on ImageNet. usi

null 50 Dec 16, 2022
Neural Scene Flow Fields using pytorch-lightning, with potential improvements

nsff_pl Neural Scene Flow Fields using pytorch-lightning. This repo reimplements the NSFF idea, but modifies several operations based on observation o

AI葵 178 Dec 21, 2022
Spiking Neural Network for Computer Vision using SpikingJelly framework and Pytorch-Lightning

Spiking Neural Network for Computer Vision using SpikingJelly framework and Pytorch-Lightning

Sami BARCHID 2 Oct 20, 2022
A simple, unofficial implementation of MAE using pytorch-lightning

Masked Autoencoders in PyTorch A simple, unofficial implementation of MAE (Masked Autoencoders are Scalable Vision Learners) using pytorch-lightning.

Connor Anderson 20 Dec 3, 2022
A fast python implementation of Ray Tracing in One Weekend using python and Taichi

ray-tracing-one-weekend-taichi A fast python implementation of Ray Tracing in One Weekend using python and Taichi. Taichi is a simple "Domain specific

null 157 Dec 26, 2022
Pytorch Lightning code guideline for conferences

Deep learning project seed Use this seed to start new deep learning / ML projects. Built in setup.py Built in requirements Examples with MNIST Badges

Pytorch Lightning 1k Jan 2, 2023
PyTorch Lightning implementation of Automatic Speech Recognition

lasr Lightening Automatic Speech Recognition An MIT License ASR research library, built on PyTorch-Lightning, for developing end-to-end ASR models. In

Soohwan Kim 40 Sep 19, 2022
A PyTorch Lightning solution to training OpenAI's CLIP from scratch.

train-CLIP ?? A PyTorch Lightning solution to training CLIP from scratch. Goal ⚽ Our aim is to create an easy to use Lightning implementation of OpenA

Cade Gordon 396 Dec 30, 2022
An AutoML Library made with Optuna and PyTorch Lightning

An AutoML Library made with Optuna and PyTorch Lightning Installation Recommended pip install -U gradsflow From source pip install git+https://github.

GradsFlow 294 Dec 17, 2022
Semantic Segmentation with Pytorch-Lightning

This is a simple demo for performing semantic segmentation on the Kitti dataset using Pytorch-Lightning and optimizing the neural network by monitoring and comparing runs with Weights & Biases.

Boris Dayma 58 Nov 18, 2022