TorchX is a library containing standard DSLs for authoring and running PyTorch related components for an E2E production ML pipeline.

Related tags

Deep Learning torchx
Overview

PyPI License Tests Lint

TorchX

TorchX is a library containing standard DSLs for authoring and running PyTorch related components for an E2E production ML pipeline.

For the latest documentation, please refer to our website.

Requirements

TorchX SDK (torchx):

  • python3 (3.8+)
  • torch

TorchX Kubeflow Pipelines Support (torchx-kfp):

  • torchx
  • kfp

Installation

# install torchx sdk and CLI
pip install torchx

# install torchx kubeflow pipelines (kfp) support
pip install "torchx[kfp]"

Quickstart

See the quickstart guide.

Contributing

We welcome PRs! See the CONTRIBUTING file.

License

TorchX is BSD licensed, as found in the LICENSE file.

Comments
  • cli: defer loading schedulers until used

    cli: defer loading schedulers until used

    This updates the cli, schedulers and runners so that the schedulers are only imported when they need to be used. This means only the relevant scheduler is loaded which vastly improves responsiveness.

    • torchx --help 900ms -> 300ms
    • torchx status local_cwd:// 1.38s to 840ms

    Breaking changes:

    • schedulers.get_schedulers() has been removed in favor of schedulers.get_scheduler_factories() since it's too dangerous
    • Runner.run_opts() now takes a specific scheduler name instead of returning all scheduler runopts since that requires loading all schedulers.

    get_schedulers is an internal interface since downstream users should be using the runner interface so changing this shouldn't be an issue

    Runner.run_opts() is part of the user Runner interface but it's only really practical for the CLI so I doubt there's any OSS usage of it

    Test plan:

    (torchx) tristanr@tristanr-arch2 ~/D/torchx (deferload)> time torchx --help
    usage: torchx [-h] [--log_level LOG_LEVEL] [--version] {builtins,cancel,configure,describe,log,run,runopts,status} ...
    
    torchx CLI
    
    optional arguments:
      -h, --help            show this help message and exit
      --log_level LOG_LEVEL
                            Python logging log level
      --version             show program's version number and exit
    
    sub-commands:
      Use the following commands to run operations, e.g.: torchx run ${JOB_NAME}
    
      {builtins,cancel,configure,describe,log,run,runopts,status}
    
    ________________________________________________________
    Executed in  300.70 millis    fish           external
       usr time  280.99 millis  916.00 micros  280.07 millis
       sys time   16.46 millis    0.00 micros   16.46 millis
    
    (torchx) tristanr@tristanr-arch2 ~/D/torchx (deferload)> time torchx status local_docker://torchx/sh-lsgcv92jm13ps
    torchx 2022-06-24 15:55:58 INFO     AppDef:
      State: SUCCEEDED
      Num Restarts: -1
    Roles:
     *sh[0]:SUCCEEDED
      sh[1]:SUCCEEDED
      sh[2]:SUCCEEDED
    
    ________________________________________________________
    Executed in  507.45 millis    fish           external
       usr time  472.78 millis  926.00 micros  471.86 millis
       sys time   19.77 millis    0.00 micros   19.77 millis
    
    CLA Signed 
    opened by d4l3k 17
  • Extend k8s integ tests, add general class that allows executing different components on local and k8s schedulers

    Extend k8s integ tests, add general class that allows executing different components on local and k8s schedulers

    Summary: Extend k8s integ tests, add general class that allows executing different components on local and k8s schedulers

    Differential Revision: D30980471

    CLA Signed fb-exported 
    opened by aivanou 17
  • API to list jobs by scheduler

    API to list jobs by scheduler

    Given a scheduler, torchx list -s <scheduler_name> lists the jobs scheduled on it. MVP that supports listing jobs for kubernetes scheduler to address https://github.com/pytorch/torchx/issues/503 More features like listing just jobs started by torchx, listing active jobs, filtering options to be implemented in following PRs

    Test plan:

    (venv38) ubuntu@ip:~/wp/torchx$ torchx list -s kubernetes
    kubernetes://default/default:cifar-trainer-a5qvfhe1hyb1b
    kubernetes://default/default:cifar-trainer-d796ei2tdf4bc
    kubernetes://default/default:cifar-trainer-em0iao2m9agog
    kubernetes://default/default:cifar-trainer-ew33oxmdg7t0c
    kubernetes://default/default:cifar-trainer-grjsnfxeinqbd
    kubernetes://default/default:cifar-trainer-p4bwlewvwmt1f
    kubernetes://default/default:cifar-trainer-pwy4c4omfufff
    kubernetes://default/default:cifar-trainer-qyan5cp6vz5od
    kubernetes://default/default:cifar-trainer-rbrd5krzkyz3c
    kubernetes://default/default:cifar-trainer-rw9kn5rtau6se
    kubernetes://default/default:cifar-trainer-tz45941r7u1b
    kubernetes://default/default:cifar-trainer-uycdlshurnf6c
    kubernetes://default/default:cifar-trainer-vstsr7qtecaif
    kubernetes://default/default:cifar-trainer-xnqu2mxdc2a5b
    .
    .
    .
    .
    kubernetes://default/default:train-zhqz9fhr7h0tsc
    kubernetes://default/default:trainer-bwhcp3vw2khftc
    kubernetes://default/default:trainer-m3mxh6dcxgwtv
    
    CLA Signed 
    opened by priyaramani 15
  • ray_scheduler: workspace + fixed no role logging

    ray_scheduler: workspace + fixed no role logging

    This updates Ray to have proper workspace support.

    • -c working_dir=... is deprecated in favor of torchx run --workspace=...
    • -c requirements=... is optional and requirements.txt will be automatically read from the workspace if present
    • torchx log ray://foo/bar works without requiring /ray/0

    Test plan:

    (torchx) tristanr@tristanr-arch2 ~/D/t/e/ray (ray)> torchx run -s ray --wait --log dist.ddp --env LOGLEVEL=INFO -j 2x1 -m scripts.compute_world_size
    torchx 2022-05-18 16:55:31 INFO     Checking for changes in workspace `file:///home/tristanr/Developer/torchrec/examples/ray`...
    torchx 2022-05-18 16:55:31 INFO     To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
    torchx 2022-05-18 16:55:31 INFO     Built new image `/tmp/torchx_workspacebe6331jv` based on original image `ghcr.io/pytorch/torchx:0.2.0dev0` and changes in workspace `file:///home/tristanr/Developer/torch
    rec/examples/ray` for role[0]=compute_world_size.
    torchx 2022-05-18 16:55:31 WARNING  The Ray scheduler does not support port mapping.
    torchx 2022-05-18 16:55:31 INFO     Uploading package gcs://_ray_pkg_63a39f7096dfa0bd.zip.
    torchx 2022-05-18 16:55:31 INFO     Creating a file package for local directory '/tmp/torchx_workspacebe6331jv'.
    ray://torchx/127.0.0.1:8265-compute_world_size-mpr03nzqvvg3td
    torchx 2022-05-18 16:55:31 INFO     Launched app: ray://torchx/127.0.0.1:8265-compute_world_size-mpr03nzqvvg3td
    torchx 2022-05-18 16:55:31 INFO     AppStatus:
      msg: PENDING
      num_restarts: -1
      roles:
      - replicas:
        - hostname: <NONE>
          id: 0
          role: ray
          state: !!python/object/apply:torchx.specs.api.AppState
          - 2
          structured_error_msg: <NONE>
        role: ray
      state: PENDING (2)
      structured_error_msg: <NONE>
      ui_url: null
    
    torchx 2022-05-18 16:55:31 INFO     Job URL: None
    torchx 2022-05-18 16:55:31 INFO     Waiting for the app to finish...
    torchx 2022-05-18 16:55:31 INFO     Waiting for app to start before logging...
    torchx 2022-05-18 16:55:43 INFO     Job finished: SUCCEEDED
    (torchx) tristanr@tristanr-arch2 ~/D/t/e/ray (ray)> torchx log ray://torchx/127.0.0.1:8265-compute_world_size-mpr03nzqvvg3td
    ray/0 Waiting for placement group to start.
    ray/0 running ray.wait on [ObjectRef(8f2664c081ffc268e1c4275021ead9801a8d33861a00000001000000), ObjectRef(afe9f14f5a927c04b8e247b9daca5a9348ef61061a00000001000000)]
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
    ray/0 (CommandActor pid=494377)   entrypoint       : scripts.compute_world_size
    ray/0 (CommandActor pid=494377)   min_nodes        : 2
    ray/0 (CommandActor pid=494377)   max_nodes        : 2
    ray/0 (CommandActor pid=494377)   nproc_per_node   : 1
    ray/0 (CommandActor pid=494377)   run_id           : compute_world_size-mpr03nzqvvg3td
    ray/0 (CommandActor pid=494377)   rdzv_backend     : c10d
    ray/0 (CommandActor pid=494377)   rdzv_endpoint    : localhost:29500
    ray/0 (CommandActor pid=494377)   rdzv_configs     : {'timeout': 900}
    ray/0 (CommandActor pid=494377)   max_restarts     : 0
    ray/0 (CommandActor pid=494377)   monitor_interval : 5
    ray/0 (CommandActor pid=494377)   log_dir          : None
    ray/0 (CommandActor pid=494377)   metrics_cfg      : {}
    ray/0 (CommandActor pid=494377)
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_vyq136c_/compute_world_size-mpr03nzqvvg3td_nu4r0f6t
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] starting workers for entrypoint: python
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous'ing worker group
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
    ray/0 (CommandActor pid=494406)   entrypoint       : scripts.compute_world_size
    ray/0 (CommandActor pid=494406)   min_nodes        : 2
    ray/0 (CommandActor pid=494406)   max_nodes        : 2
    ray/0 (CommandActor pid=494406)   nproc_per_node   : 1
    ray/0 (CommandActor pid=494406)   run_id           : compute_world_size-mpr03nzqvvg3td
    ray/0 (CommandActor pid=494406)   rdzv_backend     : c10d
    ray/0 (CommandActor pid=494406)   rdzv_endpoint    : 172.26.20.254:29500
    ray/0 (CommandActor pid=494406)   rdzv_configs     : {'timeout': 900}
    ray/0 (CommandActor pid=494406)   max_restarts     : 0
    ray/0 (CommandActor pid=494406)   monitor_interval : 5
    ray/0 (CommandActor pid=494406)   log_dir          : None
    ray/0 (CommandActor pid=494406)   metrics_cfg      : {}
    ray/0 (CommandActor pid=494406)
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_t38mo11i/compute_world_size-mpr03nzqvvg3td_ehvp80_p
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] starting workers for entrypoint: python
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous'ing worker group
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous complete for workers. Result:
    ray/0 (CommandActor pid=494377)   restart_count=0
    ray/0 (CommandActor pid=494377)   master_addr=tristanr-arch2
    ray/0 (CommandActor pid=494377)   master_port=48089
    ray/0 (CommandActor pid=494377)   group_rank=1
    ray/0 (CommandActor pid=494377)   group_world_size=2
    ray/0 (CommandActor pid=494377)   local_ranks=[0]
    ray/0 (CommandActor pid=494377)   role_ranks=[1]
    ray/0 (CommandActor pid=494377)   global_ranks=[1]
    ray/0 (CommandActor pid=494377)   role_world_sizes=[2]
    ray/0 (CommandActor pid=494377)   global_world_sizes=[2]
    ray/0 (CommandActor pid=494377)
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] Starting worker group
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_vyq136c_/compute_world_size-mpr03nzqvvg3td_nu4r0f6t/attempt_0/0/error.json
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous complete for workers. Result:
    ray/0 (CommandActor pid=494406)   restart_count=0
    ray/0 (CommandActor pid=494406)   master_addr=tristanr-arch2
    ray/0 (CommandActor pid=494406)   master_port=48089
    ray/0 (CommandActor pid=494406)   group_rank=0
    ray/0 (CommandActor pid=494406)   group_world_size=2
    ray/0 (CommandActor pid=494406)   local_ranks=[0]
    ray/0 (CommandActor pid=494406)   role_ranks=[0]
    ray/0 (CommandActor pid=494406)   global_ranks=[0]
    ray/0 (CommandActor pid=494406)   role_world_sizes=[2]
    ray/0 (CommandActor pid=494406)   global_world_sizes=[2]
    ray/0 (CommandActor pid=494406)
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] Starting worker group
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_t38mo11i/compute_world_size-mpr03nzqvvg3td_ehvp80_p/attempt_0/0/error.json
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] worker group successfully finished. Waiting 300 seconds for other agents to finish.
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.000942230224609375 seconds
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] worker group successfully finished. Waiting 300 seconds for other agents to finish.
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0013003349304199219 seconds
    ray/0 (CommandActor pid=494377) [0]:initializing `gloo` process group
    ray/0 (CommandActor pid=494377) [0]:successfully initialized process group
    ray/0 (CommandActor pid=494377) [0]:rank: 1, actual world_size: 2, computed world_size: 2
    ray/0 (CommandActor pid=494406) [0]:initializing `gloo` process group
    ray/0 (CommandActor pid=494406) [0]:successfully initialized process group
    ray/0 (CommandActor pid=494406) [0]:rank: 0, actual world_size: 2, computed world_size: 2
    ray/0 running ray.wait on [ObjectRef(afe9f14f5a927c04b8e247b9daca5a9348ef61061a00000001000000)]
    
    CLA Signed 
    opened by d4l3k 14
  • Investigate Perf Drop for fairseq gpu training on aws batch with EFA (p3dn)

    Investigate Perf Drop for fairseq gpu training on aws batch with EFA (p3dn)

    🐛 Bug

    Not really a TorchX bug per-se, but opening this issue to keep track of this investigation.

    There seems to be a perf drop when running the same fairseq trainer on AWS Batch on p3dn (w/ EFA) versus running on the same host baremetal (no Docker).

    Module (check all that applies):

    • [ ] torchx.spec
    • [ ] torchx.component
    • [ ] torchx.apps
    • [ ] torchx.runtime
    • [ ] torchx.cli
    • [ ] torchx.schedulers
    • [ ] torchx.pipelines
    • [ ] torchx.aws
    • [ ] torchx.examples
    • [x] other

    Below is a table summarizing the different runs:

    Common Params:

    1. Instance Type: p3dn.24xlarge
    2. Num Nodes: 2
    3. Nodes in same placement group: Yes
    4. Environment Variables: a. LOGLEVEL=INFO b. NCCL_SOCKET_IFNAME=eth0 c. NCCL_ALGO=RING d. NCCL_PROTO=simple e. FI_PROVIDER=efa f. FI_EFA_USE_DEVICE_RDMA=1
    5. In all cases RDMA is not enabled (seems to not be available on p3dn see: https://github.com/aws/aws-ofi-nccl/issues/104)

    |Scheduler | Docker | EFA picked up by NCCL | NCCL_ALGO | Performance | Log | |-------------|:----------- |:-------------------------|:-------------|:-------------|:-------------| | BareMetal | No | No | Tree | baseline (~110k wpm) | gist-link| | BareMetal | Yes | Yes | Tree | WIP | |
    | AWS Batch | Yes | Yes | Ring | ~44k wpm | | | AWS Batch | Yes | Yes | Tree | ~70k wpm

    • Scheduler BareMetal: ssh onto the nodes and kick off the trainer manually
    • Docker Yes/No: trainer runs in docker (e.g. docker run) versus python -m torch.distributed.launch)

    Logs

    To Reproduce

    Steps to reproduce the behavior:

    Expected behavior

    Environment

    • torchx version (e.g. 0.1.0rc1): torchx-nightly
    • Python version: 3.8+
    • OS (e.g., Linux): Amazon Linux2
    • How you installed torchx (conda, pip, source, docker): pip
    • Docker image and tag (if using docker):
    • Git commit (if installed from source): N/A
    • Execution environment (on-prem, AWS, GCP, Azure etc): AWS Batch
    • Any other relevant information:

    Additional context

    bug aws_batch 
    opened by kiukchung 14
  • schedulers/aws_batch: add a scheduler to launch jobs directly on aws_batch

    schedulers/aws_batch: add a scheduler to launch jobs directly on aws_batch

    This adds a scheduler that allows for launching TorchX jobs directly on AWS Batch. This requires almost no infrastructure setup (just UI) and provided a fairly sane docker job engine.

    Test plan:

    scripts/awsbatchint.sh
    pytest torchx/schedulers/test/aws_batch_scheduler_test.py
    pyre
    
    CLA Signed 
    opened by d4l3k 14
  • Ray scheduler driver and job api

    Ray scheduler driver and job api

    To run distributed pytorch test

    cd torch/torchx/schedulers/test
    python -m unittest ray_scheduler_test.py
    
    2021-11-24 00:53:22,810 INFO worker.py:840 -- Connecting to existing Ray cluster at address: 172.31.2.209:6379
    2021-11-24 00:53:23,127 INFO sdk.py:144 -- Uploading package gcs://_ray_pkg_e212ba1ff8e15e24.zip.
    2021-11-24 00:53:23,128 INFO packaging.py:352 -- Creating a file package for local directory '/tmp/tmp9copqxd2'.
    status: PENDING
    status: PENDING
    status: RUNNING
    status: RUNNING
    status: RUNNING
    status: RUNNING
    status: SUCCEEDED
    2021-11-24 00:53:25,521 INFO worker.py:840 -- Connecting to existing Ray cluster at address: 172.31.2.209:6379
    (CommandActor pid=3575) initializing `gloo` process group
    (CommandActor pid=3574) initializing `gloo` process group
    (CommandActor pid=3575) successfully initialized process group
    (CommandActor pid=3575) rank: 1, actual world_size: 2, computed world_size: 2
    (CommandActor pid=3574) successfully initialized process group
    (CommandActor pid=3574) rank: 0, actual world_size: 2, computed world_size: 2
    

    To run distributed pytorch test using torchX CLI

    # Setup cluster and get a HEAD NODE IP
    ray up -y ray_cluster.yaml
    pip install torchx[dev]
    
    # Get a job ID from deployed job
    torchx run -s ray -cfg dashboard_address=34.209.89.185:20002,working_dir=aivanou_test utils.binary --entrypoint ray_simple.py
    
    # Use job ID to get logs or job status
    torchx describe ray://torchx/34.209.89.185:20002-raysubmit_aKvezN3NyA2mqZeW
    torchx log ray://torchx/34.209.89.185:20002-raysubmit_aKvezN3NyA2mqZeW
    
    CLA Signed 
    opened by msaroufim 14
  • Move `examples` under `torchx` module

    Move `examples` under `torchx` module

    Summary: The diff moves examples under torchx namespace, also removes examples Dockerfile, and makes torchx image to use dev-requirements

    Differential Revision: D31464358

    fb-exported 
    opened by aivanou 14
  • github/kfp: run integration tests even for external users

    github/kfp: run integration tests even for external users

    This splits the kfp integration tests into two steps.

    1. without secrets on PR branch: builds the pipeline + containers and saves them as a GitHub artifact
    2. with secrets from master branch: loads the artifact and launches it on the KFP cluster

    Test plan:

    scripts/kfpint.py --path /tmp/foo --save
    scripts/kfpint.py --path /tmp/foo --load
    

    CI

    CLA Signed Merged 
    opened by d4l3k 14
  • add LSF scheduler

    add LSF scheduler

    I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters.

    Note: torchx log command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS).

    In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e., bsub). For distributed apps, it creates multiple bsub. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there.

    component_integration_tests.lsf.txt

    Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images.

    The following are example commands.

    Example: native hello_world and CLI utils

    $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=native utils.echo --msg hello_world --num_replicas 3
    lsf://torchx/echo-pxc3gn5ct061k
    $ torchx list -s lsf
    $ torchx status lsf://torchx/echo-pxc3gn5ct061k
    $ torchx cancel lsf://torchx/echo-pxc3gn5ct061k
    $ torchx log --stream stdout lsf://torchx/echo-pxc3gn5ct061k/echo/0
    

    Example: Docker hello_world

    $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=docker utils.echo --image alpine:latest --msg hello_world --num_replicas 3
    

    Example: Singularity hello_world

    $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=singularity utils.echo --image docker://alpine:latest --msg hello_world --num_replicas 3
    

    Example: Docker Distributed

    $ cp scripts/dist_app.py /mnt/data/dist/
    $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=docker,host_network=True" dist.ddp -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"
    

    Example: Singularity Distributed

    $ cp scripts/dist_app.py /mnt/data/dist/
    $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=singularity,host_network=True" dist.ddp --image docker://ghcr.io/pytorch/torchx:0.3.0dev0 -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"
    
    CLA Signed 
    opened by takeshi-yoshimura 13
  • c10d communication failure on k8s cluster

    c10d communication failure on k8s cluster

    🐛 Bug

    cross-referencing the issue posted here:

    https://github.com/pytorch/pytorch/issues/80772

    which is about race condition between rank 0 pod registering with DNS and c10d service trying to look it up.

    To Reproduce

    Steps to reproduce the behavior:

    1. run the code with torchx run --scheduler kubernetes dist.ddp -j 8x1 --script cifar_dist.py
    2. see the pods scheduled and then erroring

    Code:

    import os
    import torch
    import torch.distributed as dist
    import torch.nn as nn
    import torch.optim as optim
    from torch.utils.data.distributed import DistributedSampler
    from torch.utils.data import DataLoader
    
    from torchvision.models import resnet18
    from torchvision.datasets import CIFAR10
    import torchvision.transforms as transforms
    
    from torch.nn.parallel import DistributedDataParallel as DDP
    import random
    import numpy as np
    from time import sleep
    
    model_dir = '/model'
    model_filename = 'mymodel'
    batch_size = 128
    
    
    def set_random_seeds(seed=0):
    
        torch.manual_seed(seed)
        # torch.backends.cudnn.deterministic = True
        # torch.backends.cudnn.benchmark = False
        np.random.seed(seed)
        random.seed(seed)
    
    
    def evaluate(model, device, test_loader):
        model.eval()
        correct = 0
        total = 0
        with torch.no_grad():
            for data in test_loader:
                images, labels = data[0].to(device), data[1].to(device)
                outputs = model(images)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
        accuracy = correct / total
        return accuracy
    
    
    dist.init_process_group("gloo")
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    
    if torch.cuda.is_available():
        device = torch.cuda.device(f'cuda:{rank}')
    else:
        device = torch.device('cpu')
    
    
    print(f"Running basic DDP example on rank {rank} of {world_size}")
    
    if rank == 0:
        model_filepath = os.path.join(model_dir, model_filename)
        if not os.path.exists(model_dir):
            os.makedirs(model_dir)
    
    set_random_seeds()
    
    print('create model...')
    # create model
    model = resnet18(pretrained=False)
    if torch.cuda.is_available():
        torch.cuda.set_device(rank)
        device = torch.cuda.device(f'cuda:{rank}')
        ddp_model = DDP(model, device_ids=[rank], output_device=rank)
    else:
        device = torch.device('cpu')
        ddp_model = DDP(model)
    
    print('prepare dataset...')
    # Prepare dataset and dataloader
    transform = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])
    
    train_set = CIFAR10(root="data", train=True, download=True, transform=transform) 
    test_set = CIFAR10(root="data", train=False, download=True, transform=transform)
    
    train_sampler = DistributedSampler(dataset=train_set)
    
    train_loader = DataLoader(dataset=train_set, batch_size=batch_size, sampler=train_sampler, num_workers=0)
    # Test loader does not have to follow distributed sampling strategy
    test_loader = DataLoader(dataset=test_set, batch_size=128, shuffle=False, num_workers=0)
    
    print('setup model...')
    
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-5)
    
    for epoch in range(1000):
        print(f'Epoch {epoch}')
        if epoch % 10 == 0:
            if rank == 0:
                accuracy = evaluate(model=ddp_model, device=device, test_loader=test_loader)
                torch.save(ddp_model.state_dict(), model_filepath)
                print("-" * 75)
                print(f"Epoch: {epoch}, Accuracy: {accuracy}")
                print("-" * 75)
            else:
                print("-" * 75)
                print(f"Epoch: {epoch}")
                print("-" * 75)
    
        ddp_model.train()
    
        i = 0
        for data in train_loader:
            print(i)
            i += 1
            inputs, labels = data[0].to(device), data[1].to(device)
            optimizer.zero_grad()
            outputs = ddp_model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
        
    

    Expected behavior

    I expect pods to connect to c10d properly and run the code.

    Environment

    Collecting environment information...
    PyTorch version: 1.11.0
    Is debug build: False
    CUDA used to build PyTorch: 11.3
    ROCM used to build PyTorch: N/A
    
    OS: Ubuntu 18.04.6 LTS (x86_64)
    GCC version: Could not collect
    Clang version: Could not collect
    CMake version: Could not collect
    Libc version: glibc-2.27
    
    Python version: 3.8.12 (default, Oct 12 2021, 13:49:34)  [GCC 7.5.0] (64-bit runtime)
    Python platform: Linux-5.4.17-2136.306.1.3.el8uek.x86_64-x86_64-with-glibc2.17
    Is CUDA available: False
    CUDA runtime version: No CUDA
    GPU models and configuration: No CUDA
    Nvidia driver version: No CUDA
    cuDNN version: No CUDA
    HIP runtime version: N/A
    MIOpen runtime version: N/A
    Is XNNPACK available: True
    
    Versions of relevant libraries:
    [pip3] botorch==0.6.0
    [pip3] gpytorch==1.6.0
    [pip3] mypy-extensions==0.4.3
    [pip3] numpy==1.21.2
    [pip3] pytorch-lightning==1.5.10
    [pip3] torch==1.11.0
    [pip3] torch-model-archiver==0.6.0
    [pip3] torchelastic==0.2.2
    [pip3] torchmetrics==0.9.1
    [pip3] torchserve==0.6.0
    [pip3] torchtext==0.12.0
    [pip3] torchvision==0.12.0
    [pip3] torchx==0.2.0
    [conda] blas                      1.0                         mkl  
    [conda] botorch                   0.6.0                    pypi_0    pypi
    [conda] cudatoolkit               11.3.1               ha36c431_9    nvidia
    [conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
    [conda] gpytorch                  1.6.0                    pypi_0    pypi
    [conda] mkl                       2021.4.0           h06a4308_640  
    [conda] mkl-service               2.4.0            py38h7f8727e_0  
    [conda] mkl_fft                   1.3.1            py38hd3c417c_0  
    [conda] mkl_random                1.2.2            py38h51133e4_0  
    [conda] numpy                     1.21.2           py38h20f2e39_0  
    [conda] numpy-base                1.21.2           py38h79a1101_0  
    [conda] pytorch                   1.11.0          py3.8_cuda11.3_cudnn8.2.0_0    pytorch
    [conda] pytorch-lightning         1.5.10                   pypi_0    pypi
    [conda] pytorch-mutex             1.0                        cuda    pytorch
    [conda] torch-model-archiver      0.6.0                    pypi_0    pypi
    [conda] torchelastic              0.2.2                    pypi_0    pypi
    [conda] torchmetrics              0.9.1                    pypi_0    pypi
    [conda] torchserve                0.6.0                    pypi_0    pypi
    [conda] torchtext                 0.12.0                     py38    pytorch
    [conda] torchvision               0.12.0               py38_cu113    pytorch
    [conda] torchx                    0.2.0                    pypi_0    pypi
    
    • Execution environment: Oracle Cloud OKE cluster k8s version 1.23.4
    • dns: CoreDNS-1.8.6 linux/amd64, go1.16.7 BoringCrypto, 69a006c9f1a
    opened by streamnsight 13
  • Find default namespace from kube_config.

    Find default namespace from kube_config.

    For kubernetes scheduler, find default namespace from kube_config other than hard-coded "default" value.

    Right now the default namespace for list method is hard-coded as "Default", which is not the case for multiple user scenario in kubernetes cluster. The default namespace can be read from current context of kube configuration.

    Test plan:

    Tested manually with a generated kube config.

    CLA Signed 
    opened by liuwenchao 3
  • (torchx/aws_batch) Enable Region Selection in TorchConfig

    (torchx/aws_batch) Enable Region Selection in TorchConfig

    QOL improvements for support for AWS regions.

    Current Behavior: Use the default Region specified in .aws config

    Proposed Behaviour: Have manual override in .torchxconfig. If no override and no default throw and error

    Future Improvements: Add profile support. Add fargate support.

    CLA Signed 
    opened by ashvinnihalani 3
  • Kubernetes: support default scheduler instead of volcano for autoscaling

    Kubernetes: support default scheduler instead of volcano for autoscaling

    Description

    Right now Volcano doesn't handle autoscaling correctly since it won't schedule jobs if there's not enough resources. For single node jobs it would be nice to support the non-volcano K8s batch scheduler so the Pods will be created and can autoscale the cluster.

    Detailed Proposal

    Add a setting scheduler to the kubernetes scheduler to allow setting schedulerName to "default-scheduler" to use the standard logic.

    https://volcano.sh/en/docs/vcjob/#schedulername

    There's some prototype work that was done for TPU node prototyping in https://github.com/pytorch/torchx/commit/66e934c59b376a8c0b9aa6523fde34e81a673eb2 which required the same default scheduler logic

    Alternatives

    Additional context/links

    opened by d4l3k 0
  • [torchx/schedulers - aws batch] Only display jobs with torchx tag when listing

    [torchx/schedulers - aws batch] Only display jobs with torchx tag when listing

    Description

    Add a tag filter by the existence of the tag key: torchx.python.org/version when listing jobs.

    Motivation/Background

    Currently when the batch compute environment has jobs not launched via torchx, when we list jobs via:

    torchx list -s aws_batch
    

    It lists ALL the jobs (not just the ones that are submitted via torchx). Since we already tag the batch jobs launched with torchx with the tag key torchx.pytorch.org/version (see screenshot below) we should only list the jobs that have this tag.

    image

    Furthermore, we can also:

    1. Add a unix-username tag so that we only display the jobs relevant to the user
    2. Add a --filter option to either show all (*) jobs or filter by user or job queue or other useful criteria

    Detailed Proposal

    See motivation above.

    Alternatives

    N/A

    Additional context/links

    N/A

    opened by kiukchung 0
  • [RFC][torchx/schedulers - aws batch] support wildcard job queue selection

    [RFC][torchx/schedulers - aws batch] support wildcard job queue selection

    Description

    For the AWS Batch scheduler integration in torchx, support wildcards in the job_queue runopt with a simple (or configurable or extendable) queue selection logic. For instance, assume that my organization uses a job queue naming convention of the form

    ${TEAM_NAME}-${CLUSTER_TYPE}-${REGION}
    
    Example:
    
    pt_r2p-gpu_cluster-us-east-1a
    pt_r2p-gpu_cluster-us-east-1b
    pt_r2p-gpu_cluster-us-east-1c
    pt_r2p-cpu_cluster-us-east-1a
    pt_r2p-cpu_cluster-us-east-1b
    pt_r2p-trainium_cluster-us-east-1a
    pt_r2p-trainium_cluster-us-east-1c
    
    pt_core-gpu_cluster-us-east-1a
    pt_core-cpu_cluster-us-east-1a
    

    If I'm in the pt_r2p team, and want to submit a job to any gpu compute environment that has free capacity regardless of region, then I can use a wildcard on the ${CLUSTER_TYPE} portion of the job queue name as:

    [aws_batch]
    job_queue=pt_r2p-gpu_cluster-*
    

    Motivation/Background

    Ideally, with AWS Batch we create a single job queue (JQ) connected to multiple compute environments (CE) and always submit to the same JQ to have Batch figure out which CE the job needs to be submitted to. With Fair Share (FS) scheduling announced a year ago (see announcement) this is theoretically possible. However many users of AWS Batch are still using FIFO (originally supported) scheduling policy in which case having a single JQ is impractical in a multi-team use case scenario since users from other teams may affect the scheduling overhead of my team. This starts escalating pretty quickly in cases where teams BYO (bring your own) capacity.

    Detailed Proposal

    Support wild-cards for job_queue names for the aws_batch scheduler with the following MVP queue selection algorithm (greedy):

    1. Find all job queues that match the wild-card expression
    2. For each job queue pull the details of the CE that it is hooked up to
    3. Filter the CEs down to the ones that actually support the host type + quantity that the job needs
    4. For each filtered CE look at the free resources and rank them by most-free -> least-free
    5. Pick the job queue that has the most CEs with the highest rank

    This algorithm effectively choses the JQ that the job needs to be submitted to that will yield the least wait time in the queue.

    To actually implement the greedy algorithm above, I suggest that we add chain-able selection algorithms. For instance the algorithm above can be expressed as a chain of primitives:

    jqs = get_matching("pt_r2p-gpu_cluster-*")
    jqs = filter_resource(jqs, role.resource)
    jqs = order_by(jqs, Ordering.FREE_CAPACITY, desc=True)
    
    jqs[0] # <-- select this one to run
    

    Similarly a "first-match" algorithm can be implemented as:

    get_matching("pt_r2p-gpu_cluster-*")[0]
    

    We can follow torchdata's datapipe interface such that each function in the chain has the signature:

    def fn(jqs: List[JobQueue], *args, **kwargs) -> List[JobQueue]:
       """
       Returns a sorted/filtered/manipulated list of job queues to pass to the next chained fn.
       """
       pass
    

    Alternatives

    1. Resources/guidelines/script to migrate from FIFO queues Fair-Share.

    Additional context/links

    N/A

    opened by kiukchung 1
Releases(v0.4.0)
  • v0.4.0(Dec 30, 2022)

  • v0.3.0(Oct 27, 2022)

  • v0.2.0(Jun 15, 2022)

  • v0.1.2(Mar 29, 2022)

    Full Changelog: https://github.com/pytorch/torchx/compare/v0.1.1...v0.1.2

    Milestone: https://github.com/pytorch/torchx/milestones/3

    • PyTorch 1.11 Support
    • Python 3.10 Support
    • torchx.workspace
      • TorchX now supports a concept of workspaces. This enables seamless launching of jobs using changes present in your local workspace. For Docker based schedulers, we automatically build a new docker container on job launch making it easier than ever to run experiments. #333
    • torchx.schedulers
      • Ray #329
        • Newly added Ray scheduler makes it easy to launch jobs on Ray.
        • https://pytorch.medium.com/large-scale-distributed-training-with-torchx-and-ray-1d09a329aacb
      • AWS Batch #381
        • Newly added AWS Batch scheduler makes it easy to launch jobs in AWS with minimal infrastructure setup.
      • Slurm
        • Slurm jobs will by default launch in the current working directory to match local_cwd and workspace behavior. #372
        • Replicas now have their own log files and can be accessed programmatically. #373
        • Support for comment, mail-user and constraint fields. #391
        • Workspace support (prototype) - Slurm jobs can now be launched in isolated experiment directories. #416
      • Kubernetes
        • Support for running jobs under service accounts. #408
        • Support for specifying instance types. #433
      • All Docker-based Schedulers (Kubernetes, Batch, Docker)
        • Added bind mount and volume supports #420, #426
        • Bug fix: Better shm support for large dataloader #429
        • Support for .dockerignore and custom Dockerfiles #401
      • Local Scheduler
        • Automatically set CUDA_VISIBLE_DEVICES #383
        • Improved log ordering #366
    • torchx.components
      • dist.ddp
        • Rendezvous works out of the box on all schedulers #400
        • Logs are now prefixed with local ranks #412
        • Can specify resources via the CLI #395
        • Can specify environment variables via the CLI #399
      • HPO
        • Ax runner now lives in the Ax repo https://github.com/facebook/Ax/commit/8e2e68f21155e918996bda0b7d97b5b9ef4e0cba
    • torchx.cli
      • .torchxconfig
        • You can now specify component argument defaults .torchxconfig https://github.com/pytorch/torchx/commit/c37cfd7846d5a0cb527dd19c8c95e881858f8f0a
        • ~/.torchxconfig can now be used to set user level defaults. #378
        • --workspace can be configured #397
      • Color change and bug fixes #419
    • torchx.runner
      • Now supports workspace interfaces. #360
      • Returned lines now preserve whitespace to provide support for progress bars #425
      • Events are now logged to torch.monitor when available. #379
    • torchx.notebook (prototype)
      • Added new workspace interface for developing models and launching jobs via a Jupyter Notebook. #356
    • Docs
      • Improvements to clarify TorchX usage w/ workspaces and general cleanups.
      • #374, #402, #404, #407, #434
    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Nov 18, 2021)

  • v0.1.0(Oct 21, 2021)

  • v0.1.0rc1(Oct 18, 2021)

  • v0.1.0rc0(Oct 5, 2021)

  • v0.1.0b0(Jun 29, 2021)

  • v0.1.0.dev2(Jun 24, 2021)

  • v0.1.0.dev1(Jun 17, 2021)

  • v0.1.0.dev0(Jun 16, 2021)

Owner
null
《Lerning n Intrinsic Grment Spce for Interctive Authoring of Grment Animtion》

Learning an Intrinsic Garment Space for Interactive Authoring of Garment Animation Overview This is the demo code for training a motion invariant enco

YuanBo 213 Dec 14, 2022
magiCARP: Contrastive Authoring+Reviewing Pretraining

magiCARP: Contrastive Authoring+Reviewing Pretraining Welcome to the magiCARP API, the test bed used by EleutherAI for performing text/text bi-encoder

EleutherAI 43 Dec 29, 2022
E2e music remastering system - End-to-end Music Remastering System Using Self-supervised and Adversarial Training

End-to-end Music Remastering System This repository includes source code and pre

Junghyun (Tony) Koo 37 Dec 15, 2022
NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

NVIDIA Merlin NVIDIA Merlin is an open source library designed to accelerate recommender systems on NVIDIA’s GPUs. It enables data scientists, machine

null 419 Jan 3, 2023
Torchserve server using a YoloV5 model running on docker with GPU and static batch inference to perform production ready inference.

Yolov5 running on TorchServe (GPU compatible) ! This is a dockerfile to run TorchServe for Yolo v5 object detection model. (TorchServe (PyTorch librar

null 82 Nov 29, 2022
Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Pytorch Lightning 1.4k Jan 1, 2023
A library of multi-agent reinforcement learning components and systems

Mava: a research framework for distributed multi-agent reinforcement learning Table of Contents Overview Getting Started Supported Environments System

InstaDeep Ltd 463 Dec 23, 2022
PyTorch3D is FAIR's library of reusable components for deep learning with 3D data

Introduction PyTorch3D provides efficient, reusable components for 3D Computer Vision research with PyTorch. Key features include: Data structure for

Facebook Research 6.8k Jan 1, 2023
Python library containing BART query generation and BERT-based Siamese models for neural retrieval.

Neural Retrieval Embedding-based Zero-shot Retrieval through Query Generation leverages query synthesis over large corpuses of unlabeled text (such as

Amazon Web Services - Labs 35 Apr 14, 2022
PyTorch CZSL framework containing GQA, the open-world setting, and the CGE and CompCos methods.

Compositional Zero-Shot Learning This is the official PyTorch code of the CVPR 2021 works Learning Graph Embeddings for Compositional Zero-shot Learni

EML TĂźbingen 70 Dec 27, 2022
A numpy-based implementation of RANSAC for fundamental matrix and homography estimation. The degeneracy updating and local optimization components are included and optional.

Description A numpy-based implementation of RANSAC for fundamental matrix and homography estimation. The degeneracy updating and local optimization co

AoxiangFan 9 Nov 10, 2022
BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond

BasicVSR BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond Ported from https://github.com/xinntao/BasicSR Dependencie

Holy Wu 8 Jun 7, 2022
Angular & Electron desktop UI framework. Angular components for native looking and behaving macOS desktop UI (Electron/Web)

Angular Desktop UI This is a collection for native desktop like user interface components in Angular, especially useful for Electron apps. It starts w

Marc J. Schmidt 49 Dec 22, 2022
banditml is a lightweight contextual bandit & reinforcement learning library designed to be used in production Python services.

banditml is a lightweight contextual bandit & reinforcement learning library designed to be used in production Python services. This library is developed by Bandit ML and ex-authors of Facebook's applied reinforcement learning platform, Reagent.

Bandit ML 51 Dec 22, 2022
Ranger deep learning optimizer rewrite to use newest components

Ranger21 - integrating the latest deep learning components into a single optimizer Ranger deep learning optimizer rewrite to use newest components Ran

Less Wright 266 Dec 28, 2022
AlphaBot2 Pi Core software for interfacing with the various components.

AlphaBot2-Pi-Core AlphaBot2 Pi Core software for interfacing with the various components. This project is currently a W.I.P. I will update this readme

KyleDev 1 Feb 13, 2022
Dcf-game-infrastructure-public - Contains all the components necessary to run a DC finals (attack-defense CTF) game from OOO

dcf-game-infrastructure All the components necessary to run a game of the OOO DC

Order of the Overflow 46 Sep 13, 2022