TorchX is a library containing standard DSLs for authoring and running PyTorch related components for an E2E production ML pipeline.

Last update: Dec 22, 2022

Related tags

Deep Learning torchx

Overview

TorchX

TorchX is a library containing standard DSLs for authoring and running PyTorch related components for an E2E production ML pipeline.

For the latest documentation, please refer to our website.

Requirements

TorchX SDK (torchx):

python3 (3.8+)
torch

TorchX Kubeflow Pipelines Support (torchx-kfp):

torchx
kfp

Installation

# install torchx sdk and CLI
pip install torchx

# install torchx kubeflow pipelines (kfp) support
pip install "torchx[kfp]"

Quickstart

See the quickstart guide.

Contributing

We welcome PRs! See the CONTRIBUTING file.

License

TorchX is BSD licensed, as found in the LICENSE file.

Comments

cli: defer loading schedulers until used

This updates the cli, schedulers and runners so that the schedulers are only imported when they need to be used. This means only the relevant scheduler is loaded which vastly improves responsiveness.

torchx --help 900ms -> 300ms
torchx status local_cwd:// 1.38s to 840ms

Breaking changes:

schedulers.get_schedulers() has been removed in favor of schedulers.get_scheduler_factories() since it's too dangerous
Runner.run_opts() now takes a specific scheduler name instead of returning all scheduler runopts since that requires loading all schedulers.

get_schedulers is an internal interface since downstream users should be using the runner interface so changing this shouldn't be an issue

Runner.run_opts() is part of the user Runner interface but it's only really practical for the CLI so I doubt there's any OSS usage of it

Test plan:

(torchx) tristanr@tristanr-arch2 ~/D/torchx (deferload)> time torchx --help
usage: torchx [-h] [--log_level LOG_LEVEL] [--version] {builtins,cancel,configure,describe,log,run,runopts,status} ...

torchx CLI

optional arguments:
  -h, --help            show this help message and exit
  --log_level LOG_LEVEL
                        Python logging log level
  --version             show program's version number and exit

sub-commands:
  Use the following commands to run operations, e.g.: torchx run ${JOB_NAME}

  {builtins,cancel,configure,describe,log,run,runopts,status}

________________________________________________________
Executed in  300.70 millis    fish           external
   usr time  280.99 millis  916.00 micros  280.07 millis
   sys time   16.46 millis    0.00 micros   16.46 millis

(torchx) tristanr@tristanr-arch2 ~/D/torchx (deferload)> time torchx status local_docker://torchx/sh-lsgcv92jm13ps
torchx 2022-06-24 15:55:58 INFO     AppDef:
  State: SUCCEEDED
  Num Restarts: -1
Roles:
 *sh[0]:SUCCEEDED
  sh[1]:SUCCEEDED
  sh[2]:SUCCEEDED

________________________________________________________
Executed in  507.45 millis    fish           external
   usr time  472.78 millis  926.00 micros  471.86 millis
   sys time   19.77 millis    0.00 micros   19.77 millis

CLA Signed

opened by d4l3k 17

Extend k8s integ tests, add general class that allows executing different components on local and k8s schedulers

Summary: Extend k8s integ tests, add general class that allows executing different components on local and k8s schedulers

Differential Revision: D30980471
CLA Signed fb-exported

opened by aivanou 17

API to list jobs by scheduler

Given a scheduler, torchx list -s <scheduler_name> lists the jobs scheduled on it. MVP that supports listing jobs for kubernetes scheduler to address https://github.com/pytorch/torchx/issues/503 More features like listing just jobs started by torchx, listing active jobs, filtering options to be implemented in following PRs

Test plan:

(venv38) ubuntu@ip:~/wp/torchx$ torchx list -s kubernetes
kubernetes://default/default:cifar-trainer-a5qvfhe1hyb1b
kubernetes://default/default:cifar-trainer-d796ei2tdf4bc
kubernetes://default/default:cifar-trainer-em0iao2m9agog
kubernetes://default/default:cifar-trainer-ew33oxmdg7t0c
kubernetes://default/default:cifar-trainer-grjsnfxeinqbd
kubernetes://default/default:cifar-trainer-p4bwlewvwmt1f
kubernetes://default/default:cifar-trainer-pwy4c4omfufff
kubernetes://default/default:cifar-trainer-qyan5cp6vz5od
kubernetes://default/default:cifar-trainer-rbrd5krzkyz3c
kubernetes://default/default:cifar-trainer-rw9kn5rtau6se
kubernetes://default/default:cifar-trainer-tz45941r7u1b
kubernetes://default/default:cifar-trainer-uycdlshurnf6c
kubernetes://default/default:cifar-trainer-vstsr7qtecaif
kubernetes://default/default:cifar-trainer-xnqu2mxdc2a5b
.
.
.
.
kubernetes://default/default:train-zhqz9fhr7h0tsc
kubernetes://default/default:trainer-bwhcp3vw2khftc
kubernetes://default/default:trainer-m3mxh6dcxgwtv

CLA Signed

opened by priyaramani 15

ray_scheduler: workspace + fixed no role logging

This updates Ray to have proper workspace support.

-c working_dir=... is deprecated in favor of torchx run --workspace=...
-c requirements=... is optional and requirements.txt will be automatically read from the workspace if present
torchx log ray://foo/bar works without requiring /ray/0

Test plan:

(torchx) tristanr@tristanr-arch2 ~/D/t/e/ray (ray)> torchx run -s ray --wait --log dist.ddp --env LOGLEVEL=INFO -j 2x1 -m scripts.compute_world_size
torchx 2022-05-18 16:55:31 INFO     Checking for changes in workspace `file:///home/tristanr/Developer/torchrec/examples/ray`...
torchx 2022-05-18 16:55:31 INFO     To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
torchx 2022-05-18 16:55:31 INFO     Built new image `/tmp/torchx_workspacebe6331jv` based on original image `ghcr.io/pytorch/torchx:0.2.0dev0` and changes in workspace `file:///home/tristanr/Developer/torch
rec/examples/ray` for role[0]=compute_world_size.
torchx 2022-05-18 16:55:31 WARNING  The Ray scheduler does not support port mapping.
torchx 2022-05-18 16:55:31 INFO     Uploading package gcs://_ray_pkg_63a39f7096dfa0bd.zip.
torchx 2022-05-18 16:55:31 INFO     Creating a file package for local directory '/tmp/torchx_workspacebe6331jv'.
ray://torchx/127.0.0.1:8265-compute_world_size-mpr03nzqvvg3td
torchx 2022-05-18 16:55:31 INFO     Launched app: ray://torchx/127.0.0.1:8265-compute_world_size-mpr03nzqvvg3td
torchx 2022-05-18 16:55:31 INFO     AppStatus:
  msg: PENDING
  num_restarts: -1
  roles:
  - replicas:
    - hostname: <NONE>
      id: 0
      role: ray
      state: !!python/object/apply:torchx.specs.api.AppState
      - 2
      structured_error_msg: <NONE>
    role: ray
  state: PENDING (2)
  structured_error_msg: <NONE>
  ui_url: null

torchx 2022-05-18 16:55:31 INFO     Job URL: None
torchx 2022-05-18 16:55:31 INFO     Waiting for the app to finish...
torchx 2022-05-18 16:55:31 INFO     Waiting for app to start before logging...
torchx 2022-05-18 16:55:43 INFO     Job finished: SUCCEEDED
(torchx) tristanr@tristanr-arch2 ~/D/t/e/ray (ray)> torchx log ray://torchx/127.0.0.1:8265-compute_world_size-mpr03nzqvvg3td
ray/0 Waiting for placement group to start.
ray/0 running ray.wait on [ObjectRef(8f2664c081ffc268e1c4275021ead9801a8d33861a00000001000000), ObjectRef(afe9f14f5a927c04b8e247b9daca5a9348ef61061a00000001000000)]
ray/0 (CommandActor pid=494377) INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
ray/0 (CommandActor pid=494377)   entrypoint       : scripts.compute_world_size
ray/0 (CommandActor pid=494377)   min_nodes        : 2
ray/0 (CommandActor pid=494377)   max_nodes        : 2
ray/0 (CommandActor pid=494377)   nproc_per_node   : 1
ray/0 (CommandActor pid=494377)   run_id           : compute_world_size-mpr03nzqvvg3td
ray/0 (CommandActor pid=494377)   rdzv_backend     : c10d
ray/0 (CommandActor pid=494377)   rdzv_endpoint    : localhost:29500
ray/0 (CommandActor pid=494377)   rdzv_configs     : {'timeout': 900}
ray/0 (CommandActor pid=494377)   max_restarts     : 0
ray/0 (CommandActor pid=494377)   monitor_interval : 5
ray/0 (CommandActor pid=494377)   log_dir          : None
ray/0 (CommandActor pid=494377)   metrics_cfg      : {}
ray/0 (CommandActor pid=494377)
ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_vyq136c_/compute_world_size-mpr03nzqvvg3td_nu4r0f6t
ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] starting workers for entrypoint: python
ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous'ing worker group
ray/0 (CommandActor pid=494406) INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
ray/0 (CommandActor pid=494406)   entrypoint       : scripts.compute_world_size
ray/0 (CommandActor pid=494406)   min_nodes        : 2
ray/0 (CommandActor pid=494406)   max_nodes        : 2
ray/0 (CommandActor pid=494406)   nproc_per_node   : 1
ray/0 (CommandActor pid=494406)   run_id           : compute_world_size-mpr03nzqvvg3td
ray/0 (CommandActor pid=494406)   rdzv_backend     : c10d
ray/0 (CommandActor pid=494406)   rdzv_endpoint    : 172.26.20.254:29500
ray/0 (CommandActor pid=494406)   rdzv_configs     : {'timeout': 900}
ray/0 (CommandActor pid=494406)   max_restarts     : 0
ray/0 (CommandActor pid=494406)   monitor_interval : 5
ray/0 (CommandActor pid=494406)   log_dir          : None
ray/0 (CommandActor pid=494406)   metrics_cfg      : {}
ray/0 (CommandActor pid=494406)
ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_t38mo11i/compute_world_size-mpr03nzqvvg3td_ehvp80_p
ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] starting workers for entrypoint: python
ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous'ing worker group
ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous complete for workers. Result:
ray/0 (CommandActor pid=494377)   restart_count=0
ray/0 (CommandActor pid=494377)   master_addr=tristanr-arch2
ray/0 (CommandActor pid=494377)   master_port=48089
ray/0 (CommandActor pid=494377)   group_rank=1
ray/0 (CommandActor pid=494377)   group_world_size=2
ray/0 (CommandActor pid=494377)   local_ranks=[0]
ray/0 (CommandActor pid=494377)   role_ranks=[1]
ray/0 (CommandActor pid=494377)   global_ranks=[1]
ray/0 (CommandActor pid=494377)   role_world_sizes=[2]
ray/0 (CommandActor pid=494377)   global_world_sizes=[2]
ray/0 (CommandActor pid=494377)
ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] Starting worker group
ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_vyq136c_/compute_world_size-mpr03nzqvvg3td_nu4r0f6t/attempt_0/0/error.json
ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous complete for workers. Result:
ray/0 (CommandActor pid=494406)   restart_count=0
ray/0 (CommandActor pid=494406)   master_addr=tristanr-arch2
ray/0 (CommandActor pid=494406)   master_port=48089
ray/0 (CommandActor pid=494406)   group_rank=0
ray/0 (CommandActor pid=494406)   group_world_size=2
ray/0 (CommandActor pid=494406)   local_ranks=[0]
ray/0 (CommandActor pid=494406)   role_ranks=[0]
ray/0 (CommandActor pid=494406)   global_ranks=[0]
ray/0 (CommandActor pid=494406)   role_world_sizes=[2]
ray/0 (CommandActor pid=494406)   global_world_sizes=[2]
ray/0 (CommandActor pid=494406)
ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] Starting worker group
ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_t38mo11i/compute_world_size-mpr03nzqvvg3td_ehvp80_p/attempt_0/0/error.json
ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] worker group successfully finished. Waiting 300 seconds for other agents to finish.
ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.000942230224609375 seconds
ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] worker group successfully finished. Waiting 300 seconds for other agents to finish.
ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0013003349304199219 seconds
ray/0 (CommandActor pid=494377) [0]:initializing `gloo` process group
ray/0 (CommandActor pid=494377) [0]:successfully initialized process group
ray/0 (CommandActor pid=494377) [0]:rank: 1, actual world_size: 2, computed world_size: 2
ray/0 (CommandActor pid=494406) [0]:initializing `gloo` process group
ray/0 (CommandActor pid=494406) [0]:successfully initialized process group
ray/0 (CommandActor pid=494406) [0]:rank: 0, actual world_size: 2, computed world_size: 2
ray/0 running ray.wait on [ObjectRef(afe9f14f5a927c04b8e247b9daca5a9348ef61061a00000001000000)]

CLA Signed

opened by d4l3k 14

Investigate Perf Drop for fairseq gpu training on aws batch with EFA (p3dn)
🐛 Bug

Not really a TorchX bug per-se, but opening this issue to keep track of this investigation.

There seems to be a perf drop when running the same fairseq trainer on AWS Batch on p3dn (w/ EFA) versus running on the same host baremetal (no Docker).

Module (check all that applies):

[ ] torchx.spec

[ ] torchx.component

[ ] torchx.apps

[ ] torchx.runtime

[ ] torchx.cli

[ ] torchx.schedulers

[ ] torchx.pipelines

[ ] torchx.aws

[ ] torchx.examples

[x] other

Below is a table summarizing the different runs:

Common Params:

Instance Type: p3dn.24xlarge

Num Nodes: 2

Nodes in same placement group: Yes

Environment Variables: a. LOGLEVEL=INFO b. NCCL_SOCKET_IFNAME=eth0 c. NCCL_ALGO=RING d. NCCL_PROTO=simple e. FI_PROVIDER=efa f. FI_EFA_USE_DEVICE_RDMA=1

In all cases RDMA is not enabled (seems to not be available on p3dn see: https://github.com/aws/aws-ofi-nccl/issues/104)

|Scheduler | Docker | EFA picked up by NCCL | NCCL_ALGO | Performance | Log | |-------------|:----------- |:-------------------------|:-------------|:-------------|:-------------| | BareMetal | No | No | Tree | baseline (~110k wpm) | gist-link| | BareMetal | Yes | Yes | Tree | WIP | |
| AWS Batch | Yes | Yes | Ring | ~44k wpm | | | AWS Batch | Yes | Yes | Tree | ~70k wpm

Scheduler BareMetal: ssh onto the nodes and kick off the trainer manually

Docker Yes/No: trainer runs in docker (e.g. docker run) versus python -m torch.distributed.launch)

Logs

To Reproduce

Steps to reproduce the behavior:

Expected behavior

Environment

torchx version (e.g. 0.1.0rc1): torchx-nightly

Python version: 3.8+

OS (e.g., Linux): Amazon Linux2

How you installed torchx (conda, pip, source, docker): pip

Docker image and tag (if using docker):

Git commit (if installed from source): N/A

Execution environment (on-prem, AWS, GCP, Azure etc): AWS Batch

Any other relevant information:

Additional context
bug aws_batch
opened by kiukchung 14
schedulers/aws_batch: add a scheduler to launch jobs directly on aws_batch
This adds a scheduler that allows for launching TorchX jobs directly on AWS Batch. This requires almost no infrastructure setup (just UI) and provided a fairly sane docker job engine.

Test plan:

scripts/awsbatchint.sh pytest torchx/schedulers/test/aws_batch_scheduler_test.py pyre
CLA Signed
opened by d4l3k 14

Ray scheduler driver and job api

To run distributed pytorch test

cd torch/torchx/schedulers/test
python -m unittest ray_scheduler_test.py

2021-11-24 00:53:22,810 INFO worker.py:840 -- Connecting to existing Ray cluster at address: 172.31.2.209:6379
2021-11-24 00:53:23,127 INFO sdk.py:144 -- Uploading package gcs://_ray_pkg_e212ba1ff8e15e24.zip.
2021-11-24 00:53:23,128 INFO packaging.py:352 -- Creating a file package for local directory '/tmp/tmp9copqxd2'.
status: PENDING
status: PENDING
status: RUNNING
status: RUNNING
status: RUNNING
status: RUNNING
status: SUCCEEDED
2021-11-24 00:53:25,521 INFO worker.py:840 -- Connecting to existing Ray cluster at address: 172.31.2.209:6379
(CommandActor pid=3575) initializing `gloo` process group
(CommandActor pid=3574) initializing `gloo` process group
(CommandActor pid=3575) successfully initialized process group
(CommandActor pid=3575) rank: 1, actual world_size: 2, computed world_size: 2
(CommandActor pid=3574) successfully initialized process group
(CommandActor pid=3574) rank: 0, actual world_size: 2, computed world_size: 2

To run distributed pytorch test using torchX CLI

# Setup cluster and get a HEAD NODE IP
ray up -y ray_cluster.yaml
pip install torchx[dev]

# Get a job ID from deployed job
torchx run -s ray -cfg dashboard_address=34.209.89.185:20002,working_dir=aivanou_test utils.binary --entrypoint ray_simple.py

# Use job ID to get logs or job status
torchx describe ray://torchx/34.209.89.185:20002-raysubmit_aKvezN3NyA2mqZeW
torchx log ray://torchx/34.209.89.185:20002-raysubmit_aKvezN3NyA2mqZeW

CLA Signed

opened by msaroufim 14

Move `examples` under `torchx` module

Summary: The diff moves examples under torchx namespace, also removes examples Dockerfile, and makes torchx image to use dev-requirements

Differential Revision: D31464358
fb-exported

opened by aivanou 14
github/kfp: run integration tests even for external users
This splits the kfp integration tests into two steps.

without secrets on PR branch: builds the pipeline + containers and saves them as a GitHub artifact

with secrets from master branch: loads the artifact and launches it on the KFP cluster

Test plan:

scripts/kfpint.py --path /tmp/foo --save scripts/kfpint.py --path /tmp/foo --load

CI
CLA Signed Merged
opened by d4l3k 14
add LSF scheduler
I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters.

Note: torchx log command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS).

In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e., bsub). For distributed apps, it creates multiple bsub. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there.

component_integration_tests.lsf.txt

Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images.

The following are example commands.

Example: native hello_world and CLI utils

$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=native utils.echo --msg hello_world --num_replicas 3 lsf://torchx/echo-pxc3gn5ct061k $ torchx list -s lsf $ torchx status lsf://torchx/echo-pxc3gn5ct061k $ torchx cancel lsf://torchx/echo-pxc3gn5ct061k $ torchx log --stream stdout lsf://torchx/echo-pxc3gn5ct061k/echo/0

Example: Docker hello_world

$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=docker utils.echo --image alpine:latest --msg hello_world --num_replicas 3

Example: Singularity hello_world

$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=singularity utils.echo --image docker://alpine:latest --msg hello_world --num_replicas 3

Example: Docker Distributed

$ cp scripts/dist_app.py /mnt/data/dist/ $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=docker,host_network=True" dist.ddp -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"

Example: Singularity Distributed

$ cp scripts/dist_app.py /mnt/data/dist/ $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=singularity,host_network=True" dist.ddp --image docker://ghcr.io/pytorch/torchx:0.3.0dev0 -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"
CLA Signed
opened by takeshi-yoshimura 13

c10d communication failure on k8s cluster

🐛 Bug

cross-referencing the issue posted here:

https://github.com/pytorch/pytorch/issues/80772

which is about race condition between rank 0 pod registering with DNS and c10d service trying to look it up.

To Reproduce

Steps to reproduce the behavior:

run the code with torchx run --scheduler kubernetes dist.ddp -j 8x1 --script cifar_dist.py
see the pods scheduled and then erroring

Code:

import os
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
from torch.utils.data.distributed import DistributedSampler
from torch.utils.data import DataLoader

from torchvision.models import resnet18
from torchvision.datasets import CIFAR10
import torchvision.transforms as transforms

from torch.nn.parallel import DistributedDataParallel as DDP
import random
import numpy as np
from time import sleep

model_dir = '/model'
model_filename = 'mymodel'
batch_size = 128


def set_random_seeds(seed=0):

    torch.manual_seed(seed)
    # torch.backends.cudnn.deterministic = True
    # torch.backends.cudnn.benchmark = False
    np.random.seed(seed)
    random.seed(seed)


def evaluate(model, device, test_loader):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for data in test_loader:
            images, labels = data[0].to(device), data[1].to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    accuracy = correct / total
    return accuracy


dist.init_process_group("gloo")
rank = dist.get_rank()
world_size = dist.get_world_size()

if torch.cuda.is_available():
    device = torch.cuda.device(f'cuda:{rank}')
else:
    device = torch.device('cpu')


print(f"Running basic DDP example on rank {rank} of {world_size}")

if rank == 0:
    model_filepath = os.path.join(model_dir, model_filename)
    if not os.path.exists(model_dir):
        os.makedirs(model_dir)

set_random_seeds()

print('create model...')
# create model
model = resnet18(pretrained=False)
if torch.cuda.is_available():
    torch.cuda.set_device(rank)
    device = torch.cuda.device(f'cuda:{rank}')
    ddp_model = DDP(model, device_ids=[rank], output_device=rank)
else:
    device = torch.device('cpu')
    ddp_model = DDP(model)

print('prepare dataset...')
# Prepare dataset and dataloader
transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

train_set = CIFAR10(root="data", train=True, download=True, transform=transform) 
test_set = CIFAR10(root="data", train=False, download=True, transform=transform)

train_sampler = DistributedSampler(dataset=train_set)

train_loader = DataLoader(dataset=train_set, batch_size=batch_size, sampler=train_sampler, num_workers=0)
# Test loader does not have to follow distributed sampling strategy
test_loader = DataLoader(dataset=test_set, batch_size=128, shuffle=False, num_workers=0)

print('setup model...')

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-5)

for epoch in range(1000):
    print(f'Epoch {epoch}')
    if epoch % 10 == 0:
        if rank == 0:
            accuracy = evaluate(model=ddp_model, device=device, test_loader=test_loader)
            torch.save(ddp_model.state_dict(), model_filepath)
            print("-" * 75)
            print(f"Epoch: {epoch}, Accuracy: {accuracy}")
            print("-" * 75)
        else:
            print("-" * 75)
            print(f"Epoch: {epoch}")
            print("-" * 75)

    ddp_model.train()

    i = 0
    for data in train_loader:
        print(i)
        i += 1
        inputs, labels = data[0].to(device), data[1].to(device)
        optimizer.zero_grad()
        outputs = ddp_model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

Expected behavior

I expect pods to connect to c10d properly and run the code.

Environment

Collecting environment information...
PyTorch version: 1.11.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.27

Python version: 3.8.12 (default, Oct 12 2021, 13:49:34)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.17-2136.306.1.3.el8uek.x86_64-x86_64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] botorch==0.6.0
[pip3] gpytorch==1.6.0
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.2
[pip3] pytorch-lightning==1.5.10
[pip3] torch==1.11.0
[pip3] torch-model-archiver==0.6.0
[pip3] torchelastic==0.2.2
[pip3] torchmetrics==0.9.1
[pip3] torchserve==0.6.0
[pip3] torchtext==0.12.0
[pip3] torchvision==0.12.0
[pip3] torchx==0.2.0
[conda] blas                      1.0                         mkl  
[conda] botorch                   0.6.0                    pypi_0    pypi
[conda] cudatoolkit               11.3.1               ha36c431_9    nvidia
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] gpytorch                  1.6.0                    pypi_0    pypi
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-service               2.4.0            py38h7f8727e_0  
[conda] mkl_fft                   1.3.1            py38hd3c417c_0  
[conda] mkl_random                1.2.2            py38h51133e4_0  
[conda] numpy                     1.21.2           py38h20f2e39_0  
[conda] numpy-base                1.21.2           py38h79a1101_0  
[conda] pytorch                   1.11.0          py3.8_cuda11.3_cudnn8.2.0_0    pytorch
[conda] pytorch-lightning         1.5.10                   pypi_0    pypi
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torch-model-archiver      0.6.0                    pypi_0    pypi
[conda] torchelastic              0.2.2                    pypi_0    pypi
[conda] torchmetrics              0.9.1                    pypi_0    pypi
[conda] torchserve                0.6.0                    pypi_0    pypi
[conda] torchtext                 0.12.0                     py38    pytorch
[conda] torchvision               0.12.0               py38_cu113    pytorch
[conda] torchx                    0.2.0                    pypi_0    pypi

Execution environment: Oracle Cloud OKE cluster k8s version 1.23.4
dns: CoreDNS-1.8.6 linux/amd64, go1.16.7 BoringCrypto, 69a006c9f1a

opened by streamnsight 13

Find default namespace from kube_config.

For kubernetes scheduler, find default namespace from kube_config other than hard-coded "default" value.

Right now the default namespace for list method is hard-coded as "Default", which is not the case for multiple user scenario in kubernetes cluster. The default namespace can be read from current context of kube configuration.

Test plan:

Tested manually with a generated kube config.
CLA Signed

opened by liuwenchao 3
(torchx/aws_batch) Enable Region Selection in TorchConfig

QOL improvements for support for AWS regions.

Current Behavior: Use the default Region specified in .aws config

Proposed Behaviour: Have manual override in .torchxconfig. If no override and no default throw and error

Future Improvements: Add profile support. Add fargate support.
CLA Signed

opened by ashvinnihalani 3
Kubernetes: support default scheduler instead of volcano for autoscaling

Description

Right now Volcano doesn't handle autoscaling correctly since it won't schedule jobs if there's not enough resources. For single node jobs it would be nice to support the non-volcano K8s batch scheduler so the Pods will be created and can autoscale the cluster.

Detailed Proposal

Add a setting scheduler to the kubernetes scheduler to allow setting schedulerName to "default-scheduler" to use the standard logic.

https://volcano.sh/en/docs/vcjob/#schedulername

There's some prototype work that was done for TPU node prototyping in https://github.com/pytorch/torchx/commit/66e934c59b376a8c0b9aa6523fde34e81a673eb2 which required the same default scheduler logic

Alternatives

Additional context/links

opened by d4l3k 0
[torchx/schedulers - aws batch] Only display jobs with torchx tag when listing
Description

Add a tag filter by the existence of the tag key: torchx.python.org/version when listing jobs.

Motivation/Background

Currently when the batch compute environment has jobs not launched via torchx, when we list jobs via:

torchx list -s aws_batch

It lists ALL the jobs (not just the ones that are submitted via torchx). Since we already tag the batch jobs launched with torchx with the tag key torchx.pytorch.org/version (see screenshot below) we should only list the jobs that have this tag.

Furthermore, we can also:

Add a unix-username tag so that we only display the jobs relevant to the user

Add a --filter option to either show all (*) jobs or filter by user or job queue or other useful criteria

Detailed Proposal

See motivation above.

Alternatives

N/A

Additional context/links

N/A
opened by kiukchung 0
[RFC][torchx/schedulers - aws batch] support wildcard job queue selection
Description

For the AWS Batch scheduler integration in torchx, support wildcards in the job_queue runopt with a simple (or configurable or extendable) queue selection logic. For instance, assume that my organization uses a job queue naming convention of the form

${TEAM_NAME}-${CLUSTER_TYPE}-${REGION} Example: pt_r2p-gpu_cluster-us-east-1a pt_r2p-gpu_cluster-us-east-1b pt_r2p-gpu_cluster-us-east-1c pt_r2p-cpu_cluster-us-east-1a pt_r2p-cpu_cluster-us-east-1b pt_r2p-trainium_cluster-us-east-1a pt_r2p-trainium_cluster-us-east-1c pt_core-gpu_cluster-us-east-1a pt_core-cpu_cluster-us-east-1a

If I'm in the pt_r2p team, and want to submit a job to any gpu compute environment that has free capacity regardless of region, then I can use a wildcard on the ${CLUSTER_TYPE} portion of the job queue name as:

[aws_batch] job_queue=pt_r2p-gpu_cluster-*

Motivation/Background

Ideally, with AWS Batch we create a single job queue (JQ) connected to multiple compute environments (CE) and always submit to the same JQ to have Batch figure out which CE the job needs to be submitted to. With Fair Share (FS) scheduling announced a year ago (see announcement) this is theoretically possible. However many users of AWS Batch are still using FIFO (originally supported) scheduling policy in which case having a single JQ is impractical in a multi-team use case scenario since users from other teams may affect the scheduling overhead of my team. This starts escalating pretty quickly in cases where teams BYO (bring your own) capacity.

Detailed Proposal

Support wild-cards for job_queue names for the aws_batch scheduler with the following MVP queue selection algorithm (greedy):

Find all job queues that match the wild-card expression

For each job queue pull the details of the CE that it is hooked up to

Filter the CEs down to the ones that actually support the host type + quantity that the job needs

For each filtered CE look at the free resources and rank them by most-free -> least-free

Pick the job queue that has the most CEs with the highest rank

This algorithm effectively choses the JQ that the job needs to be submitted to that will yield the least wait time in the queue.

To actually implement the greedy algorithm above, I suggest that we add chain-able selection algorithms. For instance the algorithm above can be expressed as a chain of primitives:

jqs = get_matching("pt_r2p-gpu_cluster-*") jqs = filter_resource(jqs, role.resource) jqs = order_by(jqs, Ordering.FREE_CAPACITY, desc=True) jqs[0] # <-- select this one to run

Similarly a "first-match" algorithm can be implemented as:

get_matching("pt_r2p-gpu_cluster-*")[0]

We can follow torchdata's datapipe interface such that each function in the chain has the signature:

def fn(jqs: List[JobQueue], *args, **kwargs) -> List[JobQueue]: """ Returns a sorted/filtered/manipulated list of job queues to pass to the next chained fn. """ pass

Alternatives

Resources/guidelines/script to migrate from FIFO queues Fair-Share.

Additional context/links

N/A
opened by kiukchung 1

Releases(v0.4.0)

v0.4.0(Dec 30, 2022)

See: https://github.com/pytorch/torchx/blob/main/CHANGELOG.md#torchx-040
Source code(tar.gz)
Source code(zip)
v0.3.0(Oct 27, 2022)

See https://github.com/pytorch/torchx/blob/main/CHANGELOG.md?plain=1#L3
Source code(tar.gz)
Source code(zip)
v0.2.0(Jun 15, 2022)

See https://github.com/pytorch/torchx/blob/main/CHANGELOG.md?plain=1#L3
Source code(tar.gz)
Source code(zip)
v0.1.2(Mar 29, 2022)
Full Changelog: https://github.com/pytorch/torchx/compare/v0.1.1...v0.1.2

Milestone: https://github.com/pytorch/torchx/milestones/3

PyTorch 1.11 Support

Python 3.10 Support

torchx.workspace

TorchX now supports a concept of workspaces. This enables seamless launching of jobs using changes present in your local workspace. For Docker based schedulers, we automatically build a new docker container on job launch making it easier than ever to run experiments. #333

torchx.schedulers

Ray #329

Newly added Ray scheduler makes it easy to launch jobs on Ray.

https://pytorch.medium.com/large-scale-distributed-training-with-torchx-and-ray-1d09a329aacb

AWS Batch #381

Newly added AWS Batch scheduler makes it easy to launch jobs in AWS with minimal infrastructure setup.

Slurm

Slurm jobs will by default launch in the current working directory to match local_cwd and workspace behavior. #372

Replicas now have their own log files and can be accessed programmatically. #373

Support for comment, mail-user and constraint fields. #391

Workspace support (prototype) - Slurm jobs can now be launched in isolated experiment directories. #416

Kubernetes

Support for running jobs under service accounts. #408

Support for specifying instance types. #433

All Docker-based Schedulers (Kubernetes, Batch, Docker)

Added bind mount and volume supports #420, #426

Bug fix: Better shm support for large dataloader #429

Support for .dockerignore and custom Dockerfiles #401

Local Scheduler

Automatically set CUDA_VISIBLE_DEVICES #383

Improved log ordering #366

torchx.components

dist.ddp

Rendezvous works out of the box on all schedulers #400

Logs are now prefixed with local ranks #412

Can specify resources via the CLI #395

Can specify environment variables via the CLI #399

HPO

Ax runner now lives in the Ax repo https://github.com/facebook/Ax/commit/8e2e68f21155e918996bda0b7d97b5b9ef4e0cba

torchx.cli

.torchxconfig

You can now specify component argument defaults .torchxconfig https://github.com/pytorch/torchx/commit/c37cfd7846d5a0cb527dd19c8c95e881858f8f0a

~/.torchxconfig can now be used to set user level defaults. #378

--workspace can be configured #397

Color change and bug fixes #419

torchx.runner

Now supports workspace interfaces. #360

Returned lines now preserve whitespace to provide support for progress bars #425

Events are now logged to torch.monitor when available. #379

torchx.notebook (prototype)

Added new workspace interface for developing models and launching jobs via a Jupyter Notebook. #356

Docs

Improvements to clarify TorchX usage w/ workspaces and general cleanups.

#374, #402, #404, #407, #434

Source code(tar.gz)
Source code(zip)
v0.1.1(Nov 18, 2021)

See: https://github.com/pytorch/torchx/blob/main/CHANGELOG.md#torchx-011
Source code(tar.gz)
Source code(zip)
v0.1.0(Oct 21, 2021)

See: https://github.com/pytorch/torchx/blob/master/CHANGELOG.md
Source code(tar.gz)
Source code(zip)
v0.1.0rc1(Oct 18, 2021)

See: https://github.com/pytorch/torchx/blob/master/CHANGELOG.md
Source code(tar.gz)
Source code(zip)
v0.1.0rc0(Oct 5, 2021)

See: https://github.com/pytorch/torchx/blob/master/CHANGELOG.md
Source code(tar.gz)
Source code(zip)
v0.1.0b0(Jun 29, 2021)

See: https://github.com/pytorch/torchx/blob/master/CHANGELOG.md#torchx-010b0
Source code(tar.gz)
Source code(zip)
v0.1.0.dev2(Jun 24, 2021)

v0.1.0.dev2 release
Source code(tar.gz)
Source code(zip)
v0.1.0.dev1(Jun 17, 2021)

v0.1.0.dev1 release.
Source code(tar.gz)
Source code(zip)
v0.1.0.dev0(Jun 16, 2021)

v0.1.0.dev0 release.
Source code(tar.gz)
Source code(zip)

TorchX is a library containing standard DSLs for authoring and running PyTorch related components for an E2E production ML pipeline.

Related tags

Overview

TorchX

Requirements

Installation

Quickstart

Contributing

License

Comments

🐛 Bug

Logs

To Reproduce

Expected behavior

Environment

Additional context

To run distributed pytorch test

To run distributed pytorch test using torchX CLI

🐛 Bug

To Reproduce

Code:

Expected behavior

Environment

Description

Detailed Proposal

Alternatives

Additional context/links

Description

Motivation/Background

Detailed Proposal

Alternatives

Additional context/links

Description

Motivation/Background

Detailed Proposal

Alternatives

Additional context/links

Releases(v0.4.0)

v0.4.0(Dec 30, 2022)

v0.3.0(Oct 27, 2022)

v0.2.0(Jun 15, 2022)

v0.1.2(Mar 29, 2022)

v0.1.1(Nov 18, 2021)

v0.1.0(Oct 21, 2021)

v0.1.0rc1(Oct 18, 2021)

v0.1.0rc0(Oct 5, 2021)

v0.1.0b0(Jun 29, 2021)

v0.1.0.dev2(Jun 24, 2021)

v0.1.0.dev1(Jun 17, 2021)

v0.1.0.dev0(Jun 16, 2021)

Owner

《Lerning n Intrinsic Grment Spce for Interctive Authoring of Grment Animtion》

magiCARP: Contrastive Authoring+Reviewing Pretraining

E2e music remastering system - End-to-end Music Remastering System Using Self-supervised and Adversarial Training

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

Torchserve server using a YoloV5 model running on docker with GPU and static batch inference to perform production ready inference.

Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

A library of multi-agent reinforcement learning components and systems

PyTorch3D is FAIR's library of reusable components for deep learning with 3D data

Python library containing BART query generation and BERT-based Siamese models for neural retrieval.

PyTorch CZSL framework containing GQA, the open-world setting, and the CGE and CompCos methods.

A numpy-based implementation of RANSAC for fundamental matrix and homography estimation. The degeneracy updating and local optimization components are included and optional.

Siamese-nn-semantic-text-similarity - A repository containing comprehensive Neural Networks based PyTorch implementations for the semantic text similarity task

BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond

Angular & Electron desktop UI framework. Angular components for native looking and behaving macOS desktop UI (Electron/Web)

banditml is a lightweight contextual bandit & reinforcement learning library designed to be used in production Python services.

Ranger deep learning optimizer rewrite to use newest components

AlphaBot2 Pi Core software for interfacing with the various components.

Dcf-game-infrastructure-public - Contains all the components necessary to run a DC finals (attack-defense CTF) game from OOO