I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters.
Note: torchx log
command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS).
In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e., bsub
). For distributed apps, it creates multiple bsub
. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there.
component_integration_tests.lsf.txt
Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images.
The following are example commands.
Example: native hello_world and CLI utils
$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=native utils.echo --msg hello_world --num_replicas 3
lsf://torchx/echo-pxc3gn5ct061k
$ torchx list -s lsf
$ torchx status lsf://torchx/echo-pxc3gn5ct061k
$ torchx cancel lsf://torchx/echo-pxc3gn5ct061k
$ torchx log --stream stdout lsf://torchx/echo-pxc3gn5ct061k/echo/0
Example: Docker hello_world
$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=docker utils.echo --image alpine:latest --msg hello_world --num_replicas 3
Example: Singularity hello_world
$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=singularity utils.echo --image docker://alpine:latest --msg hello_world --num_replicas 3
Example: Docker Distributed
$ cp scripts/dist_app.py /mnt/data/dist/
$ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=docker,host_network=True" dist.ddp -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"
Example: Singularity Distributed
$ cp scripts/dist_app.py /mnt/data/dist/
$ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=singularity,host_network=True" dist.ddp --image docker://ghcr.io/pytorch/torchx:0.3.0dev0 -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"
CLA Signed