Lightweight Cluster/Cloud VM Job Management 🚀

Are you looking for a tool to manage your training runs locally, on Slurm/Open Grid Engine clusters, SSH servers or Google Cloud Platform VMs? mle-scheduler provides a lightweight API to launch and monitor job queues. It smoothly orchestrates simultaneous runs for different configurations and/or random seeds. It is meant to reduce boilerplate and to make job resource specification intuitive. It comes with two core pillars:

MLEJob: Launches and monitors a single job on a resource (Slurm, Open Grid Engine, GCP, SSH, etc.).
MLEQueue: Launches and monitors a queue of jobs with different training configurations and/or seeds.

For a quickstart check out the notebook blog or the example scripts 📖

	Local	Slurm	Grid Engine	SSH	GCP

Installation ⏳

pip install mle-scheduler

Managing a Single Job with `MLEJob` Locally 🚀

from mle_scheduler import MLEJob

# python train.py -config base_config_1.yaml -exp_dir logs_single -seed_id 1
job = MLEJob(
    resource_to_run="local",
    job_filename="train.py",
    config_filename="base_config_1.yaml",
    experiment_dir="logs_single",
    seed_id=1
)

_ = job.run()

Managing a Queue of Jobs with `MLEQueue` Locally 🚀 ... 🚀

from mle_scheduler import MLEQueue

# python train.py -config base_config_1.yaml -seed 0 -exp_dir logs_queue/
   
    _base_config_1
   
# python train.py -config base_config_1.yaml -seed 1 -exp_dir logs_queue/
   
    _base_config_1
   
# python train.py -config base_config_2.yaml -seed 0 -exp_dir logs_queue/
   
    _base_config_2
   
# python train.py -config base_config_2.yaml -seed 1 -exp_dir logs_queue/
   
    _base_config_2
   
queue = MLEQueue(
    resource_to_run="local",
    job_filename="train.py",
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    random_seeds=[0, 1],
    experiment_dir="logs_queue"
)

queue.run()

Launching Slurm Cluster-Based Jobs 🐒

", # Partition to schedule jobs on "env_name": "mle-toolbox", # Env to activate at job start-up "use_conda_venv": True, # Whether to use anaconda venv "num_logical_cores": 5, # Number of requested CPU cores per job "num_gpus": 1, # Number of requested GPUs per job "gpu_type": "V100S", # GPU model requested for each job "modules_to_load": "nvidia/cuda/10.0" # Modules to load at start-up } queue = MLEQueue( resource_to_run="slurm-cluster", job_filename="train.py", job_arguments=job_args, config_filenames=["base_config_1.yaml", "base_config_2.yaml"], experiment_dir="logs_slurm", random_seeds=[0, 1] ) queue.run() ">

# Each job requests 5 CPU cores & 1 V100S GPU & loads CUDA 10.0
job_args = {
    "partition": "
   
    "
   ,  # Partition to schedule jobs on
    "env_name": "mle-toolbox",  # Env to activate at job start-up
    "use_conda_venv": True,  # Whether to use anaconda venv
    "num_logical_cores": 5,  # Number of requested CPU cores per job
    "num_gpus": 1,  # Number of requested GPUs per job
    "gpu_type": "V100S",  # GPU model requested for each job
    "modules_to_load": "nvidia/cuda/10.0"  # Modules to load at start-up
}

queue = MLEQueue(
    resource_to_run="slurm-cluster",
    job_filename="train.py",
    job_arguments=job_args,
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    experiment_dir="logs_slurm",
    random_seeds=[0, 1]
)
queue.run()

Launching GridEngine Cluster-Based Jobs 🐘

", # Queue to schedule jobs on "env_name": "mle-toolbox", # Env to activate at job start-up "use_conda_venv": True, # Whether to use anaconda venv "num_logical_cores": 5, # Number of requested CPU cores per job "num_gpus": 1, # Number of requested GPUs per job "gpu_type": "V100S", # GPU model requested for each job "gpu_prefix": "cuda" #$ -l {gpu_prefix}="{num_gpus}" } queue = MLEQueue( resource_to_run="slurm-cluster", job_filename="train.py", job_arguments=job_args, config_filenames=["base_config_1.yaml", "base_config_2.yaml"], experiment_dir="logs_grid_engine", random_seeds=[0, 1] ) queue.run() ">

# Each job requests 5 CPU cores & 1 V100S GPU w. CUDA 10.0 loaded
job_args = {
    "queue": "
   
    "
   ,  # Queue to schedule jobs on
    "env_name": "mle-toolbox",  # Env to activate at job start-up
    "use_conda_venv": True,  # Whether to use anaconda venv
    "num_logical_cores": 5,  # Number of requested CPU cores per job
    "num_gpus": 1,  # Number of requested GPUs per job
    "gpu_type": "V100S",  # GPU model requested for each job
    "gpu_prefix": "cuda"  #$ -l {gpu_prefix}="{num_gpus}"
}

queue = MLEQueue(
    resource_to_run="slurm-cluster",
    job_filename="train.py",
    job_arguments=job_args,
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    experiment_dir="logs_grid_engine",
    random_seeds=[0, 1]
)
queue.run()

Launching SSH Server-Based Jobs 🦊

 ", # SSH server user name "pkey_path": "
  ", # Private key path (e.g. ~/.ssh/id_rsa) "main_server": "
  
   ", # SSH Server address "jump_server": '', # Jump host address "ssh_port": 22, # SSH port "remote_dir": "mle-code-dir", # Dir to sync code to on server "start_up_copy_dir": True, # Whether to copy code to server "clean_up_remote_dir": True # Whether to delete remote_dir on exit } job_args = { "env_name": "mle-toolbox", # Env to activate at job start-up "use_conda_venv": True # Whether to use anaconda venv } queue = MLEQueue( resource_to_run="ssh-node", job_filename="train.py", config_filenames=["base_config_1.yaml", "base_config_2.yaml"], random_seeds=[0, 1], experiment_dir="logs_ssh_queue", job_arguments=job_args, ssh_settings=ssh_settings) queue.run() "> 
   ssh_settings = {
    "user_name": "
     
      "
     ,  # SSH server user name
    "pkey_path": "
     
      "
     ,  # Private key path (e.g. ~/.ssh/id_rsa)
    "main_server": "
     
      "
     ,  # SSH Server address
    "jump_server": '',  # Jump host address
    "ssh_port": 22,  # SSH port
    "remote_dir": "mle-code-dir",  # Dir to sync code to on server
    "start_up_copy_dir": True,  # Whether to copy code to server
    "clean_up_remote_dir": True  # Whether to delete remote_dir on exit
}

job_args = {
    "env_name": "mle-toolbox",  # Env to activate at job start-up
    "use_conda_venv": True  # Whether to use anaconda venv
}

queue = MLEQueue(
    resource_to_run="ssh-node",
    job_filename="train.py",
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    random_seeds=[0, 1],
    experiment_dir="logs_ssh_queue",
    job_arguments=job_args,
    ssh_settings=ssh_settings)

queue.run() 
  
 

Launching GCP VM-Based Jobs 🦄

 ", # Name of your GCP project "bucket_name": "
  ", # Name of your GCS bucket "remote_dir": "
  
   ", # Name of code dir in bucket "start_up_copy_dir": True, # Whether to copy code to bucket "clean_up_remote_dir": True # Whether to delete remote_dir on exit } job_args = { "num_gpus": 0, # Number of requested GPUs per job "gpu_type": None, # GPU requested e.g. "nvidia-tesla-v100" "num_logical_cores": 1, # Number of requested CPU cores per job } queue = MLEQueue( resource_to_run="gcp-cloud", job_filename="train.py", config_filenames=["base_config_1.yaml", "base_config_2.yaml"], random_seeds=[0, 1], experiment_dir="logs_gcp_queue", job_arguments=job_args, cloud_settings=cloud_settings, ) queue.run() "> 
   cloud_settings = {
    "project_name": "
     
      "
     ,  # Name of your GCP project
    "bucket_name": "
     
      "
     , # Name of your GCS bucket
    "remote_dir": "
     
      "
     ,  # Name of code dir in bucket
    "start_up_copy_dir": True,  # Whether to copy code to bucket
    "clean_up_remote_dir": True  # Whether to delete remote_dir on exit
}

job_args = {
    "num_gpus": 0,  # Number of requested GPUs per job
    "gpu_type": None,  # GPU requested e.g. "nvidia-tesla-v100"
    "num_logical_cores": 1,  # Number of requested CPU cores per job
}

queue = MLEQueue(
    resource_to_run="gcp-cloud",
    job_filename="train.py",
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    random_seeds=[0, 1],
    experiment_dir="logs_gcp_queue",
    job_arguments=job_args,
    cloud_settings=cloud_settings,
)
queue.run() 
  
 

Development & Milestones for Next Release

You can run the test suite via python -m pytest -vv tests/. If you find a bug or are missing your favourite feature, feel free to contact me @RobertTLange or create an issue 🤗 . In future releases I plan on implementing the following:

Clean up TPU GCP VM & JAX dependencies case
Add local launching of cluster jobs via SSH to headnode
Add Docker/Singularity container setup support
Add Azure support
Add AWS support

Comments

use sys.executable instead of 'python'

In some systems (like mine, when I run locally on conda), the Python executable is not "python". I used here a global variable - not sure if that's the best way, but it allows for cases where we don't want the executable to be the same as sys.executable (e.g. if we want to execute the job on a different python interpreter than the one we are using).

opened by boazbk 4
Handle case when experiment_dir is not provided

At the moment if "experiment_dir" is None, then cmd_line_args is not initialized, and hence future lines like cmd_line_args += " -config " + self.config_filename will fail.

The proposed change just initializes cmd_line_args to the empty string, and then adds all options to it later.

opened by boazbk 2

[Feature] Make `meta_log` accessible from queue

Instead of having to ...

# Merge logs of random seeds & configs -> load & get final scores
queue.merge_configs(merge_seeds=True)
meta_log = load_meta_log("logs_search/meta_log.hdf5")
test_scores = [meta_log[r].stats.test_loss.mean[-1] for r in queue.mle_run_ids]

it would be great to do the load_meta_log already within the MLEQueue if merge_configs is called.

opened by RobertTLange 1

Handling Errors thrown in GCP VMs

Complete newbie to using VMs, so I'm guessing this will be a rookie questions.

If an error is encountered when executing a job on a GCP VM, what are the best practices for handling them? I'm not even sure how to know if there was an error, which obviously complicates the debugging process.

opened by wbrenton 0
Cmd capture
Adds MLEQueue option to delete config after job has finished

Adds debug_mode option to store stdout & stderr to files - partially addresses #3

Adds merging/loading of generated logs in MLEQueue w. automerge_configs option

Use system executable python version
opened by RobertTLange 0
What environment does it depend on?

It's greate of you have finished so good tool for job scheduler. I want to konw what environment does it depend on? And if it can run on Kubernetes docker environment? Thanks!

opened by kongjibai 0

Releases(v0.0.5)

v0.0.5(Jan 5, 2022)
Adds MLEQueue option to delete config after job has finished (delete_config)

Adds debug_mode option to store stdout & stderr to files

Adds merging/loading of generated logs in MLEQueue w. automerge_configs option

Use system executable python version

Source code(tar.gz)
Source code(zip)
v0.0.4(Dec 7, 2021)
[x] Track config base strings for auto-merging of mle-logs & add merge_configs

[x] Allow scheduling on multiple partitions via -p <part1>,<part2> & queues via -q <queue1>,<queue2>

Source code(tar.gz)
Source code(zip)
v0.0.3(Nov 12, 2021)

Source code(tar.gz)
Source code(zip)
v0.0.2(Nov 12, 2021)

Source code(tar.gz)
Source code(zip)

v0.0.1(Nov 12, 2021)

First release 🤗 implementing core API of MLEJob and MLEQueue

# Each job requests 5 CPU cores & 1 V100S GPU & loads CUDA 10.0
job_args = {
    "partition": "<SLURM_PARTITION>",  # Partition to schedule jobs on
    "env_name": "mle-toolbox",  # Env to activate at job start-up
    "use_conda_venv": True,  # Whether to use anaconda venv
    "num_logical_cores": 5,  # Number of requested CPU cores per job
    "num_gpus": 1,  # Number of requested GPUs per job
    "gpu_type": "V100S",  # GPU model requested for each job
    "modules_to_load": "nvidia/cuda/10.0"  # Modules to load at start-up
}

queue = MLEQueue(
    resource_to_run="slurm-cluster",
    job_filename="train.py",
    job_arguments=job_args,
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    experiment_dir="logs_slurm",
    random_seeds=[0, 1]
)
queue.run()

Source code(tar.gz)
Source code(zip)