A Lightweight Cluster/Cloud VM Job Management Tool 🚀

Overview

Lightweight Cluster/Cloud VM Job Management 🚀

Pyversions PyPI version Code style: black

Are you looking for a tool to manage your training runs locally, on Slurm/Open Grid Engine clusters, SSH servers or Google Cloud Platform VMs? mle-scheduler provides a lightweight API to launch and monitor job queues. It smoothly orchestrates simultaneous runs for different configurations and/or random seeds. It is meant to reduce boilerplate and to make job resource specification intuitive. It comes with two core pillars:

  • MLEJob: Launches and monitors a single job on a resource (Slurm, Open Grid Engine, GCP, SSH, etc.).
  • MLEQueue: Launches and monitors a queue of jobs with different training configurations and/or seeds.

For a quickstart check out the notebook blog or the example scripts 📖

Colab Local Slurm Grid Engine SSH GCP

Installation

pip install mle-scheduler

Managing a Single Job with MLEJob Locally 🚀

from mle_scheduler import MLEJob

# python train.py -config base_config_1.yaml -exp_dir logs_single -seed_id 1
job = MLEJob(
    resource_to_run="local",
    job_filename="train.py",
    config_filename="base_config_1.yaml",
    experiment_dir="logs_single",
    seed_id=1
)

_ = job.run()

Managing a Queue of Jobs with MLEQueue Locally 🚀 ... 🚀

from mle_scheduler import MLEQueue

# python train.py -config base_config_1.yaml -seed 0 -exp_dir logs_queue/
   
    _base_config_1
   
# python train.py -config base_config_1.yaml -seed 1 -exp_dir logs_queue/
   
    _base_config_1
   
# python train.py -config base_config_2.yaml -seed 0 -exp_dir logs_queue/
   
    _base_config_2
   
# python train.py -config base_config_2.yaml -seed 1 -exp_dir logs_queue/
   
    _base_config_2
   
queue = MLEQueue(
    resource_to_run="local",
    job_filename="train.py",
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    random_seeds=[0, 1],
    experiment_dir="logs_queue"
)

queue.run()

Launching Slurm Cluster-Based Jobs 🐒

", # Partition to schedule jobs on "env_name": "mle-toolbox", # Env to activate at job start-up "use_conda_venv": True, # Whether to use anaconda venv "num_logical_cores": 5, # Number of requested CPU cores per job "num_gpus": 1, # Number of requested GPUs per job "gpu_type": "V100S", # GPU model requested for each job "modules_to_load": "nvidia/cuda/10.0" # Modules to load at start-up } queue = MLEQueue( resource_to_run="slurm-cluster", job_filename="train.py", job_arguments=job_args, config_filenames=["base_config_1.yaml", "base_config_2.yaml"], experiment_dir="logs_slurm", random_seeds=[0, 1] ) queue.run() ">
# Each job requests 5 CPU cores & 1 V100S GPU & loads CUDA 10.0
job_args = {
    "partition": "
   
    "
   ,  # Partition to schedule jobs on
    "env_name": "mle-toolbox",  # Env to activate at job start-up
    "use_conda_venv": True,  # Whether to use anaconda venv
    "num_logical_cores": 5,  # Number of requested CPU cores per job
    "num_gpus": 1,  # Number of requested GPUs per job
    "gpu_type": "V100S",  # GPU model requested for each job
    "modules_to_load": "nvidia/cuda/10.0"  # Modules to load at start-up
}

queue = MLEQueue(
    resource_to_run="slurm-cluster",
    job_filename="train.py",
    job_arguments=job_args,
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    experiment_dir="logs_slurm",
    random_seeds=[0, 1]
)
queue.run()

Launching GridEngine Cluster-Based Jobs 🐘

", # Queue to schedule jobs on "env_name": "mle-toolbox", # Env to activate at job start-up "use_conda_venv": True, # Whether to use anaconda venv "num_logical_cores": 5, # Number of requested CPU cores per job "num_gpus": 1, # Number of requested GPUs per job "gpu_type": "V100S", # GPU model requested for each job "gpu_prefix": "cuda" #$ -l {gpu_prefix}="{num_gpus}" } queue = MLEQueue( resource_to_run="slurm-cluster", job_filename="train.py", job_arguments=job_args, config_filenames=["base_config_1.yaml", "base_config_2.yaml"], experiment_dir="logs_grid_engine", random_seeds=[0, 1] ) queue.run() ">
# Each job requests 5 CPU cores & 1 V100S GPU w. CUDA 10.0 loaded
job_args = {
    "queue": "
   
    "
   ,  # Queue to schedule jobs on
    "env_name": "mle-toolbox",  # Env to activate at job start-up
    "use_conda_venv": True,  # Whether to use anaconda venv
    "num_logical_cores": 5,  # Number of requested CPU cores per job
    "num_gpus": 1,  # Number of requested GPUs per job
    "gpu_type": "V100S",  # GPU model requested for each job
    "gpu_prefix": "cuda"  #$ -l {gpu_prefix}="{num_gpus}"
}

queue = MLEQueue(
    resource_to_run="slurm-cluster",
    job_filename="train.py",
    job_arguments=job_args,
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    experiment_dir="logs_grid_engine",
    random_seeds=[0, 1]
)
queue.run()

Launching SSH Server-Based Jobs 🦊

", # SSH server user name "pkey_path": " ", # Private key path (e.g. ~/.ssh/id_rsa) "main_server": " ", # SSH Server address "jump_server": '', # Jump host address "ssh_port": 22, # SSH port "remote_dir": "mle-code-dir", # Dir to sync code to on server "start_up_copy_dir": True, # Whether to copy code to server "clean_up_remote_dir": True # Whether to delete remote_dir on exit } job_args = { "env_name": "mle-toolbox", # Env to activate at job start-up "use_conda_venv": True # Whether to use anaconda venv } queue = MLEQueue( resource_to_run="ssh-node", job_filename="train.py", config_filenames=["base_config_1.yaml", "base_config_2.yaml"], random_seeds=[0, 1], experiment_dir="logs_ssh_queue", job_arguments=job_args, ssh_settings=ssh_settings) queue.run() ">
ssh_settings = {
    "user_name": "
     
      "
     ,  # SSH server user name
    "pkey_path": "
     
      "
     ,  # Private key path (e.g. ~/.ssh/id_rsa)
    "main_server": "
     
      "
     ,  # SSH Server address
    "jump_server": '',  # Jump host address
    "ssh_port": 22,  # SSH port
    "remote_dir": "mle-code-dir",  # Dir to sync code to on server
    "start_up_copy_dir": True,  # Whether to copy code to server
    "clean_up_remote_dir": True  # Whether to delete remote_dir on exit
}

job_args = {
    "env_name": "mle-toolbox",  # Env to activate at job start-up
    "use_conda_venv": True  # Whether to use anaconda venv
}

queue = MLEQueue(
    resource_to_run="ssh-node",
    job_filename="train.py",
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    random_seeds=[0, 1],
    experiment_dir="logs_ssh_queue",
    job_arguments=job_args,
    ssh_settings=ssh_settings)

queue.run()

Launching GCP VM-Based Jobs 🦄

", # Name of your GCP project "bucket_name": " ", # Name of your GCS bucket "remote_dir": " ", # Name of code dir in bucket "start_up_copy_dir": True, # Whether to copy code to bucket "clean_up_remote_dir": True # Whether to delete remote_dir on exit } job_args = { "num_gpus": 0, # Number of requested GPUs per job "gpu_type": None, # GPU requested e.g. "nvidia-tesla-v100" "num_logical_cores": 1, # Number of requested CPU cores per job } queue = MLEQueue( resource_to_run="gcp-cloud", job_filename="train.py", config_filenames=["base_config_1.yaml", "base_config_2.yaml"], random_seeds=[0, 1], experiment_dir="logs_gcp_queue", job_arguments=job_args, cloud_settings=cloud_settings, ) queue.run() ">
cloud_settings = {
    "project_name": "
     
      "
     ,  # Name of your GCP project
    "bucket_name": "
     
      "
     , # Name of your GCS bucket
    "remote_dir": "
     
      "
     ,  # Name of code dir in bucket
    "start_up_copy_dir": True,  # Whether to copy code to bucket
    "clean_up_remote_dir": True  # Whether to delete remote_dir on exit
}

job_args = {
    "num_gpus": 0,  # Number of requested GPUs per job
    "gpu_type": None,  # GPU requested e.g. "nvidia-tesla-v100"
    "num_logical_cores": 1,  # Number of requested CPU cores per job
}

queue = MLEQueue(
    resource_to_run="gcp-cloud",
    job_filename="train.py",
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    random_seeds=[0, 1],
    experiment_dir="logs_gcp_queue",
    job_arguments=job_args,
    cloud_settings=cloud_settings,
)
queue.run()

Development & Milestones for Next Release

You can run the test suite via python -m pytest -vv tests/. If you find a bug or are missing your favourite feature, feel free to contact me @RobertTLange or create an issue 🤗 . In future releases I plan on implementing the following:

  • Clean up TPU GCP VM & JAX dependencies case
  • Add local launching of cluster jobs via SSH to headnode
  • Add Docker/Singularity container setup support
  • Add Azure support
  • Add AWS support
Comments
  • use sys.executable instead of 'python'

    use sys.executable instead of 'python'

    In some systems (like mine, when I run locally on conda), the Python executable is not "python". I used here a global variable - not sure if that's the best way, but it allows for cases where we don't want the executable to be the same as sys.executable (e.g. if we want to execute the job on a different python interpreter than the one we are using).

    opened by boazbk 4
  • Handle case when experiment_dir is not provided

    Handle case when experiment_dir is not provided

    At the moment if "experiment_dir" is None, then cmd_line_args is not initialized, and hence future lines like cmd_line_args += " -config " + self.config_filename will fail.

    The proposed change just initializes cmd_line_args to the empty string, and then adds all options to it later.

    opened by boazbk 2
  • [Feature] Make `meta_log` accessible from queue

    [Feature] Make `meta_log` accessible from queue

    Instead of having to ...

    # Merge logs of random seeds & configs -> load & get final scores
    queue.merge_configs(merge_seeds=True)
    meta_log = load_meta_log("logs_search/meta_log.hdf5")
    test_scores = [meta_log[r].stats.test_loss.mean[-1] for r in queue.mle_run_ids]
    

    it would be great to do the load_meta_log already within the MLEQueue if merge_configs is called.

    opened by RobertTLange 1
  • Handling Errors thrown in GCP VMs

    Handling Errors thrown in GCP VMs

    Complete newbie to using VMs, so I'm guessing this will be a rookie questions.

    If an error is encountered when executing a job on a GCP VM, what are the best practices for handling them? I'm not even sure how to know if there was an error, which obviously complicates the debugging process.

    opened by wbrenton 0
  • Cmd capture

    Cmd capture

    • Adds MLEQueue option to delete config after job has finished
    • Adds debug_mode option to store stdout & stderr to files - partially addresses #3
    • Adds merging/loading of generated logs in MLEQueue w. automerge_configs option
    • Use system executable python version
    opened by RobertTLange 0
  • What environment does it depend on?

    What environment does it depend on?

    It's greate of you have finished so good tool for job scheduler. I want to konw what environment does it depend on? And if it can run on Kubernetes docker environment? Thanks!

    opened by kongjibai 0
Releases(v0.0.5)
  • v0.0.5(Jan 5, 2022)

    • Adds MLEQueue option to delete config after job has finished (delete_config)
    • Adds debug_mode option to store stdout & stderr to files
    • Adds merging/loading of generated logs in MLEQueue w. automerge_configs option
    • Use system executable python version
    Source code(tar.gz)
    Source code(zip)
  • v0.0.4(Dec 7, 2021)

    • [x] Track config base strings for auto-merging of mle-logs & add merge_configs
    • [x] Allow scheduling on multiple partitions via -p <part1>,<part2> & queues via -q <queue1>,<queue2>
    Source code(tar.gz)
    Source code(zip)
  • v0.0.1(Nov 12, 2021)

    First release 🤗 implementing core API of MLEJob and MLEQueue

    # Each job requests 5 CPU cores & 1 V100S GPU & loads CUDA 10.0
    job_args = {
        "partition": "<SLURM_PARTITION>",  # Partition to schedule jobs on
        "env_name": "mle-toolbox",  # Env to activate at job start-up
        "use_conda_venv": True,  # Whether to use anaconda venv
        "num_logical_cores": 5,  # Number of requested CPU cores per job
        "num_gpus": 1,  # Number of requested GPUs per job
        "gpu_type": "V100S",  # GPU model requested for each job
        "modules_to_load": "nvidia/cuda/10.0"  # Modules to load at start-up
    }
    
    queue = MLEQueue(
        resource_to_run="slurm-cluster",
        job_filename="train.py",
        job_arguments=job_args,
        config_filenames=["base_config_1.yaml",
                          "base_config_2.yaml"],
        experiment_dir="logs_slurm",
        random_seeds=[0, 1]
    )
    queue.run()
    
    Source code(tar.gz)
    Source code(zip)
Owner
null
dragonscales is a highly customizable asynchronous job-scheduler framework

dragonscales ?? dragonscales is a highly customizable asynchronous job-scheduler framework. This framework is used to scale the execution of multiple

Sorcero 2 May 16, 2022
Crontab jobs management in Python

Plan Plan is a Python package for writing and deploying cron jobs. Plan will convert Python code to cron syntax. You can easily manage you

Shipeng Feng 1.2k Dec 28, 2022
Remote task execution tool

Gunnery Gunnery is a multipurpose task execution tool for distributed systems with web-based interface. If your application is divided into multiple s

Gunnery 747 Nov 9, 2022
A simple scheduler tool that provides desktop notifications about classes and opens their meet links in the browser automatically at the start of the class.

This application provides desktop notifications about classes and opens their meet links in browser automatically at the start of the class.

Anshit 14 Jun 29, 2022
A template repository for submitting a job to the Slurm Cluster installed at the DISI - University of Bologna

Cluster di HPC con GPU per esperimenti di calcolo (draft version 1.0) Per poter utilizzare il cluster il primo passo è abilitare l'account istituziona

null 20 Dec 16, 2022
Python cluster client for the official redis cluster. Redis 3.0+.

redis-py-cluster This client provides a client for redis cluster that was added in redis 3.0. This project is a port of redis-rb-cluster by antirez, w

Grokzen 1.1k Jan 5, 2023
Ganeti is a virtual machine cluster management tool built on top of existing virtualization technologies such as Xen or KVM and other open source software.

Ganeti 3.0 =========== For installation instructions, read the INSTALL and the doc/install.rst files. For a brief introduction, read the ganeti(7) m

null 395 Jan 4, 2023
Allow you to create you own custom decentralize job management system.

ants Allow you to create you own custom decentralize job management system. Install $> git clone https://github.com/hvuhsg/ants.git Run monitor exampl

null 1 Feb 15, 2022
google-cloud-bigtable Apache-2google-cloud-bigtable (🥈31 · ⭐ 3.5K) - Google Cloud Bigtable API client library. Apache-2

Python Client for Google Cloud Bigtable Google Cloud Bigtable is Google's NoSQL Big Data database service. It's the same database that powers many cor

Google APIs 39 Dec 3, 2022
IP address management (IPAM) and data center infrastructure management (DCIM) tool.

NetBox is an IP address management (IPAM) and data center infrastructure management (DCIM) tool. Initially conceived by the network engineering team a

NetBox Community 11.8k Jan 7, 2023
Prophet is a tool to discover resources detailed for cloud migration, cloud backup and disaster recovery

Prophet is a tool to discover resources detailed for cloud migration, cloud backup and disaster recovery

null 22 May 31, 2022
CLabel is a terminal-based cluster labeling tool that allows you to explore text data interactively and label clusters based on reviewing that data.

CLabel is a terminal-based cluster labeling tool that allows you to explore text data interactively and label clusters based on reviewing that

Peter Baumgartner 29 Aug 9, 2022
Cache-house - Caching tool for python, working with Redis single instance and Redis cluster mode

Caching tool for python, working with Redis single instance and Redis cluster mo

Tural 14 Jan 6, 2022
Helperpod - A CLI tool to run a Kubernetes utility pod with pre-installed tools that can be used for debugging/testing purposes inside a Kubernetes cluster

Helperpod is a CLI tool to run a Kubernetes utility pod with pre-installed tools that can be used for debugging/testing purposes inside a Kubernetes cluster.

Atakan Tatlı 2 Feb 5, 2022
Lightweight mmm - Lightweight (Bayesian) Media Mix Model

Lightweight (Bayesian) Media Mix Model This is not an official Google product. L

Google 342 Jan 3, 2023
Student-Management-System-in-Python - Student Management System in Python

Student-Management-System-in-Python Student Management System in Python

G.Niruthian 3 Jan 1, 2022
Apache Libcloud is a Python library which hides differences between different cloud provider APIs and allows you to manage different cloud resources through a unified and easy to use API

Apache Libcloud - a unified interface for the cloud Apache Libcloud is a Python library which hides differences between different cloud provider APIs

The Apache Software Foundation 1.9k Dec 25, 2022