A Lightweight Experiment & Resource Monitoring Tool 📺

Last update: Dec 28, 2022

Related tags

Deep Learning mle-monitor

Overview

Lightweight Experiment & Resource Monitoring 📺

"Did I already run this experiment before? How many resources are currently available on my cluster?" If these are common questions you encounter during your daily life as a researcher, then mle-monitor is made for you. It provides a lightweight API for tracking your experiments using a pickle protocol database (e.g. for hyperparameter searches and/or multi-configuration/multi-seed runs). Furthermore, it comes with built-in resource monitoring on Slurm/Grid Engine clusters and local machines/servers.

mle-monitor provides three core functionalities:

MLEProtocol: A composable protocol database API for ML experiments.
MLEResource: A tool for obtaining server/cluster usage statistics.
MLEDashboard: A dashboard visualizing resource usage & experiment protocol.

To get started I recommend checking out the colab notebook and an example workflow.

`MLEProtocol`: Keeping Track of Your Experiments 📝

from mle_monitor import MLEProtocol

# Load protocol database or create new one -> print summary
protocol_db = MLEProtocol("mle_protocol.db", verbose=False)
protocol_db.summary(tail=10, verbose=True)

# Draft data to store in protocol & add it to the protocol
meta_data = {
    "purpose": "Grid search",  # Purpose of experiment
    "project_name": "MNIST",  # Project name of experiment
    "experiment_type": "hyperparameter-search",  # Type of experiment
    "experiment_dir": "experiments/logs",  # Experiment directory
    "num_total_jobs": 10,  # Number of total jobs to run
    ...
}
new_experiment_id = protocol_db.add(meta_data)

# ... train your 10 (pseudo) networks/complete respective jobs
for i in range(10):
    protocol_db.update_progress_bar(new_experiment_id)

# Wrap up an experiment (store completion time, etc.)
protocol_db.complete(new_experiment_id)

The meta data can contain the following keys:

Search Type	Description	Default
`purpose`	Purpose of experiment	`'None provided'`
`project_name`	Project name of experiment	`'default'`
`exec_resource`	Resource jobs are run on	`'local'`
`experiment_dir`	Experiment log storage directory	`'experiments'`
`experiment_type`	Type of experiment to run	`'single'`
`base_fname`	Main code script to execute	`'main.py'`
`config_fname`	Config file path of experiment	`'base_config.yaml'`
`num_seeds`	Number of evaluations seeds	1
`num_total_jobs`	Number of total jobs to run	1
`num_job_batches`	Number of jobs in single batch	1
`num_jobs_per_batch`	Number of sequential job batches	1
`time_per_job`	Expected duration: days-hours-minutes	`'00:01:00'`
`num_cpus`	Number of CPUs used in job	1
`num_gpus`	Number of GPUs used in job	0

Additionally you can synchronize the protocol with a Google Cloud Storage (GCS) bucket by providing cloud_settings. In this case also the results stored in experiment_dir will be uploaded to the GCS bucket, when you call protocol.complete().

# Define GCS settings - requires 'GOOGLE_APPLICATION_CREDENTIALS' env var.
cloud_settings = {
    "project_name": "mle-toolbox",  # GCP project name
    "bucket_name": "mle-protocol",  # GCS bucket name
    "use_protocol_sync": True,  # Whether to sync the protocol to GCS
    "use_results_storage": True,  # Whether to sync experiment_dir to GCS
}
protocol_db = MLEProtocol("mle_protocol.db", cloud_settings, verbose=True)

The `MLEResource`: Keeping Track of Your Resources 📉

On Your Local Machine

from mle_monitor import MLEResource

# Instantiate local resource and get usage data
resource = MLEResource(resource_name="local")
resource_data = resource.monitor()

On a Slurm Cluster

resource = MLEResource(
    resource_name="slurm-cluster",
    monitor_config={"partitions": ["<partition-1>", "<partition-2>"]},
)

On a Grid Engine Cluster

resource = MLEResource(
    resource_name="sge-cluster",
    monitor_config={"queues": ["<queue-1>", "<queue-2>"]}
)

The `MLEDashboard`: Dashboard Visualization 🎞️

from mle_monitor import MLEDashboard

# Instantiate dashboard with protocol and resource
dashboard = MLEDashboard(protocol, resource)

# Get a static snapshot of the protocol & resource utilisation printed in console
dashboard.snapshot()

# Run monitoring in while loop - dashboard
dashboard.live()

Installation ⏳

A PyPI installation is available via:

pip install mle-monitor

Alternatively, you can clone this repository and afterwards 'manually' install it:

git clone https://github.com/mle-infrastructure/mle-monitor.git
cd mle-monitor
pip install -e .

Development & Milestones for Next Release

You can run the test suite via python -m pytest -vv tests/. If you find a bug or are missing your favourite feature, feel free to contact me @RobertTLange or create an issue 🤗 .

Complete system for facial identity system. Include one-shot model, database operation, features visualization, monitoring

2 Dec 28, 2021

Comments

Is the dashboard pooling squeue?

Hey, Thanks for publishing the library, the dashboard looks great!

However, I was a bit concerned to see you are using squeue since the official documentation says

"Executing squeue sends a remote procedure call to slurmctld. If enough calls from squeue or other Slurm client commands that send remote procedure calls to the slurmctld daemon come in at once, it can result in a degradation of performance of the slurmctld daemon, possibly resulting in a denial of service.

Do not run squeue or other Slurm client commands that send remote procedure calls to slurmctld from loops in shell scripts or other programs. Ensure that programs limit calls to squeue to the minimum necessary for the information you are trying to gather."

Do you poll squeue or is there some other, smarter management of it that I missed?

Thanks, Eliahu

opened by eliahuhorwitz 0

Releases(v0.0.1)

v0.0.1(Dec 9, 2021)

Basic API for MLEProtocol, MLEResource & MLEDashboard:

from mle_monitor import MLEProtocol

# Load protocol database or create new one -> print summary
protocol_db = MLEProtocol("mle_protocol.db", verbose=False)
protocol_db.summary(tail=10, verbose=True)

# Draft data to store in protocol & add it to the protocol
meta_data = {
    "purpose": "Grid search",  # Purpose of experiment
    "project_name": "MNIST",  # Project name of experiment
    "experiment_type": "hyperparameter-search",  # Type of experiment
    "experiment_dir": "experiments/logs",  # Experiment directory
    "num_total_jobs": 10,  # Number of total jobs to run
    ...
}
new_experiment_id = protocol_db.add(meta_data)

# ... train your 10 (pseudo) networks/complete respective jobs
for i in range(10):
    protocol_db.update_progress_bar(new_experiment_id)

# Wrap up an experiment (store completion time, etc.)
protocol_db.complete(new_experiment_id)

Source code(tar.gz)
Source code(zip)

A Lightweight Experiment & Resource Monitoring Tool 📺

Related tags

Overview

Lightweight Experiment & Resource Monitoring 📺

`MLEProtocol`: Keeping Track of Your Experiments 📝

The `MLEResource`: Keeping Track of Your Resources 📉

On Your Local Machine

On a Slurm Cluster

On a Grid Engine Cluster

The `MLEDashboard`: Dashboard Visualization 🎞️

Installation ⏳

Development & Milestones for Next Release

You might also like...

Meta Representation Transformation for Low-resource Cross-lingual Learning

OpenDILab RL Kubernetes Custom Resource and Operator Lib

Punctuation Restoration using Transformer Models for High-and Low-Resource Languages

Byte-based multilingual transformer TTS for low-resource/few-shot language adaptation.

PyTorch Implementation of [1611.06440] Pruning Convolutional Neural Networks for Resource Efficient Inference

Real-Time Social Distance Monitoring tool using Computer Vision

An air quality monitoring service with a Raspberry Pi and a SDS011 sensor.

Attendance Monitoring with Face Recognition using Python

Complete system for facial identity system. Include one-shot model, database operation, features visualization, monitoring

Comments

Is the dashboard pooling squeue?

Releases(v0.0.1)

v0.0.1(Dec 9, 2021)

Owner

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

Experiment about Deep Person Re-identification with EfficientNet-v2

Calling Julia from Python - an experiment on data loading

An experiment to bait a generalized frontrunning MEV bot

A practical ML pipeline for data labeling with experiment tracking using DVC.

Small-bets - Ergodic Experiment With Python

An experiment on the performance of homemade Q-learning AIs in Agar.io depending on their state representation and available actions

A resource for learning about ML, DL, PyTorch and TensorFlow. Feedback always appreciated :)

A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

Code for "Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation". [AAAI 2021]

A Lightweight Experiment & Resource Monitoring Tool 📺

Related tags

Overview

Lightweight Experiment & Resource Monitoring 📺

MLEProtocol: Keeping Track of Your Experiments 📝

The MLEResource: Keeping Track of Your Resources 📉

On Your Local Machine

On a Slurm Cluster

On a Grid Engine Cluster

The MLEDashboard: Dashboard Visualization 🎞️

Installation ⏳

Development & Milestones for Next Release

You might also like...

Meta Representation Transformation for Low-resource Cross-lingual Learning

OpenDILab RL Kubernetes Custom Resource and Operator Lib

Punctuation Restoration using Transformer Models for High-and Low-Resource Languages

Byte-based multilingual transformer TTS for low-resource/few-shot language adaptation.

PyTorch Implementation of [1611.06440] Pruning Convolutional Neural Networks for Resource Efficient Inference

Real-Time Social Distance Monitoring tool using Computer Vision

An air quality monitoring service with a Raspberry Pi and a SDS011 sensor.

Attendance Monitoring with Face Recognition using Python

Complete system for facial identity system. Include one-shot model, database operation, features visualization, monitoring

Comments

Is the dashboard pooling squeue?

Releases(v0.0.1)

v0.0.1(Dec 9, 2021)

Owner

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

Experiment about Deep Person Re-identification with EfficientNet-v2

Calling Julia from Python - an experiment on data loading

An experiment to bait a generalized frontrunning MEV bot

A practical ML pipeline for data labeling with experiment tracking using DVC.

Small-bets - Ergodic Experiment With Python

An experiment on the performance of homemade Q-learning AIs in Agar.io depending on their state representation and available actions

A resource for learning about ML, DL, PyTorch and TensorFlow. Feedback always appreciated :)

A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

Code for "Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation". [AAAI 2021]

`MLEProtocol`: Keeping Track of Your Experiments 📝

The `MLEResource`: Keeping Track of Your Resources 📉

The `MLEDashboard`: Dashboard Visualization 🎞️