A template repository for submitting a job to the Slurm Cluster installed at the DISI - University of Bologna

Last update: Dec 16, 2022

Related tags

Deep Learning slurm-job-template-disi

Overview

Cluster di HPC con GPU per esperimenti di calcolo (draft version 1.0)

Per poter utilizzare il cluster il primo passo è abilitare l'account istituzionale per l'accesso ai sistemi del DISI. Se già attivo, avrai accesso con le credenziali istituzionali, anche in remoto (SSH), a tutte le macchine dei laboratori Ercolani e Ranzani.

La quota studente massima è per ora impostata a 400 MB. In caso di necessità di maggiore spazio potrai ricorrere alla creazione di una cartella in /public/ che viene di norma cancellata ogni prima domenica del mese.

/home/ utente e /public/ sono spazi di archiviazione condivisi tra le macchine, potrai dunque creare l'ambiente di esecuzione e i file necessari all'elaborazione sulla macchina SLURM (slurm.cs.unibo.it) da cui poi avviare il job che verrà eseguito sulle macchine dotate di GPU.

Istruzioni

Una possibile impostazione del lavoro potrebbe essere quella di creare un virtual environment Python inserendo all'interno tutto ciò di cui si ha bisogno e utilizzando pip per l'installazione dei moduli necessari. Le segnalo che per utilizzare Python 3 è necessario invocarlo esplicitamente in quanto sulle macchine il default è Python 2. Nel cluster sono presenti GPU Tesla pilotate con driver Nvidia v. 460.67 e librerie di computazione CUDA 11.2.1, quindi in caso di installazione di pytorch bisognerà utilizzare il comando

pip3 install torch==1.8.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html

Il cluster utilizza uno schedulatore SLURM (https://slurm.schedmd.com/overview.html) per la distribuzione dei job. Per sottomettere un job bisogna predisporre nella propria area di lavoro un file di configurzione SLURM (nell'esempio sotto lo abbiamo nominato script.sbatch).

Dopo le direttive SLURM è possibile inserire comandi di script (ad es. BASH).

#!/bin/bash
#SBATCH --job-name=nomejob
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --output=nomeoutput
#SBATCH --gres=gpu:1

. bin/activate  # per attivare il virtual environment python

python test.py # per lanciare lo script python

Nell'esempio precedente:

L'istruzione da tenere immutata è --gres=gpu:1 (ogni nodo di computazione ha un'unica GPU a disposizione e deve essere attivata per poterla utilizzare).
Tutte le altre istruzioni di configurazione per SLURM possono essere personalizzate. Per la definizione di queste e altre direttive si rimanda alla documentazione ufficiale di SLURM (https://slurm.schedmd.com/sbatch.html).
Nell'esempio, dopo le istruzioni di configurazione di SLURM è stato invocato il programma.

Per poter avviare il job sulle macchine del cluster, è necessario:

accedere via SSH alla macchina slurm.cs.unibo.it con le proprie credenziali;
lanciare il comando sbatch <nomescript>.

Alcune note importanti:

saranno inviate e-mail per tutti gli evnti che riguardano il job lanciato, all'indirizzo specificato nelle istruzioni di configurazione (ad esempio al termine del job e nel caso di errori);
i risultati dell'elaborazione saranno presenti nel file <nomeoutput> indicato nelle istruzioni di configurazioni;
l'esecuzione sulle macchine avviene all'interno dello stesso path relativo che, essendo condiviso, viene visto anche dalle macchine dei laboratori e dalla macchina slurm.

Online Pseudo Label Generation by Hierarchical Cluster Dynamics for Adaptive Person Re-identification

6 Nov 25, 2022

Clustergram - Visualization and diagnostics for cluster analysis in Python

Clustergram Visualization and diagnostics for cluster analysis Clustergram is a diagram proposed by Matthias Schonlau in his paper The clustergram: A

96 Dec 26, 2022

Sub-Cluster AdaCos: Learning Representations for Anomalous Sound Detection.

Accompanying code for the paper Sub-Cluster AdaCos: Learning Representations for Anomalous Sound Detection.

6 Dec 1, 2022

Public scripts, services, and configuration for running a smart home K3S network cluster

makerhouse_network Public scripts, services, and configuration for running MakerHouse's home network. This network supports: TODO features here For mo

1 Jan 15, 2022

Template repository for managing machine learning research projects built with PyTorch-Lightning

Tutorial Repository with a minimal example for showing how to deploy training across various compute infrastructure.

3 Feb 11, 2022

Jittor Medical Segmentation Lib -- The assignment of Pattern Recognition course (2021 Spring) in Tsinghua University

THU模式识别2021春 -- Jittor 医学图像分割模型列表本仓库收录了课程作业中同学们采用jittor框架实现的如下模型： UNet SegNet DeepLab V2 DANet EANet HarDNet及其改动HarDNet_alter PSPNet OCNet OCRNet DL

48 Dec 26, 2022

Introduction to AI assignment 1 HCM University of Technology, term 211

Sokoban Bot Introduction to AI assignment 1 HCM University of Technology, term 211 Abstract This is basically a solver for Sokoban game using Breadth-

4 Dec 12, 2022

Code for the AI lab course 2021/2022 of the University of Verona

AI-Lab Code for the AI lab course 2021/2022 of the University of Verona Set-Up the environment for the curse Download Anaconda for your System. Instal

5 Oct 19, 2022

It is the assignment for COMP 576 in Rice University

COMP-576 It is the assignment for COMP 576 in Rice University There are two programming assignments and one Final Project. Assignment 1: It is a MLP a

1 Nov 25, 2021

Comments

Torch distributed su slurm

Sto provando a inizializzare un gruppo con torch distributed, ma non riesco a venirne a capo.

Prendendo spunto da qui, ho scritto questo codice minimale:

import torch
import os
import argparse


def main():
    parser = argparse.ArgumentParser("Test")
    args = parser.parse_args()

    args.ngpus_per_node = torch.cuda.device_count()
    args.rank = int(os.getenv('SLURM_NODEID')) * args.ngpus_per_node
    args.world_size = int(os.getenv('SLURM_NNODES')) * args.ngpus_per_node
    args.dist_url = f'env://'
    torch.multiprocessing.spawn(do_something, (args,), args.ngpus_per_node)  

def do_something(device, args):
    args.rank += device
    torch.distributed.init_process_group(
        backend='nccl', init_method=args.dist_url,
        world_size=args.world_size, rank=args.rank
)  
    
    local_tensor = torch.rand(10, device=device)
    torch.distributed.all_reduce(local_tensor)

if __name__ == "__main__":
    main()

Usando questo sbatch:

#!/bin/bash
#SBATCH --job-name=tests
#SBATCH --mail-type=ALL
#SBATCH --mail-user=[myemail]
#SBATCH --time=05:00:00
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --output=output.txt
#SBATCH --gres=gpu:1

source cvenv/bin/activate

export MASTER_PORT=29500
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr

python3 main.py

Il codice funziona se lo lancio su un singolo nodo, altrimenti l'inizializzazione di torch distributed rimane bloccata e ricevo questo errore dopo il time out:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/public.hpc/oyounis/powersgd/cvenv/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/public.hpc/oyounis/powersgd/main.py", line 21, in do_something
    world_size=args.world_size, rank=args.rank
  File "/public.hpc/oyounis/powersgd/cvenv/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/public.hpc/oyounis/powersgd/cvenv/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 258, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)

Penso sia un problema sul MASTER_ADDR (che con questo codice viene impostato al nome del primo nodo, per esempio ws2gpu2), ma non saprei in che altro modo settarlo; ho provato localhost ma non funziona su più nodi.

opened by younik 1

Aggiunte istruzioni relative all'esecuzione di notebook Jupyter

Buongiorno, sono uno studente della magistrale di AI, e dal momento che molti miei colleghi sono abituati ad eseguire task di ML/DL usando notebook Jupyter, ho pensato potesse essere utile aggiungere le istruzioni per eseguire un server Jupyter sul cluster e connettersi da remoto.

opened by nihil21 0

Owner

PhD in Computer Science, Adjunct Professor @ CS department, Bologna

GitHub

This project uses Template Matching technique for object detecting by detection of template image over base image.

Object Detection Project Using OpenCV This project uses Template Matching technique for object detecting by detection the template image over base ima

7 May 29, 2022

This project uses Template Matching technique for object detecting by detection of template image over base image

Object Detection Project Using OpenCV This project uses Template Matching technique for object detecting by detection the template image over base ima

4 Nov 16, 2021

Repository of Jupyter notebook tutorials for teaching the Deep Learning Course at the University of Amsterdam (MSc AI), Fall 2020

1.1k Jan 7, 2023

Job Assignment System by Real-time Emotion Detection

Emotion-Detection Job Assignment System by Real-time Emotion Detection Emotion is the essential role of facial expression and it could provide a lot o

1 Feb 8, 2022

Job-Recommend-Competition - Vectorwise Interpretable Attentions for Multimodal Tabular Data

SiD - Simple Deep Model Vectorwise Interpretable Attentions for Multimodal Tabul

40 Dec 22, 2022

Serverless proxy for Spark cluster

Hydrosphere Mist Hydrosphere Mist is a serverless proxy for Spark cluster. Mist provides a new functional programming framework and deployment model f

317 Dec 1, 2022

codes for paper Combining Dynamic Local Context Focus and Dependency Cluster Attention for Aspect-level sentiment classification

DLCF-DCA codes for paper Combining Dynamic Local Context Focus and Dependency Cluster Attention for Aspect-level sentiment classification. submitted t

15 Aug 30, 2022

A PyTorch implementation of "Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks" (KDD 2019).

ClusterGCN ⠀⠀ A PyTorch implementation of "Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks" (KDD 2019). A

697 Dec 27, 2022

Multiband spectro-radiometric satellite image analysis with K-means cluster algorithm

Multi-band Spectro Radiomertric Image Analysis with K-means Cluster Algorithm Overview Multi-band Spectro Radiomertric images are images comprising of

6 Mar 16, 2022

Complete-IoU (CIoU) Loss and Cluster-NMS for Object Detection and Instance Segmentation (YOLACT)

Complete-IoU Loss and Cluster-NMS for Improving Object Detection and Instance Segmentation. Our paper is accepted by IEEE Transactions on Cybernetics

290 Dec 25, 2022