OpenDILab RL Kubernetes Custom Resource and Operator Lib

OpenDILab

Last update: Dec 29, 2022

Related tags

Deep Learning python go reinforcement-learning k8s orchestrator k8s-operator

Overview

DI Orchestrator

DI Orchestrator is designed to manage DI (Decision Intelligence) jobs using Kubernetes Custom Resource and Operator.

Prerequisites

A well-prepared kubernetes cluster. Follow the instructions to create a kubernetes cluster, or create a local kubernetes node referring to kind or minikube
Cert-manager. Installation on kubernetes please refer to cert-manager docs. Or you can install it by the following command.

kubectl create -f ./config/certmanager/cert-manager.yaml

Install DI Orchestrator

DI Orchestrator consists of two components: di-operator and di-server. Install di-operator and di-server with the following command.

kubectl create -f ./config/di-manager.yaml

di-operator and di-server will be installed in di-system namespace.

$ kubectl get pod -n di-system
NAME                               READY   STATUS    RESTARTS   AGE
di-operator-57cc65d5c9-5vnvn   1/1     Running   0          59s
di-server-7b86ff8df4-jfgmp     1/1     Running   0          59s

Install global components of DIJob defined in AggregatorConfig:

kubectl create -f config/samples/agconfig.yaml -n di-system

Submit DIJob

# submit DIJob
$ kubectl create -f config/samples/dijob-cartpole.yaml

# get pod and you will see coordinator is created by di-operator
# a few seconds later, you will see collectors and learners created by di-server
$ kubectl get pod

# get logs of coordinator
$ kubectl logs cartpole-dqn-coordinator

User Guide

Refers to user-guide. For Chinese version, please refer to 中文手册

Contributing

Refers to developer-guide.

Comments

在 Pod 内增加集群信息
希望以 dijob replica 方式提交时，每个 pod 都能见到整个 replica 的 host 信息和自己的启动顺序，增加以下几个环境变量：

replica 中所有 pod 的 FQDN，依据启动顺序排序

当前 pod 的 FQDN

当前 pod 的顺序编号

DI-engine 中会根据这些变量实现对应的网络连接，attach-to 的生成逻辑可以从 di-orchestrator 中移除
enhancement
opened by sailxjx 3

add tasks to dijob spec

1. goal

There is only one pod template defined in a dijob, which results in that we can not define different commands or resources for different componets of di-engine such as collector, learner and evaluator. So we are supposed to find a more general way to define a custom resource of dijob.

2. design *

Inspired by VolcanoJob, we define the spec.tasks to describe different componets of di-engine. spec.tasks is a list, which allows us to define multiple tasks. We can specify different task.type to label the task as one of collector, learner, evaluator and none. none means the task is a general task, which is the default value.

After change, the dijob can be defined as follow:

apiVersion: diengine.opendilab.org/v2alpha1
kind: DIJob
metadata:
  name: job-with-tasks
spec:
  priority: "normal"  # job priority, which is a reserved field for allocator
  backoffLimit: 0  # restart count
  cleanPodPolicy: "Running"  # the policy to clean pods after job completion
  preemptible: false  # job is preemtible or not
  minReplicas: 2  
  maxReplicas: 5
  tasks:
  - replicas: 1
    name: "learner"
    type: learner
    template:
      metadata:
        name: di
      spec:
        containers:
        - image: registry.sensetime.com/xlab/ding:nightly
          imagePullPolicy: IfNotPresent
          name: pydi
          env:
          - name: NCCL_DEBUG
            value: "INFO"
          command: ["/bin/bash", "-c",]
          args: 
          - |
            ditask --label learner xxx
          resources:
            requests:
              cpu: "1"
              nvidia.com/gpu: 1
        restartPolicy: Never
  - replicas: 1
    name: "evaluator"
    type: evaluator
    template:
      metadata:
        name: di
      spec:
        containers:
        - image: registry.sensetime.com/xlab/ding:nightly
          imagePullPolicy: IfNotPresent
          name: pydi
          env:
          - name: NCCL_DEBUG
            value: "INFO"
          command: ["/bin/bash", "-c",]
          args: 
          - |
            ditask --label evaluator xxx
        restartPolicy: Never
  - replicas: 2
    name: "collector"
    type: collector
    template:
      metadata:
        name: di
      spec:
        containers:
        - image: registry.sensetime.com/xlab/ding:nightly
          imagePullPolicy: IfNotPresent
          name: pydi
          env:
          - name: NCCL_DEBUG
            value: "INFO"
          command: ["/bin/bash", "-c",]
          args: 
          - |
            ditask --label collector xxx
        restartPolicy: Never
status:
  conditions:
  - lastTransitionTime: "2022-05-26T07:25:11Z"
    lastUpdateTime: "2022-05-26T07:25:11Z"
    message: job created.
    reason: JobPending
    status: "False"
    type: Pending
  - lastTransitionTime: "2022-05-26T07:25:11Z"
    lastUpdateTime: "2022-05-26T07:25:11Z"
    message: job is starting since all pods are created.
    reason: JobStarting
    status: "False"
    type: Starting
  phase: Starting
  profilings: {}
  readyReplicas: 0
  replicas: 4
  taskStatus:
    learner:
      Pending: 1
    evaluator:
      Pending: 1
    collector:
      Pending: 2
  reschedules: 0
  restarts: 0

task definition:

type Task struct {
	Name string `json:"name,omitempty"`

	Type TaskType `json:"type,omitempty"`

	Replicas int32 `json:"replicas,omitempty"`

	Template corev1.PodTemplateSpec `json:"template,omitempty"`
}

type TaskType string

const (
	TaskTypeLearner TaskType = "learner"

	TaskTypeCollector TaskType = "collector"

	TaskTypeEvaluator TaskType = "evaluator"

	TaskTypeNone TaskType = "none"
)

status.taskStatus definition:

type DIJobStatus struct {
  // Phase defines the observed phase of the job
  // +kubebuilder:default=Pending
  Phase Phase `json:"phase,omitempty"`

  // ...
  
  // map for different task statuses. key: task.name, value: TaskStatus
  TaskStatus map[string]TaskStatus

  // ...
}

// count of different pod phases
type TaskStatus map[corev1.PodPhase]int32

enhancement

opened by konnase 1

new version for di-engine new architecture
release notes

features

v1.0.0 for DI-engine new architecture

remove webhook

manage commands with cobra

refactor orchestrator architecture inspired from adaptdl

use gin to rewrite di-server

update di-server http interface

enhancement
opened by konnase 1
v0.2.0
[x] split webhook and operator

[x] add dockerfile.dev

[x] update CleanPolicyALL to CleanPolicyAll

[x] remove k8s service related operations from server, and operator is responsible for managing services

[x] add e2e test

enhancement
opened by konnase 1
refactor job spec
refactor job spec definition and add spec.tasks to support multi tasks #20

add DI_RANK to pod env and remove engineFields in job.spec #16

add e2e test

add validator to validate the correctness of dijob spec

change job.phase to Pending when job replicas scaled to 0

implement a processor to process di-server requests

refactor project structure

enhancement
opened by konnase 0
Release/v1.0
release notes

features

v1.0.0 for DI-engine new architecture

remove webhook

manage commands with cobra

refactor orchestrator architecture inspired from adaptdl

use gin to rewrite di-server

update di-server http interface

enhancement
opened by konnase 0
fix: job failed submit when collector/learner missed

job failed submit when collector/learner missed because webhook create an empty dijob, and golang builder add some default value to some feilds of collector/learner, which result in invalid type error. solved by make coordinator/collector/learner as pointers.
bug

opened by konnase 0
Feat/job create event
add event handler for dijob, and mark job as Created when job submitted

mark collector and learner as optional, only coordinator is required(https://github.com/opendilab/DI-orchestrator/pull/13/commits/653e64af01ec7752b08d4bf8381738d566fca224)

mark job Failed when the submitted job is incorrect(https://github.com/opendilab/DI-orchestrator/pull/13/commits/bea840a5eee3508be18b53b325168a5647daff94), but it's hard to test since client-go reflector decodes DIJob strictly, we have no chance to handle DIJob add event when incorrect job submitted

version -> v0.2.1

enhancement
opened by konnase 0
allocate的一些问题

1.目前的allocator的逻辑，对于不可被抢占的job的初始分配，仅利用minreplicas修改replicas属性，那job的pods部署到哪个节点是完全由K8S决定吗？而且Release1.13代码的allocator.go中对不可被抢占job的初始分配部分貌似还没有写。 2.job是否可以被抢占的含义具体是什么？和是否能被调度是不是等价的？ 3.调度策略的FitPolicy的Allocate和Optimize方法也没有进行实现，这部分内容什么时候可以补充？ 4.文档中存在许多与最新代码不符合的地方，比如DIJob.Spec.Group属性在代码中已经被移除，文档中提到的job.spec.minreplicas属性代码中也没有，而是在JobInfo中。可以更新一下文档吗？感谢！

opened by RZ-Q 3

Releases(v1.1.3)

v1.1.3(Aug 22, 2022)
bugs fix

judge which task a pod belongs to according to task name instead of task type (https://github.com/opendilab/DI-orchestrator/pull/27)

Source code(tar.gz)
Source code(zip)
di-manager.yaml(445.36 KB)
v1.1.2(Jul 21, 2022)
bugs fix

global cmd flag error(https://github.com/opendilab/DI-orchestrator/pull/23)

wrong pod subdomain(https://github.com/opendilab/DI-orchestrator/pull/24)

incorrect to get global rank(https://github.com/opendilab/DI-orchestrator/pull/25)

Source code(tar.gz)
Source code(zip)
di-manager.yaml(445.36 KB)
v1.1.1(Jul 4, 2022)
update status replicas and task status

add volumes to job spec

update status CompletionTimestamp when job completed

see details in https://github.com/opendilab/DI-orchestrator/pull/22
Source code(tar.gz)
Source code(zip)
di-manager.yaml(445.36 KB)
v1.1.0(Jun 30, 2022)
refactor job spec definition and add spec.tasks to support multi tasks #20

add DI_RANK to pod env and remove engineFields in job.spec #16

add e2e test

add validator to validate the correctness of dijob spec

change job.phase to Pending when job replicas scaled to 0

implement a processor to process di-server requests

refactor project structure

see details in https://github.com/opendilab/DI-orchestrator/pull/21
Source code(tar.gz)
Source code(zip)
di-manager.yaml(374.01 KB)
v1.0.0(Mar 23, 2022)
features

remove webhook

manage commands with cobra

refactor orchestrator architecture inspired from adaptdl

use gin to rewrite di-server

update di-server http interface see https://github.com/opendilab/DI-orchestrator/pull/18

Source code(tar.gz)
Source code(zip)
di-manager.yaml(350.52 KB)
v0.2.2(Dec 15, 2021)
bug fix

resolve bug that job failed to submit when collector/learner missed (https://github.com/opendilab/DI-orchestrator/pull/14)

Source code(tar.gz)
Source code(zip)
di-manager.yaml(1.38 MB)
v0.2.1(Oct 12, 2021)
feature

add event handler for dijob, and mark job as Created when job submitted(https://github.com/opendilab/DI-orchestrator/pull/13)

mark collector and learner as optional, only coordinator is required(https://github.com/opendilab/DI-orchestrator/pull/13/commits/653e64af01ec7752b08d4bf8381738d566fca224)

mark job Failed when the submitted job is incorrect(https://github.com/opendilab/DI-orchestrator/pull/13/commits/bea840a5eee3508be18b53b325168a5647daff94), but it's hard to test since client-go reflector decodes DIJob strictly, we have no chance to handle DIJob add event when incorrect job submitted

Source code(tar.gz)
Source code(zip)
di-manager.yaml(1.38 MB)
v0.2.0(Sep 28, 2021)
change orchestrator image repository

version -> v0.2.0

Source code(tar.gz)
Source code(zip)
v0.2.0-rc.0(Sep 6, 2021)
split webhook and operator

add dockerfile.dev

update CleanPolicyALL to CleanPolicyAll

remove k8s service related operations from server, and operator is responsible for managing services

add e2e test

Source code(tar.gz)
Source code(zip)
v0.1.0(Jul 8, 2021)
Features

Define DIJob CRD to support DI jobs' submission

Define AggregatorConfig CRD to support aggregator definition

Add webhook to validate DIJob submission

Provide http service for DI jobs to request for DI modules

Docs to introduce DI-orchestrator architecture

Source code(tar.gz)
Source code(zip)

Owner

OpenDILab

Open sourced Decision Intelligence (DI)

GitHub

DI-HPC is an acceleration operator component for general algorithm modules in reinforcement learning algorithms

DI-HPC: Decision Intelligence - High Performance Computation DI-HPC is an acceleration operator component for general algorithm modules in reinforceme

185 Dec 29, 2022

Example-custom-ml-block-keras - Custom Keras ML block example for Edge Impulse

Custom Keras ML block example for Edge Impulse This repository is an example on

8 Nov 2, 2022

A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

A tour through tensorflow with financial data I present several models ranging in complexity from simple regression to LSTM and policy networks. The s

195 Dec 7, 2022

A resource for learning about ML, DL, PyTorch and TensorFlow. Feedback always appreciated :)

4.7k Jan 8, 2023

Punctuation Restoration using Transformer Models for High-and Low-Resource Languages

Punctuation Restoration using Transformer Models This repository contins official implementation of the paper Punctuation Restoration using Transforme

142 Jan 1, 2023

mbrl-lib is a toolbox for facilitating development of Model-Based Reinforcement Learning algorithms.

mbrl-lib is a toolbox for facilitating development of Model-Based Reinforcement Learning algorithms. It provides easily interchangeable modeling and planning components, and a set of utility functions that allow writing model-based RL algorithms with only a few lines of code.

724 Jan 4, 2023

FluidNet re-written with ATen tensor lib

fluidnet_cxx: Accelerating Fluid Simulation with Convolutional Neural Networks. A PyTorch/ATen Implementation. This repository is based on the paper,

50 Jun 7, 2022

Jittor Medical Segmentation Lib -- The assignment of Pattern Recognition course (2021 Spring) in Tsinghua University

THU模式识别2021春 -- Jittor 医学图像分割模型列表本仓库收录了课程作业中同学们采用jittor框架实现的如下模型： UNet SegNet DeepLab V2 DANet EANet HarDNet及其改动HarDNet_alter PSPNet OCNet OCRNet DL

48 Dec 26, 2022

Python lib to talk to pylontech lithium batteries (US2000, US3000, ...) using RS485

python-pylontech Python lib to talk to pylontech lithium batteries (US2000, US3000, ...) using RS485 What is this lib ? This lib is meant to talk to P

26 Dec 28, 2022

A mini lib that implements several useful functions binding to PyTorch in C++.

Torch-gather A mini library that implements several useful functions binding to PyTorch in C++. What does gather do? Why do we need it? When dealing w

8 Sep 7, 2022

Code for "Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation". [AAAI 2021]

Graph Evolving Meta-Learning for Low-resource Medical Dialogue Generation Code to be further cleaned... This repo contains the code of the following p

29 Nov 1, 2022

Meta Representation Transformation for Low-resource Cross-lingual Learning

MetaXL: Meta Representation Transformation for Low-resource Cross-lingual Learning This repo hosts the code for MetaXL, published at NAACL 2021. [Meta

36 Aug 17, 2022

Byte-based multilingual transformer TTS for low-resource/few-shot language adaptation.

One model to speak them all ?? Audio Language Text ▷ Chinese 人人生而自由，在尊严和权利上一律平等。 ▷ English All human beings are born free and equal in dignity and rig

60 Nov 14, 2022

PyTorch Implementation of [1611.06440] Pruning Convolutional Neural Networks for Resource Efficient Inference

PyTorch implementation of [1611.06440 Pruning Convolutional Neural Networks for Resource Efficient Inference] This demonstrates pruning a VGG16 based

836 Dec 26, 2022

Machine Learning Platform for Kubernetes

Reproduce, Automate, Scale your data science. Welcome to Polyaxon, a platform for building, training, and monitoring large scale deep learning applica

3.2k Dec 23, 2022

A very lightweight monitoring system for Raspberry Pi clusters running Kubernetes.

OMNI A very lightweight monitoring system for Raspberry Pi clusters running Kubernetes. Why? When I finished my Kubernetes cluster using a few Raspber

148 Dec 29, 2022

A user-friendly research and development tool built to standardize RL competency assessment for custom agents and environments.

Built with ❤️ by Sam Showalter Contents Overview Installation Dependencies Usage Scripts Standard Execution Environment Development Environment Benchm

1 Nov 18, 2021

Quickly and easily create / train a custom DeepDream model

Dream-Creator This project aims to simplify the process of creating a custom DeepDream model by using pretrained GoogleNet models and custom image dat

55 Dec 27, 2022

Extending JAX with custom C++ and CUDA code

Extending JAX with custom C++ and CUDA code This repository is meant as a tutorial demonstrating the infrastructure required to provide custom ops in

237 Dec 23, 2022

OpenDILab RL Kubernetes Custom Resource and Operator Lib

Related tags

Overview

DI Orchestrator

Prerequisites

Install DI Orchestrator

Submit DIJob

User Guide

Contributing

Comments

1. goal

2. design *

release notes

features

release notes

features

Releases(v1.1.3)

v1.1.3(Aug 22, 2022)

bugs fix

v1.1.2(Jul 21, 2022)

bugs fix

v1.1.1(Jul 4, 2022)

v1.1.0(Jun 30, 2022)

v1.0.0(Mar 23, 2022)

features

v0.2.2(Dec 15, 2021)

bug fix

v0.2.1(Oct 12, 2021)

feature

v0.2.0(Sep 28, 2021)

v0.2.0-rc.0(Sep 6, 2021)

v0.1.0(Jul 8, 2021)

Features

Owner

OpenDILab

DI-HPC is an acceleration operator component for general algorithm modules in reinforcement learning algorithms

Example-custom-ml-block-keras - Custom Keras ML block example for Edge Impulse

A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

A resource for learning about ML, DL, PyTorch and TensorFlow. Feedback always appreciated :)

Punctuation Restoration using Transformer Models for High-and Low-Resource Languages

mbrl-lib is a toolbox for facilitating development of Model-Based Reinforcement Learning algorithms.

FluidNet re-written with ATen tensor lib

Jittor Medical Segmentation Lib -- The assignment of Pattern Recognition course (2021 Spring) in Tsinghua University

Python lib to talk to pylontech lithium batteries (US2000, US3000, ...) using RS485

A mini lib that implements several useful functions binding to PyTorch in C++.

Code for "Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation". [AAAI 2021]

Meta Representation Transformation for Low-resource Cross-lingual Learning

Byte-based multilingual transformer TTS for low-resource/few-shot language adaptation.

PyTorch Implementation of [1611.06440] Pruning Convolutional Neural Networks for Resource Efficient Inference

Machine Learning Platform for Kubernetes

A very lightweight monitoring system for Raspberry Pi clusters running Kubernetes.

A user-friendly research and development tool built to standardize RL competency assessment for custom agents and environments.

Quickly and easily create / train a custom DeepDream model

Extending JAX with custom C++ and CUDA code