1. goal
There is only one pod template defined in a dijob, which results in that we can not define different commands or resources for different componets of di-engine such as collector, learner and evaluator. So we are supposed to find a more general way to define a custom resource of dijob.
2. design *
Inspired by VolcanoJob, we define the spec.tasks
to describe different componets of di-engine. spec.tasks
is a list, which allows us to define multiple tasks. We can specify different task.type
to label the task as one of collector
, learner
, evaluator
and none
. none
means the task is a general task, which is the default value.
After change, the dijob can be defined as follow:
apiVersion: diengine.opendilab.org/v2alpha1
kind: DIJob
metadata:
name: job-with-tasks
spec:
priority: "normal" # job priority, which is a reserved field for allocator
backoffLimit: 0 # restart count
cleanPodPolicy: "Running" # the policy to clean pods after job completion
preemptible: false # job is preemtible or not
minReplicas: 2
maxReplicas: 5
tasks:
- replicas: 1
name: "learner"
type: learner
template:
metadata:
name: di
spec:
containers:
- image: registry.sensetime.com/xlab/ding:nightly
imagePullPolicy: IfNotPresent
name: pydi
env:
- name: NCCL_DEBUG
value: "INFO"
command: ["/bin/bash", "-c",]
args:
- |
ditask --label learner xxx
resources:
requests:
cpu: "1"
nvidia.com/gpu: 1
restartPolicy: Never
- replicas: 1
name: "evaluator"
type: evaluator
template:
metadata:
name: di
spec:
containers:
- image: registry.sensetime.com/xlab/ding:nightly
imagePullPolicy: IfNotPresent
name: pydi
env:
- name: NCCL_DEBUG
value: "INFO"
command: ["/bin/bash", "-c",]
args:
- |
ditask --label evaluator xxx
restartPolicy: Never
- replicas: 2
name: "collector"
type: collector
template:
metadata:
name: di
spec:
containers:
- image: registry.sensetime.com/xlab/ding:nightly
imagePullPolicy: IfNotPresent
name: pydi
env:
- name: NCCL_DEBUG
value: "INFO"
command: ["/bin/bash", "-c",]
args:
- |
ditask --label collector xxx
restartPolicy: Never
status:
conditions:
- lastTransitionTime: "2022-05-26T07:25:11Z"
lastUpdateTime: "2022-05-26T07:25:11Z"
message: job created.
reason: JobPending
status: "False"
type: Pending
- lastTransitionTime: "2022-05-26T07:25:11Z"
lastUpdateTime: "2022-05-26T07:25:11Z"
message: job is starting since all pods are created.
reason: JobStarting
status: "False"
type: Starting
phase: Starting
profilings: {}
readyReplicas: 0
replicas: 4
taskStatus:
learner:
Pending: 1
evaluator:
Pending: 1
collector:
Pending: 2
reschedules: 0
restarts: 0
task definition:
type Task struct {
Name string `json:"name,omitempty"`
Type TaskType `json:"type,omitempty"`
Replicas int32 `json:"replicas,omitempty"`
Template corev1.PodTemplateSpec `json:"template,omitempty"`
}
type TaskType string
const (
TaskTypeLearner TaskType = "learner"
TaskTypeCollector TaskType = "collector"
TaskTypeEvaluator TaskType = "evaluator"
TaskTypeNone TaskType = "none"
)
status.taskStatus
definition:
type DIJobStatus struct {
// Phase defines the observed phase of the job
// +kubebuilder:default=Pending
Phase Phase `json:"phase,omitempty"`
// ...
// map for different task statuses. key: task.name, value: TaskStatus
TaskStatus map[string]TaskStatus
// ...
}
// count of different pod phases
type TaskStatus map[corev1.PodPhase]int32
enhancement