Hi -
I'm trying to spin up a Kubernetes cluster without the benefit of managed service like EKS or GKE, then deploy Polyaxon on that cluster. Currently I'm experiencing some issues on the Polyaxon side of this process.
To deploy the Kubernetes cluster I'm using kubespray. I'm able to deploy the cluster to the point that kubectl get nodes
shows the expected nodes in a ready state, and I'm able to deploy a simple Node.js app as a test. I am not, however, able to successfully install Polyaxon on the cluster.
I've tried on both AWS and on my local machine using Vagrant/Virtualbox. The issues I'm experiencing are different between the two cases, which I find interesting, so I'll document both.
AWS
I deployed Kubernetes by loosely following this tutorial. Things went smoothly for the most part, except that I needed to deal with this issue using this fix. I used 3 t2.large instance running Ubuntu 16.04 and the standard kubespray configuration.
As I mentioned above, I get the expected output from kubectl get nodes
, and I'm able to deploy the Node.js app at the end of the tutorial.
At first, the Polyaxon installation/deployment also seems to succeed:
ubuntu@ip-10-1-0-226:~$ helm install polyaxon/polyaxon \
> --name=polyaxon \
> --namespace=polyaxon \
> -f polyaxon_config.yml
NAME: polyaxon
LAST DEPLOYED: Sat Feb 9 00:03:29 2019
NAMESPACE: polyaxon
STATUS: DEPLOYED
RESOURCES:
==> v1/Secret
NAME TYPE DATA AGE
polyaxon-docker-registry-secret Opaque 1 3m4s
polyaxon-postgresql Opaque 1 3m4s
polyaxon-rabbitmq Opaque 2 3m4s
polyaxon-polyaxon-secret Opaque 4 3m4s
==> v1/ConfigMap
NAME DATA AGE
redis-config 1 3m4s
polyaxon-polyaxon-config 141 3m4s
==> v1beta1/ClusterRole
NAME AGE
polyaxon-polyaxon-clusterrole 3m4s
==> v1beta1/DaemonSet
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
polyaxon-polyaxon-resources 2 2 2 2 2 <none> 3m4s
==> v1beta1/Deployment
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
polyaxon-docker-registry 1 1 1 1 3m4s
polyaxon-postgresql 1 1 1 1 3m4s
polyaxon-rabbitmq 1 1 1 1 3m4s
polyaxon-redis 1 1 1 1 3m4s
polyaxon-polyaxon-api 1 1 1 0 3m4s
polyaxon-polyaxon-beat 1 1 1 1 3m4s
polyaxon-polyaxon-events 1 1 1 1 3m4s
polyaxon-polyaxon-hpsearch 1 1 1 1 3m4s
polyaxon-polyaxon-k8s-events 1 1 1 1 3m4s
polyaxon-polyaxon-monitors 1 1 1 1 3m4s
polyaxon-polyaxon-scheduler 1 1 1 1 3m3s
==> v1/Pod(related)
NAME READY STATUS RESTARTS AGE
polyaxon-polyaxon-resources-hpbcv 1/1 Running 0 3m4s
polyaxon-polyaxon-resources-m7bjv 1/1 Running 0 3m4s
polyaxon-docker-registry-58bff6f777-vkl6h 1/1 Running 0 3m4s
polyaxon-postgresql-f4fc68c67-25t4p 1/1 Running 0 3m4s
polyaxon-rabbitmq-74c5d87cf6-qlk2b 1/1 Running 0 3m4s
polyaxon-redis-6f7db88668-99qvw 1/1 Running 0 3m4s
polyaxon-polyaxon-api-75c5989cb4-ppv7t 1/2 Running 0 3m4s
polyaxon-polyaxon-beat-759d6f9f96-qdhmd 2/2 Running 0 3m3s
polyaxon-polyaxon-events-86f49f8b78-vvscx 1/1 Running 0 3m4s
polyaxon-polyaxon-hpsearch-5f77c8d6cd-gkdms 1/1 Running 0 3m3s
polyaxon-polyaxon-k8s-events-555f6c8754-c242k 1/1 Running 0 3m3s
polyaxon-polyaxon-monitors-864dd8fb67-h7s47 2/2 Running 0 3m2s
polyaxon-polyaxon-scheduler-7f4978774d-pm9xz 1/1 Running 0 3m2s
==> v1/ServiceAccount
NAME SECRETS AGE
polyaxon-polyaxon-serviceaccount 1 3m4s
polyaxon-polyaxon-workers-serviceaccount 1 3m4s
==> v1beta1/ClusterRoleBinding
NAME AGE
polyaxon-polyaxon-clusterrole-binding 3m4s
==> v1beta1/Role
NAME AGE
polyaxon-polyaxon-role 3m4s
polyaxon-polyaxon-workers-role 3m4s
==> v1beta1/RoleBinding
NAME AGE
polyaxon-polyaxon-role-binding 3m4s
polyaxon-polyaxon-workers-role-binding 3m4s
==> v1/Service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
polyaxon-docker-registry NodePort 10.233.42.186 <none> 5000:31813/TCP 3m4s
polyaxon-postgresql ClusterIP 10.233.17.56 <none> 5432/TCP 3m4s
polyaxon-rabbitmq ClusterIP 10.233.33.173 <none> 4369/TCP,5672/TCP,25672/TCP,15672/TCP 3m4s
polyaxon-redis ClusterIP 10.233.31.108 <none> 6379/TCP 3m4s
polyaxon-polyaxon-api LoadBalancer 10.233.36.234 <pending> 80:32050/TCP,1337:31832/TCP 3m4s
After a few minutes all the expected pods are running:
ubuntu@ip-10-1-0-226:~$ kubectl get pods --namespace polyaxon
NAME READY STATUS RESTARTS AGE
polyaxon-docker-registry-58bff6f777-vkl6h 1/1 Running 0 3m49s
polyaxon-polyaxon-api-75c5989cb4-ppv7t 1/2 Running 0 3m49s
polyaxon-polyaxon-beat-759d6f9f96-qdhmd 2/2 Running 0 3m48s
polyaxon-polyaxon-events-86f49f8b78-vvscx 1/1 Running 0 3m49s
polyaxon-polyaxon-hpsearch-5f77c8d6cd-gkdms 1/1 Running 0 3m48s
polyaxon-polyaxon-k8s-events-555f6c8754-c242k 1/1 Running 0 3m48s
polyaxon-polyaxon-monitors-864dd8fb67-h7s47 2/2 Running 0 3m47s
polyaxon-polyaxon-resources-hpbcv 1/1 Running 0 3m49s
polyaxon-polyaxon-resources-m7bjv 1/1 Running 0 3m49s
polyaxon-polyaxon-scheduler-7f4978774d-pm9xz 1/1 Running 0 3m47s
polyaxon-postgresql-f4fc68c67-25t4p 1/1 Running 0 3m49s
polyaxon-rabbitmq-74c5d87cf6-qlk2b 1/1 Running 0 3m49s
polyaxon-redis-6f7db88668-99qvw 1/1 Running 0 3m49s
The issue in this case arises with the LoadBalancer IP, which remains suspended in a pending state:
ubuntu@ip-10-1-0-226:~$ kubectl get --namespace polyaxon svc -w polyaxon-polyaxon-api
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
polyaxon-polyaxon-api LoadBalancer 10.233.52.219 <pending> 80:30684/TCP,1337:31886/TCP 13h
ubuntu@ip-10-1-0-226:~$ kubectl get svc --namespace polyaxon polyaxon-polyaxon-api -o json
{
"apiVersion": "v1",
"kind": "Service",
"metadata": {
"creationTimestamp": "2019-02-09T01:03:11Z",
"labels": {
"app": "polyaxon-polyaxon-api",
"chart": "polyaxon-0.3.8",
"heritage": "Tiller",
"release": "polyaxon",
"role": "polyaxon-api",
"type": "polyaxon-core"
},
"name": "polyaxon-polyaxon-api",
"namespace": "polyaxon",
"resourceVersion": "17172",
"selfLink": "/api/v1/namespaces/polyaxon/services/polyaxon-polyaxon-api",
"uid": "78640925-2c06-11e9-8f3f-121248b9afae"
},
"spec": {
"clusterIP": "10.233.52.219",
"externalTrafficPolicy": "Cluster",
"ports": [
{
"name": "api",
"nodePort": 30684,
"port": 80,
"protocol": "TCP",
"targetPort": 80
},
{
"name": "streams",
"nodePort": 31886,
"port": 1337,
"protocol": "TCP",
"targetPort": 1337
}
],
"selector": {
"app": "polyaxon-polyaxon-api"
},
"sessionAffinity": "None",
"type": "LoadBalancer"
},
"status": {
"loadBalancer": {}
}
}
Looking through the Polyaxon issues, I see that this can happen on minikube, but I wasn't able to find anything that helps me debug my particular case. What are the conditions that need to be met in the Kubernetes deployment, in order for the LoadBalancer IP step to succeed?
Vagrant/Virtualbox
I was suspicious that my issues might be specific to the AWS environment, rather than a general issue with kubespray/polyaxon, so as a second test I tried deploying the Kubernetes cluster locally using Vagrant and Virtualbox. To do this I used the Vagrantfile in the kubespray repo as described here.
After debugging a couple kubespray issues, I was able to get the cluster up and running and deploy the Node.js app again.
Deploying Polyaxon, I again saw the issue w/ the LoadBalancer IP getting stuck in a pending state. What was interesting to me though, was that a number of pods actually failed to run as well, despite the fact that the deployment ostensibly succeeded:
vagrant@k8s-1:~$ helm ls
NAME REVISION UPDATED STATUS CHART APP VERSION NAMESPACE
polyaxon 1 Sat Feb 9 06:01:21 2019 DEPLOYED polyaxon-0.3.8 polyaxon
vagrant@k8s-1:~$ kubectl get pods --namespace polyaxon
NAME READY STATUS RESTARTS AGE
polyaxon-docker-registry-58bff6f777-wlb9p 0/1 Pending 0 36m
polyaxon-polyaxon-api-6bc75ff4ff-v694k 0/2 Pending 0 36m
polyaxon-polyaxon-beat-744c96b9f8-mbz5j 0/2 Pending 0 36m
polyaxon-polyaxon-events-58d9c9cbd6-72skt 0/1 Pending 0 36m
polyaxon-polyaxon-hpsearch-dc9cf6556-8rh78 0/1 Pending 0 36m
polyaxon-polyaxon-k8s-events-9f8cdf5-fvqnx 0/1 Pending 0 36m
polyaxon-polyaxon-monitors-58766747c9-gcf2r 0/2 Pending 0 36m
polyaxon-polyaxon-resources-rnntm 1/1 Running 0 36m
polyaxon-polyaxon-resources-t4pv6 0/1 Pending 0 36m
polyaxon-polyaxon-resources-x9f42 0/1 Pending 0 36m
polyaxon-polyaxon-scheduler-76bfdcfcc7-d9tq4 0/1 Pending 0 36m
polyaxon-postgresql-f4fc68c67-lwgds 1/1 Running 0 36m
polyaxon-rabbitmq-74c5d87cf6-lhvj8 1/1 Running 0 36m
polyaxon-redis-6f7db88668-6wlgs 1/1 Running 0 36m
I'm not quite sure what's going on here. My best guess would be that the virtual machines don't have the necessary resources to run these pods? ... Would be interesting to hear the experts weigh in 😄.
Please help!