Machine Learning Platform for Kubernetes

Overview

License: Apache 2 Polyaxon API Slack

Docs Release GitHub GitHub

Operator Core Api scheduler

Hub Helm Charts Codacy Badge


polyaxon

Reproduce, Automate, Scale your data science.


Welcome to Polyaxon, a platform for building, training, and monitoring large scale deep learning applications. We are making a system to solve reproducibility, automation, and scalability for machine learning applications.

Polyaxon deploys into any data center, cloud provider, or can be hosted and managed by Polyaxon, and it supports all the major deep learning frameworks such as Tensorflow, MXNet, Caffe, Torch, etc.

Polyaxon makes it faster, easier, and more efficient to develop deep learning applications by managing workloads with smart container and node management. And it turns GPU servers into shared, self-service resources for your team or organization.


demo


Install

TL;DR;

  • Install CLI

    # Install Polyaxon CLI
    $ pip install -U polyaxon
  • Create a deployment

    # Create a namespace
    $ kubectl create namespace polyaxon
    
    # Add Polyaxon charts repo
    $ helm repo add polyaxon https://charts.polyaxon.com
    
    # Deploy Polyaxon
    $ polyaxon admin deploy -f config.yaml
    
    # Access API
    $ polyaxon port-forward

Please check polyaxon installation guide

Quick start

TL;DR;

  • Start a project

    # Create a project
    $ polyaxon project create --name=quick-start --description='Polyaxon quick start.'
  • Train and track logs & resources

    # Upload code and start experiments
    $ polyaxon run -f experiment.yaml -l
  • Dashboard

    # Start Polyaxon dashboard
    $ polyaxon dashboard
    
    Dashboard page will now open in your browser. Continue? [Y/n]: y
  • Notebook

    # Start Jupyter notebook for your project
    $ polyaxon run --hub notebook
  • Tensorboard

    # Start TensorBoard for a run's output
    $ polyaxon run --hub tensorboard --run-uuid=UUID

compare dashboards tensorboard compare


Please check our quick start guide to start training your first experiment.

Distributed job

Polyaxon supports and simplifies distributed jobs. Depending on the framework you are using, you need to deploy the corresponding operator, adapt your code to enable the distributed training, and update your polyaxonfile.

Here are some examples of using distributed training:

Hyperparameters tuning

Polyaxon has a concept for suggesting hyperparameters and managing their results very similar to Google Vizier called experiment groups. An experiment group in Polyaxon defines a search algorithm, a search space, and a model to train.

Parallel executions

You can run your processing or model training jobs in parallel, Polyaxon provides a mapping abstraction to manage concurrent jobs.

DAGs and workflows

Polyaxon DAGs is a tool that provides container-native engine for running machine learning pipelines. A DAG manages multiple operations with dependencies. Each operation is defined by a component runtime. This means that operations in a DAG can be jobs, services, distributed jobs, parallel executions, or nested DAGs.

Architecture

Polyaxon architecture

Documentation

Check out our documentation to learn more about Polyaxon.

Dashboard

Polyaxon comes with a dashboard that shows the projects and experiments created by you and your team members.

To start the dashboard, just run the following command in your terminal

$ polyaxon dashboard -y

Project status

Polyaxon is stable and it's running in production mode at many startups and Fortune 500 companies.

Contributions

Please follow the contribution guide line: Contribute to Polyaxon.

Research

If you use Polyaxon in your academic research, we would be grateful if you could cite it.

Feel free to contact us, we would love to learn about your project and see how we can support your custom need.

Comments
  • Tensorboard error for the quick-start example

    Tensorboard error for the quick-start example

    Describe the bug

    I'm running the examples from the quick-start guide and when I tried to start Tensorboard I got the error:

    Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/polyaxon_k8s/manager.py", line 316, in create_or_update_deployment return self.create_deployment(name=name, body=body), True File "/usr/local/lib/python3.7/site-packages/polyaxon_k8s/manager.py", line 302, in create_deployment namespace=self.namespace, body=body File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 175, in create_namespaced_deployment (data) = self.create_namespaced_deployment_with_http_info(namespace, body, **kwargs) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 266, in create_namespaced_deployment_with_http_info collection_formats=collection_formats) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 334, in call_api _return_http_data_only, collection_formats, _preload_content, _request_timeout) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 168, in __call_api _request_timeout=_request_timeout) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request body=body) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 266, in POST body=body) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 222, in request raise ApiException(http_resp=r) kubernetes.client.rest.ApiException: (403) Reason: Forbidden HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Tue, 21 Jan 2020 17:03:28 GMT', 'Content-Length': '374'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"deployments.extensions is forbidden: User \"system:serviceaccount:polyaxon:polyaxon-polyaxon-serviceaccount\" cannot create resource \"deployments\" in API group \"extensions\" in the namespace \"polyaxon\"","reason":"Forbidden","details":{"group":"extensions","kind":"deployments"},"code":403} During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/polyaxon_k8s/manager.py", line 319, in create_or_update_deployment return self.update_deployment(name=name, body=body), False File "/usr/local/lib/python3.7/site-packages/polyaxon_k8s/manager.py", line 309, in update_deployment name=name, namespace=self.namespace, body=body File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 4089, in patch_namespaced_deployment (data) = self.patch_namespaced_deployment_with_http_info(name, namespace, body, **kwargs) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 4189, in patch_namespaced_deployment_with_http_info collection_formats=collection_formats) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 334, in call_api _return_http_data_only, collection_formats, _preload_content, _request_timeout) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 168, in __call_api _request_timeout=_request_timeout) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 393, in request body=body) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 286, in PATCH body=body) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 222, in request raise ApiException(http_resp=r) kubernetes.client.rest.ApiException: (403) Reason: Forbidden HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Tue, 21 Jan 2020 17:03:28 GMT', 'Content-Length': '484'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"deployments.extensions \"plx-tensorboard-5aa275f671f64a75924c66323cb0e6a4\" is forbidden: User \"system:serviceaccount:polyaxon:polyaxon-polyaxon-serviceaccount\" cannot patch resource \"deployments\" in API group \"extensions\" in the namespace \"polyaxon\"","reason":"Forbidden","details":{"name":"plx-tensorboard-5aa275f671f64a75924c66323cb0e6a4","group":"extensions","kind":"deployments"},"code":403} During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/polyaxon/polyaxon/scheduler/tensorboard_scheduler.py", line 53, in start_tensorboard reconcile_url=get_tensorboard_reconcile_url(tensorboard.unique_name)) File "/polyaxon/polyaxon/polypod/tensorboard.py", line 234, in start_tensorboard reraise=True) File "/usr/local/lib/python3.7/site-packages/polyaxon_k8s/manager.py", line 322, in create_or_update_deployment raise PolyaxonK8SError(e) polyaxon_k8s.exceptions.PolyaxonK8SError: (403) Reason: Forbidden HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Tue, 21 Jan 2020 17:03:28 GMT', 'Content-Length': '484'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"deployments.extensions \"plx-tensorboard-5aa275f671f64a75924c66323cb0e6a4\" is forbidden: User \"system:serviceaccount:polyaxon:polyaxon-polyaxon-serviceaccount\" cannot patch resource \"deployments\" in API group \"extensions\" in the namespace \"polyaxon\"","reason":"Forbidden","details":{"name":"plx-tensorboard-5aa275f671f64a75924c66323cb0e6a4","group":"extensions","kind":"deployments"},"code":403} 
    

    To Reproduce

    $ git clone https://github.com/polyaxon/polyaxon-quick-start.git
    $ # run create, init, etc.
    $ polyaxon run -f polyaxonfile_hyperparams.yml
    $ # wait..
    $ polyaxon tensorboard -g 1 start
    

    Expected behavior

    No error.

    Environment

    Kubernetes 1.17 using Kubeadm on a local cluster.

    Let me know if you need more info.

    bug area/helm-charts 
    opened by vakker 24
  • Expose configmaps/secrets to build environment

    Expose configmaps/secrets to build environment

    Hey, I was wondering if I could expose configmaps or secrets to build jobs aswell. What I'm trying to do is add some custom apt sources along with a client cert in order to install some internal packages as dependencies. Currently we work around this by installing some packages at runtime.

    opened by Mofef 22
  • No nodes in cluster and experiments fail to build

    No nodes in cluster and experiments fail to build

    I deployed Polyaxon on Minikube (Mac) and am trying to run experiments using the polyaxon quickstart repo (https://github.com/polyaxon/polyaxon-quick-start.git). However, the experiment build keeps failing, and running 'polyaxon cluster' shows no nodes:

    Cluster info:


    major 1 minor 10 compiler gc platform linux/amd64 build_date 2018-03-26T16:44:10Z git_commit fc32d2f3698e36b93322a3465f63a14e9f0eaead go_version go1.9.3 git_version v1.10.0 git_tree_state clean


    When I run 'kubectl get pods --all-namespaces', this is the output

    NAMESPACE NAME READY STATUS RESTARTS AGE kube-system coredns-c4cffd6dc-42gcs 1/1 Running 0 23h kube-system etcd-minikube 1/1 Running 0 23h kube-system kube-addon-manager-minikube 1/1 Running 0 23h kube-system kube-apiserver-minikube 1/1 Running 0 23h kube-system kube-controller-manager-minikube 1/1 Running 0 23h kube-system kube-dns-86f4d74b45-652fq 3/3 Running 0 23h kube-system kube-proxy-npxr5 1/1 Running 0 23h kube-system kube-scheduler-minikube 1/1 Running 0 23h kube-system kubernetes-dashboard-6f4cfc5d87-p2z4j 1/1 Running 0 23h kube-system storage-provisioner 1/1 Running 0 23h kube-system tiller-deploy-778f674bf5-xhmsv 1/1 Running 0 23h polyaxon polyaxon-docker-registry-78d5499fc9-4wm69 1/1 Running 0 5h polyaxon polyaxon-polyaxon-api-7b97bb447d-jl6h6 2/2 Running 0 5h polyaxon polyaxon-polyaxon-beat-77fb6cccc7-lmdhw 2/2 Running 0 5h polyaxon polyaxon-polyaxon-events-79c8ff59d9-2rqcq 1/1 Running 0 5h polyaxon polyaxon-polyaxon-hpsearch-9b5589f5-874n5 1/1 Running 0 5h polyaxon polyaxon-polyaxon-k8s-events-697cf8bb65-mnjz8 1/1 Running 0 5h polyaxon polyaxon-polyaxon-logs-7bf467999-b8755 1/1 Running 0 5h polyaxon polyaxon-polyaxon-monitors-57db4f7cd7-7x2j5 2/2 Running 0 5h polyaxon polyaxon-polyaxon-resources-glgwq 1/1 Running 0 5h polyaxon polyaxon-polyaxon-scheduler-76ccf9d665-xb9bg 1/1 Running 0 5h polyaxon polyaxon-postgresql-78d4cff55c-jhcvz 1/1 Running 0 5h polyaxon polyaxon-rabbitmq-6448d76c84-vp5ll 1/1 Running 0 5h polyaxon polyaxon-redis-688468649b-tg6qp 1/1 Running 0 5h

    I have also tried running 'helm update' and upgraded polyaxon to the latest release (0.3.2). How can I troubleshoot this?

    opened by jonathanlimsc 21
  • deleted flagged missed in initialization

    deleted flagged missed in initialization

    Describe the bug

    Getting this error with version 1.1.9

    image

    To reproduce

    polyaxon upgrade && polyaxon run -f poylaxonfile

    Expected behavior

    Run completed

    Environment

    polyaxon 1.1.9

    question not-reproducible 
    opened by zeyaddeeb 20
  • Scheduling many jobs at the same time leads to zombie state jobs (possible race condition?)

    Scheduling many jobs at the same time leads to zombie state jobs (possible race condition?)

    Describe the bug

    It's hard to consistently reproduce, but when scheduling many jobs such that the build happens to be at the same time, it seems like we can get the following scenario: K8s correctly schedules the pods according to their requests/limits and the available resources. Polyaxon however believes that some jobs are running although they are unschedulable by K8s. When freeing up resources quickly enough, K8s actually schedules those jobs and nothing else happens. However, if resources are blocked long enough, Polyaxon's heartbeat service will automatically stop these jobs (that it believes are running although they are unschedulable by K8s) and fail them. To me, this could be a critical bug in the scheduler and really seems like some kind of race condition. I haven't tested it with multiple users, but I assume this would occur if many users submit different jobs at the same time (a likely scenario).

    To Reproduce

    1. Create a job with a fairly large build and long runnning time (>2000 seconds).
    2. Make sure that only two of these jobs can run on the cluster at a time (by requesting resources accordingly).
    3. Run this job many times with polyaxon run -f polyaxonfile.yml (submit this command again as soon as it terminates and repeat 5 times)

    Expected behavior

    The jobs should just be recognized as unschedulable and scheduled when the resources become available again.

    Environment

    Polyaxon 0.5.6, Kubernetes 1.15.4

    opened by MatthiasKohl 20
  • Can't use TPU

    Can't use TPU

    Describe the bug

    I tried to use Cloud TPU. But I got the error on StackDriver logging. And the experiment was failed. It seems that we need to specify tensorflow version with annotation.

    HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: admission webhook \"pod-init.cloud-tpus.google.com\" denied the request: TensorFlow version must be specified in annotation \"tf-version.cloud-tpus.google.com\" for pod requesting Cloud TPUs","reason":"InternalError","details":{"causes":[{"message":"admission webhook \"pod-init.cloud-tpus.google.com\" denied the request: TensorFlow version must be specified in annotation \"tf-version.cloud-tpus.google.com\" for pod requesting Cloud TPUs"}]},"code":500}
    

    To Reproduce

    YAML

    ---
    version: 1
    
    kind: experiment
    
    environment:
      resources:
        cpu:
          requests: 4
          limits: 4
        memory:
          requests: 15000
          limits: 15000
        tpu:
          requests: 8
          limits: 8
    
    build:
      image: tensorflow/tensorflow:1.12.0
      build_steps:
        - pip install --no-cache-dir -r requirements.txt
    
    run:
      # this is just a dummy python file.
      cmd: python test.py
    

    requirements.txt

    polyaxon-client==0.3.8
    polyaxon-cli==0.3.8
    jupyter
    google-cloud-storage
    

    Expected behavior

    We can create a TPU.

    Environment

    • Polyaxon: 0.3.8

    Links

    • https://cloud.google.com/tpu/docs/kubernetes-engine-setup
    • https://github.com/tensorflow/tpu/blob/master/models/official/resnet/resnet_k8s.yaml#L28
    bug 
    opened by yu-iskw 20
  • Deploying on Kubernetes cluster created w/ Kubespray

    Deploying on Kubernetes cluster created w/ Kubespray

    Hi -

    I'm trying to spin up a Kubernetes cluster without the benefit of managed service like EKS or GKE, then deploy Polyaxon on that cluster. Currently I'm experiencing some issues on the Polyaxon side of this process.

    To deploy the Kubernetes cluster I'm using kubespray. I'm able to deploy the cluster to the point that kubectl get nodes shows the expected nodes in a ready state, and I'm able to deploy a simple Node.js app as a test. I am not, however, able to successfully install Polyaxon on the cluster.

    I've tried on both AWS and on my local machine using Vagrant/Virtualbox. The issues I'm experiencing are different between the two cases, which I find interesting, so I'll document both.

    AWS

    I deployed Kubernetes by loosely following this tutorial. Things went smoothly for the most part, except that I needed to deal with this issue using this fix. I used 3 t2.large instance running Ubuntu 16.04 and the standard kubespray configuration.

    As I mentioned above, I get the expected output from kubectl get nodes, and I'm able to deploy the Node.js app at the end of the tutorial.

    At first, the Polyaxon installation/deployment also seems to succeed:

    ubuntu@ip-10-1-0-226:~$ helm install polyaxon/polyaxon \
    > --name=polyaxon \
    > --namespace=polyaxon \
    > -f polyaxon_config.yml
    NAME:   polyaxon
    LAST DEPLOYED: Sat Feb  9 00:03:29 2019
    NAMESPACE: polyaxon
    STATUS: DEPLOYED
    
    RESOURCES:
    ==> v1/Secret
    NAME                             TYPE    DATA  AGE
    polyaxon-docker-registry-secret  Opaque  1     3m4s
    polyaxon-postgresql              Opaque  1     3m4s
    polyaxon-rabbitmq                Opaque  2     3m4s
    polyaxon-polyaxon-secret         Opaque  4     3m4s
    
    ==> v1/ConfigMap
    NAME                      DATA  AGE
    redis-config              1     3m4s
    polyaxon-polyaxon-config  141   3m4s
    
    ==> v1beta1/ClusterRole
    NAME                           AGE
    polyaxon-polyaxon-clusterrole  3m4s
    
    ==> v1beta1/DaemonSet
    NAME                         DESIRED  CURRENT  READY  UP-TO-DATE  AVAILABLE  NODE SELECTOR  AGE
    polyaxon-polyaxon-resources  2        2        2      2           2          <none>         3m4s
    
    ==> v1beta1/Deployment
    NAME                          DESIRED  CURRENT  UP-TO-DATE  AVAILABLE  AGE
    polyaxon-docker-registry      1        1        1           1          3m4s
    polyaxon-postgresql           1        1        1           1          3m4s
    polyaxon-rabbitmq             1        1        1           1          3m4s
    polyaxon-redis                1        1        1           1          3m4s
    polyaxon-polyaxon-api         1        1        1           0          3m4s
    polyaxon-polyaxon-beat        1        1        1           1          3m4s
    polyaxon-polyaxon-events      1        1        1           1          3m4s
    polyaxon-polyaxon-hpsearch    1        1        1           1          3m4s
    polyaxon-polyaxon-k8s-events  1        1        1           1          3m4s
    polyaxon-polyaxon-monitors    1        1        1           1          3m4s
    polyaxon-polyaxon-scheduler   1        1        1           1          3m3s
    
    ==> v1/Pod(related)
    NAME                                           READY  STATUS   RESTARTS  AGE
    polyaxon-polyaxon-resources-hpbcv              1/1    Running  0         3m4s
    polyaxon-polyaxon-resources-m7bjv              1/1    Running  0         3m4s
    polyaxon-docker-registry-58bff6f777-vkl6h      1/1    Running  0         3m4s
    polyaxon-postgresql-f4fc68c67-25t4p            1/1    Running  0         3m4s
    polyaxon-rabbitmq-74c5d87cf6-qlk2b             1/1    Running  0         3m4s
    polyaxon-redis-6f7db88668-99qvw                1/1    Running  0         3m4s
    polyaxon-polyaxon-api-75c5989cb4-ppv7t         1/2    Running  0         3m4s
    polyaxon-polyaxon-beat-759d6f9f96-qdhmd        2/2    Running  0         3m3s
    polyaxon-polyaxon-events-86f49f8b78-vvscx      1/1    Running  0         3m4s
    polyaxon-polyaxon-hpsearch-5f77c8d6cd-gkdms    1/1    Running  0         3m3s
    polyaxon-polyaxon-k8s-events-555f6c8754-c242k  1/1    Running  0         3m3s
    polyaxon-polyaxon-monitors-864dd8fb67-h7s47    2/2    Running  0         3m2s
    polyaxon-polyaxon-scheduler-7f4978774d-pm9xz   1/1    Running  0         3m2s
    
    ==> v1/ServiceAccount
    NAME                                      SECRETS  AGE
    polyaxon-polyaxon-serviceaccount          1        3m4s
    polyaxon-polyaxon-workers-serviceaccount  1        3m4s
    
    ==> v1beta1/ClusterRoleBinding
    NAME                                   AGE
    polyaxon-polyaxon-clusterrole-binding  3m4s
    
    ==> v1beta1/Role
    NAME                            AGE
    polyaxon-polyaxon-role          3m4s
    polyaxon-polyaxon-workers-role  3m4s
    
    ==> v1beta1/RoleBinding
    NAME                                    AGE
    polyaxon-polyaxon-role-binding          3m4s
    polyaxon-polyaxon-workers-role-binding  3m4s
    
    ==> v1/Service
    NAME                      TYPE          CLUSTER-IP     EXTERNAL-IP  PORT(S)                                AGE
    polyaxon-docker-registry  NodePort      10.233.42.186  <none>       5000:31813/TCP                         3m4s
    polyaxon-postgresql       ClusterIP     10.233.17.56   <none>       5432/TCP                               3m4s
    polyaxon-rabbitmq         ClusterIP     10.233.33.173  <none>       4369/TCP,5672/TCP,25672/TCP,15672/TCP  3m4s
    polyaxon-redis            ClusterIP     10.233.31.108  <none>       6379/TCP                               3m4s
    polyaxon-polyaxon-api     LoadBalancer  10.233.36.234  <pending>    80:32050/TCP,1337:31832/TCP            3m4s
    

    After a few minutes all the expected pods are running:

    ubuntu@ip-10-1-0-226:~$ kubectl get pods --namespace polyaxon
    NAME                                            READY   STATUS    RESTARTS   AGE
    polyaxon-docker-registry-58bff6f777-vkl6h       1/1     Running   0          3m49s
    polyaxon-polyaxon-api-75c5989cb4-ppv7t          1/2     Running   0          3m49s
    polyaxon-polyaxon-beat-759d6f9f96-qdhmd         2/2     Running   0          3m48s
    polyaxon-polyaxon-events-86f49f8b78-vvscx       1/1     Running   0          3m49s
    polyaxon-polyaxon-hpsearch-5f77c8d6cd-gkdms     1/1     Running   0          3m48s
    polyaxon-polyaxon-k8s-events-555f6c8754-c242k   1/1     Running   0          3m48s
    polyaxon-polyaxon-monitors-864dd8fb67-h7s47     2/2     Running   0          3m47s
    polyaxon-polyaxon-resources-hpbcv               1/1     Running   0          3m49s
    polyaxon-polyaxon-resources-m7bjv               1/1     Running   0          3m49s
    polyaxon-polyaxon-scheduler-7f4978774d-pm9xz    1/1     Running   0          3m47s
    polyaxon-postgresql-f4fc68c67-25t4p             1/1     Running   0          3m49s
    polyaxon-rabbitmq-74c5d87cf6-qlk2b              1/1     Running   0          3m49s
    polyaxon-redis-6f7db88668-99qvw                 1/1     Running   0          3m49s
    

    The issue in this case arises with the LoadBalancer IP, which remains suspended in a pending state:

    ubuntu@ip-10-1-0-226:~$ kubectl get --namespace polyaxon svc -w polyaxon-polyaxon-api
    NAME                    TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                       AGE
    polyaxon-polyaxon-api   LoadBalancer   10.233.52.219   <pending>     80:30684/TCP,1337:31886/TCP   13h
    
    ubuntu@ip-10-1-0-226:~$ kubectl get svc --namespace polyaxon polyaxon-polyaxon-api -o json
    {
        "apiVersion": "v1",
        "kind": "Service",
        "metadata": {
            "creationTimestamp": "2019-02-09T01:03:11Z",
            "labels": {
                "app": "polyaxon-polyaxon-api",
                "chart": "polyaxon-0.3.8",
                "heritage": "Tiller",
                "release": "polyaxon",
                "role": "polyaxon-api",
                "type": "polyaxon-core"
            },
            "name": "polyaxon-polyaxon-api",
            "namespace": "polyaxon",
            "resourceVersion": "17172",
            "selfLink": "/api/v1/namespaces/polyaxon/services/polyaxon-polyaxon-api",
            "uid": "78640925-2c06-11e9-8f3f-121248b9afae"
        },
        "spec": {
            "clusterIP": "10.233.52.219",
            "externalTrafficPolicy": "Cluster",
            "ports": [
                {
                    "name": "api",
                    "nodePort": 30684,
                    "port": 80,
                    "protocol": "TCP",
                    "targetPort": 80
                },
                {
                    "name": "streams",
                    "nodePort": 31886,
                    "port": 1337,
                    "protocol": "TCP",
                    "targetPort": 1337
                }
            ],
            "selector": {
                "app": "polyaxon-polyaxon-api"
            },
            "sessionAffinity": "None",
            "type": "LoadBalancer"
        },
        "status": {
            "loadBalancer": {}
        }
    }
    

    Looking through the Polyaxon issues, I see that this can happen on minikube, but I wasn't able to find anything that helps me debug my particular case. What are the conditions that need to be met in the Kubernetes deployment, in order for the LoadBalancer IP step to succeed?

    Vagrant/Virtualbox

    I was suspicious that my issues might be specific to the AWS environment, rather than a general issue with kubespray/polyaxon, so as a second test I tried deploying the Kubernetes cluster locally using Vagrant and Virtualbox. To do this I used the Vagrantfile in the kubespray repo as described here.

    After debugging a couple kubespray issues, I was able to get the cluster up and running and deploy the Node.js app again.

    Deploying Polyaxon, I again saw the issue w/ the LoadBalancer IP getting stuck in a pending state. What was interesting to me though, was that a number of pods actually failed to run as well, despite the fact that the deployment ostensibly succeeded:

    vagrant@k8s-1:~$ helm ls
    NAME            REVISION        UPDATED                         STATUS          CHART           APP VERSION     NAMESPACE
    polyaxon        1               Sat Feb  9 06:01:21 2019        DEPLOYED        polyaxon-0.3.8                  polyaxon
    
    vagrant@k8s-1:~$ kubectl get pods --namespace polyaxon
    NAME                                           READY   STATUS    RESTARTS   AGE
    polyaxon-docker-registry-58bff6f777-wlb9p      0/1     Pending   0          36m
    polyaxon-polyaxon-api-6bc75ff4ff-v694k         0/2     Pending   0          36m
    polyaxon-polyaxon-beat-744c96b9f8-mbz5j        0/2     Pending   0          36m
    polyaxon-polyaxon-events-58d9c9cbd6-72skt      0/1     Pending   0          36m
    polyaxon-polyaxon-hpsearch-dc9cf6556-8rh78     0/1     Pending   0          36m
    polyaxon-polyaxon-k8s-events-9f8cdf5-fvqnx     0/1     Pending   0          36m
    polyaxon-polyaxon-monitors-58766747c9-gcf2r    0/2     Pending   0          36m
    polyaxon-polyaxon-resources-rnntm              1/1     Running   0          36m
    polyaxon-polyaxon-resources-t4pv6              0/1     Pending   0          36m
    polyaxon-polyaxon-resources-x9f42              0/1     Pending   0          36m
    polyaxon-polyaxon-scheduler-76bfdcfcc7-d9tq4   0/1     Pending   0          36m
    polyaxon-postgresql-f4fc68c67-lwgds            1/1     Running   0          36m
    polyaxon-rabbitmq-74c5d87cf6-lhvj8             1/1     Running   0          36m
    polyaxon-redis-6f7db88668-6wlgs                1/1     Running   0          36m
    

    I'm not quite sure what's going on here. My best guess would be that the virtual machines don't have the necessary resources to run these pods? ... Would be interesting to hear the experts weigh in 😄.

    Please help!

    opened by jayleverett 20
  • polyaxon/polyaxon-api is start but no service on

    polyaxon/polyaxon-api is start but no service on

    docker log

    Running...
    Use default user
    nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
    nginx: configuration file /etc/nginx/nginx.conf test is successful
    Restarting nginx: nginx.
    nginx is running.
    [uWSGI] getting INI configuration from web/uwsgi.nginx.ini
    *** Starting uWSGI 2.0.18 (64bit) on [Tue Aug 18 08:34:22 2020] ***
    compiled with version: 6.3.0 20170516 on 13 August 2020 13:15:05
    os: Linux-4.18.0-193.el8.x86_64 #1 SMP Fri May 8 10:59:10 UTC 2020
    nodename: polyaxon-polyaxon-api-5c8f885949-wjq9p
    machine: x86_64
    clock source: unix
    pcre jit disabled
    detected number of CPU cores: 4
    current working directory: /polyaxon
    detected binary path: /usr/local/bin/uwsgi
    uWSGI running as root, you can use --uid/--gid/--chroot options
    *** WARNING: you are running uWSGI as root !!! (use the --uid flag) ***
    chdir() to /polyaxon/web/..
    your memory page size is 4096 bytes
    detected max file descriptor number: 1048576
    lock engine: pthread robust mutexes
    thunder lock: enabled
    uwsgi socket 0 bound to UNIX address /polyaxon/web/../web/polyaxon.sock fd 3
    uWSGI running as root, you can use --uid/--gid/--chroot options
    *** WARNING: you are running uWSGI as root !!! (use the --uid flag) ***
    Python version: 3.7.6 (default, Jan  3 2020, 23:53:24)  [GCC 6.3.0 20170516]
    Python main interpreter initialized at 0x5626c4254800
    uWSGI running as root, you can use --uid/--gid/--chroot options
    *** WARNING: you are running uWSGI as root !!! (use the --uid flag) ***
    python threads support enabled
    your server socket listen backlog is limited to 100 connections
    your mercy for graceful operations on workers is 60 seconds
    mapped 425960 bytes (415 KB) for 4 cores
    *** Operational MODE: preforking ***
    added /polyaxon/web/../polyaxon/ to pythonpath.
    WSGI app 0 (mountpoint='') ready in 2 seconds on interpreter 0x5626c4254800 pid: 66 (default app)
    uWSGI running as root, you can use --uid/--gid/--chroot options
    *** WARNING: you are running uWSGI as root !!! (use the --uid flag) ***
    *** uWSGI is running in multiple interpreter mode ***
    spawned uWSGI master process (pid: 66)
    spawned uWSGI worker 1 (pid: 72, cores: 1)
    spawned uWSGI worker 2 (pid: 73, cores: 1)
    spawned uWSGI worker 3 (pid: 74, cores: 1)
    spawned uWSGI worker 4 (pid: 75, cores: 1)
    

    docker image

    polyaxon/polyaxon-gateway                                        1.1.7                 a52bd2a3a36d        4 days ago          473MB
    polyaxon/polyaxon-api                                            1.1.7                 dc1d59a6bff9        4 days ago          590MB
    polyaxon/polyaxon-cli                                            1.1.7                 5ea8e132a2a0        4 days ago          419MB
    

    kubectl --namespace=polyaxon get pod

    NAME                                          READY   STATUS    RESTARTS   AGE
    polyaxon-polyaxon-api-5c8f885949-wjq9p        0/1     Running   4          30m
    polyaxon-polyaxon-gateway-77c4d46d4d-t85ww    1/1     Running   0          30m
    polyaxon-polyaxon-operator-7f48b54676-mh48l   1/1     Running   0          30m
    polyaxon-polyaxon-streams-7c4876dc54-jh2p6    1/1     Running   0          30m
    polyaxon-postgresql-0                         1/1     Running   0          30m
    

    helm version

    Client: &version.Version{SemVer:"v2.16.10", GitCommit:"bceca24a91639f045f22ab0f41e47589a932cf5e", GitTreeState:"clean"}
    Server: &version.Version{SemVer:"v2.16.10", GitCommit:"bceca24a91639f045f22ab0f41e47589a932cf5e", GitTreeState:"clean"}
    

    kubectl version

    Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T11:48:36Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
    
    question 
    opened by zhangchunsheng 19
  • Logs are not displayed correctly in terminal

    Logs are not displayed correctly in terminal

    Describe the bug

    Unable to see the logs correctly. Unfortunately the only things visible within in terminal are callback errors:

    $ polyaxon experiment -xp X logs
    building -- 
    scheduled -- 
    starting -- 
    running -- 
    error from callback <function SocketTransportMixin.socket.<locals>.<lambda> at 0x7fd723146400>: the JSON object must be str, not 'bytes'
    error from callback <function SocketTransportMixin.socket.<locals>.<lambda> at 0x7fd723146400>: the JSON object must be str, not 'bytes'
    error from callback <function SocketTransportMixin.socket.<locals>.<lambda> at 0x7fd723146400>: the JSON object must be str, not 'bytes'
    error from callback <function SocketTransportMixin.socket.<locals>.<lambda> at 0x7fd723146400>: the JSON object must be str, not 'bytes'
    error from callback <function SocketTransportMixin.socket.<locals>.<lambda> at 0x7fd723146400>: the JSON object must be str, not 'bytes'
    ...
    error from callback <bound method SocketTransportMixin._on_close of <polyaxon_client.transport.Transport object at 0x7fd723190978>>: _on_close() missing 1 required positional argument: 'ws'
    

    To Reproduce

    Started experiment with polyaxon run -u and then started the logs-view polyaxon experiment -xp X logs

    Experiment:

    https://github.com/polyaxon/polyaxon-examples/tree/master/tensorflow/cifare10/polyaxonfile.yml

    Expected behavior

    Building -- creating image -
      master.1 -- INFO:tensorflow:Using config: {'_model_dir': '/outputs/root/cifar10/experiments/1', '_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_session_config': gpu_options {
      master.1 --   force_gpu_compatible: true
      master.1 -- }
    

    Environment

    Local

    polyaxon is running within a virtualenv using python3.

    Cluster

    OS: Ubuntu 18.04 Kubernetes: 1.12.1

    bug 
    opened by naetherm 19
  • "cluster-admin not found" error while installing polyaxon with helm

    I am using minikube to set up a local kubernetes single node cluster. I have set up helm as described in the docs. But when I try to deploy polyaxon by following the docs, I get an error.

    temp-training:~ shivam.m$ helm install --wait polyaxon/polyaxon Error: release rousing-peahen failed: clusterroles.rbac.authorization.k8s.io "rousing-peahen-polyaxon-ingress-clusterrole" is forbidden: attempt to grant extra privileges: [PolicyRule{Resources:["configmaps"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["configmaps"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["endpoints"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["endpoints"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["nodes"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["nodes"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["pods"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["pods"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["secrets"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["secrets"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["nodes"], APIGroups:[""], Verbs:["get"]} PolicyRule{Resources:["services"], APIGroups:[""], Verbs:["get"]} PolicyRule{Resources:["services"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["services"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["ingresses"], APIGroups:["extensions"], Verbs:["get"]} PolicyRule{Resources:["ingresses"], APIGroups:["extensions"], Verbs:["list"]} PolicyRule{Resources:["ingresses"], APIGroups:["extensions"], Verbs:["watch"]} PolicyRule{Resources:["events"], APIGroups:[""], Verbs:["create"]} PolicyRule{Resources:["events"], APIGroups:[""], Verbs:["patch"]} PolicyRule{Resources:["ingresses/status"], APIGroups:["extensions"], Verbs:["update"]}] user=&{system:serviceaccount:kube-system:tiller 8e197f15-1373-11e8-9b02-080027bbca2c [system:serviceaccounts system:serviceaccounts:kube-system system:authenticated] map[]} ownerrules=[] ruleResolutionErrors=[clusterroles.rbac.authorization.k8s.io "cluster-admin" not found]

    I tried disabling the rbac and running it again but then I get an error related to port allocation. temp-training:~ shivam.m$ helm install --set=rbac.enabled=false polyaxon/polyaxon Error: release mortal-gorilla failed: Service "mortal-gorilla-docker-registry" is invalid: spec.ports[0].nodePort: Invalid value: 31813: provided port is already allocated

    bug 
    opened by codophobia 19
  • Unable to run experiments with v1.1.8

    Unable to run experiments with v1.1.8

    Describe the bug

    Unable to run experiments with new version 1.1.8. "Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f168f918700>: Failed to establish a new connection: [Errno 111] Connection refused')" Seems to be from tracking.init()

    Also when running polyaxon project ls (only the first time):

    Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f030fb6dbe0>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /api/v1/compatibility/cb08b595c6be5fe48fcbaf4860dd900c/1-1-8/cli
    Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f030fb6dc88>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /api/v1/compatibility/cb08b595c6be5fe48fcbaf4860dd900c/1-1-8/cli
    Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f030fb6dd68>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /api/v1/compatibility/cb08b595c6be5fe48fcbaf4860dd900c/1-1-8/cli
    Could not connect to remote server to fetch compatibility versions.
    Checking CLI compatibility version ...
    Could get the min/latest versions from compatibility API.
    

    However if I run it again it works as expected.

    To Reproduce

    version: 1.1
    kind: component
    name: simple-experiment
    description: Minimum information to run this TF.Keras example
    tags: [examples]
    run:
      kind: job
      init:
      - git: {url: "https://github.com/polyaxon/polyaxon-quick-start"}
        container:
          env:
            - name: http_proxy
              value: "***"
            - name: https_proxy
              value: "***"
      container:
        image: polyaxon/polyaxon-quick-start
        workingDir: "{{ globals.artifacts_path }}/polyaxon-quick-start"
        command: [python3, model.py]
        env:
          - name: http_proxy
            value: "***"
          - name: https_proxy
            value: "***"
    

    Expected behavior

    A running experiment.

    Environment

    deploymentChart: platform
    deploymentVersion: 1.1.8
    
    artifactsStore:
      name: minio
      kind: s3
      schema: {"bucket": "***"}
      secret:
        name: "***"
    
    connections:
      - name: data
        kind: volume_claim
        schema:
          mountPath: ***
          volumeClaim: ***
          readOnly: true
    
    scheduler:
      enabled: true
    
    streams:
      enabled: true
    
    postgresql:
      persistence:
        enabled: true
        storageClass: nfs
    
    redis:
      enabled: true
      master:
        persistence:
          enabled: true
          storageClass: nfs
      slave:
        persistence:
          enabled: true
          storageClass: nfs
    broker: redis
    
    rabbitmq-ha:
      enabled: false
    
    ui:
      enabled: true
      adminEnabled: true
    
    bug regression 
    opened by ONordander 17
  • Polyaxon Python API - RunClient `watch_logs()` alternate or parameter to stop its execution and return string

    Polyaxon Python API - RunClient `watch_logs()` alternate or parameter to stop its execution and return string

    Hi, Context: I have been running some experiments on EKS. Its working great, but my logs disappear after the run execution. Also while the execution is happening, after arbitrary time pod disconnects and previous logs are lost. EKS/polyaxon/mpi recovers the jobs execution and Launcher pod starts the training from where disconnect happened.

    Issue: The issue is that i want to retain the logs of my runs. I am not able to use persistent volumes yet which can be a solution. What i am trying to use is the polyaxon python api. More specifically i am using RunClient and looking at get_logs() and watch_logs(). get_logs() is not returning anything and i think its not intended for this. watch_logs() is returning the logs but issue is, its not technically "returning" anything. It seems to be like a stream function, which stdouts on console (jupyter, shell). In my code i am not able to get the logs with this, as it keeps on printing without stop.

    Question/Enhancement Is there another way to get the logs through python api? or can we have an alternate function to watch_logs which just returns the logs and its execution is done. I intend to keep saving snapshot of logs so that even if disconnection happens i can then join the log files later. Open to any suggestions. FYI, i have tried cli too. polyaxon ops logs -f its giving me encoding issues.

    question 
    opened by QaisarRajput 1
  • Errors related to uploading artifacts while tracking runs are silent

    Errors related to uploading artifacts while tracking runs are silent

    Current behavior

    From slack:

    Another question about logging metrics to a run through a local jupyter notebook. After our conversation on Nov. 30th. :point_up:, things were working fine. However recently the dashboard has stopped displaying metrics again, and I'm seeing weird behaviour in polyaxon. I don't know what has changed. Looking for advice since I'm out of troubleshooting ideas. Details in the thread... Here's code I have that recreates the problem:

    from polyaxon import tracking
    
    tracking.init(
        owner="owner",
        project="project-name",
        name="test_run",
        run_uuid=None,
        is_new=True
    )
    tracking.set_run_event_logger()
    
    tracking.log_text(name="some_text_metric", text="some text")
    
    for step in range(1, 100):
        tracking.log_metric(name="some_step_metric", value=step/2, step=step)
    
    tracking.log_succeeded()
    tracking.end()
    

    After using some debugging using:

    from polyaxon.logger import configure_logger
    
    configure_logger(verbose=True)
    ...
    

    It turns out that :

    Thank you for that command there. After looking at the logging from that, I realized that my polyaxon cli host had switched from the url of the gateway deployed on our cluster to https://cloud.polyaxon.com/. After some tests, it looks like logging metrics through cloud.polyaxon.com causes the issues I was seeing with artifacts. When I switched the polyaxon host to the url of our gateway, then the dashboard started correctly displaying metrics.

    Enhancement

    As suggested by the user, the upload is happening in a thread, API errors (404/401/403) should show to help the user debug issues:

    Any chance that you'd update the code to provide a useful error message when someone tries this?

    enhancement area/tracking area/client 
    opened by polyaxon-team 0
  • Add config to support proxy env var with GCS

    Add config to support proxy env var with GCS

    Current behavior

    Seems like GCS-FS does not automatically pick the proxy env vars, see https://github.com/fsspec/gcsfs/pull/491

    Enhancement

    Add trust_env if proxy env vars are used:

    fs = GCSFileSystem(project='my-google-project', session_kwargs={'trust_env': True})
    
    area/cli area/streams area/sidecar area/client 
    opened by polyaxon-team 0
  • Stopping an operation with a pending pod removes the operation but does not delete the pod

    Stopping an operation with a pending pod removes the operation but does not delete the pod

    Describe the bug

    Stopping an operation where the pod is pending with image pull error, removes the operations from Polyaxon's table but does not correctly delete the pod.

    bug core 
    opened by polyaxon-team 0
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • Polyaxon CLI should raise an error for invalid input/ouputs names with dots `.`

    Polyaxon CLI should raise an error for invalid input/ouputs names with dots `.`

    Current behavior

    The CLI currently allows users to pass inputs/outputs with dots ., also the platform allows the run to be scheduled. However the interpolation engine does not allow reusing the param's variable name, especially when using DAGs or Joins, since the parser uses the dot . get extract the variables required.

    Enhancement

    There are two options:

    • Add a validation on the parsing level to show an error to the user before they submit the operation to the platform, to prevent any confusion.
    • Allow using [] as an alternative solution to getting params/inputs/outputs values instead of . and update the documentation to show how it can be used.
    enhancement area/specification area/cli 
    opened by polyaxon-team 0
Releases(v1.12.2)
Owner
polyaxon
A platform for reproducible and scalable machine learning and deep learning on kubernetes
polyaxon
Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

Machine Learning From Scratch About Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The purpose

Erik Linder-Norén 21.8k Jan 9, 2023
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

This is the Vowpal Wabbit fast online learning code. Why Vowpal Wabbit? Vowpal Wabbit is a machine learning system which pushes the frontier of machin

Vowpal Wabbit 8.1k Jan 6, 2023
A very lightweight monitoring system for Raspberry Pi clusters running Kubernetes.

OMNI A very lightweight monitoring system for Raspberry Pi clusters running Kubernetes. Why? When I finished my Kubernetes cluster using a few Raspber

Matias Godoy 148 Dec 29, 2022
OpenDILab RL Kubernetes Custom Resource and Operator Lib

DI Orchestrator DI Orchestrator is designed to manage DI (Decision Intelligence) jobs using Kubernetes Custom Resource and Operator. Prerequisites A w

OpenDILab 205 Dec 29, 2022
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Dec 30, 2022
Scripts of Machine Learning Algorithms from Scratch. Implementations of machine learning models and algorithms using nothing but NumPy with a focus on accessibility. Aims to cover everything from basic to advance.

Algo-ScriptML Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The goal of this project is not t

Algo Phantoms 81 Nov 26, 2022
This is a Machine Learning Based Hand Detector Project, It Uses Machine Learning Models and Modules Like Mediapipe, Developed By Google!

Machine Learning Hand Detector This is a Machine Learning Based Hand Detector Project, It Uses Machine Learning Models and Modules Like Mediapipe, Dev

Popstar Idhant 3 Feb 25, 2022
[IROS'21] SurRoL: An Open-source Reinforcement Learning Centered and dVRK Compatible Platform for Surgical Robot Learning

SurRoL IROS 2021 SurRoL: An Open-source Reinforcement Learning Centered and dVRK Compatible Platform for Surgical Robot Learning Features dVRK compati

Med-AIR@CUHK 55 Jan 3, 2023
Determined: Deep Learning Training Platform

Determined: Deep Learning Training Platform Determined is an open-source deep learning training platform that makes building models fast and easy. Det

Determined AI 2k Dec 31, 2022
Diffgram - Supervised Learning Data Platform

Data Annotation, Data Labeling, Annotation Tooling, Training Data for Machine Learning

Diffgram 1.6k Jan 7, 2023
Lighting the Darkness in the Deep Learning Era: A Survey, An Online Platform, A New Dataset

Lighting the Darkness in the Deep Learning Era: A Survey, An Online Platform, A New Dataset This repository provides a unified online platform, LoLi-P

Chongyi Li 457 Jan 3, 2023
A Research-oriented Federated Learning Library and Benchmark Platform for Graph Neural Networks. Accepted to ICLR'2021 - DPML and MLSys'21 - GNNSys workshops.

FedGraphNN: A Federated Learning System and Benchmark for Graph Neural Networks A Research-oriented Federated Learning Library and Benchmark Platform

FedML-AI 175 Dec 1, 2022
Megaverse is a new 3D simulation platform for reinforcement learning and embodied AI research

Megaverse Megaverse is a new 3D simulation platform for reinforcement learning and embodied AI research. The efficient design of the engine enables ph

Aleksei Petrenko 191 Dec 23, 2022
TensorFlow Ranking is a library for Learning-to-Rank (LTR) techniques on the TensorFlow platform

TensorFlow Ranking is a library for Learning-to-Rank (LTR) techniques on the TensorFlow platform

null 2.6k Jan 4, 2023
A deep learning based semantic search platform that computes similarity scores between provided query and documents

semanticsearch This is a deep learning based semantic search platform that computes similarity scores between provided query and documents. Documents

null 1 Nov 30, 2021
A platform for intelligent agent learning based on a 3D open-world FPS game developed by Inspir.AI.

Wilderness Scavenger: 3D Open-World FPS Game AI Challenge This is a platform for intelligent agent learning based on a 3D open-world FPS game develope

null 46 Nov 24, 2022
An easy-to-use federated learning platform

FederatedScope is a comprehensive federated learning platform that provides convenient usage and flexible customization for various federated learning

Alibaba 809 Dec 31, 2022
🔥 Cogitare - A Modern, Fast, and Modular Deep Learning and Machine Learning framework for Python

Cogitare is a Modern, Fast, and Modular Deep Learning and Machine Learning framework for Python. A friendly interface for beginners and a powerful too

Cogitare - Modern and Easy Deep Learning with Python 76 Sep 30, 2022
Visualizer for neural network, deep learning, and machine learning models

Netron is a viewer for neural network, deep learning and machine learning models. Netron supports ONNX (.onnx, .pb, .pbtxt), Keras (.h5, .keras), Tens

Lutz Roeder 21k Jan 6, 2023