A Tools that help Data Scientists and ML engineers train and deploy ML models.

Domino Data Lab

Last update: Oct 17, 2022

Related tags

Overview

Domino Research

This repo contains projects under active development by the Domino R&D team. We build tools that help Data Scientists and ML engineers train and deploy ML models.

Active Projects

Here’s what we’re working on:

🌉 Bridge - deploy directly from your registry, turning it into a declarative source of truth for your model hosting.
🛂 Checkpoint - adds 'Pull Requests' to your registry to create a better process for promoting models to production.
🎇 Flare - monitor models and get alerts without capturing, storing or processing production inference data.

Comments

fix model version idempotency
The issue is relatively subtle and obvious at the same time:

The current state of the world is read from the endpoints, not from the models. Plus, we cannot update an already creating/updating endpoint. So, when you tag a new version for production while its destination endpoint is creating/updating:

Bridge sees the new version is missing from the current state of the world and creates a model for it in SageMaker

Bridge sees it cannot update the state of the endpoints and skips

On the next iteration, the state of the world has not changed, so Bridge attempts the exact same control loop actions. But now a model exists. So you get an error. Root cause: model version creation a) not idempotent and b) not done in a 'transaction' with endpoint updates. Solved A instead of B.

Testing revealed a similar bug with deletion:

Start with endpoint in S1.

Update to S2, update still in progress.

Update to S3. S2 is no longer active/needed and is deleted but the endpoint state still has S2. So, on the next control loop run Bridge tries to delete s2 again. Solution is the same, idempotency
opened by JoshBroomberg 2
Quick text edits

I added language around "syncing model versions tagged as production" - not sure if this is accurate, but something like it felt necessary to add context around how MLFlow tags are consumed and updated to Sagemaker.

opened by jfdesroches 2
RND-222: Create local notebooks project

This configures s3fs, but also provides a good framework for configuring JupyterLab (via config overrides and startup scripts baked into the container image).

I've gone ahead and installed gator, but its probably not completely set up right.

I ran into some strange issues with AWS rejecting queries for some buckets in regions other than us-east-1, but not others. Just to be safe, I recommend that users use us-east-1.

opened by ddl-kevin 1
RND-150: Checkpoint description scroll bar bug
Ticket Link

Description:

What is this PR about: The description box presented content in what looked like a cut off container. The issue was that the content received scroll bars by default even when no scrolling was required.

What are the changes that were made: Set overflow to auto instead of scroll

Any issues you ran into and how it affected your approach: The main issue was the fact that in local development I had no access to the page with the display component. I ended up copying this component into one I did have access to to troubleshoot and test.

Screenshots if applicable:

Before: Now:

Testing/Reproduction:

How to test this: This should be tested with a PR that:

has not description (no scroll should appear)

has a short description (no scroll should appear)

has a long description (should be able to scroll)
opened by KateDK 1
Analytics
Verified that we can create the reports we need can be generated in Mixpanel. They actually have a pretty cool UI.

Design decisions:

Plan for future deploy targets by adding a kind and id as properties of the DeployTarget class.

Route everything through a central analytics module

Allow analytics to be turned on/off centrally

Allow async event transmission in the future

Per-event tracking helper methods

Encapsulate event formatting/definitions to one file where they can be centrally managed

Expose strongly typed API to callers

Event-oriented design (capture version and routing creation/deletion as individual events not in one state bundle)

Easier analytics - track how many stage changes are happening as a direct count of model_routing_updated events

Only way to do it... we can't actually infer much from the current and desired routing maps. Either need to calculate count-based routing updates in the main app or delegate to the deploy target classes that do actually make this calculation.

TODO:

Document the opt-out environment var in docs + logs

Bake the environment variable into the docker image in quay as a Docker arg (to avoid the api key getting scraped from the open source repo)

Report package version in events so that we can measure version skew
opened by JoshBroomberg 1
RND-142: Remove UI Package
Ticket Link

Description:

What is this PR about: We wanted to simplify the FE template code structure, and decided to get rid of the packages/ui folder that made the build more cumbersome.

What are the changes that were made: -Components from packages/ui copied over to roots src directory. -packages/ui folder was deleted with all its contents. -Some files in root were merged with correlating files from packages/ui. -package.json in root updated -storybook adjusted

Successfully rebuilt the code env locally from this branch before opening this PR.

Id love to get any feedback on this
opened by KateDK 0
RND-131 One step MLflow

Makes the local MLflow into a one step process. I decided to do this by wrapping up the example model training code into a docker image that is run as a service in docker compose. Using docker gets past the nasty python environment issue which would likely trip up many users

Is introducing seeding into compose smart?

Keeping it in compose has the major advantage that all the credentials config and networking is in one file. Vs a script that would need to be run in a configured environment to have access to the minio bucket etc.

We could still decouple by removing the seeding from the CMD of the image and running it later with docker-compose run. But I think in > 95% of cases where the local mlflow is used, people will want a seeded registry to try one of our tools. So I think just doing it all at once is actually the best approach.

opened by JoshBroomberg 0
Don't use port 6000 in quickstart

Chrome has decided to have a magic list of ports that we cannot use on localhost. This includes 6000. We should move the quickstart to use a different port that isn't on the magic list.

https://superuser.com/questions/188058/which-ports-are-considered-unsafe-by-chrome
🛂 checkpoint

opened by ddl-kevin 0
SIGTERM not propagated to worker threads in localhost deploy target

Killing the container requires repeated ctrl-c signals - looks like the first one kills the main loop in Bridge and the next is caught by a python thread
bug :bridge_at_night: Bridge

opened by JoshBroomberg 0

500s during model update in localhost target

This looks like the error

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/flask/app.py", line 2070, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/conda/lib/python3.9/site-packages/flask/app.py", line 1516, in full_dispatch_request
    return self.finalize_request(rv)
  File "/opt/conda/lib/python3.9/site-packages/flask/app.py", line 1535, in finalize_request
    response = self.make_response(rv)
  File "/opt/conda/lib/python3.9/site-packages/flask/app.py", line 1727, in make_response
    raise TypeError(
TypeError: The view function did not return a valid response. The return type must be a string, dict, tuple, Response instance, or WSGI callable, but it was a int.

bug :bridge_at_night: Bridge

opened by JoshBroomberg 0

Query S3_ENDPOINT from MLflow registry to simplify config

See if we can pull the S3 endpoint that the registry is using so that we can avoid duplicating this config across the registry and Bridge itself when we use a non-s3 backend
enhancement :bridge_at_night: Bridge

opened by JoshBroomberg 2
Model update/synchronization bug
Replication:

Add a couple model versions to registry

Init bridge in a fresh account

Bridge creates an endpoint/model

Before bridge is done creating the endpoint, add and tag a new version destined for the endpoint

Bridge skips the update.

After some time (perhaps when endpoint becomes updateable) bridge attempts update and then fails due to existing model.

bug :bridge_at_night: Bridge
opened by JoshBroomberg 1

Owner

Domino Data Lab

GitHub

Model factory is a ML training platform to help engineers to build ML models at scale

Model Factory Machine learning today is powering many businesses today, e.g., search engine, e-commerce, news or feed recommendation. Training high qu

16 Sep 23, 2022

Tangram makes it easy for programmers to train, deploy, and monitor machine learning models.

Tangram Website | Discord Tangram makes it easy for programmers to train, deploy, and monitor machine learning models. Run tangram train to train a mo

1.4k Jan 5, 2023

MiniTorch - a diy teaching library for machine learning engineers

This repo is the full student code for minitorch. It is designed as a single repo that can be completed part by part following the guide book. It uses

1.1k Jan 7, 2023

Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.

Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models. Solve a variety of tasks with pre-trained models or finetune them in

227 Dec 10, 2022

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

Hivemind: decentralized deep learning in PyTorch Hivemind is a PyTorch library to train large neural networks across the Internet. Its intended usage

1.3k Jan 8, 2023

Exemplary lightweight and ready-to-deploy machine learning project

6 Dec 20, 2022

#30DaysOfStreamlit is a 30-day social challenge for you to build and deploy Streamlit apps.

30 Days Of Streamlit ?? This is the official repo of #30DaysOfStreamlit — a 30-day social challenge for you to learn, build and deploy Streamlit apps.

53 Jan 2, 2023

[HELP REQUESTED] Generalized Additive Models in Python

pyGAM Generalized Additive Models in Python. Documentation Official pyGAM Documentation: Read the Docs Building interpretable models with Generalized

747 Jan 5, 2023

Falken provides developers with a service that allows them to train AI that can play their games

Falken provides developers with a service that allows them to train AI that can play their games. Unlike traditional RL frameworks that learn through rewards or batches of offline training, Falken is based on training AI via realtime, human interactions.

223 Jan 3, 2023

Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

25 Dec 28, 2022

ClearML - Auto-Magical Suite of tools to streamline your ML workflow. Experiment Manager, MLOps and Data-Management

ClearML - Auto-Magical Suite of tools to streamline your ML workflow Experiment Manager, MLOps and Data-Management ClearML Formerly known as Allegro T

4k Jan 9, 2023

A collection of interactive machine-learning experiments: 🏋️models training + 🎨models demo

?? Interactive Machine Learning experiments: ??️models training + ??models demo

1.4k Jan 6, 2023

Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

Auto_TS: Auto_TimeSeries Automatically build multiple Time Series models using a Single Line of Code. Now updated with Dask. Auto_timeseries is a comp

519 Jan 3, 2023

Scikit learn library models to account for data and concept drift.

liquid_scikit_learn Scikit learn library models to account for data and concept drift. This python library focuses on solving data drift and concept d

7 Nov 18, 2021

easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

5 Jun 18, 2022

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

366 Jan 3, 2023

A Tools that help Data Scientists and ML engineers train and deploy ML models.

Related tags

Overview

Domino Research

Active Projects

Comments

Description:

Testing/Reproduction:

Description:

Owner

Domino Data Lab

Model factory is a ML training platform to help engineers to build ML models at scale

Tangram makes it easy for programmers to train, deploy, and monitor machine learning models.

MiniTorch - a diy teaching library for machine learning engineers

Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

Exemplary lightweight and ready-to-deploy machine learning project

#30DaysOfStreamlit is a 30-day social challenge for you to build and deploy Streamlit apps.

[HELP REQUESTED] Generalized Additive Models in Python

Falken provides developers with a service that allows them to train AI that can play their games

Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

ClearML - Auto-Magical Suite of tools to streamline your ML workflow. Experiment Manager, MLOps and Data-Management

A collection of interactive machine-learning experiments: 🏋️models training + 🎨models demo

Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

Scikit learn library models to account for data and concept drift.

easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

A collection of Scikit-Learn compatible time series transformers and tools.

Tools for Optuna, MLflow and the integration of both.

A single Python file with some tools for visualizing machine learning in the terminal.