A Tools that help Data Scientists and ML engineers train and deploy ML models.

Overview

Domino Research

This repo contains projects under active development by the Domino R&D team. We build tools that help Data Scientists and ML engineers train and deploy ML models.

Active Projects

Here’s what we’re working on:

  • 🌉 Bridge - deploy directly from your registry, turning it into a declarative source of truth for your model hosting.

  • 🛂 Checkpoint - adds 'Pull Requests' to your registry to create a better process for promoting models to production.

  • 🎇 Flare - monitor models and get alerts without capturing, storing or processing production inference data.

Comments
  • fix model version idempotency

    fix model version idempotency

    The issue is relatively subtle and obvious at the same time:

    The current state of the world is read from the endpoints, not from the models. Plus, we cannot update an already creating/updating endpoint. So, when you tag a new version for production while its destination endpoint is creating/updating:

    • Bridge sees the new version is missing from the current state of the world and creates a model for it in SageMaker
    • Bridge sees it cannot update the state of the endpoints and skips
    • On the next iteration, the state of the world has not changed, so Bridge attempts the exact same control loop actions. But now a model exists. So you get an error. Root cause: model version creation a) not idempotent and b) not done in a 'transaction' with endpoint updates. Solved A instead of B.

    Testing revealed a similar bug with deletion:

    • Start with endpoint in S1.
    • Update to S2, update still in progress.
    • Update to S3. S2 is no longer active/needed and is deleted but the endpoint state still has S2. So, on the next control loop run Bridge tries to delete s2 again. Solution is the same, idempotency
    opened by JoshBroomberg 2
  • Quick text edits

    Quick text edits

    I added language around "syncing model versions tagged as production" - not sure if this is accurate, but something like it felt necessary to add context around how MLFlow tags are consumed and updated to Sagemaker.

    opened by jfdesroches 2
  • RND-222: Create local notebooks project

    RND-222: Create local notebooks project

    This configures s3fs, but also provides a good framework for configuring JupyterLab (via config overrides and startup scripts baked into the container image).

    I've gone ahead and installed gator, but its probably not completely set up right.

    I ran into some strange issues with AWS rejecting queries for some buckets in regions other than us-east-1, but not others. Just to be safe, I recommend that users use us-east-1.

    opened by ddl-kevin 1
  • RND-150: Checkpoint description scroll bar bug

    RND-150: Checkpoint description scroll bar bug

    Ticket Link

    Description:

    • What is this PR about: The description box presented content in what looked like a cut off container. The issue was that the content received scroll bars by default even when no scrolling was required.
    • What are the changes that were made: Set overflow to auto instead of scroll
    • Any issues you ran into and how it affected your approach: The main issue was the fact that in local development I had no access to the page with the display component. I ended up copying this component into one I did have access to to troubleshoot and test.
    • Screenshots if applicable:

    Before: Screen Shot 2021-09-28 at 12 45 37 PM Now: Screen Shot 2021-09-28 at 2 25 37 PM

    Testing/Reproduction:

    How to test this: This should be tested with a PR that:

    • has not description (no scroll should appear)
    • has a short description (no scroll should appear)
    • has a long description (should be able to scroll)
    opened by KateDK 1
  • Analytics

    Analytics

    Verified that we can create the reports we need can be generated in Mixpanel. They actually have a pretty cool UI.

    Design decisions:

    • Plan for future deploy targets by adding a kind and id as properties of the DeployTarget class.
    • Route everything through a central analytics module
      • Allow analytics to be turned on/off centrally
      • Allow async event transmission in the future
    • Per-event tracking helper methods
      • Encapsulate event formatting/definitions to one file where they can be centrally managed
      • Expose strongly typed API to callers
    • Event-oriented design (capture version and routing creation/deletion as individual events not in one state bundle)
    • Easier analytics - track how many stage changes are happening as a direct count of model_routing_updated events
    • Only way to do it... we can't actually infer much from the current and desired routing maps. Either need to calculate count-based routing updates in the main app or delegate to the deploy target classes that do actually make this calculation.

    TODO:

    • Document the opt-out environment var in docs + logs
    • Bake the environment variable into the docker image in quay as a Docker arg (to avoid the api key getting scraped from the open source repo)
    • Report package version in events so that we can measure version skew
    opened by JoshBroomberg 1
  • RND-142: Remove UI Package

    RND-142: Remove UI Package

    Ticket Link

    Description:

    • What is this PR about: We wanted to simplify the FE template code structure, and decided to get rid of the packages/ui folder that made the build more cumbersome.
    • What are the changes that were made: -Components from packages/ui copied over to roots src directory. -packages/ui folder was deleted with all its contents. -Some files in root were merged with correlating files from packages/ui. -package.json in root updated -storybook adjusted

    Successfully rebuilt the code env locally from this branch before opening this PR.

    Id love to get any feedback on this

    opened by KateDK 0
  • RND-131 One step MLflow

    RND-131 One step MLflow

    Makes the local MLflow into a one step process. I decided to do this by wrapping up the example model training code into a docker image that is run as a service in docker compose. Using docker gets past the nasty python environment issue which would likely trip up many users

    Is introducing seeding into compose smart?

    Keeping it in compose has the major advantage that all the credentials config and networking is in one file. Vs a script that would need to be run in a configured environment to have access to the minio bucket etc.

    We could still decouple by removing the seeding from the CMD of the image and running it later with docker-compose run. But I think in > 95% of cases where the local mlflow is used, people will want a seeded registry to try one of our tools. So I think just doing it all at once is actually the best approach.

    opened by JoshBroomberg 0
  • Don't use port 6000 in quickstart

    Don't use port 6000 in quickstart

    Chrome has decided to have a magic list of ports that we cannot use on localhost. This includes 6000. We should move the quickstart to use a different port that isn't on the magic list.

    https://superuser.com/questions/188058/which-ports-are-considered-unsafe-by-chrome

    🛂 checkpoint 
    opened by ddl-kevin 0
  • SIGTERM not propagated to worker threads in localhost deploy target

    SIGTERM not propagated to worker threads in localhost deploy target

    Killing the container requires repeated ctrl-c signals - looks like the first one kills the main loop in Bridge and the next is caught by a python thread

    bug :bridge_at_night: Bridge 
    opened by JoshBroomberg 0
  • 500s during model update in localhost target

    500s during model update in localhost target

    This looks like the error

    Traceback (most recent call last):
      File "/opt/conda/lib/python3.9/site-packages/flask/app.py", line 2070, in wsgi_app
        response = self.full_dispatch_request()
      File "/opt/conda/lib/python3.9/site-packages/flask/app.py", line 1516, in full_dispatch_request
        return self.finalize_request(rv)
      File "/opt/conda/lib/python3.9/site-packages/flask/app.py", line 1535, in finalize_request
        response = self.make_response(rv)
      File "/opt/conda/lib/python3.9/site-packages/flask/app.py", line 1727, in make_response
        raise TypeError(
    TypeError: The view function did not return a valid response. The return type must be a string, dict, tuple, Response instance, or WSGI callable, but it was a int.
    
    bug :bridge_at_night: Bridge 
    opened by JoshBroomberg 0
  • Query S3_ENDPOINT from MLflow registry to simplify config

    Query S3_ENDPOINT from MLflow registry to simplify config

    See if we can pull the S3 endpoint that the registry is using so that we can avoid duplicating this config across the registry and Bridge itself when we use a non-s3 backend

    enhancement :bridge_at_night: Bridge 
    opened by JoshBroomberg 2
  • Model update/synchronization bug

    Model update/synchronization bug

    Replication:

    1. Add a couple model versions to registry
    2. Init bridge in a fresh account
    3. Bridge creates an endpoint/model
    4. Before bridge is done creating the endpoint, add and tag a new version destined for the endpoint
    5. Bridge skips the update.
    6. After some time (perhaps when endpoint becomes updateable) bridge attempts update and then fails due to existing model.
    bug :bridge_at_night: Bridge 
    opened by JoshBroomberg 1
Owner
Domino Data Lab
Domino Data Lab
Model factory is a ML training platform to help engineers to build ML models at scale

Model Factory Machine learning today is powering many businesses today, e.g., search engine, e-commerce, news or feed recommendation. Training high qu

null 16 Sep 23, 2022
Tangram makes it easy for programmers to train, deploy, and monitor machine learning models.

Tangram Website | Discord Tangram makes it easy for programmers to train, deploy, and monitor machine learning models. Run tangram train to train a mo

Tangram 1.4k Jan 5, 2023
MiniTorch - a diy teaching library for machine learning engineers

This repo is the full student code for minitorch. It is designed as a single repo that can be completed part by part following the guide book. It uses

null 1.1k Jan 7, 2023
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.

Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models. Solve a variety of tasks with pre-trained models or finetune them in

Backprop 227 Dec 10, 2022
Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

Hivemind: decentralized deep learning in PyTorch Hivemind is a PyTorch library to train large neural networks across the Internet. Its intended usage

null 1.3k Jan 8, 2023
Exemplary lightweight and ready-to-deploy machine learning project

Exemplary lightweight and ready-to-deploy machine learning project

snapADDY GmbH 6 Dec 20, 2022
#30DaysOfStreamlit is a 30-day social challenge for you to build and deploy Streamlit apps.

30 Days Of Streamlit ?? This is the official repo of #30DaysOfStreamlit — a 30-day social challenge for you to learn, build and deploy Streamlit apps.

Streamlit 53 Jan 2, 2023
[HELP REQUESTED] Generalized Additive Models in Python

pyGAM Generalized Additive Models in Python. Documentation Official pyGAM Documentation: Read the Docs Building interpretable models with Generalized

daniel servén 747 Jan 5, 2023
Falken provides developers with a service that allows them to train AI that can play their games

Falken provides developers with a service that allows them to train AI that can play their games. Unlike traditional RL frameworks that learn through rewards or batches of offline training, Falken is based on training AI via realtime, human interactions.

Google Research 223 Jan 3, 2023
Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

FINRA 25 Dec 28, 2022
ClearML - Auto-Magical Suite of tools to streamline your ML workflow. Experiment Manager, MLOps and Data-Management

ClearML - Auto-Magical Suite of tools to streamline your ML workflow Experiment Manager, MLOps and Data-Management ClearML Formerly known as Allegro T

ClearML 4k Jan 9, 2023
A collection of interactive machine-learning experiments: 🏋️models training + 🎨models demo

?? Interactive Machine Learning experiments: ??️models training + ??models demo

Oleksii Trekhleb 1.4k Jan 6, 2023
Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

Auto_TS: Auto_TimeSeries Automatically build multiple Time Series models using a Single Line of Code. Now updated with Dask. Auto_timeseries is a comp

AutoViz and Auto_ViML 519 Jan 3, 2023
Scikit learn library models to account for data and concept drift.

liquid_scikit_learn Scikit learn library models to account for data and concept drift. This python library focuses on solving data drift and concept d

null 7 Nov 18, 2021
easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

Neuron AI 5 Jun 18, 2022
Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Microsoft 366 Jan 3, 2023
A collection of Scikit-Learn compatible time series transformers and tools.

tsfeast A collection of Scikit-Learn compatible time series transformers and tools. Installation Create a virtual environment and install: From PyPi p

Chris Santiago 0 Mar 30, 2022
Tools for Optuna, MLflow and the integration of both.

HPOflow - Sphinx DOC Tools for Optuna, MLflow and the integration of both. Detailed documentation with examples can be found here: Sphinx DOC Table of

Telekom Open Source Software 17 Nov 20, 2022
A single Python file with some tools for visualizing machine learning in the terminal.

Machine Learning Visualization Tools A single Python file with some tools for visualizing machine learning in the terminal. This demo is composed of t

Bram Wasti 35 Dec 29, 2022