An end-to-end implementation of intent prediction with Metaflow and other cool tools

Jacopo Tagliabue

Last update: Dec 31, 2022

Related tags

Deep Learning you-dont-need-a-bigger-boat

Overview

You Don't Need a Bigger Boat

An end-to-end (Metaflow-based) implementation of an intent prediction flow for kids who can't MLOps good and wanna learn to do other stuff good too.

This is a WIP - check back often for updates.

Philosophical Motivations

There is plenty of tutorials and blog posts around the Internet on data pipelines and tooling. However:

they (for good pedagogical reasons) tend to focus on one tool / step at a time, leaving us to wonder how the rest of the pipeline works;
they (for good pedagogical reasons) tend to work in a toy-world fashion, leaving us to wonder what would happen when a real dataset and a real-world problem enter the scene.

This repository (and soon-to-be-drafted written tutorial) aims to fill these gaps. In particular:

we provide open-source working code that glues together what we believe are some of the best tools in the ecosystem, going all the way from raw data to a deployed endpoint serving predictions;
we run the pipeline under a realistic load for companies at "reasonable scale", leveraging a huge open dataset we released in 2021; moreover, we train a model for a real-world use case, and show how to monitor it after deployment.

The repo may also be seen as a (very opinionated) introduction to modern, PaaS-like pipelines; while there is obviously room for disagreement over tool X or tool Y, we believe the general principles to be sound for companies at "reasonable scale": in-between bare-bone infrastructure for Tech Giants, and ready-made solutions for low-code/simple scenarios, there is a world of exciting machine learning at scale for sophisticated practitioners who don't want to waste their time managing cloud resources.

Overview

The repo shows how several (mostly open-source) tools can be effectively combined together to run data pipelines. The project current features:

Metaflow for ML DAGs (Alternatives: Luigi (?))
Snowflake as a data warehouse solution (Alternatives: Redshift)
Prefect as a general orchestrator (Alternatives: Airflow)
dbt for data transformation (Alternatives: ?)
Great Expectations for data quality (Alternatives: dbt-expectations plugin)
Weights&Biases for experiment tracking (Alternatives: Comet)
Gantry for ML monitoring (Alternatives: Aporia)
Sagemaker / Lambda for model serving (Alternatives: many)

The following picture from our Recsys paper (forthcoming) gives a quick overview of such a pipeline:

We provide two versions of the pipeline, depending on the sophistication of the setup:

a Metaflow-only version, which runs from static data files (see below) to Sagemaker as a single Flow, and can be run from a Metaflow-enabled laptop without much additional setup;
a data warehouse version, which runs in a more realistic setup, reading data from Snowflake and using an external orchestrator to run the steps. In this setup, the downside is that a Snowflake and a Prefect Cloud accounts are required (nonetheless, both are veasy to get); the upside is that the pipeline reflects almost perfectly a real setup, and Metaflow can be used specifically for the ML part of the process.

The parallelism between the two scenarios should be pretty clear by looking at the two projects: if you are familiarizing with all the tools for the first time, we suggest you to start from the Metaflow version and then move to the full-scale one when all the pieces of the puzzle are well understood.

Relevant Material

If you want to know more, you can give a look at the following material:

"Serverless MLOps for Reasonable Companies" (video), Data Science Meetup, June 2021;
"You Do Not Need a Bigger Boat: Recommendations at Reasonable Scale in a (Mostly) Serverless and Open Stack" (preprint), RecSys 2021.

TBC

Status Update

July 2021

End-2-end flow working for remote and local projects; started standardizing Prefect agents with Docker and adding other services (monitoring, feature store etc.).

TO-DOs:

dockerize the local flow;
write-up all of this as a blog post;
improve code / readability / docs, add potentially some more pics and some videos;
providing an orchestrator-free version, by using step functions to manage the steps;
finish feature store and gantry integration;
add Github Action flow;
continue improving the DAG card project.

Setup

General Prerequisites (do this first!)

Irrespectively of the flow you wish to run, some general tools need to be in place: Metaflow of course, as the heart of our ML practice, but also data and AWS users/roles. Please go through the general items below before tackling the flow-specific instructions.

After you finish the prerequisites below, you can run the flow you desire: each folder - remote and local - contains a specific README which should allow you to quickly run the project end-to-end: please refer to that documentation for flow-specific instructions (check back often for updates).

Dataset

The project leverages the open dataset from the 2021 Coveo Data Challenge: the dataset can be downloaded directly from here (refer to the full README for terms and conditions). Data is freely available under a research-friendly license - for background information on the dataset, the use cases and relevant work in the ML literature, please refer to the accompanying paper.

Once you download and unzip the dataset in a local folder of your choice (the zip contains 3 files, browsing_train.csv, search_train.csv, sku_to_content.csv), write down their location as an absolute path (e.g. /Users/jacopo/Documents/data/train/browsing_train.csv): both projects need to know where the dataset is.

AWS

Both projects - remote and local - use AWS services extensively - and by design: this ties back to our philosophy of PaaS-whenever-possible, and play nicely with our core adoption of Metaflow. While you can setup your users in many functionally equivalent ways, note that if you want to run the pipeline from ingestion to serving you need to be comfortable with the following AWS interactions:

Metaflow stack (see below): we assume you installed the Metaflow stack and can run it with an AWS profile of your choice;
Serverless stack (see below): we assume you can run serverless deploy in your AWS stack;
Sagemaker user: we assume you have an AWS user with permissions to manage Sagemaker endpoints (it may be totally distinct from any other Metaflow user).

TBC

Serverless

We wrap Sagemaker predictions in a serverless REST endpoint provided by AWS Lambda and API Gateway. To manage the lambda stack we use Serverless as a wrapper around AWS infrastructure.

TBC

Metaflow

Metaflow: Configuration

If you have an AWS profile configured with a metaflow-friendly user, and you created metaflow stack with CloudFormation, you can run the following command with the resources created by CloudFormation to set up metaflow on AWS:

metaflow configure aws --profile metaflow

Remember to use METAFLOW_PROFILE=metaflow to use this profile when running a flow. Once you completed the setup, you can run flow_playground.py to test the AWS setup is working as expected (in particular, GPU batch jobs can run correctly). To run the flow with the custom profile created, you should do:

METAFLOW_PROFILE=metaflow python flow_playground.py run

Metaflow: Tips & Tricks

Parallelism Safe Guard
- The flag --max-workers should be used to limit the maximum number of parallel steps
- For example METAFLOW_PROFILE=metaflow python flow_playground.py run --max-workers 8 limits the maximum number of parallel tasks to 8
Environment Variables in AWS Batch
- The @environment decorator is used in conjunction with @batch to pass environment variables to AWS Batch, which will not directly have access to env variables on your local machine
- In the local example, we use @environemnt to pass the Weights & Biases API Key (amongst other things)
Resuming Flows
- Resuming flows is useful during development to avoid re-running compute/time intensive steps such as data preparation
- METAFLOW_PROFILE=metaflow python flow_playground.py resume <STEP_NAME> --origin-run-id <RUN_ID>
Local-Only execution
- It may sometimes be useful to debug locally (i.e to avoid Batch startup latency), we introduce a wrapper enable_decorator around the @batch decorator which enables or disables a decorator's functionality
- We use this in conjunction with an environment variable EN_BATCH to toggle the functionality of all @batch decorators.

FAQ

Both projects deal with data that has already been ingested/transmitted to the pipeline, but are silent on data collection. Any serverless option there as well?

Yes. In e-commerce use cases, for example, pixel tracking is standard (e.g. Google Analytics), so a serverless /collect endpoint can be used to get front-end data and drop it in a pure PaaS pipeline with Firehose and Snowpipe, for example. While a bit out-dated for some details, we championed exactly this approach a while ago: if you want to know more, you can start from this Medium post and old code.

TBC

How to Cite our Work

If you find our principles, code or data useful, please cite our work:

Paper (forthcoming in RecSys2021)

@inproceedings{10.1145/3460231.3474604,
author = {Tagliabue, Jacopo},
title = {You Do Not Need a Bigger Boat: Recommendations at Reasonable Scale in a (Mostly) Serverless and Open Stack},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3460231.3474604},
doi = {10.1145/3460231.3474604},
series = {RecSys '21}
}

Data

@inproceedings{CoveoSIGIR2021,
author = {Tagliabue, Jacopo and Greco, Ciro and Roy, Jean-Francis and Bianchi, Federico and Cassani, Giovanni and Yu, Bingqing and Chia, Patrick John},
title = {SIGIR 2021 E-Commerce Workshop Data Challenge},
year = {2021},
booktitle = {SIGIR eCom 2021}
}

Comments

Open source options for all stages

Thanks a lot for your work. Love your repos and podcasts. I wanted to suggest adding atleast one open source alternative for each stage. It would be helpful for folks that was to quickly build a full system with freely available open source tools. Thanks!

opened by bsridatta 2
Running local flows without GPUs

hi all!

Thank you for these incredible resources :)

how would I run the local flow if I don't have access to any GPUs?

https://github.com/jacopotagliabue/you-dont-need-a-bigger-boat/tree/main/local_flow

opened by hugobowne 2
Fix missing ESCAPED_DQ File Format

The definition of the custom ESCAPED_DQ file format is missing on the sf_connector file, causing issues when a user is trying to upload the data into Snowflake.

Original issue: https://github.com/jacopotagliabue/you-dont-need-a-bigger-boat/issues/6

opened by bigluck 2

File Format Does Not Exist

Hey there!

I was recently looking at the remote_flow workflow in the repo. However at the step where data needs to be uploaded into a Snowflake instance, I have been running into an error when calling make upload:

Traceback (most recent call last):
  File "push_data_to_sf.py", line 83, in <module>
    write_chunks(table=sku_to_content_table,
  File "push_data_to_sf.py", line 64, in write_chunks
    conn.upload_file(f"{output_prefix}*", table_name)
  File "/app/connectors/sf_connector.py", line 79, in upload_file
    self._cs.execute(f"COPY INTO {table} FILE_FORMAT = ESCAPED_DQ")
  File "/usr/local/lib/python3.8/site-packages/snowflake/connector/cursor.py", line 693, in execute
    Error.errorhandler_wrapper(
  File "/usr/local/lib/python3.8/site-packages/snowflake/connector/errors.py", line 258, in errorhandler_wrapper
    cursor.errorhandler(connection, cursor, error_class, error_value)
  File "/usr/local/lib/python3.8/site-packages/snowflake/connector/errors.py", line 188, in default_errorhandler
    raise error_class(
snowflake.connector.errors.ProgrammingError: 002003 (02000): SQL compilation error:
File format 'ESCAPED_DQ' does not exist or not authorized.

It appears that ESCAPED_DQ might be a custom defined file format for Snowflake data ingestion? Where is its definition?

Thanks for putting this repo together!

opened by mihail911 2

only enable pip install decorator if batch enabled

I wasn't running your exact example but I was having issues with running this locally with the pip install decorator. I think it should also be wrapped and only run on batch. ❤️ the clever decorators to install on batch.

opened by JSpenced 2
Comparison between this and something like Kedro or ZenML?

Hey there! Thanks for this amazing work! I was wondering if you /any users here had done a comparison between these projects. Seems like this repo just describes a general set of tools and how they link up together whereas the others take an opinionated stance and do the linking themselves?

I'm in the process of evaluating tools / frameworks.

I'm currently a single person looking to set up the groundwork for things to come. So far my plans have been to stay local for as long as possible before moving to some distributed computing framework (Dask gave me a lot of trouble in the past). I'm also looking to avoid using tools such as AWS or GCP for as long as possible so ideally the discussion would revolve around local machines. I'd love to hear thoughts and opinions.

opened by IanQS 1
only pip install packages if running on batch

I accidentally closed the other pull requests.

This will only pip install packages if on batch. I wasn't sure if to put this at this level or move it before the wrapper decorator.

opened by JSpenced 0
Andrew/standarization of variables
Changes:

Updates to the variable names for SQL

and

using serverless-dotenv-plugin

The plugin allows us to populate the serverless.yml with the contents of the .env file. It seemed to be the approach with less friction.
opened by asutcliffe-coveo 0
Andrew/dbt
This is a basic setup for DBT, the project structure could be review. Currently sigir_dbt is under prototype_flow a the same level as source which contains my python scripts to unprocess data.

The README in sigir_dbt should contain the information needed for setup. Ping me if you have an issue. If everything works on your side we could merge.

Important notes:

If you want to se the python connector for snowflake we may need to fix the version for pyarrow as it could lead to conflicts with dbt. I remeber gettign a warning.

Also I will do a clean up of the src folder and merge it in tonight.
opened by asutcliffe-coveo 0

An end-to-end implementation of intent prediction with Metaflow and other cool tools

Related tags

Overview

You Don't Need a Bigger Boat

Philosophical Motivations

Overview

Relevant Material

Status Update

Setup

General Prerequisites (do this first!)

Dataset

AWS

Serverless

Metaflow

Metaflow: Configuration

Metaflow: Tips & Tricks

FAQ

How to Cite our Work

Comments

Owner

Jacopo Tagliabue

Pytorch implementation of "Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling"

GPU-accelerated PyTorch implementation of Zero-shot User Intent Detection via Capsule Neural Networks

(CVPR 2022) A minimalistic mapless end-to-end stack for joint perception, prediction, planning and control for self driving.

Open source code for Paper "A Co-Interactive Transformer for Joint Slot Filling and Intent Detection"

SlotRefine: A Fast Non-Autoregressive Model forJoint Intent Detection and Slot Filling

Intent parsing and slot filling in PyTorch with seq2seq + attention

pytorch bert intent classification and slot filling

On-device speech-to-intent engine powered by deep learning

Citation Intent Classification in scientific papers using the Scicite dataset an Pytorch

Checkout some cool self-projects you can try your hands on to curb your boredom this December!

Have you ever wondered how cool it would be to have your own A.I

A cool little repl-based simulation written in Python

A pure PyTorch batched computation implementation of "CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition"

Price-Prediction-For-a-Dream-Home - A machine learning based linear regression trained model for house price prediction.

Doge-Prediction - Coding Club prediction ig

PyTorch implementation of the end-to-end coreference resolution model with different higher-order inference methods.

An implementation for `Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction`

This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object Tracking with TRansformer.

PyTorch implementation of SampleRNN: An Unconditional End-to-End Neural Audio Generation Model