This example implements the end-to-end MLOps process using Vertex AI platform and Smart Analytics technology capabilities

Overview

MLOps with Vertex AI

This example implements the end-to-end MLOps process using Vertex AI platform and Smart Analytics technology capabilities. The example use Keras to implement the ML model, TFX to implement the training pipeline, and Model Builder SDK to interact with Vertex AI.

MLOps lifecycle

Getting started

  1. Setup your MLOps environment on Google Cloud.

  2. Start your AI Notebook instance.

  3. Open the JupyterLab then open a new Terminal

  4. Clone the repository to your AI Notebook instance:

    git clone https://github.com/GoogleCloudPlatform/mlops-with-vertex-ai.git
    cd mlops-with-vertex-ai
    
  5. Install the required Python packages:

    pip install tfx==1.2.0 --user
    pip install -r requirements.txt
    

    NOTE: You can ignore the pip dependencies issues. These will be fixed when upgrading to subsequent TFX version.


  6. Upgrade the gcloud components:

    sudo apt-get install google-cloud-sdk
    gcloud components update
    

Dataset Management

The Chicago Taxi Trips dataset is one of public datasets hosted with BigQuery, which includes taxi trips from 2013 to the present, reported to the City of Chicago in its role as a regulatory agency. The task is to predict whether a given trip will result in a tip > 20%.

The 01-dataset-management notebook covers:

  1. Performing exploratory data analysis on the data in BigQuery.
  2. Creating Vertex AI Dataset resource using the Python SDK.
  3. Generating the schema for the raw data using TensorFlow Data Validation.

ML Development

We experiment with creating a Custom Model using 02-experimentation notebook, which covers:

  1. Preparing the data using Dataflow.
  2. Implementing a Keras classification model.
  3. Training the Keras model with Vertex AI using a pre-built container.
  4. Upload the exported model from Cloud Storage to Vertex AI.
  5. Extract and visualize experiment parameters from Vertex AI Metadata.
  6. Use Vertex AI for hyperparameter tuning.

We use Vertex TensorBoard and Vertex ML Metadata to track, visualize, and compare ML experiments.

In addition, the training steps are formalized by implementing a TFX pipeline. The 03-training-formalization notebook covers implementing and testing the pipeline components interactively.

Training Operationalization

The 04-pipeline-deployment notebook covers executing the CI/CD steps for the training pipeline deployment using Cloud Build. The CI/CD routine is defined in the pipeline-deployment.yaml file, and consists of the following steps:

  1. Clone the repository to the build environment.
  2. Run unit tests.
  3. Run a local e2e test of the TFX pipeline.
  4. Build the ML container image for pipeline steps.
  5. Compile the pipeline.
  6. Upload the pipeline to Cloud Storage.

Continuous Training

After testing, compiling, and uploading the pipeline definition to Cloud Storage, the pipeline is executed with respect to a trigger. We use Cloud Functions and Cloud Pub/Sub as a triggering mechanism. The Cloud Function listens to the Pub/Sub topic, and runs the training pipeline given a message sent to the Pub/Sub topic. The Cloud Function is implemented in src/pipeline_triggering.

The 05-continuous-training notebook covers:

  1. Creating a Cloud Pub/Sub topic.
  2. Deploying a Cloud Function.
  3. Triggering the pipeline.

The end-to-end TFX training pipeline implementation is in the src/pipelines directory, which covers the following steps:

  1. Receive hyper-parameters using hyperparam_gen custom python component.
  2. Extract data from BigQuery using BigQueryExampleGen component.
  3. Validate the raw data using StatisticsGen and ExampleValidator component.
  4. Process the data using on Dataflow Transform component.
  5. Train a custom model with Vertex AI using Trainer component.
  6. Evaluate and validate the custom model using ModelEvaluator component.
  7. Save the blessed to model registry location in Cloud Storage using Pusher component.
  8. Upload the model to Vertex AI using vertex_model_pusher custom python component.

Model Deployment

The 06-model-deployment notebook covers executing the CI/CD steps for the model deployment using Cloud Build. The CI/CD routine is defined in build/model-deployment.yaml file, and consists of the following steps:

  1. Test model interface.
  2. Create an endpoint in Vertex AI.
  3. Deploy the model to the endpoint.
  4. Test the Vertex AI endpoint.

Prediction Serving

We serve the deployed model for prediction. The 07-prediction-serving notebook covers:

  1. Use the Vertex AI endpoint for online prediction.
  2. Use the Vertex AI uploaded model for batch prediction.
  3. Run the batch prediction using Vertex Pipelines.

Model Monitoring

After a model is deployed in for prediction serving, continuous monitoring is set up to ensure that the model continue to perform as expected. The 08-model-monitoring notebook covers configuring Vertex AI Model Monitoring for skew and drift detection:

  1. Set skew and drift threshold.
  2. Create a monitoring job for all the models under and endpoint.
  3. List the monitoring jobs.
  4. List artifacts produced by monitoring job.
  5. Pause and delete the monitoring job.

Metadata Tracking

You can view the parameters and metrics logged by your experiments, as well as the artifacts and metadata stored by your Vertex Pipelines in Cloud Console.

Disclaimer

This is not an official Google product but sample code provided for an educational purpose.


Copyright 2021 Google LLC.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Comments
  • 02-Experimentation: Create classifier  fails w/layer requires matching shapes

    02-Experimentation: Create classifier fails w/layer requires matching shapes

    ValueError: A Concatenate layer requires inputs with matching shapes except for the concat axis. Got inputs shapes: [(None, 2), (None, 4), (7,), (None, 3), (None, 1), (None, 1), (6,), (None, 3), (None, 3), (None, 1), (None, 10)]

    Input Features dropoff_grid_xf <dtype: 'int64'>: [0, 0, 0] euclidean_xf <dtype: 'float32'>: [0.669279932975769, -0.8318284749984741, -0.8318284749984741] loc_cross_xf <dtype: 'int64'>: [0, 0, 0] payment_type_xf <dtype: 'int64'>: [2, 0, 0] pickup_grid_xf <dtype: 'int64'>: [0, 0, 0] trip_day_of_week_xf <dtype: 'int64'>: [0, 6, 2] trip_day_xf <dtype: 'int64'>: [8, 30, 1] trip_hour_xf <dtype: 'int64'>: [1, 13, 6] trip_miles_xf <dtype: 'float32'>: [2.3255326747894287, -0.22459185123443604, -0.4029441475868225] trip_month_xf <dtype: 'int64'>: [3, 1, 0] trip_seconds_xf <dtype: 'float32'>: [0.9550504088401794, -0.2630620300769806, -0.24356801807880402] target: [0, 0, 0]

    opened by jth1911 3
  • DataTransformer ModuleNotFoundError: No module named 'user_module_0' error

    DataTransformer ModuleNotFoundError: No module named 'user_module_0' error

    ModuleNotFoundError: No module named 'user_module_0' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 649, in do_work work_executor.execute() File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 179, in execute op.start() File "apache_beam/runners/worker/operations.py", line 710, in apache_beam.runners.worker.operations.DoOperation.start File "apache_beam/runners/worker/operations.py", line 712, in apache_beam.runners.worker.operations.DoOperation.start File "apache_beam/runners/worker/operations.py", line 713, in apache_beam.runners.worker.operations.DoOperation.start File "apache_beam/runners/worker/operations.py", line 311, in apache_beam.runners.worker.operations.Operation.start File "apache_beam/runners/worker/operations.py", line 317, in apache_beam.runners.worker.operations.Operation.start File "apache_beam/runners/worker/operations.py", line 659, in apache_beam.runners.worker.operations.DoOperation.setup File "apache_beam/runners/worker/operations.py", line 660, in apache_beam.runners.worker.operations.DoOperation.setup File "apache_beam/runners/worker/operations.py", line 292, in apache_beam.runners.worker.operations.Operation.setup File "apache_beam/runners/worker/operations.py", line 306, in apache_beam.runners.worker.operations.Operation.setup File "apache_beam/runners/worker/operations.py", line 799, in apache_beam.runners.worker.operations.DoOperation._get_runtime_performance_hints File "/usr/local/lib/python3.7/site-packages/apache_beam/internal/pickler.py", line 294, in loads return dill.loads(s) File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 275, in loads return load(file, ignore, **kwds) File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 270, in load return Unpickler(file, ignore=ignore, **kwds).load() File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 472, in load obj = StockUnpickler.load(self) File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 826, in _import_module return __import__(import_name) ModuleNotFoundError: No module named 'user_module_0'

    Running the transform component of this repository in Vertex AI hits the following error (in the 04-pipeline-deployment.ipynb notebook). Does anyone have a quick fix for this? Have tried specifying setup.py and "save_main_session": True so far with no luck.

    opened by baymears 2
  • Error: 02 - ML Experimentation with Custom Model

    Error: 02 - ML Experimentation with Custom Model

    Hi I'm running the tutorial with TFX 1.4.0 and TF 2.7.0.

    When I run the cell classifier = model.create_binary_classifier(tft_output, hyperparams) classifier.summary()

    I get the error:


    ValueError Traceback (most recent call last) in ----> 1 classifier = model.create_binary_classifier(tft_output, hyperparams) 2 classifier.summary()

    ~/mlops-with-vertex-ai/src/model_training/model.py in create_binary_classifier(tft_output, hyperparams) 83 ) 84 ---> 85 return _create_binary_classifier(feature_vocab_sizes, hyperparams)

    ~/mlops-with-vertex-ai/src/model_training/model.py in _create_binary_classifier(feature_vocab_sizes, hyperparams) 62 pass 63 ---> 64 joined = keras.layers.Concatenate(name="combines_inputs")(layers) 65 feedforward_output = keras.Sequential( 66 [

    /opt/conda/lib/python3.7/site-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs) 65 except Exception as e: # pylint: disable=broad-except 66 filtered_tb = _process_traceback_frames(e.traceback) ---> 67 raise e.with_traceback(filtered_tb) from None 68 finally: 69 del filtered_tb

    /opt/conda/lib/python3.7/site-packages/keras/layers/merge.py in build(self, input_shape) 514 ranks = set(len(shape) for shape in shape_set) 515 if len(ranks) != 1: --> 516 raise ValueError(err_msg) 517 # Get the only rank for the set. 518 (rank,) = ranks

    ValueError: A Concatenate layer requires inputs with matching shapes except for the concatenation axis. Received: input_shape=[(None, 2), (None, 4), (7,), (None, 3), (None, 1), (None, 1), (6,), (None, 3), (None, 3), (None, 1), (None, 10)]

    opened by jarmstrongcorus 1
  • Error during compilation with Cloud Build

    Error during compilation with Cloud Build

    opened by sayakpaul 1
  • A doubt regarding the TFX pipeline associated with Continuous Training

    A doubt regarding the TFX pipeline associated with Continuous Training

    Hi @ksalama.

    Thank you very much for this amazing resource. It's a mini-book in itself.

    I am referring to the statement written just below the continuous training section:

    The end-to-end TFX training pipeline implementation is in the src/pipelines directory, which covers the following steps:

    Are these pipelines demonstrated in any of the notebooks?

    opened by sayakpaul 1
  • Version an already trained custom model on Model Registry

    Version an already trained custom model on Model Registry

    Hi, everyone!

    I trained a model using scikit-learn, and saved it as a pickle file, stored in a GCS bucket. I was wondering... how can I version this model using model registry since it is already trained?

    In my case, I have the option of using Cloud Run for this (a cloud run container that runs a retraining task every week), but I want to start doing it on Vertex AI.

    After reading some articles about Vertex AI Model Registry, I concluded that the model must be firstly trained using a custom training job, and, just after that, we can begin its versioning on Model Registry. Is this correct?

    opened by LiviaPimentelCVER 0
  • Error: googleapi: Error 409: The requested bucket name is not available. The bucket namespace is shared by all users of the system. Please select a different name and try again a different name and try again.

    Error: googleapi: Error 409: The requested bucket name is not available. The bucket namespace is shared by all users of the system. Please select a different name and try again a different name and try again.

    I literally created a bucket with a random BTC wallet address to make it unique, which I was told to do by the GCP guide, and still i am hitting the same error while executing gcs-bucket.tf

    google_storage_bucket.bc1qxy2kgdygjrsqtzq2n0yrf2493p83kkfjhx0wlh: Creating... ╷ │ Error: googleapi: Error 409: The requested bucket name is not available. The bucket namespace is shared by all users of the system. Please select a different name and try again., conflict │ │ with google_storage_bucket.bc1qxy2kgdygjrsqtzq2n0yrf2493p83kkfjhx0wlh, │ on gcs-bucket.tf line 17, in resource "google_storage_bucket" "bc1qxy2kgdygjrsqtzq2n0yrf2493p83kkfjhx0wlh": │ 17: resource "google_storage_bucket" "bc1qxy2kgdygjrsqtzq2n0yrf2493p83kkfjhx0wlh" { │

    opened by Sharaykaka 2
  • In Vertex AI creating Tensorboard instance will drain out all the credits

    In Vertex AI creating Tensorboard instance will drain out all the credits

    I've noticed that the TensorBoard instance that is being produced in Notebook 2 uses a lot of credits. Before January 22 this was free, once enabled this instance , disabling the Vertex AI API won't stop the billing. See response from GCP - https://www.googlecloudcommunity.com/gc/AI-ML/How-can-I-avoid-being-charged-for-Tensorboard/td-p/180658

    opened by sthubeML2020 0
  • Python markupsafe dependency error in 03-training-formalization.ipynb

    Python markupsafe dependency error in 03-training-formalization.ipynb

    The latest Python 3 package markupsafe is not compatible with from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext:

    import ml_metadata as mlmd
    from ml_metadata.proto import metadata_store_pb2
    from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext
    

    Error:

    ImportError Traceback (most recent call last) /tmp/ipykernel_1/3645963042.py in 1 import ml_metadata as mlmd 2 from ml_metadata.proto import metadata_store_pb2 ----> 3 from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext 4 5 connection_config = metadata_store_pb2.ConnectionConfig()

    ~/.local/lib/python3.7/site-packages/tfx/orchestration/experimental/interactive/interactive_context.py in 35 36 import absl ---> 37 import jinja2 38 import nbformat 39 from tfx import types

    ~/.local/lib/python3.7/site-packages/jinja2/init.py in 10 from .bccache import FileSystemBytecodeCache 11 from .bccache import MemcachedBytecodeCache ---> 12 from .environment import Environment 13 from .environment import Template 14 from .exceptions import TemplateAssertionError

    ~/.local/lib/python3.7/site-packages/jinja2/environment.py in 23 from .compiler import CodeGenerator 24 from .compiler import generate ---> 25 from .defaults import BLOCK_END_STRING 26 from .defaults import BLOCK_START_STRING 27 from .defaults import COMMENT_END_STRING

    ~/.local/lib/python3.7/site-packages/jinja2/defaults.py in 1 # -- coding: utf-8 -- 2 from ._compat import range_type ----> 3 from .filters import FILTERS as DEFAULT_FILTERS # noqa: F401 4 from .tests import TESTS as DEFAULT_TESTS # noqa: F401 5 from .utils import Cycler

    ~/.local/lib/python3.7/site-packages/jinja2/filters.py in 11 from markupsafe import escape 12 from markupsafe import Markup ---> 13 from markupsafe import soft_unicode 14 15 from ._compat import abc

    ImportError: cannot import name 'soft_unicode' from 'markupsafe' (/home/jupyter/.local/lib/python3.7/site-packages/markupsafe/init.py)

    The workaround is to install an older version by pip install markupsafe==2.0.1 and restart kernel

    opened by hilliao 0
  • Python IndexError execution error in 03-training-formalization.ipynb

    Python IndexError execution error in 03-training-formalization.ipynb

    Python error encountered executing the following line at [Extract train and eval splits]:

    sql_query = datasource_utils.get_training_source_query(

    sql_query = datasource_utils.get_training_source_query(
        PROJECT, REGION, DATASET_DISPLAY_NAME, ml_use='UNASSIGNED', limit=5000)
    

    Observed error:

    IndexError Traceback (most recent call last) /tmp/ipykernel_1/1584844956.py in 1 print(DATASET_DISPLAY_NAME) 2 sql_query = datasource_utils.get_training_source_query( ----> 3 PROJECT, REGION, DATASET_DISPLAY_NAME, ml_use='UNASSIGNED', limit=5000) 4 5 output_config = example_gen_pb2.Output(

    ~/mlops-with-vertex-ai/src/common/datasource_utils.py in get_training_source_query(project, region, dataset_display_name, ml_use, limit) 55 dataset = vertex_ai.TabularDataset.list( 56 filter=f"display_name={dataset_display_name}", order_by="update_time" ---> 57 )[-1] 58 bq_source_uri = dataset.gca_resource.metadata"inputConfig"]["bigquerySource"][ 59 "uri"

    IndexError: list index out of range

    I can't find .list method for google.cloud.aiplatform's TabularDataset in datasource_utils.py

    opened by hilliao 1
Owner
Google Cloud Platform
Google Cloud Platform
The MLOps platform for innovators 🚀

​ DS2.ai is an integrated AI operation solution that supports all stages from custom AI development to deployment. It is an AI-specialized platform service that collects data, builds a training dataset through data labeling, and enables automatic development of artificial intelligence and easy deployment and operation.

null 9 Jan 3, 2023
Painting app using Python machine learning and vision technology.

AI Painting App We are making an app that will track our hand and helps us to draw from that. We will be using the advance knowledge of Machine Learni

Badsha Laskar 3 Oct 3, 2022
A PyTorch library and evaluation platform for end-to-end compression research

CompressAI CompressAI (compress-ay) is a PyTorch library and evaluation platform for end-to-end compression research. CompressAI currently provides: c

InterDigital 680 Jan 6, 2023
FastReID is a research platform that implements state-of-the-art re-identification algorithms.

FastReID is a research platform that implements state-of-the-art re-identification algorithms.

JDAI-CV 2.8k Jan 7, 2023
A repository that finds a person who looks like you by using face recognition technology.

Find Your Twin Hello everyone, I've always wondered how casting agencies do the casting for a scene where a certain actor is young or old for a movie

Cengizhan Yurdakul 3 Jan 29, 2022
The end-to-end platform for building voice products at scale

Picovoice Made in Vancouver, Canada by Picovoice Picovoice is the end-to-end platform for building voice products on your terms. Unlike Alexa and Goog

Picovoice 318 Jan 7, 2023
Wafer Fault Detection using MlOps Integration

Wafer Fault Detection using MlOps Integration This is an end to end machine learning project with MlOps integration for predicting the quality of wafe

Sethu Sai Medamallela 0 Mar 11, 2022
Neon-erc20-example - Example of creating SPL token and wrapping it with ERC20 interface in Neon EVM

Example of wrapping SPL token by ERC2-20 interface in Neon Requirements Install

null 7 Mar 28, 2022
This solves the autonomous driving issue which is supported by deep learning technology. Given a video, it splits into images and predicts the angle of turning for each frame.

Self Driving Car An autonomous car (also known as a driverless car, self-driving car, and robotic car) is a vehicle that is capable of sensing its env

Sagor Saha 4 Sep 4, 2021
Artificial intelligence technology inferring issues and logically supporting facts from raw text

개요 비정형 텍스트를 학습하여 쟁점별 사실과 논리적 근거 추론이 가능한 인공지능 원천기술 Artificial intelligence techno

null 6 Dec 29, 2021
Example-custom-ml-block-keras - Custom Keras ML block example for Edge Impulse

Custom Keras ML block example for Edge Impulse This repository is an example on

Edge Impulse 8 Nov 2, 2022
Python-kafka-reset-consumergroup-offset-example - Python Kafka reset consumergroup offset example

Python Kafka reset consumergroup offset example This is a simple example of how

Willi Carlsen 1 Feb 16, 2022
Introduction to AI assignment 1 HCM University of Technology, term 211

Sokoban Bot Introduction to AI assignment 1 HCM University of Technology, term 211 Abstract This is basically a solver for Sokoban game using Breadth-

Quang Minh 4 Dec 12, 2022
The Face Mask recognition system uses AI technology to detect the person with or without a mask.

Face Mask Detection Face Mask Detection system built with OpenCV, Keras/TensorFlow using Deep Learning and Computer Vision concepts in order to detect

Rohan Kasabe 4 Apr 5, 2022
MLOps will help you to understand how to build a Continuous Integration and Continuous Delivery pipeline for an ML/AI project.

page_type languages products description sample python azure azure-machine-learning-service azure-devops Code which demonstrates how to set up and ope

null 1 Nov 1, 2021
Real-Time SLAM for Monocular, Stereo and RGB-D Cameras, with Loop Detection and Relocalization Capabilities

ORB-SLAM2 Authors: Raul Mur-Artal, Juan D. Tardos, J. M. M. Montiel and Dorian Galvez-Lopez (DBoW2) 13 Jan 2017: OpenCV 3 and Eigen 3.3 are now suppor

Raul Mur-Artal 7.8k Dec 30, 2022
deep-table implements various state-of-the-art deep learning and self-supervised learning algorithms for tabular data using PyTorch.

deep-table implements various state-of-the-art deep learning and self-supervised learning algorithms for tabular data using PyTorch.

null 63 Oct 17, 2022
Some code of the implements of Geological Modeling Using 3D Pixel-Adaptive and Deformable Convolutional Neural Network

3D-GMPDCNN Geological Modeling Using 3D Pixel-Adaptive and Deformable Convolutional Neural Network PyTorch implementation of "Geological Modeling Usin

null 5 Nov 21, 2022
Camera-caps - Examine the camera capabilities for V4l2 cameras

camera-caps This is a graphical user interface over the v4l2-ctl command line to

Jetsonhacks 25 Dec 26, 2022