AI and Machine Learning workflows on Anthos Bare Metal.

Overview

Hybrid and Sovereign AI on Anthos Bare Metal

Table of Contents

Overview

AI and Machine Learning workflows using TensorFlow on Anthos Bare Metal. TensorFlow is one of the most popular ML frameworks (10M+ downloads per month) in use today, but at the same time presents a lot of challenges when it comes to setup (GPUs, CUDA Drivers, TF Serving etc), performance tuning, cluster provisioning, maintenance, and model serving. This work will showcase the easy to use guides for ML model serving, training, infrastructure, ML Notebooks, and more on Anthos Bare Metal.

Terraform as IaC Substrate

Terraform is an open-source infrastructure as code software tool, and one of the ways in which Enterprise IT teams create, manage, and update infrastructure resources such as physical machines, VMs, switches, containers, and more. Provisioning the hardware or resources is always the first step in the process and these guides will be using Terraform as a common substrate to create the infrastructure for AI/ML apps. Checkout our upstream contribution to the Google Terraform Provider for GPU support in the instance_template module.

Serving TensorFlow ResNet Model on ABM

In this installation you'll see how to create an end-to-end TensorFlow ML serving ResNet installation on ABM using Google Compute Engine. Once the setup is completed, you'll be able to send image classification requests using GRPC client to ABM ML Serving cluster.

Requirements

  • Google Cloud Platform access and install gcloud SDK
  • Service Account JSON
  • Terraform, Git, Container Image

ResNet SavedModel Image on GCR

Let's create a local directory and download the Deep residual network (ResNet) model.

rm -rf /tmp/resnet
mkdir /tmp/resnet
curl -s http://download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC_jpg.tar.gz | tar --strip-components=2 -C /tmp/resnet -xvz

Verify the SavedModel

ls /tmp/resnet/*
saved_model.pb variables

Now we will commit the ResNet serving docker image:

docker run -d --name serving_base tensorflow/serving
docker cp /tmp/resnet serving_base:/models/resnet
docker commit --change "ResNet model" serving_base $USER/resnet_serving
docker kill serving_base
docker rm serving_base

Copy the local docker image to gcr.io

export GCR_IMAGE_PATH="gcr.io/$GCP_PROJECT/abm_serving/resnet"
docker tar $USER/resnet_serving $GCR_IMAGE_PATH
docker push $GCR_IMAGEPATH

ABM GCE Cluster using Terraform

Create GCE demo host and perform few steps to setup the host:

export SERVICE_ACCOUNT_FILE=<FILE_LOCATION>

export DEMO_HOST="abm-demo-host-live"
gcloud compute instances create $DEMO_HOST --zone=us-central1-a
gcloud compute scp $SERVICE_ACCOUNT_FILE $USER@$DEMO_HOST:

Perform ssh login into the demo machine and follow steps below:

gcloud compute ssh $DEMO_HOST --zone=us-central1-a

# Activate Service Account
gcloud auth activate-service-account --key-file=$SERVICE_ACCOUNT_FILE

# Install Git
sudo apt-get install git

# Install Terraform
# v0.14.10
export TERRAFORM_VERSION="0.14.10"

List current Anthos/GKE clusters using hub membership. You can list existing clusters and compare it with newly created clusters.

# List Anthos BM clusters
gcloud container hub memberships list

Install Terraform, and make few minor changes to configuration files:

# Remove any previous versions. You can skip if this is a new instance
sudo apt remove terraform

sudo apt-get install software-properties-common

curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
sudo apt-get update && sudo apt-get install terraform=$TERRAFORM_VERSION

terraform -version

Let's setup some ABM infrastructure on GCE using Terraform

# Git clone ABM Terraform setup
git clone https://github.com/GoogleCloudPlatform/anthos-samples.git
cd anthos-samples
git checkout abm-gcp-tf-demo
cd anthos-bm-gcp-terraform

# Make changes to cluster names and few edits
cp terraform.tfvars.sample terraform.tfvars

Make edits to the variables.tf and terraform.tfvars and also make sure the abm_cluster_id is modified to a unique name

# Change abm_cluster_id and service account name in variables.tf
export CLUSTER_ID=`echo "abm-tensorflow-"$(date +"%m%d%H%M")`
echo $CLUSTER_ID

Create GCE resources using Terraform and verify

# Terraform init and apply
terraform init && terraform plan
terraform apply

# Verify resources using gcloud
gcloud compute instancs list

# Let's create cluster using bmctl and perform pre-flight checks and verify
export KUBECONFIG=$HOME/bmctl-workspace/$CLUSTER_ID/$CLUSTER_ID-kubeconfig

# List ABM clusters
gcloud container hub memberships list

# Listing the details of live-cluster
gcloud container hub memberships describe $LIVE_CLUSTER_NAME

Verify k8s cluster details and check few outputs

kubectl get nodes
kubectl get deployments
kubectl get pods

TensorFlow ResNet model service on ABM Cluster

git clone https://github.com/GoogleCloudPlatform/anthos-ai
cd anthos-ai

kubectl create -f serving/resnet_k8s.yaml

# Let's view deployments and pods
kubectl get deployments
kubectl get pods

kubectl get services
kubectl describe service resnet-abm-service

# Let's send prediction request to ResNet service on ABM
git clone https://github.com/puneith/serving.git
sudo tools/run_in_docker.sh python tensorflow_serving/example/resnet_client_grpc.py $IMAGE_URL --server=10.200.0.51:8500

Return to the demo host and then destroy the demo host

# Destroy resources and demo host
terraform destroy

gcloud compute instances delete $DEMO_HOST
You might also like...
A machine learning model for analyzing text for user sentiment and determine whether its a positive, neutral, or negative review.
A machine learning model for analyzing text for user sentiment and determine whether its a positive, neutral, or negative review.

Sentiment Analysis on Yelp's Dataset Author: Roberto Sanchez, Talent Path: D1 Group Docker Deployment: Deployment of this application can be found her

Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer
Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

IMDB Sentiment Analysis This is the final project of Machine Learning Courses in Huazhong University of Science and Technology, School of Artificial I

ChatterBot is a machine learning, conversational dialog engine for creating chat bots
ChatterBot is a machine learning, conversational dialog engine for creating chat bots

ChatterBot ChatterBot is a machine-learning based conversational dialog engine build in Python which makes it possible to generate responses based on

ChatterBot is a machine learning, conversational dialog engine for creating chat bots
ChatterBot is a machine learning, conversational dialog engine for creating chat bots

ChatterBot ChatterBot is a machine-learning based conversational dialog engine build in Python which makes it possible to generate responses based on

Knowledge Management for Humans using Machine Learning & Tags
Knowledge Management for Humans using Machine Learning & Tags

HyperTag helps humans intuitively express how they think about their files using tags and machine learning. Represent how you think using tags. Find what you look for using semantic search for your text documents (yes, even PDF's) and images.

Machine learning models from Singapore's NLP research community

SG-NLP Machine learning models from Singapore's natural language processing (NLP) research community. sgnlp is a Python package that allows you to eas

NLP, Machine learning
NLP, Machine learning

Netflix-recommendation-system NLP, Machine learning About Recommendation algorithms are at the core of the Netflix product. It provides their members

NLP - Machine learning
NLP - Machine learning

Flipkart-product-reviews NLP - Machine learning About Product reviews is an essential part of an online store like Flipkart’s branding and marketing.

Learning to Rewrite for Non-Autoregressive Neural Machine Translation
Learning to Rewrite for Non-Autoregressive Neural Machine Translation

RewriteNAT This repo provides the code for reproducing our proposed RewriteNAT in EMNLP 2021 paper entitled "Learning to Rewrite for Non-Autoregressiv

Comments
  • Create guide for ResNet GPU Training using TFJob

    Create guide for ResNet GPU Training using TFJob

    The objective is to show GPU based TensorFlow training on ABM cluster on GCE. The summary of tasks which need to be performed:

    • Create ABM GPU cluster with NVIDIA Drivers and V100 GPU
    • Perform Model Training using TFJob and GPUs
    • Perform ResNet TensorFlow model serving using ABM GPU
    • Author the README guide
    documentation enhancement 
    opened by puneith 1
  • Create a GPU based ABM cluster and install CUDA drivers during Terraform initialization

    Create a GPU based ABM cluster and install CUDA drivers during Terraform initialization

    We need to create a Terraform based ABM cluster setup [1] using GPU on Google Compute Engine. We've made upstream contribution to the Google Terraform Provider for GPU support [2] in the instance_template module release [3]. We need to perform few things --

    • Provide a config based sample for GPU hardware enabled ABM (anthos-samples repo)
    • Install NVIDIA CUDA drivers on ABM cluster (anthos-samples repo)
    • Create a setup guide in this repo on how to create the ABM GPU cluster

    [1] https://github.com/GoogleCloudPlatform/anthos-samples/tree/master/anthos-bm-gcp-terraform [2] https://github.com/terraform-google-modules/terraform-google-vm/pull/160 [3] https://github.com/terraform-google-modules/terraform-google-vm/pull/158

    enhancement 
    opened by puneith 1
  • TensorFlow model training on ABM using Kubeflow TFJob

    TensorFlow model training on ABM using Kubeflow TFJob

    TensorFlow ML model training on Anthos Bare Metal. Create a basic Hello World training app using TFJob.

    TFJob is a Kubernetes custom resource that you can use to run TensorFlow training jobs on Kubernetes. The Kubeflow implementation of TFJob is in tf-operator.

    The model is a simple MNIST model with persistent volume. Once we have a running example on ABM, we'll need to create an end to end ML workflow on ABM and serve the model as well.

    The above example also provides an exploration into Kubeflow on ABM and why such a workflow makes sense. Even though we can run TensorFlow directly on Kubernetes (as shown in the TF Serving example), the TFJob abstraction makes it easy to define TensorFlow deployments. The standard way of performing multi-node training in TensorFlow is using TF_CONFIG environment variable. This environment variable is a JSON string which provides cluster and task information, and is set for each binary running on the cluster. The nodes in the cluster can be worker and ps nodes. In multi-worker training, there is usually one worker that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular worker does. Such worker is referred to as the chief worker, and it is customary that the worker with index 0 is appointed as the chief worker. Each node in the cluster also has one of the roles worker, ps, or chief.

    The setting of TF_CONFIG environment variable can be a manual process if done outside Kubeflow, and this is the part which TFJob Controller automatically manages.

    enhancement 
    opened by puneith 0
  • Create Edge ML deployment using ABM and ACM

    Create Edge ML deployment using ABM and ACM

    There has been current work going on with ABM Edge large scale rollout here. One of the common workloads we've been seeing on the Edge is the deployment of ML models (like serving ML models for image classification).

    Deploying these pre-trained models on the Edge involves complex software systems. ABM 1.8.0+ provides an Edge profile with minimal vCPU and Mem resource requirements and therefore well suited to deploy appropriate size ML models. Once the rollout work is finished we'd like to replace that work with an ML workload.

    opened by puneith 1
Owner
Google Cloud Platform
Google Cloud Platform
Neural-Machine-Translation - Implementation of revolutionary machine translation models

Neural Machine Translation Framework: PyTorch Repository contaning my implementa

Utkarsh Jain 1 Feb 17, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Dec 30, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Jan 3, 2023
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 10.8k Feb 18, 2021
One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

Adobe, Inc. 148 Dec 26, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.3k Jan 7, 2023
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.1k Feb 17, 2021
Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

Texar-PyTorch is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar

ASYML 726 Dec 30, 2022
This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

Twitter COVID-19 Sentiment Analysis Members: Christopher Bach | Khalid Hamid Fallous | Jay Hirpara | Jing Tang | Graham Thomas | David Wetherhold Pro

null 4 Oct 15, 2022