Lorien: A Unified Infrastructure for Efficient Deep Learning Workloads Delivery

Related tags

Deep Learning lorien

Overview

Lorien: A Unified Infrastructure for Efficient Deep Learning Workloads Delivery

Lorien is an infrastructure to massively explore/benchmark the best schedules of given deep learning models. Lorien is deep learning compiler (DLC) agnostic, so one can easily implement a Lorien dialect to support a new DLC.

Motivation

Although auto-tuning frameworks for deep learning compilers (e.g., TVM, Halide) are capable of delivering high-performance operators that match or even beat vendor kernel libraries, auto-tuning a deep learning model could take days or even weeks, especially for the model with many workloads like ResNet-152 or Inception V3.

With such a long tuning time, one key question To maintain the best user experience during deep model developments and deployments is How to promptly deliver schedules with reasonably good performance upon user requests? Accordingly, we design and implement Lorien to remove the following obstacles:

Tuning Process Scalability and Stability. Long tuning time affects not only the time-to-market but the stability. To the best of our knowledge, none of existing auto-tuning frameworks is designed for tuning on multiple machines, and none of them consider fault tolerance. The tuning process, hence, has to be manually started over if it was accidentally interrupted. This is crucial especially on edge devices, which are less reliable than cloud instances and may fail frequently due to overheat or other factors.
Tuning Result Management. Although almost all auto-tuning frameworks provide mechanisms to serialize tuning results for future applications, all of them use file-based mechanism and have different formats. As a result, engineers have additional work to orchestrate the data for efficient usage.
Time to Deliver an Efficient Schedule. Even a database is constructed to serve most user requests, it is still possible that certain workloads are missing. However, modern auto-tuning frameworks usually leverage iterative search algorithms with on-device measurements, which usually take hours, to find an efficient schedule for an unseen workload. The unfavorably expensive querying/tuning overhead makes production deployment impractical.

Lorien is a unified and extensible infrastructure for delivering efficient deep learning workloads upon requests. Lorien allows auto-tuning deep learning frameworks to be easily plugged in as dialects, and supports large scale tuning on both cloud and edge platforms. The tuning results are managed in a NoSQL database with a unified data model that fits all auto-tuning frameworks. While the best schedules managed in the database can be used to compile deep learning models to achieve high performance, the tuning logs managed in a file system can also 1) enable more comprehensive performance analysis on different platforms, and 2) help train a performance cost model with an AutoML solution.

Please visit the official documentations for setup guideline and tutorials.

System Requirements

Python 3.6+
Amazon DynamoDB (local or aws): DynamoDB is used for storing and maintain the tuned schedules. You can choose to either of the following:
1. Launch a local version using JVM on your machine, and specify endpoint URL (e.g. --db "endpoint_url: http://:8000") when invoking a tuning procses.
2. Configure AWS credential on your machine to directly use AWS DynamoDB service. In this case, you do not have to specify any argument in tuning configurations.
AWS S3 (optional): S3 is used to store the full tuning logs (JSON files generated by AutoTVM). If you specify --commit-log-to bucket_name and configure an AWS credential on your machine, then all complete tuning logs will be uploaded to the S3 bucket for debugging or research prupose. Note that this is an optional requirement, so you can ignore the --commit-log-to argument if you do not want to keep full tuning logs.
AWS Batch (AWS ECR): You have to set up AWS batch computation environments, job queues, and job definitions in advance to use Lorien AWS batch worker for tuning. See this blog post for reference. You may also need to build an upload Lorien docker images to AWS ECR as the AWS batch job running container.

Docker Images

You can directly make use of pre-built Lorien docker images on Docker Hub, which includes two typs of images for CPU and CPU+CUDA platforms. The docker images have TVM deployed so you can launch a tuning process in the container after cloning Lorien. The docker image is also used for Lorien CI purpose.

Documentation

https://awslabs.github.io/lorien/

Citing Lorien

If you use Lorien in a scientific publication, please cite the following paper:

Cody Hao Yu, Xingjian Shi, Haichen Shen, Zhi Chen, Mu Li, Yida Wang, "Lorien: Efficient Deep Learning Workloads Delivery", Proceedings of the 12th ACM Symposium on Cloud Computing. 2021.

@inproceedings{yu2021lorien,
  title={Lorien: Efficient Deep Learning Workloads Delivery},
  author={Yu, Cody Hao and Shi, Xingjian and Shen, Haichen and Chen, Zhi and Li, Mu and Wang, Yida},
  booktitle={Proceedings of the Seventh ACM Symposium on Cloud Computing},
  year={2021}
}

You might also like...

Infrastructure as Code (IaC) for a self-hosted version of Gnosis Safe on AWS

Welcome to Yearn Gnosis Safe! Setting up your local environment Infrastructure Deploying Gnosis Safe Prerequisites 1. Create infrastructure for secret

16 Jul 18, 2022

Dcf-game-infrastructure-public - Contains all the components necessary to run a DC finals (attack-defense CTF) game from OOO

dcf-game-infrastructure All the components necessary to run a game of the OOO DC

46 Sep 13, 2022

AdaShare: Learning What To Share For Efficient Deep Multi-Task Learning

Comments

Why use -cost get the record of bigger latency?

We know the cost in auto_schedule measure result means [latency] or [Cost]， the smaller the better. While in the code as below: https://github.com/awslabs/lorien/blob/bcd39132e5f0738ee6f4685676ea8628cb4cea1b/lorien/dialect/tvm_dial/auto_scheduler_dial/result.py#L81 as you say eapq is min-heap the record on the top would be the best. Why use -cost get the record of bigger latency?

opened by yogurfrul 3
Is there something wrong with the to_list function？

When commit the records to the database, we save the original records in the database and the records in log_file to the heap. After that, get the n best results and go to commit. https://github.com/awslabs/lorien/blob/bcd39132e5f0738ee6f4685676ea8628cb4cea1b/lorien/tune/result.py#L239 What I understand is that -latency is stored in the heap. We should get the n largest values in the heap, so I think we should use nlargest instead of nsmallest. https://github.com/awslabs/lorien/blob/bcd39132e5f0738ee6f4685676ea8628cb4cea1b/lorien/dialect/tvm_dial/auto_scheduler_dial/result.py#L109 cc: @comaniac

opened by pansn1995 0

Lorien: A Unified Infrastructure for Efficient Deep Learning Workloads Delivery

Related tags

Overview

Lorien: A Unified Infrastructure for Efficient Deep Learning Workloads Delivery

Motivation

System Requirements

Docker Images

Documentation

Citing Lorien

You might also like...

Infrastructure as Code (IaC) for a self-hosted version of Gnosis Safe on AWS

Dcf-game-infrastructure-public - Contains all the components necessary to run a DC finals (attack-defense CTF) game from OOO

AdaShare: Learning What To Share For Efficient Deep Multi-Task Learning

Efficient-GlobalPointer - Pytorch Efficient GlobalPointer

A unified framework for machine learning with time series

Unified learning approach for egocentric hand gesture recognition and fingertip detection

D2Go is a toolkit for efficient deep learning

A clear, concise, simple yet powerful and efficient API for deep learning.

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

Comments

Why use -cost get the record of bigger latency?

Is there something wrong with the to_list function？

Owner

Amazon Web Services - Labs

MLOps will help you to understand how to build a Continuous Integration and Continuous Delivery pipeline for an ML/AI project.

UMEC: Unified Model and Embedding Compression for Efficient Recommendation Systems

Code for the ICCV 2021 Workshop paper: A Unified Efficient Pyramid Transformer for Semantic Segmentation.

Colossal-AI: A Unified Deep Learning System for Large-Scale Parallel Training

Monk is a low code Deep Learning tool and a unified wrapper for Computer Vision.

Parris, the automated infrastructure setup tool for machine learning algorithms.

Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

CRLT: A Unified Contrastive Learning Toolkit for Unsupervised Text Representation Learning

SmartSim Infrastructure Library.

FwordCTF 2021 Infrastructure and Source code of Web/Bash challenges