[NeurIPS 2021 Spotlight] HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning [Paper]

This is Official PyTorch implementation for HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning.

@inproceedings{lee2021help,
    title     = {HELP: Hardware-Adaptive Efficient Latency Prediction for NAS via Meta-Learning},
    author    = {Lee, Hayeon and Lee, Sewoong and Chong, Song and Hwang, Sung Ju},
    booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
    year      = {2021}
}

Overview

For deployment, neural architecture search should be hardware-aware, in order to satisfy the device-specific constraints (e.g., memory usage, latency and energy consumption) and enhance the model efficiency. Existing methods on hardware-aware NAS collect a large number of samples (e.g., accuracy and latency) from a target device, either builds a lookup table or a latency estimator. However, such approach is impractical in real-world scenarios as there exist numerous devices with different hardware specifications, and collecting samples from such a large number of devices will require prohibitive computational and monetary cost. To overcome such limitations, we propose Hardware-adaptive Efficient Latency Predictor (HELP), which formulates the device-specific latency estimation problem as a meta-learning problem, such that we can estimate the latency of a model's performance for a given task on an unseen device with a few samples. To this end, we introduce novel hardware embeddings to embed any devices considering them as black-box functions that output latencies, and meta-learn the hardware-adaptive latency predictor in a device-dependent manner, using the hardware embeddings. We validate the proposed HELP for its latency estimation performance on unseen platforms, on which it achieves high estimation performance with as few as 10 measurement samples, outperforming all relevant baselines. We also validate end-to-end NAS frameworks using HELP against ones without it, and show that it largely reduces the total time cost of the base NAS method, in latency-constrained settings.

Prerequisites

Python 3.8 (Anaconda)
PyTorch 1.8.1
CUDA 10.2

Hardware spec used for meta-training the proposed HELP model

GPU: A single Nvidia GeForce RTX 2080Ti
CPU: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz

Installation

$ conda create --name help python=3.8
$ conda activate help
$ conda install pytorch==1.8.1 torchvision cudatoolkit=10.2 -c pytorch
$ pip install nas-bench-201
$ pip install tqdm
$ conda install scipy
$ conda install pyyaml
$ conda install tensorboard

1. Experiments on NAS-Bench-201 Search Space

2. Experiments on FBNet Search Space

3. Experiments on OFA Search Space

4. Experiments on HAT Search Space

1. Reproduce Main Results on NAS-Bench-201 Search Space

We provide the code to reproduce the main results on NAS-Bench-201 search space as follows:

Computing architecture ranking correlation between latencies estimated by HELP and true measured latencies on unseen devices (Table 3).
Latency-constrained NAS Results with MetaD2A + HELP on unseen devices (Table 4).
Meta-Training HELP model.

1.1. Data Preparation and Model Checkpoint

We include all required datasets and checkpoints in this github repository.

1.2. [Meta-Test] Architecture ranking correlation

You can compute architecture ranking correlation between latencies estimated by HELP and true measured latencies on unseen devices on NAS-Bench-201 search space (Table 3):

$ python main.py --search_space nasbench201 \
		 --mode 'meta-test' \
		 --num_samples 10 \
		 --num_meta_train_sample 900 \
                 --load_path [Path of Checkpoint File] \
		 --meta_train_devices '1080ti_1,1080ti_32,1080ti_256,silver_4114,silver_4210r,samsung_a50,pixel3,essential_ph_1,samsung_s7' \
		 --meta_valid_devices 'titanx_1,titanx_32,titanx_256,gold_6240' \                 
                 --meta_test_devices 'titan_rtx_256,gold_6226,fpga,pixel2,raspi4,eyeriss'

You can use checkpoint file provided by this git repository ./data/nasbench201/checkpoint/help_max_corr.pt as follows:

$ python main.py --search_space nasbench201 \
		 --mode 'meta-test' \
		 --num_samples 10 \
		 --num_meta_train_sample 900 \
                 --load_path './data/nasbench201/checkpoint/help_max_corr.pt' \
		 --meta_train_devices '1080ti_1,1080ti_32,1080ti_256,silver_4114,silver_4210r,samsung_a50,pixel3,essential_ph_1,samsung_s7' \
		 --meta_valid_devices 'titanx_1,titanx_32,titanx_256,gold_6240' \                 
                 --meta_test_devices 'titan_rtx_256,gold_6226,fpga,pixel2,raspi4,eyeriss'

or you can use provided script:

$ bash script/run_meta_test_nasbench201.sh [GPU_NUM]

Architecture Ranking Correlation Results (Table 3)

Method	# of Training Samples From Target Device	Desktop GPU (Titan RTX Batch 256)	Desktop CPU (Intel Gold 6226)	Mobile Pixel2	Raspi4	ASIC	FPGA	Mean
FLOPS	-	0.950	0.826	0.765	0.846	0.437	0.900	0.787
Layer-wise Predictor	-	0.667	0.866	-	-	-	-	0.767
BRP-NAS	900	0.814	0.796	0.666	0.847	0.811	0.801	0.789
BRP-NAS (+extra samples)	3200	0.822	0.805	0.693	0.853	0.830	0.828	0.805
HELP (Ours)	10	0.987	0.989	0.802	0.890	0.940	0.985	0.932

1.3. [Meta-Test] Efficient Latency-constrained NAS combined with MetaD2A

You can reproduce latency-constrained NAS results with MetaD2A + HELP on unseen devices on NAS-Bench-201 search space (Table 4):

$ python main.py --search_space nasbench201 --mode 'nas' \
                 --load_path [Path of Checkpoint File] \
                 --sampled_arch_path 'data/nasbench201/arch_generated_by_metad2a.txt' \
                 --nas_target_device [Device] \ 
                 --latency_constraint [Latency Constraint]

For example, if you use checkpoint file provided by this git repository, then path of checkpoint file is ./data/nasbench201/checkpoint/help_max_corr.pt, if you set target device as CPU Intel Gold 6226 (gold_6226) with batch size 256 and target latency constraint as 11.0 (ms), command is as follows:

$ python main.py --search_space nasbench201 --mode 'nas' \
                 --load_path './data/nasbench201/checkpoint/help_max_corr.pt' \
                 --sampled_arch_path 'data/nasbench201/arch_generated_by_metad2a.txt' \
                 --nas_target_device gold_6226 \ 
                 --latency_constraint 11.0

or you can use provided script:

$ bash script/run_nas_metad2a.sh [GPU_NUM]

Efficient Latency-constrained NAS Results (Table 4)

Device	# of Training Samples from Target Device	Latency Constraint (ms)	Latency (ms)	Accuracy (%)	Neural Architecture Config
GPU Titan RTX (Batch 256) `titan_rtx_256`	10	18.0 21.0 25.0	17.8 18.9 24.2	69.7 71.5 71.8	link link link
CPU Intel Gold 6226 `gold_6226`	10	8.0 11.0 14.0	8.0 10.7 14.3	67.3 70.2 72.1	link link link
Mobile Pixel2 `pixel2`	10	14.0 18.0 22.0	13.0 19.0 25.0	69.7 71.8 73.2	link link link
ASIC-Eyeriss `eyeriss`	10	5.0 7.0 9.0	3.9 5.1 9.1	71.5 71.8 73.5	link link link
FPGA `fpga`	10	4.0 5.0 6.0	3.8 4.7 7.4	70.2 71.8 73.5	link link link

1.4. Meta-Training HELP model

Note that this process is performed only once for all NAS results.

$ python main.py --search_space nasbench201 \
                 --mode 'meta-train' \
                 --num_samples 10 \
                 --num_meta_train_sample 900 \
                 --meta_train_devices '1080ti_1,1080ti_32,1080ti_256,silver_4114,silver_4210r,samsung_a50,pixel3,essential_ph_1,samsung_s7' \
                 --meta_valid_devices 'titanx_1,titanx_32,titanx_256,gold_6240' \           
                 --meta_test_devices 'titan_rtx_256,gold_6226,fpga,pixel2,raspi4,eyeriss' \
                 --exp_name [EXP_NAME] \
                 --seed 3 # e.g.) 1, 2, 3

or you can use provided script:

$ bash script/run_meta_training_nasbench201.sh [GPU_NUM]

The results (checkpoint file, log file etc) are saved in

./results/nasbench201/[EXP_NAME]

2. Reproduce Main Results on FBNet Search Space

We provide the code to reproduce the main results on FBNet search space as follows:

Computing architecture ranking correlation between latencies estimated by HELP and true measured latencies on unseen devices (Table 2).
Meta-Training HELP model.

2.1. Data Preparation and Model Checkpoint

We include all required datasets and checkpoints in this github repository.

2.2. [Meta-Test] Architecture ranking correlation

You can compute architecture ranking correlation between latencies estimated by HELP and true measured latencies on unseen devices on FBNet search space (Table 2):

$ python main.py --search_space fbnet \
	--mode 'meta-test' \
	--num_samples 10 \
	--num_episodes 4000 \
	--num_meta_train_sample 4000 \
	--load_path './data/fbnet/checkpoint/help_max_corr.pt' \
	--meta_train_devices '1080ti_1,1080ti_32,1080ti_64,silver_4114,silver_4210r,samsung_a50,pixel3,essential_ph_1,samsung_s7' \
	--meta_valid_devices 'titanx_1,titanx_32,titanx_64,gold_6240' \
	--meta_test_devices 'fpga,raspi4,eyeriss'

or you can use provided script:

$ bash script/run_meta_test_fbnet.sh [GPU_NUM]

Architecture Ranking Correlation Results (Table 2)

Method	Raspi4	ASIC	FPGA	Mean
MAML	0.718	0.763	0.727	0.736
Meta-SGD	0.821	0.822	0.776	0.806
HELP (Ours)	0.887	0.943	0.892	0.910

2.3. Meta-Training HELP model

Note that this process is performed only once for all results.

$ python main.py --search_space fbnet \
	--mode 'meta-train' \
	--num_samples 10 \
	--num_episodes 4000 \
	--num_meta_train_sample 4000 \
	--exp_name [EXP_NAME] \
	--meta_train_devices '1080ti_1,1080ti_32,1080ti_64,silver_4114,silver_4210r,samsung_a50,pixel3,essential_ph_1,samsung_s7' \
	--meta_valid_devices 'titanx_1,titanx_32,titanx_64,gold_6240' \
	--meta_test_devices 'fpga,raspi4,eyeriss' \
	--seed 3 # e.g.) 1, 2, 3

or you can use provided script:

$ bash script/run_meta_training_fbnet.sh [GPU_NUM]

The results (checkpoint file, log file etc) are saved in

./results/fbnet/[EXP_NAME]

3. Reproduce Main Results on OFA Search Space

We provide the code to reproduce the main results on OFA search space as follows:

Latency-constrained NAS Results with accuracy predictor of OFA + HELP on unseen devices (Table 5).
Validating obatined neural architecture on ImageNet-1K.
Meta-Training HELP model.

3.1. Data Preparation and Model Checkpoint

We include required datasets except ImageNet-1K, and checkpoints in this github repository. To validate obatined neural architecture on ImageNet-1K, you should download ImageNet-1K (2012 ver.)

3.2. [Meta-Test] Efficient Latency-constrained NAS combined with accuracy predictor of OFA

You can reproduce latency-constrained NAS results with OFA + HELP on unseen devices on OFA search space (Table 5):

python main.py \
	--search_space ofa \
	--mode nas \
	--num_samples 10 \
	--seed 3 \
	--num_meta_train_sample 4000 \
	--load_path './data/ofa/checkpoint/help_max_corr.pt' \
	--nas_target_device [DEVICE_NAME] \
	--latency_constraint [LATENCY_CONSTRAINT] \
	--exp_name 'nas' \
	--meta_train_devices '2080ti_1,2080ti_32,2080ti_64,titan_xp_1,titan_xp_32,titan_xp_64,v100_1,v100_32,v100_64' \
	--meta_valid_devices 'titan_rtx_1,titan_rtx_32' \
	--meta_test_devices 'titan_rtx_64'

For example,

$ python main.py \
	--search_space ofa \
	--mode nas \
	--num_samples 10 \
	--seed 3 \
	--num_meta_train_sample 4000 \
	--load_path './data/ofa/checkpoint/help_max_corr.pt' \
	--nas_target_device titan_rtx_64 \
	--latency_constraint 20 \
	--exp_name 'nas' \
	--meta_train_devices '2080ti_1,2080ti_32,2080ti_64,titan_xp_1,titan_xp_32,titan_xp_64,v100_1,v100_32,v100_64' \
	--meta_valid_devices 'titan_rtx_1,titan_rtx_32' \
	--meta_test_devices 'titan_rtx_64'

or you can use provided script:

$ bash script/run_nas_ofa.sh [GPU_NUM]

Efficient Latency-constrained NAS Results (Table 5)

Device	Sample from Target Device	Latency Constraint (ms)	Latency (ms)	Accuracy (%)	Architecture config
GPU Titan RTX (Batch 64)	10	20 23 28	20.3 23.1 28.6	76.0 76.8 77.9	link link link
CPU Intel Gold 6226	20	170 190	147 171	77.6 78.1	link link
Jetson AGX Xavier	10	65 70	67.4 76.4	75.9 76.4	link link

3.3. Validating obtained neural architecture on ImageNet-1K

$ python validate_imagenet.py \
		--config_path [Path of neural architecture config file]
		--imagenet_save_path [Path of ImageNet 1k]

for example,

$ python validate_imagenet.py \
		--config_path 'data/ofa/architecture_config/gpu_titan_rtx_64/latency_28.6ms_accuracy_77.9.json' \
		--imagenet_save_path './ILSVRC2012'

3.4. Meta-training HELP model

Note that this process is performed only once for all results.

$ python main.py --search_space ofa \
		--mode 'meta-train' \
		--num_samples 10 \
		--num_meta_train_sample 4000 \
		--exp_name [EXP_NAME] \
                --meta_train_devices '2080ti_1,2080ti_32,2080ti_64,titan_xp_1,titan_xp_32,titan_xp_64,v100_1,v100_32,v100_64' \
                --meta_valid_devices 'titan_rtx_1,titan_rtx_32' \
                --meta_test_devices 'titan_rtx_64' \
		--seed 3 # e.g.) 1, 2, 3

or you can use provided script:

$ bash script/run_meta_training_ofa.sh [GPU_NUM]

4. Main Results on HAT Search Space

We provide the neural architecture configurations to reproduce the results of machine translation (WMT'14 En-De Task) on HAT search space.

Efficient Latency-constrained NAS Results

Task	Device	Samples from Target Device	Latency	BLEU score	Architecture Config
WMT'14 En-De	GPU NVIDIA Titan RTX	10	74.0ms 106.5ms	27.19 27.44	link link
WMT'14 En-De	CPU Intel Xeon Gold 6240	10	159.6ms 343.2ms	27.20 27.52	link link

You can test models by BLEU score and Computing Latency.

Reference

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (ICML17)

Meta-SGD: Learning to Learn Quickly for Few-Shot Learning

Once-for-All: Train One Network and Specialize it for Efficient Deployment (ICLR20)

NAS-Bench-201: Extending the Scope of Reproducible Neural Architecture Search (ICLR20)

BRP-NAS: Prediction-based NAS using GCNs (NeurIPS20)

HAT: Hardware Aware Transformers for Efficient Natural Language Processing (ACL20)

Rapid Neural Architecture Search by Learning to Generate Graphs from Datasets (ICLR21)

HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark (ICLR21)

[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

Large Scale Image Completion via Co-Modulated Generative Adversarial Networks, ICLR 2021 (Spotlight) Demo | Paper [NEW!] Time to play with our interac

373 Jan 2, 2023

Source code of the paper Meta-learning with an Adaptive Task Scheduler.

ATS About Source code of the paper Meta-learning with an Adaptive Task Scheduler. If you find this repository useful in your research, please cite the

16 Dec 26, 2022

Official implementation of the paper 'Efficient and Degradation-Adaptive Network for Real-World Image Super-Resolution'

DASR Paper Efficient and Degradation-Adaptive Network for Real-World Image Super-Resolution Jie Liang, Hui Zeng, and Lei Zhang. In arxiv preprint. Abs

81 Dec 28, 2022

Easily benchmark PyTorch model FLOPs, latency, throughput, max allocated memory and energy consumption

⏱ pytorch-benchmark Easily benchmark model inference FLOPs, latency, throughput, max allocated memory and energy consumption Install pip install pytor

21 Dec 22, 2022

Predict the latency time of the deep learning models

Deep Neural Network Prediction Step 1. Genernate random parameters and Run them sequentially : $ python3 collect_data.py -gp -ep -pp -pl pooling -num

1 Nov 12, 2021

Official Pytorch implementation of Meta Internal Learning

10 Aug 24, 2022

Pytorch Implementation for NeurIPS (oral) paper: Pixel Level Cycle Association: A New Perspective for Domain Adaptive Semantic Segmentation

Pixel-Level Cycle Association This is the Pytorch implementation of our NeurIPS 2020 Oral paper Pixel-Level Cycle Association: A New Perspective for D

87 Oct 19, 2022

[TPAMI 2021] iOD: Incremental Object Detection via Meta-Learning

Incremental Object Detection via Meta-Learning To appear in an upcoming issue of the IEEE Transactions on Pattern Analysis and Machine Intelligence (T

66 Jan 4, 2023

Official implementation of Generalized Data Weighting via Class-level Gradient Manipulation (NeurIPS 2021).

Generalized Data Weighting via Class-level Gradient Manipulation This repository is the official implementation of Generalized Data Weighting via Clas

9 Nov 3, 2021

How is the file 'architecture.pt' generated in NASBench201?

Hello,

What is the NB201 indexing used to generate the architecture.pt file? (https://github.com/HayeonLee/HELP/blob/main/data/nasbench201/architecture.pt)

Is there an easy way to go from the architectures in architecture.pt to NB201 style strings/vectors to describe the arch?

Thanks!

opened by akhauriyash 3

Official PyTorch Implementation of HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning (NeurIPS 2021 Spotlight)

Related tags

Overview

[NeurIPS 2021 Spotlight] HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning [Paper]

Overview

Prerequisites

Installation

Contents

1. Reproduce Main Results on NAS-Bench-201 Search Space

1.1. Data Preparation and Model Checkpoint

1.2. [Meta-Test] Architecture ranking correlation

1.3. [Meta-Test] Efficient Latency-constrained NAS combined with MetaD2A

1.4. Meta-Training HELP model

2. Reproduce Main Results on FBNet Search Space

2.1. Data Preparation and Model Checkpoint

2.2. [Meta-Test] Architecture ranking correlation

2.3. Meta-Training HELP model

3. Reproduce Main Results on OFA Search Space

3.1. Data Preparation and Model Checkpoint

3.2. [Meta-Test] Efficient Latency-constrained NAS combined with accuracy predictor of OFA

3.3. Validating obtained neural architecture on ImageNet-1K

3.4. Meta-training HELP model

4. Main Results on HAT Search Space

Reference

You might also like...

[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

Source code of the paper Meta-learning with an Adaptive Task Scheduler.

Official implementation of the paper 'Efficient and Degradation-Adaptive Network for Real-World Image Super-Resolution'

Easily benchmark PyTorch model FLOPs, latency, throughput, max allocated memory and energy consumption

Predict the latency time of the deep learning models

Official Pytorch implementation of Meta Internal Learning

Pytorch Implementation for NeurIPS (oral) paper: Pixel Level Cycle Association: A New Perspective for Domain Adaptive Semantic Segmentation

[TPAMI 2021] iOD: Incremental Object Detection via Meta-Learning

Official implementation of Generalized Data Weighting via Class-level Gradient Manipulation (NeurIPS 2021).

Comments

How is the file 'architecture.pt' generated in NASBench201?

Owner

[ICLR 2021] HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark

[NeurIPS 2021 Spotlight] Aligning Pretraining for Detection via Object-Level Contrastive Learning

Official Implementation of 'UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers' ICLR 2021(spotlight)

PyTorch implementation of the paper: "Preference-Adaptive Meta-Learning for Cold-Start Recommendation", IJCAI, 2021.

PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short-Term Transformer for Online Action Detection".

[NeurIPS 2021 Spotlight] Code for Learning to Compose Visual Relations

[CVPR 2021] 'Searching by Generating: Flexible and Efficient One-Shot NAS with Architecture Generator'

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation", Haoxiang Wang, Han Zhao, Bo Li.

Implementation of "Meta-rPPG: Remote Heart Rate Estimation Using a Transductive Meta-Learner"