AdaFocus (ICCV 2021) Adaptive Focus for Efficient Video Recognition

Rainforest Wang

Last update: Dec 21, 2022

Related tags

Deep Learning AdaFocus

Overview

AdaFocus (ICCV 2021)

This repo contains the official code and pre-trained models for AdaFocus.

Adaptive Focus for Efficient Video Recognition

Reference

If you find our code or paper useful for your research, please cite:

@InProceedings{Wang_2021_ICCV,
author = {Wang, Yulin and Chen, Zhaoxi and Jiang, Haojun and Song, Shiji and Han, Yizeng and Huang, Gao},
title = {Adaptive Focus for Efficient Video Recognition},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2021}
}

Introduction

In this paper, we explore the spatial redundancy in video recognition with the aim to improve the computational efficiency. It is observed that the most informative region in each frame of a video is usually a small image patch, which shifts smoothly across frames. Therefore, we model the patch localization problem as a sequential decision task, and propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus). In specific, a light-weighted ConvNet is first adopted to quickly process the full video sequence, whose features are used by a recurrent policy network to localize the most task-relevant regions. Then the selected patches are inferred by a high-capacity network for the final prediction. During offline inference, once the informative patch sequence has been generated, the bulk of computation can be done in parallel, and is efficient on modern GPU devices. In addition, we demonstrate that the proposed method can be easily extended by further considering the temporal redundancy, e.g., dynamically skipping less valuable frames. Extensive experiments on five benchmark datasets, i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, demonstrate that our method is significantly more efficient than the competitive baselines.

Result

ActivityNet

Something-Something V1&V2

Visualization

Requirements

python 3.8
pytorch 1.7.0
torchvision 0.8.0
hydra 1.1.0

Datasets

Please get train/test splits file for each dataset from Google Drive and put them in PATH_TO_DATASET.
Download videos from following links, or contact the corresponding authors for the access. Save them to PATH_TO_DATASET/videos

ActivityNet-v1.3
FCVID
Mini-Kinetics. Please download Kinetics 400, for Mini-Kinetics used in our paper, you need to use the train/val splits file.

Extract frames using ops/video_jpg.py, the frames will be saved to PATH_TO_DATASET/frames. Minor modifications on file path are needed when extracting frames from different dataset.

Pre-trained Models

Please download pretrained weights and checkpoints from Google Drive.

globalcnn.pth.tar: pretrained weights for global CNN (MobileNet-v2).
localcnn.pth.tar: pretrained weights for local CNN (ResNet-50).
128checkpoint.pth.tar: checkpoint of stage 1 for patch size 128x128.
160checkpoint.pth.tar: checkpoint of stage 1 for patch size 160x128.
192checkpoint.pth.tar: checkpoint of stage 1 for patch size 192x128.

Training

Here we take training model with patch size 128x128 on ActivityNet dataset for example.
All logs and checkpoints will be saved in the directory: ./outputs/YYYY-MM-DD/HH-MM-SS
Note that we store a set of default paramenter in conf/default.yaml which can override through command line. You can also use your own config files.
Before training, please initialize Global CNN and Local CNN by fine-tuning the ImageNet pre-trained models in Pytorch using the following command:

for Global CNN:

CUDA_VISIBLE_DEVICES=0,1 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=0 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.01 epochs=15 dist_url=tcp://127.0.0.1:8857 random_patch=true patch_size=128 glance_size=224 eval_freq=5 consensus=gru hidden_dim=1024 pretrain_glancer=true

for Local CNN:

CUDA_VISIBLE_DEVICES=0,1 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=0 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.01 epochs=15 dist_url=tcp://127.0.0.1:8857 random_patch=true patch_size=128 glance_size=224 eval_freq=5 consensus=gru hidden_dim=1024 pretrain_glancer=false

Training stage 1, pretrained weights for Global CNN and Local CNN are required:

CUDA_VISIBLE_DEVICES=0,1 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=1 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.0005 fc_lr=0.05 epochs=50 dist_url=tcp://127.0.0.1:8857 random_patch=true patch_size=128 glance_size=224 eval_freq=5 consensus=gru hidden_dim=1024 pretrained_glancer=PATH_TO_CHECKPOINTS pretrained_focuser=PATH_TO_CHECKPOINTS

Training stage 2, a stage-1 checkpoint is required:

CUDA_VISIBLE_DEVICES=0 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=2 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.0005 fc_lr=0.05 epochs=50 random_patch=false patch_size=128 glance_size=224 action_dim=49 eval_freq=5 consensus=gru hidden_dim=1024 resume=PATH_TO_CHECKPOINTS multiprocessing_distributed=false distributed=false

Training stage 3, a stage-2 checkpoint is required:

CUDA_VISIBLE_DEVICES=0,1 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=3 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.0005 fc_lr=0.005 epochs=10 random_patch=false patch_size=128 glance_size=224 action_dim=49 eval_freq=5 consensus=gru hidden_dim=1024 resume=PATH_TO_CHECKPOINTS multiprocessing_distributed=false distributed=false

Contact

If you have any question, feel free to contact the authors or raise an issue. Yulin Wang: [email protected].

Acknowledgement

We use implementation of MobileNet-v2 and ResNet from Pytorch source code. We also borrow some codes for dataset preparation from AR-Net and PPO from here.

Comments

Independent spatial focus along temporal dimension

Thanks for your great job which will motivate a lot of works in this area! After checking your code, i found that for frames in a video it seems to assign a identical spatial sampling location. I wonder if it is true? If it is, where do the locations in fig.7 which are independent for each frame come from?

opened by hulianyuyy 1
Train on UCF101

I use the following parameters and take mobilenet and resnet-50 which trained by TSN as pre-trained. But the training results are strange！From the beginning, the training accuracy has reached 100%, while the test accuracy is basically unchanged. CUDA_VISIBLE_DEVICES=4 python stage1.py
dataset=ucf101
data_dir=/data/ymy/data/
train_stage=1
batch_size=32
num_segments_glancer=8
num_segments_focuser=12
glance_size=224
patch_size=144
random_patch=True
epochs=50
backbone_lr=0.001
fc_lr=0.01
lr_type=step
dropout=0.5
load_pretrained_focuser_fc=False
dist_url=tcp://127.0.0.1:8816
eval_freq=1
start_eval=0
print_freq=25
workers=16
pretrained_glancer='/AdaFocus-main/new_mobile.tar'
pretrained_focuser='/AdaFocus-main/new_resnet.tar'

Epoch: [5][ 0/298] Time 43.183 (43.183) Data 42.607 (42.607) Loss 1.1841e-03 (1.1841e-03) Acc@1 100.00 (100.00) Acc@5 100.00 (100.00) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][ 25/298] Time 0.674 ( 2.839) Data 0.107 ( 2.276) Loss 1.7993e-03 (8.2321e-03) Acc@1 100.00 ( 99.76) Acc@5 100.00 (100.00) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][ 50/298] Time 1.080 ( 2.122) Data 0.526 ( 1.560) Loss 1.7797e-02 (1.1389e-02) Acc@1 100.00 ( 99.63) Acc@5 100.00 (100.00) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][ 75/298] Time 0.615 ( 1.833) Data 0.048 ( 1.272) Loss 2.5565e-04 (1.1153e-02) Acc@1 100.00 ( 99.63) Acc@5 100.00 (100.00) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][100/298] Time 0.624 ( 1.724) Data 0.056 ( 1.163) Loss 1.6186e-03 (9.6181e-03) Acc@1 100.00 ( 99.72) Acc@5 100.00 (100.00) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][125/298] Time 0.640 ( 1.601) Data 0.082 ( 1.041) Loss 6.2654e-02 (9.9088e-03) Acc@1 96.88 ( 99.68) Acc@5 100.00 (100.00) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][150/298] Time 0.618 ( 1.596) Data 0.061 ( 1.036) Loss 1.9718e-04 (9.0484e-03) Acc@1 100.00 ( 99.71) Acc@5 100.00 (100.00) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][175/298] Time 0.673 ( 1.526) Data 0.107 ( 0.965) Loss 1.8096e-03 (9.6376e-03) Acc@1 100.00 ( 99.70) Acc@5 100.00 (100.00) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][200/298] Time 0.630 ( 1.523) Data 0.061 ( 0.962) Loss 2.6468e-03 (9.3167e-03) Acc@1 100.00 ( 99.72) Acc@5 100.00 (100.00) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][225/298] Time 11.313 ( 1.514) Data 10.754 ( 0.952) Loss 9.3352e-03 (9.5301e-03) Acc@1 100.00 ( 99.72) Acc@5 100.00 (100.00) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][250/298] Time 0.643 ( 1.475) Data 0.086 ( 0.913) Loss 1.7089e-03 (1.0416e-02) Acc@1 100.00 ( 99.70) Acc@5 100.00 ( 99.99) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][275/298] Time 0.604 ( 1.472) Data 0.046 ( 0.910) Loss 1.3999e-03 (9.9850e-03) Acc@1 100.00 ( 99.73) Acc@5 100.00 ( 99.99) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][297/298] Time 0.647 ( 1.410) Data 0.094 ( 0.848) Loss 1.0134e-03 (1.0606e-02) Acc@1 100.00 ( 99.72) Acc@5 100.00 ( 99.98) Focuser BackBone LR: 0.001 FC LR: 0 Test: [ 0/119] Time 21.262 (21.262) Loss 6.7998e-01 (6.7998e-01) Acc@1 81.25 ( 81.25) Acc@5 100.00 (100.00) Test: [ 25/119] Time 0.381 ( 1.223) Loss 2.3228e-01 (6.1051e-01) Acc@1 93.75 ( 85.10) Acc@5 100.00 ( 97.60) Test: [ 50/119] Time 0.366 ( 0.818) Loss 4.2509e-01 (8.5970e-01) Acc@1 93.75 ( 81.07) Acc@5 96.88 ( 95.22) Test: [ 75/119] Time 0.406 ( 0.680) Loss 2.0299e-01 (1.0306e+00) Acc@1 93.75 ( 78.12) Acc@5 100.00 ( 93.09) Test: [100/119] Time 0.362 ( 0.609) Loss 3.9213e-01 (9.9937e-01) Acc@1 96.88 ( 78.53) Acc@5 96.88 ( 93.56) Test: [118/119] Time 0.122 ( 0.571) Loss 1.6555e+00 (9.4728e-01) Acc@1 28.57 ( 79.33) Acc@5 100.00 ( 94.00) Testing Results: Prec@1 79.329 Prec@5 93.999 Loss 0.94728

opened by Morning-YU 0
Hi，the segment_indices_glancer is different from segment_indices_focuser in Something. what is the purpose?

And the number of num_segments_glancer is different from the number of num_segments_focuser. 但是论文Figure 2. Overview of AdaFocus. 中输入到fG网络和 fL网络的帧是一一对应的。请问，输入到fG网络和 fL网络的帧的需要是一一对应的吗？

opened by xusong-20 1
About other datasets

I'm very interseted in your works. But when I experimented with the UCF101 dataset, the results were not encouraging ( just around 1%). Looking forward to your reply. THX！

The parameters of the experiment are as follows：

CUDA_VISIBLE_DEVICES=0,3,4,5 python stage1.py
dataset=ucf101
data_dir=/data/ymy/data/
train_stage=1
batch_size=32
num_segments_glancer=8
num_segments_focuser=12
glance_size=224
patch_size=144
random_patch=True
epochs=10
backbone_lr=0.00001
fc_lr=0.01
lr_type=cos
dropout=0.5
load_pretrained_focuser_fc=False
dist_url=tcp://127.0.0.1:8816
eval_freq=1
start_eval=0
print_freq=25
workers=16
pretrained_glancer='/data/AdaFocus-main/mobilenetv2_segment8.pth.tar'
pretrained_focuser='/data/AdaFocus-main/resnet50_segment12.pt.tar' # load the pretrained model

opened by Morning-YU 0
About eval with SCSampler

A wonderful work!

But I have a problem with evaluation. I can't flnd the code about the article SCSampler:Sampling Salient Clips from Video for Efficient Action Recognition. How can i evaluate these two models on a certain dataset?

opened by Kaeless 0
Cannot reproduce 75.0 mAP with 128x128 patch

With the same setting and same checkpoint (128s3_checkpoint.pth.tar), 75.0 mAP cannot be reproduced in my environment (achieved 74.4 ). The difference I know is that I use FPS1 frames, the data list provided seems to be 30 fps. However, as far as I know, FPS should not make such a big difference.

opened by LawrenceXia2008 1

Owner

Rainforest Wang

GitHub

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition [ArXiv] [Project Page] This repository is the official implementation of AdaMML:

43 Dec 26, 2022

[AAAI 2021] MVFNet: Multi-View Fusion Network for Efficient Video Recognition

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021) Overview We release the code of the MVFNet (Multi-View Fusion Network).

114 Nov 27, 2022

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021)

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021) Overview We release the code of the MVFNet (Multi-View Fusion Network).

2 Jan 29, 2022

PyTorchVideo is a deeplearning library with a focus on video understanding work

PyTorchVideo is a deeplearning library with a focus on video understanding work. PytorchVideo provides resusable, modular and efficient components needed to accelerate the video understanding research. PyTorchVideo is developed using PyTorch and supports different deeplearning video components like video models, video datasets, and video-specific transforms.

2.7k Jan 7, 2023

A PyTorch implementation of SlowFast based on ICCV 2019 paper "SlowFast Networks for Video Recognition"

SlowFast A PyTorch implementation of SlowFast based on ICCV 2019 paper SlowFast Networks for Video Recognition. Requirements Anaconda PyTorch conda in

8 Dec 23, 2022

Eff video representation - Efficient video representation through neural fields

Neural Residual Flow Fields for Efficient Video Representations 1. Download MPI

41 Jan 6, 2023

Code for the ICCV 2021 paper "Pixel Difference Networks for Efficient Edge Detection" (Oral).

Pixel Difference Convolution This repository contains the PyTorch implementation for "Pixel Difference Networks for Efficient Edge Detection" by Zhuo

236 Dec 21, 2022

Code for the ICCV 2021 Workshop paper: A Unified Efficient Pyramid Transformer for Semantic Segmentation.

Unified-EPT Code for the ICCV 2021 Workshop paper: A Unified Efficient Pyramid Transformer for Semantic Segmentation. Installation Linux, CUDA>=10.0,

29 Aug 23, 2022

Implementation of Memory-Efficient Neural Networks with Multi-Level Generation, ICCV 2021

Memory-Efficient Multi-Level In-Situ Generation (MLG) By Jiaqi Gu, Hanqing Zhu, Chenghao Feng, Mingjie Liu, Zixuan Jiang, Ray T. Chen and David Z. Pan

2 Jan 4, 2022

Official implementation of the paper 'Efficient and Degradation-Adaptive Network for Real-World Image Super-Resolution'

DASR Paper Efficient and Degradation-Adaptive Network for Real-World Image Super-Resolution Jie Liang, Hui Zeng, and Lei Zhang. In arxiv preprint. Abs

81 Dec 28, 2022

Image transformations designed for Scene Text Recognition (STR) data augmentation. Published at ICCV 2021 Workshop on Interactive Labeling and Data Augmentation for Vision.

Data Augmentation for Scene Text Recognition (ICCV 2021 Workshop) (Pronounced as "strog") Paper Arxiv Why it matters? Scene Text Recognition (STR) req

152 Dec 28, 2022

[ICCV 2021] Released code for Causal Attention for Unbiased Visual Recognition

CaaM This repo contains the codes of training our CaaM on NICO/ImageNet9 dataset. Due to my recent limited bandwidth, this codebase is still messy, wh

66 Dec 31, 2022

Official PyTorch implementation of N-ImageNet: Towards Robust, Fine-Grained Object Recognition with Event Cameras (ICCV 2021)

N-ImageNet: Towards Robust, Fine-Grained Object Recognition with Event Cameras Official PyTorch implementation of N-ImageNet: Towards Robust, Fine-Gra

32 Dec 26, 2022

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

1 Jan 23, 2022

HDR Video Reconstruction: A Coarse-to-fine Network and A Real-world Benchmark Dataset (ICCV 2021)

Code for HDR Video Reconstruction HDR Video Reconstruction: A Coarse-to-fine Network and A Real-world Benchmark Dataset (ICCV 2021) Guanying Chen, Cha

64 Nov 19, 2022

VIL-100: A New Dataset and A Baseline Model for Video Instance Lane Detection (ICCV 2021)

Preparation Please see dataset/README.md to get more details about our datasets-VIL100 Please see INSTALL.md to install environment and evaluation too

82 Dec 15, 2022

[2021][ICCV][FSNet] Full-Duplex Strategy for Video Object Segmentation

Full-Duplex Strategy for Video Object Segmentation (ICCV, 2021) Authors: Ge-Peng Ji, Keren Fu, Zhe Wu, Deng-Ping Fan*, Jianbing Shen, & Ling Shao This

55 Dec 22, 2022

official Pytorch implementation of ICCV 2021 paper FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting.

FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting By Rui Liu, Hanming Deng, Yangyi Huang, Xiaoyu Shi, Lewei Lu, Wenxiu

77 Dec 27, 2022

[Official] Exploring Temporal Coherence for More General Video Face Forgery Detection(ICCV 2021)

Exploring Temporal Coherence for More General Video Face Forgery Detection(FTCN) Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, Fang Wen Accepted b

57 Dec 28, 2022

AdaFocus (ICCV 2021) Adaptive Focus for Efficient Video Recognition

Related tags

Overview

AdaFocus (ICCV 2021)

Reference

Introduction

Result

Requirements

Datasets

Pre-trained Models

Training

Contact

Acknowledgement

Comments

Independent spatial focus along temporal dimension

Train on UCF101

Hi，the segment_indices_glancer is different from segment_indices_focuser in Something. what is the purpose?

About other datasets

About eval with SCSampler

Cannot reproduce 75.0 mAP with 128x128 patch

Owner

Rainforest Wang

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

[AAAI 2021] MVFNet: Multi-View Fusion Network for Efficient Video Recognition

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021)

PyTorchVideo is a deeplearning library with a focus on video understanding work

A PyTorch implementation of SlowFast based on ICCV 2019 paper "SlowFast Networks for Video Recognition"

Eff video representation - Efficient video representation through neural fields

Code for the ICCV 2021 paper "Pixel Difference Networks for Efficient Edge Detection" (Oral).

Code for the ICCV 2021 Workshop paper: A Unified Efficient Pyramid Transformer for Semantic Segmentation.

Implementation of Memory-Efficient Neural Networks with Multi-Level Generation, ICCV 2021

Official implementation of the paper 'Efficient and Degradation-Adaptive Network for Real-World Image Super-Resolution'

Image transformations designed for Scene Text Recognition (STR) data augmentation. Published at ICCV 2021 Workshop on Interactive Labeling and Data Augmentation for Vision.

[ICCV 2021] Released code for Causal Attention for Unbiased Visual Recognition

Official PyTorch implementation of N-ImageNet: Towards Robust, Fine-Grained Object Recognition with Event Cameras (ICCV 2021)

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

HDR Video Reconstruction: A Coarse-to-fine Network and A Real-world Benchmark Dataset (ICCV 2021)

VIL-100: A New Dataset and A Baseline Model for Video Instance Lane Detection (ICCV 2021)

[2021][ICCV][FSNet] Full-Duplex Strategy for Video Object Segmentation

official Pytorch implementation of ICCV 2021 paper FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting.

[Official] Exploring Temporal Coherence for More General Video Face Forgery Detection(ICCV 2021)