AdaFocus (ICCV 2021) Adaptive Focus for Efficient Video Recognition

Overview

AdaFocus (ICCV 2021)

This repo contains the official code and pre-trained models for AdaFocus.

Reference

If you find our code or paper useful for your research, please cite:

@InProceedings{Wang_2021_ICCV,
author = {Wang, Yulin and Chen, Zhaoxi and Jiang, Haojun and Song, Shiji and Han, Yizeng and Huang, Gao},
title = {Adaptive Focus for Efficient Video Recognition},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2021}
}

Introduction

In this paper, we explore the spatial redundancy in video recognition with the aim to improve the computational efficiency. It is observed that the most informative region in each frame of a video is usually a small image patch, which shifts smoothly across frames. Therefore, we model the patch localization problem as a sequential decision task, and propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus). In specific, a light-weighted ConvNet is first adopted to quickly process the full video sequence, whose features are used by a recurrent policy network to localize the most task-relevant regions. Then the selected patches are inferred by a high-capacity network for the final prediction. During offline inference, once the informative patch sequence has been generated, the bulk of computation can be done in parallel, and is efficient on modern GPU devices. In addition, we demonstrate that the proposed method can be easily extended by further considering the temporal redundancy, e.g., dynamically skipping less valuable frames. Extensive experiments on five benchmark datasets, i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, demonstrate that our method is significantly more efficient than the competitive baselines.

Result

  • ActivityNet

  • Something-Something V1&V2

  • Visualization

Requirements

  • python 3.8
  • pytorch 1.7.0
  • torchvision 0.8.0
  • hydra 1.1.0

Datasets

  1. Please get train/test splits file for each dataset from Google Drive and put them in PATH_TO_DATASET.
  2. Download videos from following links, or contact the corresponding authors for the access. Save them to PATH_TO_DATASET/videos
  1. Extract frames using ops/video_jpg.py, the frames will be saved to PATH_TO_DATASET/frames. Minor modifications on file path are needed when extracting frames from different dataset.

Pre-trained Models

Please download pretrained weights and checkpoints from Google Drive.

  • globalcnn.pth.tar: pretrained weights for global CNN (MobileNet-v2).
  • localcnn.pth.tar: pretrained weights for local CNN (ResNet-50).
  • 128checkpoint.pth.tar: checkpoint of stage 1 for patch size 128x128.
  • 160checkpoint.pth.tar: checkpoint of stage 1 for patch size 160x128.
  • 192checkpoint.pth.tar: checkpoint of stage 1 for patch size 192x128.

Training

  • Here we take training model with patch size 128x128 on ActivityNet dataset for example.

  • All logs and checkpoints will be saved in the directory: ./outputs/YYYY-MM-DD/HH-MM-SS

  • Note that we store a set of default paramenter in conf/default.yaml which can override through command line. You can also use your own config files.

  • Before training, please initialize Global CNN and Local CNN by fine-tuning the ImageNet pre-trained models in Pytorch using the following command:

for Global CNN:

CUDA_VISIBLE_DEVICES=0,1 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=0 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.01 epochs=15 dist_url=tcp://127.0.0.1:8857 random_patch=true patch_size=128 glance_size=224 eval_freq=5 consensus=gru hidden_dim=1024 pretrain_glancer=true

for Local CNN:

CUDA_VISIBLE_DEVICES=0,1 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=0 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.01 epochs=15 dist_url=tcp://127.0.0.1:8857 random_patch=true patch_size=128 glance_size=224 eval_freq=5 consensus=gru hidden_dim=1024 pretrain_glancer=false
  • Training stage 1, pretrained weights for Global CNN and Local CNN are required:
CUDA_VISIBLE_DEVICES=0,1 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=1 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.0005 fc_lr=0.05 epochs=50 dist_url=tcp://127.0.0.1:8857 random_patch=true patch_size=128 glance_size=224 eval_freq=5 consensus=gru hidden_dim=1024 pretrained_glancer=PATH_TO_CHECKPOINTS pretrained_focuser=PATH_TO_CHECKPOINTS
  • Training stage 2, a stage-1 checkpoint is required:
CUDA_VISIBLE_DEVICES=0 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=2 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.0005 fc_lr=0.05 epochs=50 random_patch=false patch_size=128 glance_size=224 action_dim=49 eval_freq=5 consensus=gru hidden_dim=1024 resume=PATH_TO_CHECKPOINTS multiprocessing_distributed=false distributed=false
  • Training stage 3, a stage-2 checkpoint is required:
CUDA_VISIBLE_DEVICES=0,1 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=3 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.0005 fc_lr=0.005 epochs=10 random_patch=false patch_size=128 glance_size=224 action_dim=49 eval_freq=5 consensus=gru hidden_dim=1024 resume=PATH_TO_CHECKPOINTS multiprocessing_distributed=false distributed=false

Contact

If you have any question, feel free to contact the authors or raise an issue. Yulin Wang: [email protected].

Acknowledgement

We use implementation of MobileNet-v2 and ResNet from Pytorch source code. We also borrow some codes for dataset preparation from AR-Net and PPO from here.

Comments
  • Independent spatial focus along temporal dimension

    Independent spatial focus along temporal dimension

    Thanks for your great job which will motivate a lot of works in this area! After checking your code, i found that for frames in a video it seems to assign a identical spatial sampling location. I wonder if it is true? If it is, where do the locations in fig.7 which are independent for each frame come from?

    opened by hulianyuyy 1
  • Train on UCF101

    Train on UCF101

    I use the following parameters and take mobilenet and resnet-50 which trained by TSN as pre-trained. But the training results are strange!From the beginning, the training accuracy has reached 100%, while the test accuracy is basically unchanged. CUDA_VISIBLE_DEVICES=4 python stage1.py
    dataset=ucf101
    data_dir=/data/ymy/data/
    train_stage=1
    batch_size=32
    num_segments_glancer=8
    num_segments_focuser=12
    glance_size=224
    patch_size=144
    random_patch=True
    epochs=50
    backbone_lr=0.001
    fc_lr=0.01
    lr_type=step
    dropout=0.5
    load_pretrained_focuser_fc=False
    dist_url=tcp://127.0.0.1:8816
    eval_freq=1
    start_eval=0
    print_freq=25
    workers=16
    pretrained_glancer='/AdaFocus-main/new_mobile.tar'
    pretrained_focuser='/AdaFocus-main/new_resnet.tar'

    Epoch: [5][ 0/298] Time 43.183 (43.183) Data 42.607 (42.607) Loss 1.1841e-03 (1.1841e-03) Acc@1 100.00 (100.00) Acc@5 100.00 (100.00) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][ 25/298] Time 0.674 ( 2.839) Data 0.107 ( 2.276) Loss 1.7993e-03 (8.2321e-03) Acc@1 100.00 ( 99.76) Acc@5 100.00 (100.00) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][ 50/298] Time 1.080 ( 2.122) Data 0.526 ( 1.560) Loss 1.7797e-02 (1.1389e-02) Acc@1 100.00 ( 99.63) Acc@5 100.00 (100.00) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][ 75/298] Time 0.615 ( 1.833) Data 0.048 ( 1.272) Loss 2.5565e-04 (1.1153e-02) Acc@1 100.00 ( 99.63) Acc@5 100.00 (100.00) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][100/298] Time 0.624 ( 1.724) Data 0.056 ( 1.163) Loss 1.6186e-03 (9.6181e-03) Acc@1 100.00 ( 99.72) Acc@5 100.00 (100.00) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][125/298] Time 0.640 ( 1.601) Data 0.082 ( 1.041) Loss 6.2654e-02 (9.9088e-03) Acc@1 96.88 ( 99.68) Acc@5 100.00 (100.00) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][150/298] Time 0.618 ( 1.596) Data 0.061 ( 1.036) Loss 1.9718e-04 (9.0484e-03) Acc@1 100.00 ( 99.71) Acc@5 100.00 (100.00) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][175/298] Time 0.673 ( 1.526) Data 0.107 ( 0.965) Loss 1.8096e-03 (9.6376e-03) Acc@1 100.00 ( 99.70) Acc@5 100.00 (100.00) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][200/298] Time 0.630 ( 1.523) Data 0.061 ( 0.962) Loss 2.6468e-03 (9.3167e-03) Acc@1 100.00 ( 99.72) Acc@5 100.00 (100.00) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][225/298] Time 11.313 ( 1.514) Data 10.754 ( 0.952) Loss 9.3352e-03 (9.5301e-03) Acc@1 100.00 ( 99.72) Acc@5 100.00 (100.00) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][250/298] Time 0.643 ( 1.475) Data 0.086 ( 0.913) Loss 1.7089e-03 (1.0416e-02) Acc@1 100.00 ( 99.70) Acc@5 100.00 ( 99.99) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][275/298] Time 0.604 ( 1.472) Data 0.046 ( 0.910) Loss 1.3999e-03 (9.9850e-03) Acc@1 100.00 ( 99.73) Acc@5 100.00 ( 99.99) Focuser BackBone LR: 0.001 FC LR: 0 Epoch: [5][297/298] Time 0.647 ( 1.410) Data 0.094 ( 0.848) Loss 1.0134e-03 (1.0606e-02) Acc@1 100.00 ( 99.72) Acc@5 100.00 ( 99.98) Focuser BackBone LR: 0.001 FC LR: 0 Test: [ 0/119] Time 21.262 (21.262) Loss 6.7998e-01 (6.7998e-01) Acc@1 81.25 ( 81.25) Acc@5 100.00 (100.00) Test: [ 25/119] Time 0.381 ( 1.223) Loss 2.3228e-01 (6.1051e-01) Acc@1 93.75 ( 85.10) Acc@5 100.00 ( 97.60) Test: [ 50/119] Time 0.366 ( 0.818) Loss 4.2509e-01 (8.5970e-01) Acc@1 93.75 ( 81.07) Acc@5 96.88 ( 95.22) Test: [ 75/119] Time 0.406 ( 0.680) Loss 2.0299e-01 (1.0306e+00) Acc@1 93.75 ( 78.12) Acc@5 100.00 ( 93.09) Test: [100/119] Time 0.362 ( 0.609) Loss 3.9213e-01 (9.9937e-01) Acc@1 96.88 ( 78.53) Acc@5 96.88 ( 93.56) Test: [118/119] Time 0.122 ( 0.571) Loss 1.6555e+00 (9.4728e-01) Acc@1 28.57 ( 79.33) Acc@5 100.00 ( 94.00) Testing Results: Prec@1 79.329 Prec@5 93.999 Loss 0.94728

    opened by Morning-YU 0
  • Hi,the segment_indices_glancer is different from segment_indices_focuser in Something.  what is the purpose?

    Hi,the segment_indices_glancer is different from segment_indices_focuser in Something. what is the purpose?

    And the number of num_segments_glancer is different from the number of num_segments_focuser. 但是论文Figure 2. Overview of AdaFocus. 中 输入到fG网络 和 fL网络的帧是一一对应的。请问,输入到fG网络 和 fL网络的帧的需要是一一对应的吗?

    opened by xusong-20 1
  • About other datasets

    About other datasets

    I'm very interseted in your works. But when I experimented with the UCF101 dataset, the results were not encouraging ( just around 1%). Looking forward to your reply. THX!

    The parameters of the experiment are as follows:

    CUDA_VISIBLE_DEVICES=0,3,4,5 python stage1.py
    dataset=ucf101
    data_dir=/data/ymy/data/
    train_stage=1
    batch_size=32
    num_segments_glancer=8
    num_segments_focuser=12
    glance_size=224
    patch_size=144
    random_patch=True
    epochs=10
    backbone_lr=0.00001
    fc_lr=0.01
    lr_type=cos
    dropout=0.5
    load_pretrained_focuser_fc=False
    dist_url=tcp://127.0.0.1:8816
    eval_freq=1
    start_eval=0
    print_freq=25
    workers=16
    pretrained_glancer='/data/AdaFocus-main/mobilenetv2_segment8.pth.tar'
    pretrained_focuser='/data/AdaFocus-main/resnet50_segment12.pt.tar' # load the pretrained model

    opened by Morning-YU 0
  • About eval with SCSampler

    About eval with SCSampler

    A wonderful work!

    But I have a problem with evaluation. I can't flnd the code about the article SCSampler:Sampling Salient Clips from Video for Efficient Action Recognition. How can i evaluate these two models on a certain dataset?

    opened by Kaeless 0
  • Cannot reproduce 75.0 mAP with 128x128 patch

    Cannot reproduce 75.0 mAP with 128x128 patch

    With the same setting and same checkpoint (128s3_checkpoint.pth.tar), 75.0 mAP cannot be reproduced in my environment (achieved 74.4 ). The difference I know is that I use FPS1 frames, the data list provided seems to be 30 fps. However, as far as I know, FPS should not make such a big difference.

    opened by LawrenceXia2008 1
Owner
Rainforest Wang
Rainforest Wang
AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition [ArXiv] [Project Page] This repository is the official implementation of AdaMML:

International Business Machines 43 Dec 26, 2022
[AAAI 2021] MVFNet: Multi-View Fusion Network for Efficient Video Recognition

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021) Overview We release the code of the MVFNet (Multi-View Fusion Network).

Wenhao Wu 114 Nov 27, 2022
MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021)

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021) Overview We release the code of the MVFNet (Multi-View Fusion Network).

null 2 Jan 29, 2022
PyTorchVideo is a deeplearning library with a focus on video understanding work

PyTorchVideo is a deeplearning library with a focus on video understanding work. PytorchVideo provides resusable, modular and efficient components needed to accelerate the video understanding research. PyTorchVideo is developed using PyTorch and supports different deeplearning video components like video models, video datasets, and video-specific transforms.

Facebook Research 2.7k Jan 7, 2023
A PyTorch implementation of SlowFast based on ICCV 2019 paper "SlowFast Networks for Video Recognition"

SlowFast A PyTorch implementation of SlowFast based on ICCV 2019 paper SlowFast Networks for Video Recognition. Requirements Anaconda PyTorch conda in

Hao Ren 8 Dec 23, 2022
Eff video representation - Efficient video representation through neural fields

Neural Residual Flow Fields for Efficient Video Representations 1. Download MPI

null 41 Jan 6, 2023
Code for the ICCV 2021 paper "Pixel Difference Networks for Efficient Edge Detection" (Oral).

Pixel Difference Convolution This repository contains the PyTorch implementation for "Pixel Difference Networks for Efficient Edge Detection" by Zhuo

Alex 236 Dec 21, 2022
Code for the ICCV 2021 Workshop paper: A Unified Efficient Pyramid Transformer for Semantic Segmentation.

Unified-EPT Code for the ICCV 2021 Workshop paper: A Unified Efficient Pyramid Transformer for Semantic Segmentation. Installation Linux, CUDA>=10.0,

null 29 Aug 23, 2022
Implementation of Memory-Efficient Neural Networks with Multi-Level Generation, ICCV 2021

Memory-Efficient Multi-Level In-Situ Generation (MLG) By Jiaqi Gu, Hanqing Zhu, Chenghao Feng, Mingjie Liu, Zixuan Jiang, Ray T. Chen and David Z. Pan

Jiaqi Gu 2 Jan 4, 2022
Official implementation of the paper 'Efficient and Degradation-Adaptive Network for Real-World Image Super-Resolution'

DASR Paper Efficient and Degradation-Adaptive Network for Real-World Image Super-Resolution Jie Liang, Hui Zeng, and Lei Zhang. In arxiv preprint. Abs

null 81 Dec 28, 2022
Image transformations designed for Scene Text Recognition (STR) data augmentation. Published at ICCV 2021 Workshop on Interactive Labeling and Data Augmentation for Vision.

Data Augmentation for Scene Text Recognition (ICCV 2021 Workshop) (Pronounced as "strog") Paper Arxiv Why it matters? Scene Text Recognition (STR) req

Rowel Atienza 152 Dec 28, 2022
[ICCV 2021] Released code for Causal Attention for Unbiased Visual Recognition

CaaM This repo contains the codes of training our CaaM on NICO/ImageNet9 dataset. Due to my recent limited bandwidth, this codebase is still messy, wh

Wang Tan 66 Dec 31, 2022
Official PyTorch implementation of N-ImageNet: Towards Robust, Fine-Grained Object Recognition with Event Cameras (ICCV 2021)

N-ImageNet: Towards Robust, Fine-Grained Object Recognition with Event Cameras Official PyTorch implementation of N-ImageNet: Towards Robust, Fine-Gra

null 32 Dec 26, 2022
Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

null 1 Jan 23, 2022
HDR Video Reconstruction: A Coarse-to-fine Network and A Real-world Benchmark Dataset (ICCV 2021)

Code for HDR Video Reconstruction HDR Video Reconstruction: A Coarse-to-fine Network and A Real-world Benchmark Dataset (ICCV 2021) Guanying Chen, Cha

Guanying Chen 64 Nov 19, 2022
VIL-100: A New Dataset and A Baseline Model for Video Instance Lane Detection (ICCV 2021)

Preparation Please see dataset/README.md to get more details about our datasets-VIL100 Please see INSTALL.md to install environment and evaluation too

null 82 Dec 15, 2022
[2021][ICCV][FSNet] Full-Duplex Strategy for Video Object Segmentation

Full-Duplex Strategy for Video Object Segmentation (ICCV, 2021) Authors: Ge-Peng Ji, Keren Fu, Zhe Wu, Deng-Ping Fan*, Jianbing Shen, & Ling Shao This

Daniel-Ji 55 Dec 22, 2022
official Pytorch implementation of ICCV 2021 paper FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting.

FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting By Rui Liu, Hanming Deng, Yangyi Huang, Xiaoyu Shi, Lewei Lu, Wenxiu

null 77 Dec 27, 2022
[Official] Exploring Temporal Coherence for More General Video Face Forgery Detection(ICCV 2021)

Exploring Temporal Coherence for More General Video Face Forgery Detection(FTCN) Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, Fang Wen Accepted b

null 57 Dec 28, 2022