Official code for "Focal Self-attention for Local-Global Interactions in Vision Transformers"

Microsoft

Last update: Dec 20, 2022

Related tags

Deep Learning Focal-Transformer

Overview

Focal Transformer

This is the official implementation of our Focal Transformer -- "Focal Self-attention for Local-Global Interactions in Vision Transformers", by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.

Introduction

Our Focal Transfomer introduced a new self-attention mechanism called focal self-attention for vision transformers. In this new mechanism, each token attends the closest surrounding tokens at fine granularity but the tokens far away at coarse granularity, and thus can capture both short- and long-range visual dependencies efficiently and effectively.

With our Focal Transformers, we achieved superior performance over the state-of-the-art vision Transformers on a range of public benchmarks. In particular, our Focal Transformer models with a moderate size of 51.1M and a larger size of 89.8M achieve 83.6 and 84.0 Top-1 accuracy, respectively, on ImageNet classification at 224x224 resolution. Using Focal Transformers as the backbones, we obtain consistent and substantial improvements over the current state-of-the-art methods for 6 different object detection methods trained with standard 1x and 3x schedules. Our largest Focal Transformer yields 58.7/58.9 box mAPs and 50.9/51.3 mask mAPs on COCO mini-val/test-dev, and 55.4 mIoU on ADE20K for semantic segmentation.

Benchmarking

Image Classification on ImageNet-1K

Model	Pretrain	Use Conv	Resolution	acc@1	acc@5	#params	FLOPs	Checkpoint	Config
Focal-T	IN-1K	No	224	82.2	95.9	28.9M	4.9G	download	yaml
Focal-T	IN-1K	Yes	224	82.7	96.1	30.8M	4.9G	download	yaml
Focal-S	IN-1K	No	224	83.6	96.2	51.1M	9.4G	download	yaml
Focal-B	IN-1K	No	224	84.0	96.5	89.8M	16.4G	download	yaml

Object Detection and Instance Segmentation on COCO

Mask R-CNN

Backbone	Pretrain	Lr Schd	#params	FLOPs	box mAP	mask mAP
Focal-T	ImageNet-1K	1x	49M	291G	44.8	41.0
Focal-T	ImageNet-1K	3x	49M	291G	47.2	42.7
Focal-S	ImageNet-1K	1x	71M	401G	47.4	42.8
Focal-S	ImageNet-1K	3x	71M	401G	48.8	43.8
Focal-B	ImageNet-1K	1x	110M	533G	47.8	43.2
Focal-B	ImageNet-1K	3x	110M	533G	49.0	43.7

RetinaNet

Backbone	Pretrain	Lr Schd	#params	FLOPs	box mAP
Focal-T	ImageNet-1K	1x	39M	265G	43.7
Focal-T	ImageNet-1K	3x	39M	265G	45.5
Focal-S	ImageNet-1K	1x	62M	367G	45.6
Focal-S	ImageNet-1K	3x	62M	367G	47.3
Focal-B	ImageNet-1K	1x	101M	514G	46.3
Focal-B	ImageNet-1K	3x	101M	514G	46.9

Other detection methods

Backbone	Pretrain	Method	Lr Schd	#params	FLOPs	box mAP
Focal-T	ImageNet-1K	Cascade Mask R-CNN	3x	87M	770G	51.5
Focal-T	ImageNet-1K	ATSS	3x	37M	239G	49.5
Focal-T	ImageNet-1K	RepPointsV2	3x	45M	491G	51.2
Focal-T	ImageNet-1K	Sparse R-CNN	3x	111M	196G	49.0

Semantic Segmentation on ADE20K

Backbone	Pretrain	Method	Resolution	Iters	#params	FLOPs	mIoU	mIoU (MS)
Focal-T	ImageNet-1K	UPerNet	512x512	160k	62M	998G	45.8	47.0
Focal-S	ImageNet-1K	UPerNet	512x512	160k	85M	1130G	48.0	50.0
Focal-B	ImageNet-1K	UPerNet	512x512	160k	126M	1354G	49.0	50.5
Focal-L	ImageNet-22K	UPerNet	640x640	160k	240M	3376G	54.0	55.4

Getting Started

Please follow get_started_for_image_classification.md to get started for image classification.
Please follow get_started_for_object_detection.md to get started for object detection.
Please follow get_started_for_semantic_segmentation.md to get started for semantic segmentation.

Citation

If you find this repo useful to your project, please consider to cite it with following bib:

@misc{yang2021focal,
    title={Focal Self-attention for Local-Global Interactions in Vision Transformers}, 
    author={Jianwei Yang and Chunyuan Li and Pengchuan Zhang and Xiyang Dai and Bin Xiao and Lu Yuan and Jianfeng Gao},
    year={2021},
    eprint={2107.00641},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Acknowledgement

Our codebase is built based on Swin-Transformer. We thank the authors for the nicely organized code!

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Comments

Link expires？

Link expires？https://projects4jw.blob.core.windows.net/model/focal-transformer/imagenet1k/focal-base-useconv-is224-ws7.pth Can you reupload this model？

opened by my462 4
Relationship between focal window size and focal region size

I was confused by the relationship between focal window size and focal region size. Can you explained it more clearly? Take stage 1 level 0 as an example, sw=1, sr = 13, sw*sr could not be divided by the Ouput Size 56. I could not understand why sr is 13. Thanks a lot if you could help me.

opened by liyiersan 3
How to get q,k,v?

In your code, I do not understand how to get q, k, v from x and x_pooled. I have been confused by the roll operation on k_windows and the unfold operation on k_pooled_k for several days. Take stage 1 as an example, for level 0, since the sw is 1, I think the sr should be window_size//sw, that is 7. And for level 1, since the sw is 7, I think the sr should be output_size//sw, that is 8. Therefore the number of k is 7*7+8*8 = 113. But in your paper, you set sr as 13 at lavel 0 , sr as 7 at level 1. Why 13 and 7? And in your code, the number of k is 7*7+4*7*7-4*(7-3)*(7-3)+7*7=230, which is different from 7*7+13*13=218. As a suggestion, the window attention should be writen more clearly, and more comments are need in your code. Thanks a lot. If there is something wrong with what I said, please forgive me.

opened by liyiersan 2
num_heads value

hi,I'm sorry to bother you . I would like to know whether the num_heads values in each of the four stages in your code are the same, or whether the num_heads values in each stage are fixed or related to something else

opened by kimjisoo12 1
about pool_method

Hi,I want to know doing sub-windows pooling，If I want to follow the sub-window-pooling method introduced in the paper, which pooling method should I select.

thank you very much

opened by kimjisoo12 1
Focal Transformer on 1-D Data

Great Work !

My query is that : can this method be run on 1-D features ? Because the pooling operations done before flattening is done for 2-D representation !

opened by sauradip 1
num_heads value

hi,I'm sorry to bother you . I would like to know whether the num_heads values in each of the four stages in your code are the same, or whether the num_heads values in each stage are fixed or related to something else

opened by kimjisoo12 0
Link error

https://projects4jw.blob.core.windows.net/model/focal-transformer/imagenet1k/focal-tiny-is224-ws7.pth ResourceNotFound The specified resource does not exist. RequestId:933ca8ee-601e-0028-4af4-f455bc000000 Time:2022-11-10T11:05:52.4511547Z

opened by MONSTER012 0
Confusion about window size at different focal level
Thanks for your great work! But I don't understand the meaning of the following code (class FocalTransformerBlock in ./classification/focal_transformer.py) at Link

for k in range(self.focal_level-1): window_size_glo = math.floor(self.window_size_glo / (2 ** k)) pooled_h = math.ceil(H / self.window_size) * (2 ** k) pooled_w = math.ceil(W / self.window_size) * (2 ** k) H_pool = pooled_h * window_size_glo W_pool = pooled_w * window_size_glo

I guss the purpose of this is to make the H and W of x_level_k a multiple of window size and to facilitate the pool operation. But how does this actually work?

Why calculate window_size_glo?

What is meaning of pooled_h ? (math.ceil(H / self.window_size) does not change over iteration)

Why not change window_size over iteration directly?

Could you please explain to me in details? Thanks in advance.
opened by where2go947 1
Some question about focal_transformer_v2.py

Hello, should the range_h and range_w on line 176 and line 177 of the file (https://github.com/microsoft/Focal-Transformer/blob/main/classification/focal_transformer_v2.py) be different at different focal levels? The same size range_h and range_w are used at different focal levels in the code; this makes me very puzzled.

opened by DQiaole 0
How the load the pretrained model on classification when trained on segmentation?

The positional embedding is related to the input size, is it possible to load the pretrained models on classification? By the way, it seems that torch.unfold is faster than torch.roll on GPUs.

opened by liyiersan 0
Some problems with the reproduction process about focal block
Hello, first of all, thank you for providing the Focal Transformer module,

I have some questions:

The image you are processing is 224224 and window size = 7. I wonder if it is reasonable for me to change the window size to 8 in order to make it divisible when I input 512512 at the beginning.

Since you didn't give the segmentation code, IT seems to me that you set focal level as 2 in your demo. Should I set level values as 1, 2 and 3 respectively when processing? To find the best fit.

As for num_heads, I see that you are patch_embeding, and the number of channels becomes 96. Then in focal attention,num_heads = 2 in order to divide evenly.

Suppose that the size of the image I input is 3232, since patch_size = 4, the length of the vector is 64, then the size I input should increase successively, such as 6464, 128*128. Should I also increase ptach_size so that the final vector length is still 64?

If you can help me, I will be very grateful to you
opened by kimjisoo12 0

Owner

Microsoft

Open source projects and samples from Microsoft

GitHub

Official TensorFlow code for the forthcoming paper

~ Efficient-CapsNet ~ Are you tired of over inflated and overused convolutional neural networks? You're right! It's time for CAPSULES :)

203 Jan 8, 2023

Official code for Score-Based Generative Modeling through Stochastic Differential Equations

Score-Based Generative Modeling through Stochastic Differential Equations This repo contains the official implementation for the paper Score-Based Gen

818 Jan 6, 2023

Official code for paper "Optimization for Oriented Object Detection via Representation Invariance Loss".

Optimization for Oriented Object Detection via Representation Invariance Loss By Qi Ming, Zhiqiang Zhou, Lingjuan Miao, Xue Yang, and Yunpeng Dong. Th

56 Nov 28, 2022

This repo provides the official code for TransBTS: Multimodal Brain Tumor Segmentation Using Transformer (https://arxiv.org/pdf/2103.04430.pdf).

TransBTS: Multimodal Brain Tumor Segmentation Using Transformer This repo is the official implementation for TransBTS: Multimodal Brain Tumor Segmenta

247 Dec 28, 2022

Official code of the paper "ReDet: A Rotation-equivariant Detector for Aerial Object Detection" (CVPR 2021)

ReDet: A Rotation-equivariant Detector for Aerial Object Detection ReDet: A Rotation-equivariant Detector for Aerial Object Detection (CVPR2021), Jiam

334 Dec 23, 2022

Official code implementation for "Personalized Federated Learning using Hypernetworks"

Personalized Federated Learning using Hypernetworks This is an official implementation of Personalized Federated Learning using Hypernetworks paper. [

121 Dec 25, 2022

Official code for the paper: Deep Graph Matching under Quadratic Constraint (CVPR 2021)

QC-DGM This is the official PyTorch implementation and models for our CVPR 2021 paper: Deep Graph Matching under Quadratic Constraint. It also contain

55 Nov 14, 2022

Official code for the ICLR 2021 paper Neural ODE Processes

Neural ODE Processes Official code for the paper Neural ODE Processes (ICLR 2021). Abstract Neural Ordinary Differential Equations (NODEs) use a neura

50 Oct 28, 2022

Official PyTorch Code of GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Monocular 3D Object Detection (CVPR 2021)

GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Monocular 3D Object Detection GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Mo

76 Jan 2, 2023

Official code for the CVPR 2021 paper "How Well Do Self-Supervised Models Transfer?"

How Well Do Self-Supervised Models Transfer? This repository hosts the code for the experiments in the CVPR 2021 paper How Well Do Self-Supervised Mod

157 Dec 16, 2022

Official PyTorch code of Holistic 3D Scene Understanding from a Single Image with Implicit Representation (CVPR 2021)

Implicit3DUnderstanding (Im3D) [Project Page] Holistic 3D Scene Understanding from a Single Image with Implicit Representation Cheng Zhang, Zhaopeng C

149 Jan 8, 2023

This is the official code release for the paper Shape and Material Capture at Home

This is the official code release for the paper Shape and Material Capture at Home. The code enables you to reconstruct a 3D mesh and Cook-Torrance BRDF from one or more images captured with a flashlight or camera with flash.

89 Dec 10, 2022

Official code of CVPR 2021's PLOP: Learning without Forgetting for Continual Semantic Segmentation

PLOP: Learning without Forgetting for Continual Semantic Segmentation This repository contains all of our code. It is a modified version of Cermelli e

116 Dec 14, 2022

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

138 Dec 30, 2022

official code for dynamic convolution decomposition

Revisiting Dynamic Convolution via Matrix Decomposition (ICLR 2021) A pytorch implementation of DCD. If you use this code in your research please cons

110 Nov 23, 2022

This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.

Skeleton Aware Multi-modal Sign Language Recognition By Songyao Jiang, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li and Yun Fu. Smile Lab @ Northeastern

128 Dec 8, 2022

Official code for "End-to-End Optimization of Scene Layout" -- including VAE, Diff Render, SPADE for colorization (CVPR 2020 Oral)

End-to-End Optimization of Scene Layout Code release for: End-to-End Optimization of Scene Layout CVPR 2020 (Oral) Project site, Bibtex For help conta

41 Dec 9, 2022

Official source code to CVPR'20 paper, "When2com: Multi-Agent Perception via Communication Graph Grouping"

When2com: Multi-Agent Perception via Communication Graph Grouping This is the PyTorch implementation of our paper: When2com: Multi-Agent Perception vi

34 Nov 9, 2022

Official code repository of the paper Learning Associative Inference Using Fast Weight Memory by Schlag et al.

Learning Associative Inference Using Fast Weight Memory This repository contains the offical code for the paper Learning Associative Inference Using F

18 Oct 12, 2022