[CVPR22] Official codebase of Semantic Segmentation by Early Region Proxy.

Yifan

Last update: Nov 29, 2022

Related tags

Overview

RegionProxy

Figure 2. Performance vs. GFLOPs on ADE20K val split.

Semantic Segmentation by Early Region Proxy

Yifan Zhang, Bo Pang, Cewu Lu

CVPR 2022 (Poster) [arXiv]

Installation

Note: recommend using the exact version of the packages to avoid running issues.

Install PyTorch 1.7.1 and torchvision 0.8.2 following the official guide.
Install timm 0.4.12 and einops:
```
pip install timm==0.4.12 einops
```
This project depends on mmsegmentation 0.17 and mmcv 1.3.13, so you may follow its instructions to setup environment and prepare datasets.

Models

ADE20K

backbone	Resolution	FLOPs	#params.	mIoU	mIoU (ms+flip)	FPS	download
ViT-Ti/16	512x512	3.9G	5.8M	42.1	43.1	38.9	[model]
ViT-S/16	512x512	15G	22M	47.6	48.5	32.1	[model]
R26+ViT-S/32	512x512	16G	36M	47.8	49.1	28.5	[model]
ViT-B/16	512x512	59G	87M	49.8	50.5	20.1	[model]
R50+ViT-L/32	640x640	82G	323M	51.0	51.7	12.7	[model]
ViT-L/16	640x640	326G	306M	52.9	53.4	6.6	[model]

Cityscapes

backbone	Resolution	FLOPs	#params.	mIoU	mIoU (ms+flip)	download
ViT-Ti/16	768x768	69G	6M	76.5	77.7	[model]
ViT-S/16	768x768	270G	23M	79.8	81.5	[model]
ViT-B/16	768x768	1064G	88M	81.0	82.2	[model]
ViT-L/16	768x768	-	307M	81.4	82.7	[model]

Evaluation

You may evaluate the model on single GPU by running:

python test.py \
	--config configs/regproxy_ade20k/regproxy-t16-sub4+implicit-mid-4+512x512+160k+adamw-poly+ade20k.py \
	--checkpoint /path/to/ckpt \
	--eval mIoU

To evaluate on multiple GPUs, run:

python -m torch.distributed.launch --nproc_per_node 8 test.py \
	--launcher pytorch \
	--config configs/regproxy_ade20k/regproxy-t16-sub4+implicit-mid-4+512x512+160k+adamw-poly+ade20k.py \
	--checkpoint /path/to/ckpt 
	--eval mIoU

You may add --aug-test to enable multi-scale + flip evaluation. The test.py script is mostly copy-pasted from mmsegmentation. Please refer to this link for more usage (e.g., visualization).

Training

The first step is to prepare the pre-trained weights. Following Segmenter, we use AugReg pre-trained weights on our tiny, small and large models, and we use DeiT pre-trained weights on our base models. Do following steps to prepare the pre-trained weights for model initialization:

For DeiT weight, simply download from this link. For AugReg weights, first acquire the timm-style models:
```
import timm
m = timm.create_model('vit_tiny_patch16_384', pretrained=True)
```
The full list of entries can be found here (vanilla ViTs) and here (hybrid models).
Convert the timm models to mmsegmentation style using this script.

We train all models on 8 V100 GPUs. For example, to train RegProxy-Ti/16, run:

python -m torch.distributed.launch --nproc_per_node 8 train.py 
	--launcher pytorch \
	--config configs/regproxy_ade20k/regproxy-t16-sub4+implicit-mid-4+512x512+160k+adamw-poly+ade20k.py \
	--work-dir /path/to/workdir \
	--options model.pretrained=/path/to/pretrained/model

You may need to adjust data.samples_per_gpu if you plan to train on less GPUs. Please refer to this link for more training optioins.

Citation

@article{zhang2022semantic,
  title={Semantic Segmentation by Early Region Proxy},
  author={Zhang, Yifan and Pang, Bo and Lu, Cewu},
  journal={arXiv preprint arXiv:2203.14043},
  year={2022}
}

Comments

About Region Encoder
Hi,YiF.I have read the literature with reference to your open source code in the past two days, and have the following questions about the Region Encoder part:

H^W=N, then N is the number of tokens, do h and w change with the changes of H and W?

For a 512^512 image, H^h=512, W^w=512?

h and w are both 4 in your source code, so for a 512^512 image, H=W=128 and N=16384, is that too much?

question
opened by wanghr-git 2
AttributeError: 'PatchEmbed' object has no attribute 'DH'

亲爱的作者你们好，非常感谢你们的贡献！我在运行 test.py 时遇到了如下的错误：

File "/home/disk/xxx/git/RegionProxy/models/vit.py", line 113, in forward x, hw_shape = self.patch_embed(inputs), (self.patch_embed.DH, File "/home/xxx/anaconda3/envs/region_proxy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1185, in getattr raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'PatchEmbed' object has no attribute 'DH'

我运行的参数为：

--config configs/regproxy_cityscapes/regproxy-s16-sub4+implicit-mid-2+768x768+80k+adamw-poly+cityscapes.py --checkpoint checkpoints/regproxy-s16-sub4+implicit-mid-2+768x768+80k+adamw-poly+cityscapes.pth --eval mIoU

以下为我的文件目录结构：

├── checkpoints │ └── regproxy-s16-sub4+implicit-mid-2+768x768+80k+adamw-poly+cityscapes.pth ├── configs │ ├── base │ │ ├── datasets │ │ ├── default_runtime.py │ │ ├── models │ │ └── schedules │ ├── regproxy_ade20k │ │ ├── regproxy-b16-sub4+implicit-mid-4+512x512+160k+adamw-cr+ade20k.py │ │ ├── regproxy-l16-sub4+implicit-mid-9+640x640+160k+adamw-cr+ade20k.py │ │ ├── regproxy-r26-s32-sub4+implicit-mid-n1+512x512+160k+adamw-poly+ade20k.py │ │ ├── regproxy-r50-l32-sub4+implicit-mid-n1+640x640+160k+adamw-cr+ade20k.py │ │ ├── regproxy-s16-sub4+implicit-mid-4+512x512+160k+adamw-poly+ade20k.py │ │ └── regproxy-t16-sub4+implicit-mid-4+512x512+160k+adamw-poly+ade20k.py │ └── regproxy_cityscapes │ ├── regproxy-b16-sub4+implicit-mid-2+768x768+80k+adamw-cr+cityscapes.py │ ├── regproxy-l16-sub4+implicit-mid-5+768x768+80k+adamw-cr+cityscapes.py │ ├── regproxy-s16-sub4+implicit-mid-2+768x768+80k+adamw-poly+cityscapes.py │ └── regproxy-t16-sub4+implicit-mid-2+768x768+80k+adamw-poly+cityscapes.py ├── data │ └── cityscapes │ ├── gtFine │ ├── leftImg8bit │ ├── test.txt │ ├── train.txt │ └── val.txt ├── LICENSE ├── models │ ├── init.py │ ├── proxy_head.py │ ├── segmentors.py │ └── vit.py ├── README.md ├── test.py ├── train.py └── utils ├── checkpoint.py ├── init.py
help wanted

opened by hollow-503 2
About the handle borders

Hi, thanks for your code! There are some about handle borders in your code. And I'm not very clear on the purpose of this. Could you please explain that? Thank you very much!

opened by GuoQingqing 1
How does h,w in the paper and F.unfold(）function in the code work?

1、 About h,w The sentence "(Hh) × (W w) matches the size of the output segmentation map and (h, w) is the relative stride of the initial token gird" in the paper indicate that h,w is the downsample stride of segmentation map, but when I reading the code, I feel confused how it works, throught rerange the token_logits and matrix multiplication we get the final segmentation map,which is as large as the input image. So why do you set the extra parameter h and w, and how do h,w relate with stride?

2、About F.unfold() Official Implement Code token_logits = F.unfold(token_logits, kernel_size=3, padding=1).reshape(B, -1, 9, H, W) # (B, C, 9, H, W) pseudocode in the paper # get neighbors for each cell y = rar(y, "B N K -> B K H W") nb = im2col(y, kernel_size=3, padding=1) nb = rar(nb, "B (K n) (H W) -> B H W n K") The other is what does F.unfold() do in the code ,in the paper ,you show the process of proxy head using pseudocode，and say im2col( i.e. F.unfold() ) is using to get neighbors for each cell, I can not understand this well ,too.

Looking forward to your reply!!! Thank you ~~~

opened by stte0v0 0
Visualization of regions

Dear authors, Thanks for sharing your great work. I want to know how to visualize the regions as shown in Fig.6, and Fig 8. Can you release the code?

Thanks

opened by dingjiansw101 0

Unofficial implementation of RegionProxy based on Pytorch

Hi,YiF.Based on your open source code, I implemented an unofficial implementation of RegionProxy based only on Pytorch.This is the link of it[https://github.com/wanghr-git/RegionProxy](url).Can you check it for correctness if you have time?

opened by wanghr-git 0

About pretrained model

Hi, YIF@YiF-Zhang If my dataset is 512*512 resolution, how can I use the pretrained model? Can the pre-trained model at 224 or 384 resolution be used directly?

opened by wanghr-git 2

[CVPR22] Official codebase of Semantic Segmentation by Early Region Proxy.

Related tags

Overview

RegionProxy

Installation

Models

ADE20K

Cityscapes

Evaluation

Training

Citation

Comments

About Region Encoder

AttributeError: 'PatchEmbed' object has no attribute 'DH'

About the handle borders

How does h,w in the paper and F.unfold(）function in the code work?

Visualization of regions

Unofficial implementation of RegionProxy based on Pytorch

About pretrained model

Owner

Yifan

Cross-Image Region Mining with Region Prototypical Network for Weakly Supervised Segmentation

Discriminative Region Suppression for Weakly-Supervised Semantic Segmentation

Recall Loss for Semantic Segmentation (This repo implements the paper: Recall Loss for Semantic Segmentation)

A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

[ICCV 2021] Official Pytorch implementation for Discriminative Region-based Multi-Label Zero-Shot Learning SOTA results on NUS-WIDE and OpenImages

[ICCV 2021] Official Pytorch implementation for Discriminative Region-based Multi-Label Zero-Shot Learning SOTA results on NUS-WIDE and OpenImages

Codebase for Amodal Segmentation through Out-of-Task andOut-of-Distribution Generalization with a Bayesian Model

Organseg dags - The repository contains the codebase for multi-organ segmentation with directed acyclic graphs (DAGs) in CT.

Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation, CVPR 2018

Official codebase for Pretrained Transformers as Universal Computation Engines.

Official codebase for Decision Transformer: Reinforcement Learning via Sequence Modeling.

Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World

This codebase is the official implementation of Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization (NeurIPS2021, Spotlight)

Official codebase for "B-Pref: Benchmarking Preference-BasedReinforcement Learning" contains scripts to reproduce experiments.

Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

Official codebase for ICLR oral paper Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling

Official codebase used to develop Vision Transformer, MLP-Mixer, LiT and more.

Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP