(CVPR 2022) Pytorch implementation of "Self-supervised transformers for unsupervised object discovery using normalized cut"

YANGTAO WANG

Last update: Jan 2, 2023

Related tags

Overview

(CVPR 2022) TokenCut

Pytorch implementation of Tokencut:

Self-supervised Transformers for Unsupervised Object Discovery using Normalized Cut

Yangtao Wang, Xi Shen, Shell Xu Hu, Yuan Yuan, James L. Crowley, Dominique Vaufreydaz

[Project page] [Paper]

If our project is helpful for your research, please consider citing :

@inproceedings{wang2022tokencut,
          title={Self-supervised Transformers for Unsupervised Object Discovery using Normalized Cut},
          author={Wang, Yangtao and Shen, Xi and Hu, Shell Xu and Yuan, Yuan and Crowley, James L. and Vaufreydaz, Dominique},
          booktitle={Conference on Computer Vision and Pattern Recognition}
          year={2022}
        }

Table of Content

1. Updates
2. Installation
- 2.1 Dependencies
- 2.2 Data
3. Quick Start
- 3.1 Detecting an object in one image
- 3.2 Segmenting a salient region in one image
4. Evaluation
5. Acknowledgement

1. Updates

03/10/2022 Creating a 480p Demo using Gradio. Try out the Web Demo:

Internet image results:

02/26/2022 Integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo:

02/26/2022 A simple TokenCut Colab Demo is available.

02/21/2022 Initial commit: Code of TokenCut is released, including evaluation of unsupervised object discovery, unsupervised saliency object detection, weakly supervised object locolization.

2. Installation

2.1 Dependencies

This code was implemented with Python 3.7, PyTorch 1.7.1 and CUDA 11.2. Please refer to the official installation. If CUDA 10.2 has been properly installed :

pip install torch==1.7.1 torchvision==0.8.2

In order to install the additionnal dependencies, please launch the following command:

pip install -r requirements.txt

2.2 Data

We provide quick download commands in DOWNLOAD_DATA.md for VOC2007, VOC2012, COCO, CUB, ImageNet, ECSSD, DUTS and DUT-OMRON as well as DINO checkpoints.

3. Quick Start

3.1 Detecting an object in one image

We provide TokenCut visualization for bounding box prediction and attention map. Using all for all visualization results.

python main_tokencut.py --image_path examples/VOC07_000036.jpg --visualize pred
python main_tokencut.py --image_path examples/VOC07_000036.jpg --visualize attn
python main_tokencut.py --image_path examples/VOC07_000036.jpg --visualize all

3.2 Segmenting a salient region in one image

We provide TokenCut segmentation results as follows:

cd unsupervised_saliency_detection 
python get_saliency.py --sigma-spatial 16 --sigma-luma 16 --sigma-chroma 8 --vit-arch small --patch-size 16 --img-path ../examples/VOC07_000036.jpg --out-dir ./output

4. Evaluation

Following are the different steps to reproduce the results of TokenCut presented in the paper.

4.1 Unsupervised object discovery

PASCAL-VOC

In order to apply TokenCut and compute corloc results (VOC07 68.8, VOC12 72.1), please launch:

python main_tokencut.py --dataset VOC07 --set trainval
python main_tokencut.py --dataset VOC12 --set trainval

If you want to extract Dino features, which corresponds to the KEY features in DINO:

mkdir features
python main_lost.py --dataset VOC07 --set trainval --save-feat-dir features/VOC2007

COCO

Results are provided given the 2014 annotations following previous works. The following command line allows you to get results on the subset of 20k images of the COCO dataset (corloc 58.8), following previous litterature. To be noted that the 20k images are a subset of the train set.

python main_tokencut.py --dataset COCO20k --set train

Different models

We have tested the method on different setups of the VIT model, corloc results are presented in the following table (more can be found in the paper).

arch	pre-training	dataset
		VOC07	VOC12	COCO20k
ViT-S/16	DINO	68.8	72.1	58.8
ViT-S/8	DINO	67.3	71.6	60.7
ViT-B/16	DINO	68.8	72.4	59.0

Previous results on the dataset VOC07 can be obtained by launching:

python main_tokencut.py --dataset VOC07 --set trainval #VIT-S/16
python main_tokencut.py --dataset VOC07 --set trainval --patch_size 8 #VIT-S/8
python main_tokencut.py --dataset VOC07 --set trainval --arch vit_base #VIT-B/16

4.2 Unsupervised saliency detection

To evaluate on ECSSD, DUTS, DUT_OMRON dataset:

python get_saliency.py --out-dir ECSSD --sigma-spatial 16 --sigma-luma 16 --sigma-chroma 8 --nb-vis 1 --vit-arch small --patch-size 16 --dataset ECSSD

python get_saliency.py --out-dir DUTS --sigma-spatial 16 --sigma-luma 16 --sigma-chroma 8 --nb-vis 1 --vit-arch small --patch-size 16 --dataset DUTS

python get_saliency.py --out-dir DUT --sigma-spatial 16 --sigma-luma 16 --sigma-chroma 8 --nb-vis 1 --vit-arch small --patch-size 16 --dataset DUT

This should give:

Method	ECSSD			DUTS			DUT-OMRON
	maxF	IoU	Acc	maxF	IoU	Acc	maxF	IoU	Acc
TokenCut	80.3	71.2	91.8	67.2	57.6	90.3	60.0	53.3	88.0
TokenCut + BS	87.4	77.2	93.4	75.5	62,4	91.4	69.7	61.8	89.7

4.3 Weakly supervised object detection

Fintune DINO small on CUB

To finetune ViT-S/16 on CUB on a single node with 4 gpus for 1000 epochs run:

python -m torch.distributed.launch --nproc_per_node=4 main.py --data_path /path/to/data --batch_size_per_gpu 256 --dataset cub --weight_decay 0.005 --pretrained_weights ./dino_deitsmall16_pretrain.pth --epoch 1000 --output_dir ./path/to/checkpoin --lr 2e-4 --warmup-epochs 50 --num_labels 200 --num_workers 16 --n_last_blocks 1 --avgpool_patchtokens true --arch vit_small --patch_size 16

Evaluation on CUB

To evaluate a fine-tuned ViT-S/16 on CUB val with a single GPU run:

python eval.py --pretrained_weights ./path/to/checkpoint --dataset cub --data_path ./path/to/data --batch_size_per_gpu 1 --no_center_crop

This should give:

Top1_cls: 79.12, top5_cls94.80, gt_loc: 0.914, top1_loc:0.723

Evaluate on Imagenet

To Evaluate ViT-S/16 finetuned on ImageNet val with a single GPU run:

python eval.py --pretrained_weights /path/to/checkpoint --classifier_weights /path/to/linear_weights--dataset imagenet --data_path ./path/to/data --batch_size_per_gpu 1 --num_labels 1000 --batch_size_per_gpu 1 --no_center_crop --input_size 256 --tau 0.2 --patch_size 16 --arch vit_small

5. Acknowledgement

TokenCut code is built on top of LOST, DINO, Segswap, and Bilateral_Sovlver. We would like to sincerely thanks those authors for their great works.

Comments

wants to improve the failed images comeout from this code

I have run this code for unsupervised saliency detection for ECSSD dataset where found some of the images are not generating the proper mask. Varying the tau and taking eigen vector without taking the absolute value getting some improved results.

What could be the generalised solution for this type of images?

opened by srilekhapanda 4
Results about unsupervised salient object detection

I notice that the reported results of DeepUSPS are significantly lower than other methods, even traditional HS. I used to test the performance of this method and reported much better performance than the scores reported in this paper. In addition, [1] can also prove that DeepUSPS should not get such low scores.

[1] A Causal Debiasing Framework for Unsupervised Salient Object Detection. AAAI 2022.

opened by moothes 4
Re-implementation of unsupervised saliency detection

Hi, thanks for the interesting work!

I can not gain the results of "TokenCut + BS" as reported in the paper.

E.g., I gain "ECSSD: IoU (0.621), Acc (0.891), F-max (0.751)" , which are much lower than ones claimed.

As for "TokenCut", I have gained the same results, so I guess the problems may occur at "bilateral solver". Could you please check and update the codes?

Thanks in advance!

Best,

opened by ZHANG-Jun-Pu 3
Files required for running ImageNet dataset

Hi @YangtaoWANG95, I am a Ph.D. student from Purdue University. I am trying to run TokenCut with ImageNet but not able to figure out how to get the text files under ILSVRC/detection folder (i.e., train.txt, val.txt and wnids.txt). Can you please let me know where I can find files? Any help would be highly appreciated.

Thanks, Aparna

opened by aparna-aketi 2
Support returning bounding box coordinates

Hi there,

Amazing work with TokenCut! I'm impressed at how well this appears to work on the dataset I'm playing with.

It would be great if there was a convenient method for returning bounding box predictions, instead of just images annotated with the bounding boxes. If that already exists, please point me in the right direction.

opened by lextoumbourou 2
Add simple Colab demo

Just a suggestion - so it's easy to try out with one click :) Note that this pull request depends upon #1

(A Huggingface/Gradio demo would be even better! ... @AK391)

opened by josephrocca 2
Why in this paper we are using NCut given the edge weights are all non-negative?

In the paper, the graph is constructed in a way that all edge weights are either 1 or eps (almost zero), making the NCut energy's optimal value is 0. Then the problem has a trivial solution of A=all nodes and B=empty set. Actually we have a lot of trivial solutions like A=a connected component in the graph and B=the rest nodes. In this way how can we tell which solution is desired? And because the graph is collapsed, the problem becomes very easy and it doesn't need the relaxed version of the NCut at all.

If I have any misunderstandings, could you please explain that to me?

opened by Helicopt 1
how to prepare the imagenet dataset
Sincerely thanks to this greate work! I get images and boxes annotations from imagenet official website. However, how to prepare the imagenet dataset as follows? ./datasets/ImageNet/ ├── ILSVRC ├── Annoations ├── Data ├── Detection ├── ImageSets ├── LOC_synset_mapping.txt ├── LOC_val_solution.csv

...
opened by gazelxu 1
CVPR2022 call for demos

Hi, there is a call for demos this year for cvpr 2022

https://cvpr2022.thecvf.com/call-demos

where a demo can be added to the Hugging Face organization here: https://huggingface.co/cvpr

As there is already a Gradio demo for Tokencut on Hugging Face https://huggingface.co/spaces/akhaliq/TokenCut, would you be interested in submitting a demo for this?

opened by AK391 1
Fix saliency detection for `patch_size != 16`

The unsupervised saliency detection code currently does not work with patch size 8 because the feature extractor is always instantiated with patch size 16.

This PR fixes that.

opened by Callidior 0
Fintune DINO small on CUB

When Using Vit-s for fine-tuning DINO on CUB, the descriptions between code and that in readme are not consistent. In code, about avgpool_patchtokens, typically set this to False for ViT-Small, and in the readme file, the given is --avgpool_patchtokens true.

opened by zaiquanyang 0

(CVPR 2022) Pytorch implementation of "Self-supervised transformers for unsupervised object discovery using normalized cut"

Related tags

Overview

(CVPR 2022) TokenCut

Table of Content

1. Updates

2. Installation

2.1 Dependencies

2.2 Data

3. Quick Start

3.1 Detecting an object in one image

3.2 Segmenting a salient region in one image

4. Evaluation

4.1 Unsupervised object discovery

PASCAL-VOC

COCO

Different models

4.2 Unsupervised saliency detection

4.3 Weakly supervised object detection

Fintune DINO small on CUB

Evaluation on CUB

Evaluate on Imagenet

5. Acknowledgement

Comments

Owner

YANGTAO WANG

The 7th edition of NTIRE: New Trends in Image Restoration and Enhancement workshop will be held on June 2022 in conjunction with CVPR 2022.

Official Pytorch implementation of "Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes", CVPR 2022

Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data - Official PyTorch Implementation (CVPR 2022)

Official PyTorch implementation of the paper "Deep Constrained Least Squares for Blind Image Super-Resolution", CVPR 2022.

Official pytorch implementation for Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion (CVPR 2022)

[CVPR 2022] Pytorch implementation of "Templates for 3D Object Pose Estimation Revisited: Generalization to New objects and Robustness to Occlusions" paper

[CVPR 2022] Official PyTorch Implementation for "Reference-based Video Super-Resolution Using Multi-Camera Video Triplets"

Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

This project is the PyTorch implementation of our CVPR 2022 paper:

This repository contains a pytorch implementation of "HeadNeRF: A Real-time NeRF-based Parametric Head Model (CVPR 2022)".

Sound-guided Semantic Image Manipulation - Official Pytorch Code (CVPR 2022)

PyTorch implementations of the paper: "DR.VIC: Decomposition and Reasoning for Video Individual Counting, CVPR, 2022"

[CVPR 2022] Official Pytorch code for OW-DETR: Open-world Detection Transformer

The Pytorch code of "Joint Distribution Matters: Deep Brownian Distance Covariance for Few-Shot Classification", CVPR 2022 (Oral).

Official implementation of "Can You Spot the Chameleon? Adversarially Camouflaging Images from Co-Salient Object Detection" in CVPR 2022.

Official Implementation of CVPR 2022 paper: "Mimicking the Oracle: An Initial Phase Decorrelation Approach for Class Incremental Learning"

This is the formal code implementation of the CVPR 2022 paper 'Federated Class Incremental Learning'.

Official implementation for "QS-Attn: Query-Selected Attention for Contrastive Learning in I2I Translation" (CVPR 2022)

(CVPR 2022 Oral) Official implementation for "Surface Representation for Point Clouds"