source code of “Visual Saliency Transformer” (ICCV2021)

Last update: Dec 21, 2022

Related tags

Deep Learning VST

Overview

Visual Saliency Transformer (VST)

source code for our ICCV 2021 paper “Visual Saliency Transformer” by Nian Liu, Ni Zhang, Kaiyuan Wan, Junwei Han, and Ling Shao.

created by Ni Zhang, email: [email protected]

Requirement

Pytorch 1.6.0
Torchvison 0.7.0

RGB VST for RGB Salient Object Detection

Data Preparation

Training Set

We use the training set of DUTS to train our VST for RGB SOD. Besides, we follow Egnet to generate contour maps of DUTS trainset for training. You can directly download the generated contour maps DUTS-TR-Contour from [baidu pan fetch code: ow76 | Google drive] and put it into RGB_VST/Data folder.

Testing Set

We use the testing set of DUTS, ECSSD, HKU-IS, PASCAL-S, DUT-O, and SOD to test our VST. After Downloading, put them into RGB_VST/Data folder.

Your RGB_VST/Data folder should look like this:

-- Data
   |-- DUTS
   |   |-- DUTS-TR
   |   |-- | DUTS-TR-Image
   |   |-- | DUTS-TR-Mask
   |   |-- | DUTS-TR-Contour
   |   |-- DUTS-TE
   |   |-- | DUTS-TE-Image
   |   |-- | DUTS-TE-Mask
   |-- ECSSD
   |   |--images
   |   |--GT
   ...

Training, Testing, and Evaluation

cd RGB_VST
Download the pretrained T2T-ViT_t-14 model [baidu pan fetch code: 2u34 | Google drive] and put it into pretrained_model/ folder.
Run python train_test_eval.py --Training True --Testing True --Evaluation True for training, testing, and evaluation. The predictions will be in preds/ folder and the evaluation results will be in result.txt file.

Testing on Our Pretrained RGB VST Model

cd RGB_VST
Download our pretrained RGB_VST.pth[baidu pan fetch code: pe54 | Google drive] and then put it in checkpoint/ folder.
Run python train_test_eval.py --Testing True --Evaluation True for testing and evaluation. The predictions will be in preds/ folder and the evaluation results will be in result.txt file.

Our saliency maps can be downloaded from [baidu pan fetch code: 92t0 | Google drive].

SOTA Saliency Maps for Comparison

The saliency maps of the state-of-the-art methods in our paper can be downloaded from [baidu pan fetch code: de4k | Google drive].

RGB-D VST for RGB-D Salient Object Detection

Data Preparation

Training Set

We use 1,485 images from NJUD, 700 images from NLPR, and 800 images from DUTLF-Depth to train our VST for RGB-D SOD. Besides, we follow Egnet to generate corresponding contour maps for training. You can directly download the whole training set from here [baidu pan fetch code: 7vsw | Google drive] and put it into RGBD_VST/Data folder.

Testing Set

NJUD [baidu pan fetch code: 7mrn | Google drive]
NLPR [baidu pan fetch code: tqqm | Google drive]
DUTLF-Depth [baidu pan fetch code: 9jac | Google drive]
STERE [baidu pan fetch code: 93hl | Google drive]
LFSD [baidu pan fetch code: l2g4 | Google drive]
RGBD135 [baidu pan fetch code: apzb | Google drive]
SSD [baidu pan fetch code: j3v0 | Google drive]
SIP [baidu pan fetch code: q0j5 | Google drive]
ReDWeb-S

After Downloading, put them into RGBD_VST/Data folder.

Your RGBD_VST/Data folder should look like this:

-- Data
   |-- NJUD
   |   |-- trainset
   |   |-- | RGB
   |   |-- | depth
   |   |-- | GT
   |   |-- | contour
   |   |-- testset
   |   |-- | RGB
   |   |-- | depth
   |   |-- | GT
   |-- STERE
   |   |-- RGB
   |   |-- depth
   |   |-- GT
   ...

Training, Testing, and Evaluation

cd RGBD_VST
Download the pretrained T2T-ViT_t-14 model [baidu pan fetch code: 2u34 | Google drive] and put it into pretrained_model/ folder.
Run python train_test_eval.py --Training True --Testing True --Evaluation True for training, testing, and evaluation. The predictions will be in preds/ folder and the evaluation results will be in result.txt file.

Testing on Our Pretrained RGB-D VST Model

cd RGBD_VST
Download our pretrained RGBD_VST.pth[baidu pan fetch code: zt0v | Google drive] and then put it in checkpoint/ folder.
Run python train_test_eval.py --Testing True --Evaluation True for testing and evaluation. The predictions will be in preds/ folder and the evaluation results will be in result.txt file.

Our saliency maps can be downloaded from [baidu pan fetch code: jovk | Google drive].

SOTA Saliency Maps for Comparison

The saliency maps of the state-of-the-art methods in our paper can be downloaded from [baidu pan fetch code: i1we | Google drive].

Acknowledgement

We thank the authors of Egnet for providing codes of generating contour maps. We also thank Zhao Zhang for providing the efficient evaluation tool.

Citation

If you think our work is helpful, please cite

@inproceedings{liu2021VST, 
  title={Visual Saliency Transformer}, 
  author={Liu, Nian and Zhang, Ni and Han, Junwei and Shao, Ling},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2021}
}

Comments

Testing without Masks

Hello! And thank you for this latest work. I do apologize for this question if there is an easy answer that I have missed in the code. The inclusion of the evaluator is super helpful, but was curious if it was possible to amend the code to allow for testing when I do not have a mask of the image I wish to test, only the image, and still output the predicted mask?

opened by LBNord 4
Generate contour maps for other dataset

Hi, How we can generate contour maps for other datasets. I checked the Egnet but didn't find anything about create contour maps. They mentioned this sentence (We use the sal2edge.m to generate the edge label for training.) Are the edge label same as the contour maps? @nnizhang

opened by kiashann 1
Why we need "Saliency Token"?

Hi, firstly, thanks for your amazing work!

I have a question about the model. I dont understand why in decoder we need to prepare a "saliency token" to the transformer. I simply remove contour branch and purely use saliency branch, and delete the saliency token, the model will not work... also I dont understand the function "saliency_token_inference", why we use feature as queue but use token as k and v...?

do you mind to explain a bit?

thanks

opened by BarCodeReader 0
Pretrained T2T-ViT model can't be opened.

Thanks for your hard work.

I find a problem in your project on github, that is the pretrained T2T-ViT_t-14 model couldn't be opened. There are always problems, regarsless of in windows or Linux.

opened by wanghaitong-q 1
A question regarding the token based multi-task prediction

Excellent work! Thanks very much for the repo.

I have a question regarding the Equation (5) in the paper below. Given the output of sigmoid() is the attention (i.e., As, of size l1 x 1) between the task-specific token and all patch tokens, what does As*Vs mean if the Vs is a value of the task-specific token? Why not using values of patch tokens?

opened by lianxxx 0
model small

Hello, author. I changed the distributed training of the code into single-step training, and set batchsize to 8. The training model is smaller than what you provided, 174165KB.The test images are all gray.Can you tell me what's going on here

opened by cherryolg 0

source code of “Visual Saliency Transformer” (ICCV2021)

Related tags

Overview

Visual Saliency Transformer (VST)

Requirement

RGB VST for RGB Salient Object Detection

Data Preparation

Training Set

Testing Set

Training, Testing, and Evaluation

Testing on Our Pretrained RGB VST Model

SOTA Saliency Maps for Comparison

RGB-D VST for RGB-D Salient Object Detection

Data Preparation

Training Set

Testing Set

Training, Testing, and Evaluation

Testing on Our Pretrained RGB-D VST Model

SOTA Saliency Maps for Comparison

Acknowledgement

Citation

Comments

Testing without Masks

Generate contour maps for other dataset

Why we need "Saliency Token"?

Pretrained T2T-ViT model can't be opened.

A question regarding the token based multi-task prediction

model small

Owner

Code of Classification Saliency-Based Rule for Visible and Infrared Image Fusion

Official code for "Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer. ICCV2021".

A PyTorch implementation of "From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network" (ICCV2021)

[ECCV 2020] Gradient-Induced Co-Saliency Detection

Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral)

How to Become More Salient? Surfacing Representation Biases of the Saliency Prediction Model

Task-related Saliency Network For Few-shot learning

PyTorch implementation of saliency map-aided GAN for Auto-demosaic+denosing

Revisiting Video Saliency: A Large-scale Benchmark and a New Model (CVPR18, PAMI19)

Pyramid Grafting Network for One-Stage High Resolution Saliency Detection. CVPR 2022

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

The source code of the ICCV2021 paper "PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering"

The source code of the ICCV2021 paper "PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering"

HiFT: Hierarchical Feature Transformer for Aerial Tracking (ICCV2021)

VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

Code and models for ICCV2021 paper "Robust Object Detection via Instance-Level Temporal Cycle Confusion".

Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)