Deep ViT Features as Dense Visual Descriptors

Shir Amir

Last update: Dec 24, 2022

Related tags

Deep Learning computer-vision deep-learning pytorch dino co-segmentation part-segmentation semantic-correspondence vision-transformers

Overview

dino-vit-features

[paper] [project page]

Official implementation of the paper "Deep ViT Features as Dense Visual Descriptors".

We demonstrate the effectiveness of deep features extracted from a self-supervised, pre-trained ViT model (DINO-ViT) as dense patch descriptors via real-world vision tasks: (a-b) co-segmentation & part co-segmentation: given a set of input images (e.g., 4 input images), we automatically co-segment semantically common foreground objects (e.g., animals), and then further partition them into common parts; (c-d) point correspondence: given a pair of input images, we automatically extract a sparse set of corresponding points. We tackle these tasks by applying only lightweight, simple methodologies such as clustering or binning, to deep ViT features.

Setup

Our code is developed in pytorch on and requires the following modules: tqdm, faiss, timm, matplotlib, pydensecrf, opencv, scikit-learn. We use python=3.9 but our code should be runnable on any version above 3.6. We recomment running our code with any CUDA supported GPU for faster performance. We recommend setting the running environment via Anaconda by running the following commands:

$ conda env create -f env/dino-vit-feats-env.yml
$ conda activate dino-vit-feats-env

Otherwise, run the following commands in your conda environment:

$ conda install pytorch torchvision torchaudio cudatoolkit=11 -c pytorch
$ conda install tqdm
$ conda install -c conda-forge faiss
$ conda install -c conda-forge timm 
$ conda install matplotlib
$ pip install opencv-python
$ pip install git+https://github.com/lucasb-eyer/pydensecrf.git
$ conda install -c anaconda scikit-learn

ViT Extractor

We provide a wrapper class for a ViT model to extract dense visual descriptors in extractor.py. You can extract descriptors to .pt files using the following command:

python extractor.py --image_path 
   
     --output_path

You can specify the pretrained model using the --model flag with the following options:

dino_vits8, dino_vits16, dino_vitb8, dino_vitb16 from the DINO repo.
vit_small_patch8_224, vit_small_patch16_224, vit_base_patch8_224, vit_base_patch16_224 from the timm repo.

You can specify the stride of patch extracting layer to increase resolution using the --stride flag.

Part Co-segmentation

We provide a notebook for running on a single example in part_cosegmentation.ipynb.

To run on several image sets, arrange each set in a directory, inside a data root directory:


   
    
|
|_ 
    
     
|  |
|  |_ img1.png
|  |_ img2.png
|   
|_ 
     
      
   |
   |_ img1.png
   |_ img2.png
   |_ img3.png
...

The following command will produce results in the specified :

python part_cosegmentation.py --root_dir 
   
     --save_dir

Note: The default configuration in part_cosegmentation.ipynb is suited for running on small sets (e.g. < 10). Increase amount of num_crop_augmentations for more stable results (and increased runtime). The default configuration in part_cosegmentation.py is suited for larger sets (e.g. >> 10).

Co-segmentation

We provide a notebook for running on a single example in cosegmentation.ipynb.

To run on several image sets, arrange each set in a directory, inside a data root directory:


   
    
|
|_ 
    
     
|  |
|  |_ img1.png
|  |_ img2.png
|   
|_ 
     
      
   |
   |_ img1.png
   |_ img2.png
   |_ img3.png
...

The following command will produce results in the specified :

python cosegmentation.py --root_dir 
   
     --save_dir

Point Correspondences

We provide a notebook for running on a single example in correpondences.ipynb.

To run on several image pairs, arrange each image pair in a directory, inside a data root directory:


   
    
|
|_ 
    
     
|  |
|  |_ img1.png
|  |_ img2.png
|   
|_ 
     
      
   |
   |_ img1.png
   |_ img2.png
...

The following command will produce results in the specified :

python correspondences.py --root_dir 
   
     --save_dir

Citation

If you found this repository useful please consider starring ⭐ and citing :

@article{amir2021deep,
    author    = {Shir Amir and Yossi Gandelsman and Shai Bagon and Tali Dekel},
    title     = {Deep ViT Features as Dense Visual Descriptors},
    journal   = {arXiv preprint arXiv:2112.05814},
    year      = {2021}
}

Comments

parameter tunning for custom dataset

I found your method sensitive to the choice of parameters (thresh, elbow coefficient, etc.). Instead of tunning them manually and assessing the results qualitatively, is there a way to do a grid search and assess quantitatively? For example, can I search on the training set and evaluate on the validation set and use Landmark regression results to select the best parameters? If so, could you upload your evaluation scripts so that I can do it this way? Thank you.

opened by HHenryD 4
PCK Evaluation

I am unable to replicate results in Table 4 (Correspondence Evaluation on Spair71k). Since in the case of PCK evaluation, keypoints are provided for the source image, I find the closest point (according to the binned descriptor) in the second image within the "salient region". The numbers I get are close to zeros so there might be a mistake in my code. Are there any additional heuristics that you apply for this one-way correspondence?

opened by kampta 3
CUDA out of memory

I have GPUs with 11GB of memory, and I will get out of memory warning when I load more than three images. (when computing the attention of ViT, attn = (q @ k.transpose(-2, -1)) * self.scale)

I think I can increase the stride or decrease the load size, but it will also degrade the performance.

I found the code only processes a single image each time, so I would like to ask if I can run the program across multiple GPUs?

opened by Reagan1311 2
Use of the previous KMeans instance

I wonder if using the previous KMeans instance at this point is intentional. I mean the part_algorithm was trained on normalized_all_fg_sampled_descriptors as opposed to common_part_algorithm, which is trained on normalized_all_common_sampled_descriptors. https://github.com/ShirAmir/dino-vit-features/blob/4b023eca1ac0bd462a68fcd03ccbdcb5aed40cb1/part_cosegmentation.py#L273

opened by mateusz-politycki-wttech 1

Indexing error when using high resolution saliency map

Hi,

When running cosegmentation.py and parts_cosegmentation.py, turning low_res_saliency_maps off leads to an indexing error. It appears that saliency_map is batched with shape 1xN. So something like:

if not low_res_saliency_maps:
    saliency_map = saliency_map[0]

is a sufficient fix.

Traceback (most recent call last):
  File "cosegmentation.py", line 523, in <module>
    seg_masks, pil_images = find_cosegmentation(
  File "cosegmentation.py", line 257, in find_cosegmentation
    label_saliency = saliency_map[image_labels[:, 0] == label].mean()
IndexError: boolean index did not match indexed array along dimension 0; dimension is 1 but corresponding boolean dimension is 1705

opened by jasonyzhang 1

The choice of head_idx

https://github.com/ShirAmir/dino-vit-features/blob/79c1289b5a83960b85ca8e268bc569f48975fddb/extractor.py#L313

Is there any reason why you choose these heads?

opened by kwea123 1
DINO vs MAE

Hi, Thanks for your amazing work. The study is very interesting. You are using DINO as feature extractor in your work, and I was just wondering if you tried using MAE or a different method? And do you have the same/similar results? Thanks for your time,

opened by fabienbaradel 1
Extractor feature OOM

Hi there， I tried the code, at the beginning , everything sames fine, but when I try to use the extractor on my own images, more specificly, extract feature of high resolution pciture and visualize the pca picture. I tried the code on 100 800 x 800 images, set the load_size=224 and stride=2, it seems fine, so the code maybe run the image separately?

How should I calculate the memory of GPU and Could the Extractor could modify to using on multiple GPUs?

opened by StarsTesla 2

Owner

Shir Amir

Graduate Student @ Weizmann Institute of Science

GitHub https://dino-vit-features.github.io

A simple approach to emable dense segmentation with ViT.

Vision Transformer Segmentation Network This implementation of ViT in pytorch uses a super simple and straight-forward way of generating an output of

5 Jan 3, 2023

So-ViT: Mind Visual Tokens for Vision Transformer

So-ViT: Mind Visual Tokens for Vision Transformer Introduction This repository contains the source code under PyTorch framework and models trai

44 Nov 24, 2022

Rotation Robust Descriptors

RoRD Rotation-Robust Descriptors and Orthographic Views for Local Feature Matching Project Page | Paper link Evaluation and Datasets MMA : Training on

25 Nov 15, 2022

Code for the CVPR2021 paper "Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition"

Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition This repository contains code for the CVPR2021 paper "Patch-NetV

368 Jan 6, 2023

Code for the RA-L (ICRA) 2021 paper "SeqNet: Learning Descriptors for Sequence-Based Hierarchical Place Recognition"

SeqNet: Learning Descriptors for Sequence-Based Hierarchical Place Recognition [ArXiv+Supplementary] [IEEE Xplore RA-L 2021] [ICRA 2021 YouTube Video]

63 Dec 12, 2022

Compute descriptors for 3D point cloud registration using a multi scale sparse voxel architecture

MS-SVConv : 3D Point Cloud Registration with Multi-Scale Architecture and Self-supervised Fine-tuning Compute features for 3D point cloud registration

42 Jul 25, 2022

You Only Hypothesize Once: Point Cloud Registration with Rotation-equivariant Descriptors

You Only Hypothesize Once: Point Cloud Registration with Rotation-equivariant Descriptors In this paper, we propose a novel local descriptor-based fra

80 Dec 15, 2022

Official code for "Eigenlanes: Data-Driven Lane Descriptors for Structurally Diverse Lanes", CVPR2022

[CVPR 2022] Eigenlanes: Data-Driven Lane Descriptors for Structurally Diverse Lanes Dongkwon Jin, Wonhui Park, Seong-Gyun Jeong, Heeyeon Kwon, and Cha

106 Dec 29, 2022

Official PyTorch Implementation of GAN-Supervised Dense Visual Alignment

GAN-Supervised Dense Visual Alignment — Official PyTorch Implementation Paper | Project Page | Video This repo contains training, evaluation and visua

944 Jan 7, 2023

A PyTorch Implementation of ViT (Vision Transformer)

ViT - Vision Transformer This is an implementation of ViT - Vision Transformer by Google Research Team through the paper "An Image is Worth 16x16 Word

7 May 11, 2022

PyTorch implementation of MoCo v3 for self-supervised ResNet and ViT.

MoCo v3 for Self-supervised ResNet and ViT Introduction This is a PyTorch implementation of MoCo v3 for self-supervised ResNet and ViT. The original M

887 Jan 8, 2023

Official implement of Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer This repository contains the PyTorch code for Evo-ViT. This work proposes a slow-fas

53 Dec 5, 2022

This repository contains an overview of important follow-up works based on the original Vision Transformer (ViT) by Google.

75 Dec 2, 2022

Deep ViT Features as Dense Visual Descriptors

Related tags

Overview

dino-vit-features

Setup

ViT Extractor

Part Co-segmentation

Co-segmentation

Point Correspondences

Citation

Comments

parameter tunning for custom dataset

PCK Evaluation

CUDA out of memory

Use of the previous KMeans instance

Indexing error when using high resolution saliency map

The choice of head_idx

DINO vs MAE

Extractor feature OOM

Owner

Shir Amir

A simple approach to emable dense segmentation with ViT.

So-ViT: Mind Visual Tokens for Vision Transformer

Rotation Robust Descriptors

Code for the CVPR2021 paper "Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition"

Code for the RA-L (ICRA) 2021 paper "SeqNet: Learning Descriptors for Sequence-Based Hierarchical Place Recognition"

Compute descriptors for 3D point cloud registration using a multi scale sparse voxel architecture

You Only Hypothesize Once: Point Cloud Registration with Rotation-equivariant Descriptors

Official code for "Eigenlanes: Data-Driven Lane Descriptors for Structurally Diverse Lanes", CVPR2022

Official PyTorch Implementation of GAN-Supervised Dense Visual Alignment

A PyTorch Implementation of ViT (Vision Transformer)

PyTorch implementation of MoCo v3 for self-supervised ResNet and ViT.

Official implement of Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

This repository contains an overview of important follow-up works based on the original Vision Transformer (ViT) by Google.

PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

A simple program for training and testing vit

Implementing Vision Transformer (ViT) in PyTorch

This project uses ViT to perform image classification tasks on DATA set CIFAR10.

vit for few-shot classification

As-ViT: Auto-scaling Vision Transformers without Training