Joint Versus Independent Multiview Hashing for Cross-View Retrieval[J] (IEEE TCYB 2021, PyTorch Code)

Related tags

Deep Learning DCHN
Overview

2021-IEEE TCYB-DCHN

Peng Hu, Xi Peng, Hongyuan Zhu, Jie Lin, Liangli Zhen, Dezhong Peng, Joint Versus Independent Multiview Hashing for Cross-View Retrieval[J]. IEEE Transactions on Cybernetics, vol. 51, no. 10, pp. 4982-4993, Oct. 2021. (PyTorch Code)

Abstract

Thanks to the low storage cost and high query speed, cross-view hashing (CVH) has been successfully used for similarity search in multimedia retrieval. However, most existing CVH methods use all views to learn a common Hamming space, thus making it difficult to handle the data with increasing views or a large number of views. To overcome these difficulties, we propose a decoupled CVH network (DCHN) approach which consists of a semantic hashing autoencoder module (SHAM) and multiple multiview hashing networks (MHNs). To be specific, SHAM adopts a hashing encoder and decoder to learn a discriminative Hamming space using either a few labels or the number of classes, that is, the so-called flexible inputs. After that, MHN independently projects all samples into the discriminative Hamming space that is treated as an alternative ground truth. In brief, the Hamming space is learned from the semantic space induced from the flexible inputs, which is further used to guide view-specific hashing in an independent fashion. Thanks to such an independent/decoupled paradigm, our method could enjoy high computational efficiency and the capacity of handling the increasing number of views by only using a few labels or the number of classes. For a newly coming view, we only need to add a view-specific network into our model and avoid retraining the entire model using the new and previous views. Extensive experiments are carried out on five widely used multiview databases compared with 15 state-of-the-art approaches. The results show that the proposed independent hashing paradigm is superior to the common joint ones while enjoying high efficiency and the capacity of handling newly coming views.

Framework

DCHN

Figure 1. Framework of the proposed DCHN method. g is the output of the corresponding view (i.e., image, text, video, etc.). o is the semantic hash code that is computed by the corresponding label y and semantic hashing transformation W. W is computed by the proposed semantic hashing autoencoder module (SHAM). sgn is an elementwise sign function. ℒR and ℒH are hash reconstruction and semantic hashing functions, respectively. In the training stage, first, W is used to recast the label y as a ground-truth hash code o. Then, the obtained hash code is used to guide view-specific networks with a semantic hashing reconstruction regularizer. Such a learning scheme makes the v view-specific neural networks (one network for each view) can be trained separately since they are decoupled and do not share any trainable parameters. Therefore, our DCHN can be easy to scale to a large number of views. In the inference stage, each trained view-specific network fk(xk, Θk) is used to compute the hash code of the sample xk.

SHAM

Figure 1. Proposed SHAM utilizes the semantic information (e.g., labels or classes) to learn an encoder W and a decoder WT by mutually converting the semantic and Hamming spaces. SHAM is one key component of our independent hashing paradigm.

Usage

First, to train SHAM wtih 64 bits on MIRFLICKR-25K, just run trainSHAM.py as follows:

python trainSHAM.py --datasets mirflickr25k --output_shape 64 --gama 1 --available_num 100

Then, to train a model for image modality wtih 64 bits on MIRFLICKR-25K, just run main_DCHN.py as follows:

python main_DCHN.py --mode train --epochs 100 --view 0 --datasets mirflickr25k --output_shape 64 --alpha 0.02 --gama 1 --available_num 100 --gpu_id 0

For text modality:

python main_DCHN.py --mode train --epochs 100 --view 1 --datasets mirflickr25k --output_shape 64 --alpha 0.02 --gama 1 --available_num 100 --gpu_id 1

To evaluate the trained models, you could run main_DCHN.py as follows:

python main_DCHN.py --mode eval --view -1 --datasets mirflickr25k --output_shape 64 --alpha 0.02 --gama 1 --available_num 100 --num_workers 0

Comparison with the State-of-the-Art

Table 1: Performance comparison in terms of MAP scores on the MIRFLICKR-25K and IAPR TC-12 datasets. The highest MAP score is shown in bold.

   Method    MIRFLICKR-25K IAPR TC-12
Image → Text Text → Image Image → Text Text → Image
16 32 64 128 16 32 64 128 16 32 64 128 16 32 64 128
Baseline 0.581 0.520 0.553 0.573 0.578 0.544 0.556 0.579 0.329 0.292 0.309 0.298 0.332 0.295 0.311 0.304
SePH [21] 0.729 0.738 0.744 0.750 0.753 0.762 0.764 0.769 0.467 0.476 0.486 0.493 0.463 0.475 0.485 0.492
SePHlr [12] 0.729 0.746 0.754 0.763 0.760 0.780 0.785 0.793 0.410 0.434 0.448 0.463 0.461 0.495 0.515 0.525
RoPH [34] 0.733 0.744 0.749 0.756 0.757 0.759 0.768 0.771 0.457 0.481 0.493 0.500 0.451 0.478 0.488 0.495
LSRH [22] 0.756 0.780 0.788 0.800 0.772 0.786 0.791 0.802 0.474 0.490 0.512 0.522 0.474 0.492 0.511 0.526
KDLFH [23] 0.734 0.755 0.770 0.771 0.764 0.780 0.794 0.797 0.306 0.314 0.351 0.357 0.307 0.315 0.350 0.356
DLFH [23] 0.721 0.743 0.760 0.767 0.761 0.788 0.805 0.810 0.306 0.314 0.326 0.340 0.305 0.315 0.333 0.353
MTFH [13] 0.581 0.571 0.645 0.543 0.584 0.556 0.633 0.531 0.303 0.303 0.307 0.300 0.303 0.303 0.308 0.302
DJSRH [14] 0.620 0.630 0.645 0.660 0.620 0.626 0.645 0.649 0.368 0.396 0.419 0.439 0.370 0.400 0.423 0.437
DCMH [9] 0.737 0.754 0.763 0.771 0.753 0.760 0.763 0.770 0.423 0.439 0.456 0.463 0.449 0.464 0.476 0.481
SSAH [20] 0.797 0.809 0.810 0.802 0.782 0.797 0.799 0.790 0.501 0.503 0.496 0.479 0.504 0.530 0.554 0.565
DCHN0 0.806 0.823 0.836 0.842 0.797 0.808 0.823 0.827 0.487 0.492 0.550 0.573 0.481 0.488 0.543 0.567
DCHN100 0.813 0.816 0.823 0.840 0.808 0.803 0.814 0.830 0.533 0.558 0.582 0.596 0.527 0.557 0.582 0.595

Table 2: Performance comparison in terms of MAP scores on the NUS-WIDE and MS-COCO datasets. The highest MAP score is shown in bold.

   Method    NUS-WIDE MS-COCO
Image → Text Text → Image Image → Text Text → Image
16 32 64 128 16 32 64 128 16 32 64 128 16 32 64 128
Baseline 0.281 0.337 0.263 0.341 0.299 0.339 0.276 0.346 0.362 0.336 0.332 0.373 0.348 0.341 0.347 0.359
SePH [21] 0.644 0.652 0.661 0.664 0.654 0.662 0.670 0.673 0.586 0.598 0.620 0.628 0.587 0.594 0.618 0.625
SePHlr [12] 0.607 0.624 0.644 0.651 0.630 0.649 0.665 0.672 0.527 0.571 0.592 0.600 0.555 0.596 0.618 0.621
RoPH [34] 0.638 0.656 0.662 0.669 0.645 0.665 0.671 0.677 0.592 0.634 0.649 0.657 0.587 0.628 0.643 0.652
LSRH [22] 0.622 0.650 0.659 0.690 0.600 0.662 0.685 0.692 0.580 0.563 0.561 0.567 0.580 0.611 0.615 0.632
KDLFH [23] 0.323 0.367 0.364 0.403 0.325 0.365 0.368 0.408 0.373 0.403 0.451 0.542 0.370 0.400 0.449 0.542
DLFH [23] 0.316 0.367 0.381 0.404 0.319 0.379 0.386 0.415 0.352 0.398 0.455 0.443 0.359 0.393 0.456 0.442
MTFH [13] 0.265 0.473 0.434 0.445 0.243 0.418 0.414 0.485 0.288 0.264 0.311 0.413 0.301 0.284 0.310 0.406
DJSRH [14] 0.433 0.453 0.467 0.442 0.457 0.468 0.468 0.501 0.478 0.520 0.544 0.566 0.462 0.525 0.550 0.567
DCMH [9] 0.569 0.595 0.612 0.621 0.548 0.573 0.585 0.592 0.548 0.575 0.607 0.625 0.568 0.595 0.643 0.664
SSAH [20] 0.636 0.636 0.637 0.510 0.653 0.676 0.683 0.682 0.550 0.577 0.576 0.581 0.552 0.578 0.578 0.669
DCHN0 0.648 0.660 0.669 0.683 0.662 0.677 0.685 0.697 0.602 0.658 0.682 0.706 0.591 0.652 0.669 0.696
DCHN100 0.654 0.671 0.681 0.691 0.668 0.683 0.697 0.707 0.662 0.701 0.703 0.720 0.650 0.689 0.693 0.714

Citation

If you find DCHN useful in your research, please consider citing:

@article{hu2021joint,
  author={Hu, Peng and Peng, Xi and Zhu, Hongyuan and Lin, Jie and Zhen, Liangli and Peng, Dezhong},
  journal={IEEE Transactions on Cybernetics}, 
  title={Joint Versus Independent Multiview Hashing for Cross-View Retrieval}, 
  year={2021},
  volume={51},
  number={10},
  pages={4982-4993},
  doi={10.1109/TCYB.2020.3027614}}
}
You might also like...
pytorch implementation of
pytorch implementation of "Contrastive Multiview Coding", "Momentum Contrast for Unsupervised Visual Representation Learning", and "Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination"

Unofficial implementation: MoCo: Momentum Contrast for Unsupervised Visual Representation Learning (Paper) InsDis: Unsupervised Feature Learning via N

A Joint Video and Image Encoder for End-to-End Retrieval
A Joint Video and Image Encoder for End-to-End Retrieval

Frozen️ in Time ❄️ ️️️️ ⏳ A Joint Video and Image Encoder for End-to-End Retrieval project page | arXiv | webvid-data Repository containing the code,

Code for 'Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning', ICCV 2021
Code for 'Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning', ICCV 2021

CMIC-Retrieval Code for Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning. ICCV 2021. Introduction In this wo

PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval (M2HSE) PyTorch code fo

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.
This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.

Skeleton Aware Multi-modal Sign Language Recognition By Songyao Jiang, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li and Yun Fu. Smile Lab @ Northeastern

Generative Models as a Data Source for Multiview Representation Learning
Generative Models as a Data Source for Multiview Representation Learning

GenRep Project Page | Paper Generative Models as a Data Source for Multiview Representation Learning Ali Jahanian, Xavier Puig, Yonglong Tian, Phillip

Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance
Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance

Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance Project Page | Paper | Data This repository contains an implementatio

Multiview 3D object detection on MultiviewC dataset through moft3d.
Multiview 3D object detection on MultiviewC dataset through moft3d.

Multiview Orthographic Feature Transformation for 3D Object Detection Multiview 3D object detection on MultiviewC dataset through moft3d. Introduction

Owner
https://penghu-cs.github.io/
null
Fake videos detection by tracing the source using video hashing retrieval.

Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos ??️ ?? Directory Introduction VTL Trace Samples and Acc of Hash

null 56 Dec 22, 2022
Tracing Versus Freehand for Evaluating Computer-Generated Drawings (SIGGRAPH 2021)

Tracing Versus Freehand for Evaluating Computer-Generated Drawings (SIGGRAPH 2021) Zeyu Wang, Sherry Qiu, Nicole Feng, Holly Rushmeier, Leonard McMill

Zach Zeyu Wang 23 Dec 9, 2022
《Where am I looking at? Joint Location and Orientation Estimation by Cross-View Matching》(CVPR 2020)

This contains the codes for cross-view geo-localization method described in: Where am I looking at? Joint Location and Orientation Estimation by Cross-View Matching, CVPR2020.

null 41 Oct 27, 2022
This is the official implementation of 3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object Detection, built on SECOND.

3D-CVF This is the official implementation of 3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object

YecheolKim 97 Dec 20, 2022
Code for "Single-view robot pose and joint angle estimation via render & compare", CVPR 2021 (Oral).

Single-view robot pose and joint angle estimation via render & compare Yann Labbé, Justin Carpentier, Mathieu Aubry, Josef Sivic CVPR: Conference on C

Yann Labbé 51 Oct 14, 2022
Joint Learning of 3D Shape Retrieval and Deformation, CVPR 2021

Joint Learning of 3D Shape Retrieval and Deformation Joint Learning of 3D Shape Retrieval and Deformation Mikaela Angelina Uy, Vladimir G. Kim, Minhyu

Mikaela Uy 38 Oct 18, 2022
Losslandscapetaxonomy - Taxonomizing local versus global structure in neural network loss landscapes

Taxonomizing local versus global structure in neural network loss landscapes Int

Yaoqing Yang 8 Dec 30, 2022
Blender add-on: Add to Cameras menu: View → Camera, View → Add Camera, Camera → View, Previous Camera, Next Camera

Blender add-on: Camera additions In 3D view, it adds these actions to the View|Cameras menu: View → Camera : set the current camera to the 3D view Vie

German Bauer 11 Feb 8, 2022
(CVPR 2022 - oral) Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry

Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry Official implementation of the paper Multi-View Depth Est

Bae, Gwangbin 138 Dec 28, 2022
[CVPR'21] Projecting Your View Attentively: Monocular Road Scene Layout Estimation via Cross-view Transformation

Projecting Your View Attentively: Monocular Road Scene Layout Estimation via Cross-view Transformation Weixiang Yang, Qi Li, Wenxi Liu, Yuanlong Yu, Y

null 118 Dec 26, 2022