CAMoE + Dual SoftMax Loss (DSL): Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

Related tags

Deep Learning CAMoE
Overview

CAMoE + Dual SoftMax Loss (DSL): Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

This is official implement of "Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss".

Open source project

We intented to publish the dual softmax loss firstly, the entire version will be available before the end of this year.

Abstract

Employing large-scale pre-trained model CLIP to conduct video-text retrieval task (VTR) has become a new trend, which exceeds previous VTR methods. Though, due to the heterogeneity of structures and contents between video and text, previous CLIP-based models are prone to overfitting in the training phase, resulting in relatively poor retrieval performance. In this paper, we propose a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (CAMoE) and a novel Dual Softmax Loss (DSL) to solve the two heterogeneity. The CAMoE employs Mixture-of-Experts (MoE) to extract multi-perspective video representations, including action, entity, scene, etc., then align them with the corresponding part of the text. In this stage, we conduct massive explorations towards the feature extraction module and feature alignment module. DSL is proposed to avoid the one-way optimum-match which occurs in previous contrastive methods. Introducing the intrinsic prior of each pair in a batch, DSL serves as a reviser to correct the similarity matrix and achieves the dual optimal match. DSL is easy to implement with only one-line code but improves significantly. The results show that the proposed CAMoE and DSL are of strong efficiency, and each of them is capable of achieving State-of-The-Art (SOTA) individually on various benchmarks such as MSR-VTT, MSVD, LSMDC, DiDeMo, and Activitynet. Further, with both of them, the performance is advanced to a big extend, surpassing the previous SOTA methods for the average of around 4~5% R@1 in various VTR datasets.

The Experimental Result

dataset T2V-R@1 V2T-R@1
MSR-VTT-1k 48.8 50.3
MSR-VTT 32.9 59.8
MSVD 51.8 69.3
DiDeMo 43.8 45.5
Activitynet 51.0 49.9

Note that all the results achieve the SOTA.

Citing artical

Pleadse cite this article as:

@misc{cheng2021improving,
      title={Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss}, 
      author={Xing Cheng and Hezheng Lin and Xiangyu Wu and Fan Yang and Dong Shen},
      year={2021},
      eprint={2109.04290},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
Comments
  • Some partners question some precious problems, and we'd like to display them here.

    Some partners question some precious problems, and we'd like to display them here.

    Hi, I'm trying to evaluate the effectiveness of DSL loss. I've tested it with CLIP4Clip, but the improvement was not consistent with you paper. I got 44.1%@V2T-R@1 for original loss and 44.6%@V2T-R@1 for DSL loss. I revised simlarity matrix both on training and infernece time. Can you help me ?

    opened by idealwei 33
  • Does DSL only work for caption and video pairing diagonally ?

    Does DSL only work for caption and video pairing diagonally ?

    image

    It seems that caption and video must be one by one pairing diagonally .

    I am trying to evaluate the DSL on MSRVTT full split (2990 videos and 2990*20 captions), but the DSL didn't work. Howerver, on MSRVTT 1k split (1000 videos and 1000 captions), it works well (49.0% V2T-R@1 and 47.8% T2V-R@1). My model is CLIP4CLIP.

    Therefore, video and text matching information needs to be known in advance. Could you report the random shuffle comparative experiments on evaluation? If the random shuffle invalidate DSL, I am suspicious of data leak.

    opened by fly-dragon211 2
  • DSL temperature

    DSL temperature

    thanks for your great work, it looks very strong baseline to explore. a simple Question about DSL, in your expriment,whta's the temperature ? I can't find anything in your paper....

    opened by miziha-zp 2
  • Contradiction between code and formula

    Contradiction between code and formula

    hi,I have read your paper, but I have some questions. image According to the code you left, I found that your formula is inconsistent with the code. image Which one should I choose?

    opened by char-Fyzhao 1
  • DSL in evaluation

    DSL in evaluation

    Thanks for your work ! I am unable to reproduce the results using DSL... Are you using DSL in both training and evaluation stage , or using it only in training stage? Thanks.

    opened by liuyuyuil 1
  • Question about License

    Question about License

    Hello,

    Open-source project We intented to publish the dual softmax loss firstly, the entire version will be available before the end of this year.

    I'm looking forward to the full version!

    By the way, are you going to clarify any open-source license of this project?

    opened by kondounagi 0
  • Question about this retrieval setup

    Question about this retrieval setup

    Hi, thanks for your work. I read the paper and the boost of DSL is substantial so it is worthy to find this. However, my main criticism when using this in practice would be: a) at inference requires all text to be queried together, in order to use the prior b) the prior that there is a one-to-one mapping between test set queries and videos is not always true in the real world. You could do this with classification tasks if you know all classes have equal frequency -- however in practice this is not the case. So I think this is an unrealistic setup for text-to-video retrieval, you can have a user spam the text query "boy running" 100 times and this would cause catastrophic results for DSL.

    Do you have results when this is used during training but not testing? If it helps in that case it would be good to know

    opened by m-bain 5
Owner
程星
ICT master major in CV.
程星
Official repository of "BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment"

BasicVSR_PlusPlus (CVPR 2022) [Paper] [Project Page] [Code] This is the official repository for BasicVSR++. Please feel free to raise issue related to

Kelvin C.K. Chan 227 Jan 1, 2023
[EMNLP 2021] MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity Representations

MuVER This repo contains the code and pre-trained model for our EMNLP 2021 paper: MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity

null 24 May 30, 2022
[TIP 2021] SADRNet: Self-Aligned Dual Face Regression Networks for Robust 3D Dense Face Alignment and Reconstruction

SADRNet Paper link: SADRNet: Self-Aligned Dual Face Regression Networks for Robust 3D Dense Face Alignment and Reconstruction Requirements python

Multimedia Computing Group, Nanjing University 99 Dec 30, 2022
Pytorch implementation for "Implicit Feature Alignment: Learn to Convert Text Recognizer to Text Spotter".

Implicit Feature Alignment: Learn to Convert Text Recognizer to Text Spotter This is a pytorch-based implementation for paper Implicit Feature Alignme

wangtianwei 61 Nov 12, 2022
Offical implementation for "Trash or Treasure? An Interactive Dual-Stream Strategy for Single Image Reflection Separation".

Trash or Treasure? An Interactive Dual-Stream Strategy for Single Image Reflection Separation (NeurIPS 2021) by Qiming Hu, Xiaojie Guo. Dependencies P

Qiming Hu 31 Dec 20, 2022
Fast and customizable reconnaissance workflow tool based on simple YAML based DSL.

Fast and customizable reconnaissance workflow tool based on simple YAML based DSL, with support of notifications and distributed workload of that work

Américo Júnior 3 Mar 11, 2022
Image-retrieval-baseline - MUGE Multimodal Retrieval Baseline

MUGE Multimodal Retrieval Baseline This repo is implemented based on the open_cl

null 47 Dec 16, 2022
DSL for matching Python ASTs

py-ast-rule-engine This library provides a DSL (domain-specific language) to match a pattern inside a Python AST (abstract syntax tree). The library i

null 1 Dec 18, 2021
PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval (M2HSE) PyTorch code fo

Xinlei-Pei 6 Dec 23, 2022
[arXiv22] Disentangled Representation Learning for Text-Video Retrieval

Disentangled Representation Learning for Text-Video Retrieval This is a PyTorch implementation of the paper Disentangled Representation Learning for T

Qiang Wang 49 Dec 18, 2022
Face Detection and Alignment using Multi-task Cascaded Convolutional Networks (MTCNN)

Face-Detection-with-MTCNN Face detection is a computer vision problem that involves finding faces in photos. It is a trivial problem for humans to sol

Chetan Hirapara 3 Oct 7, 2022
Seeing Dynamic Scene in the Dark: High-Quality Video Dataset with Mechatronic Alignment (ICCV2021)

Seeing Dynamic Scene in the Dark: High-Quality Video Dataset with Mechatronic Alignment This is a pytorch project for the paper Seeing Dynamic Scene i

DV Lab 21 Nov 28, 2022
the code of the paper: Recurrent Multi-view Alignment Network for Unsupervised Surface Registration (CVPR 2021)

RMA-Net This repo is the implementation of the paper: Recurrent Multi-view Alignment Network for Unsupervised Surface Registration (CVPR 2021). Paper

Wanquan Feng 205 Nov 9, 2022
Collective Multi-type Entity Alignment Between Knowledge Graphs (WWW'20)

CG-MuAlign A reference implementation for "Collective Multi-type Entity Alignment Between Knowledge Graphs", published in WWW 2020. If you find our pa

Bran Zhu 28 Dec 11, 2022
A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

RE2 This is a pytorch implementation of the ACL 2019 paper "Simple and Effective Text Matching with Richer Alignment Features". The original Tensorflo

null 287 Dec 21, 2022
HistoSeg : Quick attention with multi-loss function for multi-structure segmentation in digital histology images

HistoSeg : Quick attention with multi-loss function for multi-structure segmentation in digital histology images Histological Image Segmentation This

Saad Wazir 11 Dec 16, 2022
Code for "Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search"

Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search This is an implementation for our paper Contextual Non-Loca

Tencent YouTu Research 50 Dec 3, 2022
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language This repository contains UA-GEC data and an accompanying Python lib

Grammarly 226 Dec 29, 2022
This repository contains the code for "Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP".

Self-Diagnosis and Self-Debiasing This repository contains the source code for Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based

Timo Schick 62 Dec 12, 2022