[AAAI 2022] Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Overview

[AAAI 2022] Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Official Pytorch implementation of Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding (AAAI 2022).

Paper is at https://arxiv.org/pdf/2109.04872.pdf.

Paper explanation in Zhihu (in Chinese) is at https://zhuanlan.zhihu.com/p/446203594.

Abstract

Temporal grounding aims to localize a video moment which is semantically aligned with a given natural language query. Existing methods typically apply a detection or regression pipeline on the fused representation with the research focus on designing complicated prediction heads or fusion strategies. Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Mutual Matching Network (MMN), to directly model the similarity between language queries and video moments in a joint embedding space. This new metric-learning framework enables fully exploiting negative samples from two new aspects: constructing negative cross-modal pairs in a mutual matching scheme and mining negative pairs across different videos. These new negative samples could enhance the joint representation learning of two modalities via cross-modal mutual matching to maximize their mutual information. Experiments show that our MMN achieves highly competitive performance compared with the state-of-the-art methods on four video grounding benchmarks. Based on MMN, we present a winner solution for the HC-STVG challenge of the 3rd PIC workshop. This suggests that metric learning is still a promising method for temporal grounding via capturing the essential cross-modal correlation in a joint embedding space.

Updates

Dec, 2021 - We uploaded the code and trained weights for Charades-STA, ActivityNet-Captions and TACoS datasets.

Todo: The code for spatio-temporal video grounding (HC-STVG dataset) will be available soon.

Datasets

  • Download the video feature and the groundtruth provided by 2D-TAN.
  • Extract and put them in a dataset folder in the same directory as train_net.py. For configurations of feature/groundtruth's paths, please refer to ./mmn/config/paths_catalog.py. (ann_file is annotation, feat_file is the video feature)

Dependencies

Our code is developed on the third-party implementation of 2D-TAN, so we have similar dependencies with it, such as:

yacs h5py terminaltables tqdm pytorch transformers 

Quick Start

We provide scripts for simplifying training and inference. For training our model, we provide a script for each dataset (e.g., ./scripts/tacos_train.sh). For evaluating the performance, we provide ./scripts/eval.sh.

For example, for training model in TACoS dataset in tacos_train.sh, we need to select the right config in config and decide the GPU by yourself in gpus (gpu id in your server) and gpun (total number of gpus).

# find all configs in configs/
config=pool_tacos_128x128_k5l8
# set your gpu id
gpus=0,1
# number of gpus
gpun=2
# please modify it with different value (e.g., 127.0.0.2, 29502) when you run multi mmn task on the same machine
master_addr=127.0.0.3
master_port=29511

Similarly, to evaluate the model, just change the information in eval.sh. Our trained weights for three datasets are in the Google Drive.

Citation

If you find our code useful, please generously cite our paper. (AAAI version bibtex will be updated later)

@article{DBLP:journals/corr/abs-2109-04872,
  author    = {Zhenzhi Wang and
               Limin Wang and
               Tao Wu and
               Tianhao Li and
               Gangshan Wu},
  title     = {Negative Sample Matters: {A} Renaissance of Metric Learning for Temporal
               Grounding},
  journal   = {CoRR},
  volume    = {abs/2109.04872},
  year      = {2021}
}

Contact

For any question, please raise an issue (preferred) or contact

Zhenzhi Wang: [email protected]

Acknowledgement

We appreciate 2D-TAN for video feature and configurations, and the third-party implementation of 2D-TAN for its implementation with DistributedDataParallel. Disclaimer: the performance gain of this third-party implementation is due to a tiny mistake of adding val set into training, yet our reproduced result is similar to the reported result in 2D-TAN paper.

Comments
  • Experimental results are not the same when run the code multiple times

    Experimental results are not the same when run the code multiple times

    Hi,

    It's a great work in moment localization and achieves significant results! I have some questions about the results when running codes for multiple times. It seems that for the same code and the same hyper parameters, experimental results are not the same when run the code twice.

    Have you meet the same problem?Is there any solutions?

    Thanks!

    opened by LLLddddd 6
  • 关于论文中,只用BCE loss在activitinet上面效果的一点疑问

    关于论文中,只用BCE loss在activitinet上面效果的一点疑问

    您好, 您的工作提出了一个很好的针对video grounding的组织对比学习的范式。再各个数据集上都表现的很惊艳,令人印象深刻。 关于您论文里面消融实验部分,我有一点点疑问, 好像您论文里面只用BCEloss,在activitynet上面就可以达到(R@1,IoU=0.5)= 46.75,这个比原先2D-TAN高了两个点。是不是可以认为这两个点是bert带来的呢?

    opened by starmemda 3
  • I did not reproduce the scores in the paper, what is your environment when training?

    I did not reproduce the scores in the paper, what is your environment when training?

    Thank you for proposing a very interesting work. On Charades, since the original number of GPUs is 4 and the original batchsize is 48, I set batchsize as 24 in two 3090 for keeping the same samples on each GPU. Other configurations remain the same. However, I get the score are

    R@1,[email protected] = 45.35 (47.31 in paper)
    R@1,[email protected] = 26.30 (27.28 in paper)
    R@5,[email protected] = 84.21 (83.74 in paper)
    R@5,[email protected] = 57.02 (58.41 in paper)
    

    The excessive gap confuses me. So, what was your training environment, and if I don't have 4 GPUs, is there any way to get the score in the paper? Looking forward to your reply.

    opened by daidaiershidi 2
  • About the final prediction score s

    About the final prediction score s

    Hi, thanks for sharing your great work. In the paper, the final prediction score for a candidate moment is the product of s_iou and s_mm, but s is the cosine similarity result mentioned in Section 3.3, so the range of s is [-1, 1]. Do you actually mean the final matching score is s_iou * s_mm? Or you are trying to say the final score is the product of s after sigmoid function?

    opened by vin30731 2
  • about ending automatically after training

    about ending automatically after training

    Hi, when the model is finished training it cannot stop itself, I have to terminaterminate the taste the task, what causes this? How can I modify it so that it ends automatically after training?

    opened by menghuaa 1
  • How to choose the best trained model

    How to choose the best trained model

    Hi, how do I select a model for testing when I have trained it? Does it rely on the loss or the test effect on the validation set? Your code does not give how to choose the best model.

    opened by menghuaa 1
  • Great work. I just make it runnable out of the box.

    Great work. I just make it runnable out of the box.

    I fixed some small bugs to make it runnable.

    • rename dataset/Charades-STA to dataset/Charades_STA since all configs using dataset/Charades_STA.
    • set dataset root variable DATA_DIR to empty string since all configs already had prefix dataset/.
    • make train_net.py always enable distributed training even if the number of gpu is one. This can avoid the error message that the set_epoch used in multi-gpu does not exist in the samplers used by single GPU training.
    opened by w86763777 0
Owner
Multimedia Computing Group, Nanjing University
Multimedia Computing Group, Nanjing University
《LightXML: Transformer with dynamic negative sampling for High-Performance Extreme Multi-label Text Classification》(AAAI 2021) GitHub:

LightXML: Transformer with dynamic negative sampling for High-Performance Extreme Multi-label Text Classification

null 76 Dec 5, 2022
This is the repository for CVPR2021 Dynamic Metric Learning: Towards a Scalable Metric Space to Accommodate Multiple Semantic Scales

Intro This is the repository for CVPR2021 Dynamic Metric Learning: Towards a Scalable Metric Space to Accommodate Multiple Semantic Scales Vehicle Sam

null 39 Jul 21, 2022
Official PyTorch implementation of "Proxy Synthesis: Learning with Synthetic Classes for Deep Metric Learning" (AAAI 2021)

Proxy Synthesis: Learning with Synthetic Classes for Deep Metric Learning Official PyTorch implementation of "Proxy Synthesis: Learning with Synthetic

NAVER/LINE Vision 30 Dec 6, 2022
The Pytorch code of "Joint Distribution Matters: Deep Brownian Distance Covariance for Few-Shot Classification", CVPR 2022 (Oral).

DeepBDC for few-shot learning        Introduction In this repo, we provide the implementation of the following paper: "Joint Distribution Matters: Dee

FeiLong 116 Dec 19, 2022
Yolox-bytetrack-sample - Python sample of MOT (Multiple Object Tracking) using YOLOX and ByteTrack

yolox-bytetrack-sample YOLOXとByteTrackを用いたMOT(Multiple Object Tracking)のPythonサン

KazuhitoTakahashi 12 Nov 9, 2022
Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

Context Matters: Graph-based Self-supervised Representation Learning for Medical Images Official PyTorch implementation for paper Context Matters: Gra

null 49 Nov 23, 2022
naked is a Python tool which allows you to strip a model and only keep what matters for making predictions.

naked is a Python tool which allows you to strip a model and only keep what matters for making predictions. The result is a pure Python function with no third-party dependencies that you can simply copy/paste wherever you wish.

Max Halford 24 Dec 20, 2022
Pytorch implementation of Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization https://arxiv.org/abs/2008.11646

[TCSVT] Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization LPN [Paper] NEWs Prerequisites Python 3.6 GPU Memory >= 8G Numpy > 1.

null 46 Dec 14, 2022
Normalization Matters in Weakly Supervised Object Localization (ICCV 2021)

Normalization Matters in Weakly Supervised Object Localization (ICCV 2021) 99% of the code in this repository originates from this link. ICCV 2021 pap

Jeesoo Kim 10 Feb 1, 2022
Network Pruning That Matters: A Case Study on Retraining Variants (ICLR 2021)

Network Pruning That Matters: A Case Study on Retraining Variants (ICLR 2021)

Duong H. Le 18 Jun 13, 2022
Code for One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning (AAAI 2022)

One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning (AAAI 2022) Paper | Demo Requirements Python >= 3.6 , Pytorch >

FuxiVirtualHuman 84 Jan 3, 2023
[CVPR2021] Look before you leap: learning landmark features for one-stage visual grounding.

LBYL-Net This repo implements paper Look Before You Leap: Learning Landmark Features For One-Stage Visual Grounding CVPR 2021. Getting Started Prerequ

SVIP Lab 45 Dec 12, 2022
Imposter-detector-2022 - HackED 2022 Team 3IQ - 2022 Imposter Detector

HackED 2022 Team 3IQ - 2022 Imposter Detector By Aneeljyot Alagh, Curtis Kan, Jo

Joshua Ji 3 Aug 20, 2022
CVPR2021: Temporal Context Aggregation Network for Temporal Action Proposal Refinement

Temporal Context Aggregation Network - Pytorch This repo holds the pytorch-version codes of paper: "Temporal Context Aggregation Network for Temporal

Zhiwu Qing 63 Sep 27, 2022
Implementation of temporal pooling methods studied in [ICIP'20] A Comparative Evaluation Of Temporal Pooling Methods For Blind Video Quality Assessment

Implementation of temporal pooling methods studied in [ICIP'20] A Comparative Evaluation Of Temporal Pooling Methods For Blind Video Quality Assessment

Zhengzhong Tu 5 Sep 16, 2022
Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-local Spatial-Temporal Similarity

This repository is the official PyTorch implementation of Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-local Spatial-Temporal Similarity

hippopmonkey 4 Dec 11, 2022
Personal implementation of paper "Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval"

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval This repo provides personal implementation of paper Approximate Ne

John 8 Oct 7, 2022
Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph".

multilingual-mrc-isdg Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph". This r

Liyan 5 Dec 7, 2022
ShuttleNet: Position-aware Fusion of Rally Progress and Player Styles for Stroke Forecasting in Badminton (AAAI 2022)

ShuttleNet: Position-aware Rally Progress and Player Styles Fusion for Stroke Forecasting in Badminton (AAAI 2022) Official code of the paper ShuttleN

Wei-Yao Wang 11 Nov 30, 2022