[AAAI 2022] Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Multimedia Computing Group, Nanjing University

Last update: Dec 26, 2022

Related tags

Overview

[AAAI 2022] Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Official Pytorch implementation of Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding (AAAI 2022).

Paper is at https://arxiv.org/pdf/2109.04872.pdf.

Paper explanation in Zhihu (in Chinese) is at https://zhuanlan.zhihu.com/p/446203594.

Abstract

Temporal grounding aims to localize a video moment which is semantically aligned with a given natural language query. Existing methods typically apply a detection or regression pipeline on the fused representation with the research focus on designing complicated prediction heads or fusion strategies. Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Mutual Matching Network (MMN), to directly model the similarity between language queries and video moments in a joint embedding space. This new metric-learning framework enables fully exploiting negative samples from two new aspects: constructing negative cross-modal pairs in a mutual matching scheme and mining negative pairs across different videos. These new negative samples could enhance the joint representation learning of two modalities via cross-modal mutual matching to maximize their mutual information. Experiments show that our MMN achieves highly competitive performance compared with the state-of-the-art methods on four video grounding benchmarks. Based on MMN, we present a winner solution for the HC-STVG challenge of the 3rd PIC workshop. This suggests that metric learning is still a promising method for temporal grounding via capturing the essential cross-modal correlation in a joint embedding space.

Updates

Dec, 2021 - We uploaded the code and trained weights for Charades-STA, ActivityNet-Captions and TACoS datasets.

Todo: The code for spatio-temporal video grounding (HC-STVG dataset) will be available soon.

Datasets

Download the video feature and the groundtruth provided by 2D-TAN.
Extract and put them in a dataset folder in the same directory as train_net.py. For configurations of feature/groundtruth's paths, please refer to ./mmn/config/paths_catalog.py. (ann_file is annotation, feat_file is the video feature)

Dependencies

Our code is developed on the third-party implementation of 2D-TAN, so we have similar dependencies with it, such as:

yacs h5py terminaltables tqdm pytorch transformers

Quick Start

We provide scripts for simplifying training and inference. For training our model, we provide a script for each dataset (e.g., ./scripts/tacos_train.sh). For evaluating the performance, we provide ./scripts/eval.sh.

For example, for training model in TACoS dataset in tacos_train.sh, we need to select the right config in config and decide the GPU by yourself in gpus (gpu id in your server) and gpun (total number of gpus).

# find all configs in configs/
config=pool_tacos_128x128_k5l8
# set your gpu id
gpus=0,1
# number of gpus
gpun=2
# please modify it with different value (e.g., 127.0.0.2, 29502) when you run multi mmn task on the same machine
master_addr=127.0.0.3
master_port=29511

Similarly, to evaluate the model, just change the information in eval.sh. Our trained weights for three datasets are in the Google Drive.

Citation

If you find our code useful, please generously cite our paper. (AAAI version bibtex will be updated later)

@article{DBLP:journals/corr/abs-2109-04872,
  author    = {Zhenzhi Wang and
               Limin Wang and
               Tao Wu and
               Tianhao Li and
               Gangshan Wu},
  title     = {Negative Sample Matters: {A} Renaissance of Metric Learning for Temporal
               Grounding},
  journal   = {CoRR},
  volume    = {abs/2109.04872},
  year      = {2021}
}

Contact

For any question, please raise an issue (preferred) or contact

Zhenzhi Wang: [email protected]

Acknowledgement

We appreciate 2D-TAN for video feature and configurations, and the third-party implementation of 2D-TAN for its implementation with DistributedDataParallel. Disclaimer: the performance gain of this third-party implementation is due to a tiny mistake of adding val set into training, yet our reproduced result is similar to the reported result in 2D-TAN paper.

Comments

Experimental results are not the same when run the code multiple times

Hi,

It's a great work in moment localization and achieves significant results! I have some questions about the results when running codes for multiple times. It seems that for the same code and the same hyper parameters, experimental results are not the same when run the code twice.

Have you meet the same problem?Is there any solutions?

Thanks!

opened by LLLddddd 6
关于论文中，只用BCE loss在activitinet上面效果的一点疑问

您好，您的工作提出了一个很好的针对video grounding的组织对比学习的范式。再各个数据集上都表现的很惊艳，令人印象深刻。关于您论文里面消融实验部分，我有一点点疑问，好像您论文里面只用BCEloss，在activitynet上面就可以达到（R@1，IoU=0.5）= 46.75，这个比原先2D-TAN高了两个点。是不是可以认为这两个点是bert带来的呢？

opened by starmemda 3
I did not reproduce the scores in the paper, what is your environment when training？
Thank you for proposing a very interesting work. On Charades, since the original number of GPUs is 4 and the original batchsize is 48, I set batchsize as 24 in two 3090 for keeping the same samples on each GPU. Other configurations remain the same. However, I get the score are

R@1,[email protected] = 45.35 (47.31 in paper) R@1,[email protected] = 26.30 (27.28 in paper) R@5,[email protected] = 84.21 (83.74 in paper) R@5,[email protected] = 57.02 (58.41 in paper)

The excessive gap confuses me. So, what was your training environment, and if I don't have 4 GPUs, is there any way to get the score in the paper? Looking forward to your reply.
opened by daidaiershidi 2
About the final prediction score s

Hi, thanks for sharing your great work. In the paper, the final prediction score for a candidate moment is the product of s_iou and s_mm, but s is the cosine similarity result mentioned in Section 3.3, so the range of s is [-1, 1]. Do you actually mean the final matching score is s_iou * s_mm? Or you are trying to say the final score is the product of s after sigmoid function?

opened by vin30731 2
about ending automatically after training

Hi, when the model is finished training it cannot stop itself, I have to terminaterminate the taste the task, what causes this? How can I modify it so that it ends automatically after training?

opened by menghuaa 1
How to choose the best trained model

Hi, how do I select a model for testing when I have trained it? Does it rely on the loss or the test effect on the validation set? Your code does not give how to choose the best model.

opened by menghuaa 1
Great work. I just make it runnable out of the box.
I fixed some small bugs to make it runnable.

rename dataset/Charades-STA to dataset/Charades_STA since all configs using dataset/Charades_STA.

set dataset root variable DATA_DIR to empty string since all configs already had prefix dataset/.

make train_net.py always enable distributed training even if the number of gpu is one. This can avoid the error message that the set_epoch used in multi-gpu does not exist in the samplers used by single GPU training.
opened by w86763777 0

[AAAI 2022] Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Related tags

Overview

[AAAI 2022] Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Abstract

Updates

Datasets

Dependencies

Quick Start

Citation

Contact

Acknowledgement

Comments

Experimental results are not the same when run the code multiple times

关于论文中，只用BCE loss在activitinet上面效果的一点疑问

I did not reproduce the scores in the paper, what is your environment when training？

About the final prediction score s

about ending automatically after training

How to choose the best trained model

Great work. I just make it runnable out of the box.

Owner

Multimedia Computing Group, Nanjing University

《LightXML: Transformer with dynamic negative sampling for High-Performance Extreme Multi-label Text Classiﬁcation》(AAAI 2021) GitHub:

This is the repository for CVPR2021 Dynamic Metric Learning: Towards a Scalable Metric Space to Accommodate Multiple Semantic Scales

Official PyTorch implementation of "Proxy Synthesis: Learning with Synthetic Classes for Deep Metric Learning" (AAAI 2021)

The Pytorch code of "Joint Distribution Matters: Deep Brownian Distance Covariance for Few-Shot Classification", CVPR 2022 (Oral).

Yolox-bytetrack-sample - Python sample of MOT (Multiple Object Tracking) using YOLOX and ByteTrack

Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

naked is a Python tool which allows you to strip a model and only keep what matters for making predictions.

Pytorch implementation of Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization https://arxiv.org/abs/2008.11646

Normalization Matters in Weakly Supervised Object Localization (ICCV 2021)

Network Pruning That Matters: A Case Study on Retraining Variants (ICLR 2021)

Code for One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning (AAAI 2022)

[CVPR2021] Look before you leap: learning landmark features for one-stage visual grounding.

Imposter-detector-2022 - HackED 2022 Team 3IQ - 2022 Imposter Detector

CVPR2021: Temporal Context Aggregation Network for Temporal Action Proposal Refinement

Implementation of temporal pooling methods studied in [ICIP'20] A Comparative Evaluation Of Temporal Pooling Methods For Blind Video Quality Assessment

Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-local Spatial-Temporal Similarity

Personal implementation of paper "Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval"

Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph".

ShuttleNet: Position-aware Fusion of Rally Progress and Player Styles for Stroke Forecasting in Badminton (AAAI 2022)