UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

Applied Research Center (ARC), Tencent PCG

Last update: Jan 4, 2023

Related tags

Deep Learning UMT

Overview

Unified Multi-modal Transformers

This repository maintains the official implementation of the paper UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection by Ye Liu, Siyuan Li, Yang Wu, Chang Wen Chen, Ying Shan, and Xiaohu Qie, which has been accepted by CVPR 2022.

Installation

Please refer to the following environmental settings that we use. You may install these packages by yourself if you meet any problem during automatic installation.

CUDA 11.5.0
CUDNN 8.3.2.44
Python 3.10.0
PyTorch 1.11.0
NNCore 0.3.6

Install from source

Clone the repository from GitHub.

git clone https://github.com/TencentARC/UMT.git
cd UMT

Install dependencies.

pip install -r requirements.txt

Getting Started

Download and prepare the datasets

Download and extract the datasets.

Prepare the files in the following structure.

UMT
├── configs
├── datasets
├── models
├── tools
├── data
│   ├── qvhighlights
│   │   ├── *features
│   │   ├── highlight_{train,val,test}_release.jsonl
│   │   └── subs_train.jsonl
│   ├── charades
│   │   ├── *features
│   │   └── charades_sta_{train,test}.txt
│   ├── youtube
│   │   ├── *features
│   │   └── youtube_anno.json
│   └── tvsum
│       ├── *features
│       └── tvsum_anno.json
├── README.md
├── setup.cfg
└── ···

Train a model

Run the following command to train a model using a specified config.

# Single GPU
python tools/launch.py ${path-to-config}

# Multiple GPUs
torchrun --nproc_per_node=${num-gpus} tools/launch.py ${path-to-config}

Test a model and evaluate results

Run the following command to test a model and evaluate results.

python tools/launch.py ${path-to-config} --checkpoint ${path-to-checkpoint} --eval

Pre-train with ASR captions on QVHighlights

Run the following command to pre-train a model using ASR captions on QVHighlights.

torchrun --nproc_per_node=4 tools/launch.py configs/qvhighlights/umt_base_pretrain_100e_asr.py

Model Zoo

We provide multiple pre-trained models and training logs here. All the models are trained with a single NVIDIA Tesla V100-FHHL-16GB GPU and are evaluated using the default metrics of the datasets.

Dataset	Model	Type	MR mAP		HD mAP		Download
Dataset	Model	Type	[email protected]	[email protected]	[email protected]	[email protected]	Download
QVHighlights	UMT-B	—	38.59		39.85		model \| metrics
QVHighlights	UMT-B	w/ PT	39.26		40.10		model \| metrics
Charades-STA	UMT-B	V + A	48.31	29.25	88.79	56.08	model \| metrics
Charades-STA	UMT-B	V + O	49.35	26.16	89.41	54.95	model \| metrics
YouTube Highlights	UMT-S	Dog	—		65.93		model \| metrics
	UMT-S	Gymnastics	—		75.20		model \| metrics
	UMT-S	Parkour	—		81.64		model \| metrics
	UMT-S	Skating	—		71.81		model \| metrics
	UMT-S	Skiing	—		72.27		model \| metrics
	UMT-S	Surfing	—		82.71		model \| metrics
TVSum	UMT-S	VT	—		87.54		model \| metrics
	UMT-S	VU	—		81.51		model \| metrics
	UMT-S	GA	—		88.22		model \| metrics
	UMT-S	MS	—		78.81		model \| metrics
	UMT-S	PK	—		81.42		model \| metrics
	UMT-S	PR	—		86.96		model \| metrics
	UMT-S	FM	—		75.96		model \| metrics
	UMT-S	BK	—		86.89		model \| metrics
	UMT-S	BT	—		84.42		model \| metrics
	UMT-S	DS	—		79.63		model \| metrics

Here, w/ PT means initializing the model using pre-trained weights on ASR captions. V, A, and O indicate video, audio, and optical flow, respectively.

Citation

If you find this project useful for your research, please kindly cite our paper.

@inproceedings{liu2022umt,
  title={UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection},
  author={Liu, Ye and Li, Siyuan and Wu, Yang and Chen, Chang Wen and Shan, Ying and Qie, Xiaohu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}

Comments

feature extraction （i3d and optical flow）

Hello, I would like to ask which code base do you use for i3d feature extraction and optical flow feature extraction mentioned in the data set paper? I want to reproduce it and then test my video.

opened by Lvqin001 16
model test

1.How to test highlight_test_release.jsonl text and How to output a prediction like this.

2.Can you explain the metrics that appear in the validation set? MR-long-mAP", "MR-middle-mAP", "MR-short-mAP", "HL-min-Fair-mAP", "HL-min-Fair-Hit1", "HL-min-Good-mAP"

opened by Yangaiei 14
How to align the audio and video at the clip level

Your paper says "Visual and audio features are temporally aligned at clip level". For example, in the YouTube Highlight dataset, the video is divided into clips every 100 frames, and the overlap is 50%. I extract audio features through the codebase you provided. How to align the audio and video at the clip level? How did you do it?

opened by Lynneyyq 8
how to align the audio feature and video feature?

If the size of video feature is [14, 2048], i need to extract the audio feature which size is [14, 2048].

Follow you, I use the PANN_inference project to extract audio feature from raw wave file. Because of video clips and overlap operation, the first dimension of video feature is 14. How to align the audio feature and video feature?

I found that the size of audio feature is related to the sample rate , window size, hop size and anymore, what i should set the parameter. I want to know more details about how to extract the audio feature, thank you.

opened by Xuguozi 7
bug?? if (num_gt := sum(label)) == 0:
if (num_gt := sum(label)) == 0: ^

SyntaxError: invalid syntax

I modified the code as follow:

num_gt = sum(label)##add if num_gt == 0: print("????????") collected.append(0) continue

but The map for each evaluation are different. WHY?
opened by Xuguozi 7
metric methods

Hello, I am very interested in your research. In the evaluation method, after you sort the predicted scores, you do not use them any more, but only use the real labels corresponding to the sorted scores for prediction. Why? And there is no detailed explanation of the mAP metric method in the paper. Is there any reference?

opened by oomq 6
Text embedding on charadesSTA dataset and some minor questions

Hi, first of all, thanks for your great work!

I plan to perform the experiments on charadesSTA dataset with the features you provide, but I notice that there is no text embedding files, although other features (videos, optical flows, audio) are available.

Can you provide the text embedding you used for your experiments?

Also, I have some minor questions about data preprocessing. Similar to the Issue in https://github.com/TencentARC/UMT/issues/29#issue-1410050684, I found the length of optical flow features is different from that of video features.

Did you simply crop the feature using a shorter length, same as the audio features?

Thanks,

opened by hsi1032 5
Hello, questions about text feature extraction。

Hello, questions about text feature extraction。 1、Is the model loaded when using CLIP to extract text features VIT-B /32? 2、Is the text input when extracting text features using CLIP the value of "query" in the "highligiht_train_release.jsonl" file?

opened by Yangaiei 5
retrieve a video in real time

Hello, can this method retrieve a video in real time?

The paper says that "On YouTube Highlights and TVSum, we obtain clip- level visual features using an I3D [4] pre-trained on Kinetics 400 [13] ”, which means that if an unknown video is verified, should audio and video features be extracted offline separately? How to retrieve the highlighted part of a video in real time?

Thanks a lot.

opened by Lynneyyq 3
How do I make my dataset

I have a lot of questions？ 1、I want to make a data set similar to QVHighlights in my research direction, What do I need to do?

2、What method was used for feature extraction of QVHIGHLIGHTS text?

opened by Yangaiei 3
Misalignment between video and audio for QVhighlight

THank you for the great work.

However, when I use the features provided by this repo, some video and audio features are misaligned in their context length.

Example is attached below. It is described in the order of "vid, video shape, audio shape". B3yOejNbNks_210.0_360.0 torch.Size([71, 2816]) torch.Size([70, 2048])

Can you provide how to align these features?

Thank you. Best regards

opened by wjun0830 2

Owner

Applied Research Center (ARC), Tencent PCG

GitHub

A fast poisson image editing implementation that can utilize multi-core CPU or GPU to handle a high-resolution image input.

Poisson Image Editing - A Parallel Implementation Jiayi Weng (jiayiwen), Zixu Chen (zixuc) Poisson Image Editing is a technique that can fuse two imag

110 Dec 27, 2022

Calculates carbon footprint based on fuel mix and discharge profile at the utility selected. Can create graphs and tabular output for fuel mix based on input file of series of power drawn over a period of time.

carbon-footprint-calculator Conda distribution ~/anaconda3/bin/conda install anaconda-client conda-build ~/anaconda3/bin/conda config --set anaconda_u

Seattle university Renewable energy research

7 Sep 26, 2022

Python project to take sound as input and output as RGB + Brightness values suitable for DMX

sound-to-light Python project to take sound as input and output as RGB + Brightness values suitable for DMX Current goals: Get one pixel working: Vary

1 Nov 17, 2021

Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning

We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input. To this end, we introduce a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time.

360 Dec 28, 2022

Point Cloud Denoising input segmentation output raw point-cloud valid/clear fog rain de-noised Abstract Lidar sensors are frequently used in environme

75 Nov 24, 2022

Released code for Objects are Different: Flexible Monocular 3D Object Detection, CVPR21

MonoFlex Released code for Objects are Different: Flexible Monocular 3D Object Detection, CVPR21. Work in progress. Installation This repo is tested w

169 Dec 6, 2022

Very simple NCHW and NHWC conversion tool for ONNX. Change to the specified input order for each and every input OP. Also, change the channel order of RGB and BGR. Simple Channel Converter for ONNX.

scc4onnx Very simple NCHW and NHWC conversion tool for ONNX. Change to the specified input order for each and every input OP. Also, change the channel

16 Dec 22, 2022

Finite difference solution of 2D Poisson equation. Can handle Dirichlet, Neumann and mixed boundary conditions.

Poisson-solver-2D Finite difference solution of 2D Poisson equation Current version can handle Dirichlet, Neumann, and mixed (combination of Dirichlet

34 Dec 23, 2022

Image-retrieval-baseline - MUGE Multimodal Retrieval Baseline

MUGE Multimodal Retrieval Baseline This repo is implemented based on the open_cl

47 Dec 16, 2022

Code to reproduce the results for Compositional Attention: Disentangling Search and Retrieval.

Compositional-Attention This repository contains the official implementation for the paper Compositional Attention: Disentangling Search and Retrieval

17 Oct 23, 2021

Measures input lag without dedicated hardware, performing motion detection on recorded or live video

What is InputLagTimer? This tool can measure input lag by analyzing a video where both the game controller and the game screen can be seen on a webcam

4 Aug 18, 2022

Unified Interface for Constructing and Managing Workflows on different workflow engines, such as Argo Workflows, Tekton Pipelines, and Apache Airflow.

Couler What is Couler? Couler aims to provide a unified interface for constructing and managing workflows on different workflow engines, such as Argo

781 Jan 3, 2023

Unified file system operation experience for different backend

megfile - Megvii FILE library Docs: http://megvii-research.github.io/megfile megfile provides a silky operation experience with different backends (cu

76 Dec 14, 2022

GluonMM is a library of transformer models for computer vision and multi-modality research

GluonMM is a library of transformer models for computer vision and multi-modality research. It contains reference implementations of widely adopted baseline models and also research work from Amazon Research.

42 Dec 2, 2022

Code of paper Interact, Embed, and EnlargE (IEEE): Boosting Modality-specific Representations for Multi-Modal Person Re-identification.

Interact, Embed, and EnlargE (IEEE): Boosting Modality-specific Representations for Multi-Modal Person Re-identification We provide the codes for repr

12 Dec 12, 2022

UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

Related tags

Overview

Unified Multi-modal Transformers

Installation

Install from source

Getting Started

Download and prepare the datasets

Train a model

Test a model and evaluate results

Pre-train with ASR captions on QVHighlights

Model Zoo

Citation

Comments

Owner

Applied Research Center (ARC), Tencent PCG

A fast poisson image editing implementation that can utilize multi-core CPU or GPU to handle a high-resolution image input.

Calculates carbon footprint based on fuel mix and discharge profile at the utility selected. Can create graphs and tabular output for fuel mix based on input file of series of power drawn over a period of time.

Python project to take sound as input and output as RGB + Brightness values suitable for DMX

Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning

Point Cloud Denoising input segmentation output raw point-cloud valid/clear fog rain de-noised Abstract Lidar sensors are frequently used in environme

Released code for Objects are Different: Flexible Monocular 3D Object Detection, CVPR21

Very simple NCHW and NHWC conversion tool for ONNX. Change to the specified input order for each and every input OP. Also, change the channel order of RGB and BGR. Simple Channel Converter for ONNX.

Finite difference solution of 2D Poisson equation. Can handle Dirichlet, Neumann and mixed boundary conditions.

Image-retrieval-baseline - MUGE Multimodal Retrieval Baseline

Code to reproduce the results for Compositional Attention: Disentangling Search and Retrieval.

Measures input lag without dedicated hardware, performing motion detection on recorded or live video

Unified Interface for Constructing and Managing Workflows on different workflow engines, such as Argo Workflows, Tekton Pipelines, and Apache Airflow.

Unified file system operation experience for different backend

GluonMM is a library of transformer models for computer vision and multi-modality research

Code of paper Interact, Embed, and EnlargE (IEEE): Boosting Modality-specific Representations for Multi-Modal Person Re-identification.

《LXMERT: Learning Cross-Modality Encoder Representations from Transformers》(EMNLP 2020)

MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space

CM-NAS: Cross-Modality Neural Architecture Search for Visible-Infrared Person Re-Identification (ICCV2021)

PyTorch implementation of the cross-modality generative model that synthesizes dance from music.