UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

Related tags

Deep Learning UMT
Overview

Unified Multi-modal Transformers

arXiv License

This repository maintains the official implementation of the paper UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection by Ye Liu, Siyuan Li, Yang Wu, Chang Wen Chen, Ying Shan, and Xiaohu Qie, which has been accepted by CVPR 2022.

Installation

Please refer to the following environmental settings that we use. You may install these packages by yourself if you meet any problem during automatic installation.

  • CUDA 11.5.0
  • CUDNN 8.3.2.44
  • Python 3.10.0
  • PyTorch 1.11.0
  • NNCore 0.3.6

Install from source

  1. Clone the repository from GitHub.
git clone https://github.com/TencentARC/UMT.git
cd UMT
  1. Install dependencies.
pip install -r requirements.txt

Getting Started

Download and prepare the datasets

  1. Download and extract the datasets.
  1. Prepare the files in the following structure.
UMT
├── configs
├── datasets
├── models
├── tools
├── data
│   ├── qvhighlights
│   │   ├── *features
│   │   ├── highlight_{train,val,test}_release.jsonl
│   │   └── subs_train.jsonl
│   ├── charades
│   │   ├── *features
│   │   └── charades_sta_{train,test}.txt
│   ├── youtube
│   │   ├── *features
│   │   └── youtube_anno.json
│   └── tvsum
│       ├── *features
│       └── tvsum_anno.json
├── README.md
├── setup.cfg
└── ···

Train a model

Run the following command to train a model using a specified config.

# Single GPU
python tools/launch.py ${path-to-config}

# Multiple GPUs
torchrun --nproc_per_node=${num-gpus} tools/launch.py ${path-to-config}

Test a model and evaluate results

Run the following command to test a model and evaluate results.

python tools/launch.py ${path-to-config} --checkpoint ${path-to-checkpoint} --eval

Pre-train with ASR captions on QVHighlights

Run the following command to pre-train a model using ASR captions on QVHighlights.

torchrun --nproc_per_node=4 tools/launch.py configs/qvhighlights/umt_base_pretrain_100e_asr.py

Model Zoo

We provide multiple pre-trained models and training logs here. All the models are trained with a single NVIDIA Tesla V100-FHHL-16GB GPU and are evaluated using the default metrics of the datasets.

Dataset Model Type MR mAP HD mAP Download
[email protected] [email protected] [email protected] [email protected]
QVHighlights UMT-B 38.59 39.85 model | metrics
UMT-B w/ PT 39.26 40.10 model | metrics
Charades-STA UMT-B V + A 48.31 29.25 88.79 56.08 model | metrics
UMT-B V + O 49.35 26.16 89.41 54.95 model | metrics
YouTube
Highlights
UMT-S Dog 65.93 model | metrics
UMT-S Gymnastics 75.20 model | metrics
UMT-S Parkour 81.64 model | metrics
UMT-S Skating 71.81 model | metrics
UMT-S Skiing 72.27 model | metrics
UMT-S Surfing 82.71 model | metrics
TVSum UMT-S VT 87.54 model | metrics
UMT-S VU 81.51 model | metrics
UMT-S GA 88.22 model | metrics
UMT-S MS 78.81 model | metrics
UMT-S PK 81.42 model | metrics
UMT-S PR 86.96 model | metrics
UMT-S FM 75.96 model | metrics
UMT-S BK 86.89 model | metrics
UMT-S BT 84.42 model | metrics
UMT-S DS 79.63 model | metrics

Here, w/ PT means initializing the model using pre-trained weights on ASR captions. V, A, and O indicate video, audio, and optical flow, respectively.

Citation

If you find this project useful for your research, please kindly cite our paper.

@inproceedings{liu2022umt,
  title={UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection},
  author={Liu, Ye and Li, Siyuan and Wu, Yang and Chen, Chang Wen and Shan, Ying and Qie, Xiaohu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}
Comments
  • feature extraction (i3d and optical flow)

    feature extraction (i3d and optical flow)

    Hello, I would like to ask which code base do you use for i3d feature extraction and optical flow feature extraction mentioned in the data set paper? I want to reproduce it and then test my video.

    opened by Lvqin001 16
  • model test

    model test

    1.How to test highlight_test_release.jsonl text and How to output a prediction like this.

    2.Can you explain the metrics that appear in the validation set? MR-long-mAP", "MR-middle-mAP", "MR-short-mAP", "HL-min-Fair-mAP", "HL-min-Fair-Hit1", "HL-min-Good-mAP"

    opened by Yangaiei 14
  • How to align the audio and video at the clip level

    How to align the audio and video at the clip level

    Your paper says "Visual and audio features are temporally aligned at clip level". For example, in the YouTube Highlight dataset, the video is divided into clips every 100 frames, and the overlap is 50%. I extract audio features through the codebase you provided. How to align the audio and video at the clip level? How did you do it?

    opened by Lynneyyq 8
  • how to align the audio feature and video feature?

    how to align the audio feature and video feature?

    If the size of video feature is [14, 2048], i need to extract the audio feature which size is [14, 2048].

    Follow you, I use the PANN_inference project to extract audio feature from raw wave file. Because of video clips and overlap operation, the first dimension of video feature is 14. How to align the audio feature and video feature?

    I found that the size of audio feature is related to the sample rate , window size, hop size and anymore, what i should set the parameter. I want to know more details about how to extract the audio feature, thank you.

    Screenshot from 2022-06-06 21-58-33

    opened by Xuguozi 7
  • bug??          if (num_gt := sum(label)) == 0:

    bug?? if (num_gt := sum(label)) == 0:

    if (num_gt := sum(label)) == 0:
               ^
    

    SyntaxError: invalid syntax

    I modified the code as follow:

    num_gt = sum(label)##add
    
    if num_gt == 0:
        print("????????")
        collected.append(0)
        continue
    

    but The map for each evaluation are different. WHY?

    opened by Xuguozi 7
  • metric methods

    metric methods

    Hello, I am very interested in your research. In the evaluation method, after you sort the predicted scores, you do not use them any more, but only use the real labels corresponding to the sorted scores for prediction. Why? And there is no detailed explanation of the mAP metric method in the paper. Is there any reference?

    opened by oomq 6
  • Text embedding on charadesSTA dataset and some minor questions

    Text embedding on charadesSTA dataset and some minor questions

    Hi, first of all, thanks for your great work!

    I plan to perform the experiments on charadesSTA dataset with the features you provide, but I notice that there is no text embedding files, although other features (videos, optical flows, audio) are available.

    Can you provide the text embedding you used for your experiments?

    Also, I have some minor questions about data preprocessing. Similar to the Issue in https://github.com/TencentARC/UMT/issues/29#issue-1410050684, I found the length of optical flow features is different from that of video features.

    Did you simply crop the feature using a shorter length, same as the audio features?

    Thanks,

    opened by hsi1032 5
  • Hello, questions about text feature extraction。

    Hello, questions about text feature extraction。

    Hello, questions about text feature extraction。 1、Is the model loaded when using CLIP to extract text features VIT-B /32? 2、Is the text input when extracting text features using CLIP the value of "query" in the "highligiht_train_release.jsonl" file?

    opened by Yangaiei 5
  •  retrieve a video in real time

    retrieve a video in real time

    Hello, can this method retrieve a video in real time?

    The paper says that "On YouTube Highlights and TVSum, we obtain clip- level visual features using an I3D [4] pre-trained on Kinetics 400 [13] ”, which means that if an unknown video is verified, should audio and video features be extracted offline separately? How to retrieve the highlighted part of a video in real time?

    Thanks a lot.

    opened by Lynneyyq 3
  • How do I make my dataset

    How do I make my dataset

    I have a lot of questions? 1、I want to make a data set similar to QVHighlights in my research direction, What do I need to do?

    2、What method was used for feature extraction of QVHIGHLIGHTS text?

    opened by Yangaiei 3
  • Misalignment between video and audio for QVhighlight

    Misalignment between video and audio for QVhighlight

    THank you for the great work.

    However, when I use the features provided by this repo, some video and audio features are misaligned in their context length.

    Example is attached below. It is described in the order of "vid, video shape, audio shape". B3yOejNbNks_210.0_360.0 torch.Size([71, 2816]) torch.Size([70, 2048])

    Can you provide how to align these features?

    Thank you. Best regards

    opened by wjun0830 2
Owner
Applied Research Center (ARC), Tencent PCG
Applied Research Center (ARC), Tencent PCG
A fast poisson image editing implementation that can utilize multi-core CPU or GPU to handle a high-resolution image input.

Poisson Image Editing - A Parallel Implementation Jiayi Weng (jiayiwen), Zixu Chen (zixuc) Poisson Image Editing is a technique that can fuse two imag

Jiayi Weng 110 Dec 27, 2022
Calculates carbon footprint based on fuel mix and discharge profile at the utility selected. Can create graphs and tabular output for fuel mix based on input file of series of power drawn over a period of time.

carbon-footprint-calculator Conda distribution ~/anaconda3/bin/conda install anaconda-client conda-build ~/anaconda3/bin/conda config --set anaconda_u

Seattle university Renewable energy research 7 Sep 26, 2022
Python project to take sound as input and output as RGB + Brightness values suitable for DMX

sound-to-light Python project to take sound as input and output as RGB + Brightness values suitable for DMX Current goals: Get one pixel working: Vary

Bobby Cox 1 Nov 17, 2021
Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning

We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input. To this end, we introduce a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time.

OATML 360 Dec 28, 2022
Point Cloud Denoising input segmentation output raw point-cloud valid/clear fog rain de-noised Abstract Lidar sensors are frequently used in environme

Point Cloud Denoising input segmentation output raw point-cloud valid/clear fog rain de-noised Abstract Lidar sensors are frequently used in environme

null 75 Nov 24, 2022
Released code for Objects are Different: Flexible Monocular 3D Object Detection, CVPR21

MonoFlex Released code for Objects are Different: Flexible Monocular 3D Object Detection, CVPR21. Work in progress. Installation This repo is tested w

Yunpeng 169 Dec 6, 2022
Very simple NCHW and NHWC conversion tool for ONNX. Change to the specified input order for each and every input OP. Also, change the channel order of RGB and BGR. Simple Channel Converter for ONNX.

scc4onnx Very simple NCHW and NHWC conversion tool for ONNX. Change to the specified input order for each and every input OP. Also, change the channel

Katsuya Hyodo 16 Dec 22, 2022
Finite difference solution of 2D Poisson equation. Can handle Dirichlet, Neumann and mixed boundary conditions.

Poisson-solver-2D Finite difference solution of 2D Poisson equation Current version can handle Dirichlet, Neumann, and mixed (combination of Dirichlet

Mohammad Asif Zaman 34 Dec 23, 2022
Image-retrieval-baseline - MUGE Multimodal Retrieval Baseline

MUGE Multimodal Retrieval Baseline This repo is implemented based on the open_cl

null 47 Dec 16, 2022
Code to reproduce the results for Compositional Attention: Disentangling Search and Retrieval.

Compositional-Attention This repository contains the official implementation for the paper Compositional Attention: Disentangling Search and Retrieval

Sarthak Mittal 17 Oct 23, 2021
Measures input lag without dedicated hardware, performing motion detection on recorded or live video

What is InputLagTimer? This tool can measure input lag by analyzing a video where both the game controller and the game screen can be seen on a webcam

Bruno Gonzalez 4 Aug 18, 2022
Unified Interface for Constructing and Managing Workflows on different workflow engines, such as Argo Workflows, Tekton Pipelines, and Apache Airflow.

Couler What is Couler? Couler aims to provide a unified interface for constructing and managing workflows on different workflow engines, such as Argo

Couler Project 781 Jan 3, 2023
Unified file system operation experience for different backend

megfile - Megvii FILE library Docs: http://megvii-research.github.io/megfile megfile provides a silky operation experience with different backends (cu

MEGVII Research 76 Dec 14, 2022
GluonMM is a library of transformer models for computer vision and multi-modality research

GluonMM is a library of transformer models for computer vision and multi-modality research. It contains reference implementations of widely adopted baseline models and also research work from Amazon Research.

null 42 Dec 2, 2022
Code of paper Interact, Embed, and EnlargE (IEEE): Boosting Modality-specific Representations for Multi-Modal Person Re-identification.

Interact, Embed, and EnlargE (IEEE): Boosting Modality-specific Representations for Multi-Modal Person Re-identification We provide the codes for repr

null 12 Dec 12, 2022
《LXMERT: Learning Cross-Modality Encoder Representations from Transformers》(EMNLP 2020)

The Most Important Thing. Our code is developed based on: LXMERT: Learning Cross-Modality Encoder Representations from Transformers

null 53 Dec 16, 2022
MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space

Update (20 Jan 2020): MODALS on text data is avialable MODALS MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space Table of Conte

null 38 Dec 15, 2022
CM-NAS: Cross-Modality Neural Architecture Search for Visible-Infrared Person Re-Identification (ICCV2021)

CM-NAS Official Pytorch code of paper CM-NAS: Cross-Modality Neural Architecture Search for Visible-Infrared Person Re-Identification in ICCV2021. Vis

JDAI-CV 40 Nov 25, 2022
PyTorch implementation of the cross-modality generative model that synthesizes dance from music.

Dancing to Music PyTorch implementation of the cross-modality generative model that synthesizes dance from music. Paper Hsin-Ying Lee, Xiaodong Yang,

NVIDIA Research Projects 485 Dec 26, 2022