ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning. In ICCV, 2021.

sangho.lee

Last update: Nov 8, 2022

Related tags

Deep Learning pytorch

Overview

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning

This repository contains the code for our ICCV 2021 paper:

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning
Sangho Lee*, Jiwan Chung*, Youngjae Yu, Gunhee Kim, Thomas Breuel, Gal Chechik, Yale Song (*: equal contribution)
[paper]

@inproceedings{lee2021acav100m,
    title="{ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning}",
    author={Sangho Lee and Jiwan Chung and Youngjae Yu and Gunhee Kim and Thomas Breuel and Gal Chechik and Yale Song},
    booktitle={ICCV},
    year=2021
}

System Requirements

Python >= 3.8.5
FFMpeg 4.3.1

Installation

Install PyTorch 1.6.0, torchvision 0.7.0 and torchaudio 0.6.0 for your environment. Follow the instructions in HERE.
Install the other required packages.

pip install -r requirements.txt
python -m nltk.downloader 'punkt'
pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/<cuda version>/torch1.6/index.html
pip install git+https://github.com/jiwanchung/slowfast
pip install torch-scatter==2.0.5 -f https://pytorch-geometric.com/whl/torch-1.6.0+<cuda version>.html

e.g. Replace <cuda version> with cu102 for CUDA 10.2.

Input File Structure

Create the data directory

mkdir data

Prepare the input file.

data/metadata.tsv should be structured as follows. We provide an example input file in examples/metadata.tsv

YOUTUBE_ID\t{"LatestDAFeature": {"Title": TITLE, "Description": DESCRIPTION, "YouTubeCategory": YOUTUBE_CATEGORY, "VideoLength": VIDEO_LENGTH}, "MediaVersionList": [{"Duration": DURATION}]}

Data Curation Pipeline

One-Liner

bash ./run.sh

To enable GPU computation, modify the CUDA_VISIBLE_DEVICES environment variable accordingly. For example, run the above command as export CUDA_VISIBLE_DEVICES=2,3; bash ./run.sh.

Step-by-Step

Filter the videos with metadata.

bash ./metadata_filtering/code/run.sh

The above command will build the data/filtered.tsv file.

Download the actual video files from youtube.

bash ./video_download/code/run.sh

Although we provide a simple download script, we recommend more scalable solutions for downloading large-scale data.

The above command will download the files to data/videos/raw directory.

Segment the videos into 10-second clips.

bash ./clip_segmentation/code/run.sh

The above command will save the segmented clips to data/videos directory.

Extract features from the clips.

bash ./feature_extraction/code/run.sh

The above command will save the extracted features to data/features directory.

This step requires GPU for faster computation.

Perform clustering with the extracted features.

bash ./clustering/code/run.sh

The above command will save the extracted features to data/clusters directory.

This step requires GPU for faster computation.

Select subset with high audio-visual correspondence using the clustering results.

bash ./subset_selection/code/run.sh

The above command will save the selected clip indices to data/datasets directory.

This step requires GPU for faster computation.

The final output should be saved in the data/output.csv file.

Output File Structure

output.csv is structured as follows. We provide an example output file at examples/output.csv.

# SHARD_NAME,FILENAME,YOUTUBE_ID,SEGMENT
shard-000009,qpxektwhzra_292.mp4,qpxektwhzra,"[292.3329999997, 302.3329999997]"

Evaluation

Instructions on downstream evaluation are provided in Evaluation.

Correspondence Retrieval

Instructions on correspondence retrieval experiments are provided in Correspondence Retrieval.

We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.

Multi-Modal Self-Supervision using GDT and StiCa This is an official pytorch implementation of papers: Multi-modal Self-Supervision from Generalized D

42 Dec 9, 2022

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning, CVPR 2021

This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.

Skeleton Aware Multi-modal Sign Language Recognition By Songyao Jiang, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li and Yun Fu. Smile Lab @ Northeastern

128 Dec 8, 2022

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning. In ICCV, 2021.

Related tags

Overview

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning

System Requirements

Installation

Input File Structure

Data Curation Pipeline

One-Liner

Step-by-Step

Output File Structure

Evaluation

Correspondence Retrieval

You might also like...

We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning, CVPR 2021

This is the code for CVPR 2021 oral paper: Jigsaw Clustering for Unsupervised Visual Representation Learning

Scale-aware Automatic Augmentation for Object Detection (CVPR 2021)

Code for the paper "Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds" (ICCV 2021)

A-SDF: Learning Disentangled Signed Distance Functions for Articulated Shape Representation (ICCV 2021)

[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.

Owner

sangho.lee

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

Eff video representation - Efficient video representation through neural fields

Official implementation of "Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets" (CVPR2021)

A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

A pytorch implementation of the CVPR2021 paper "VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild"

Revisiting Video Saliency: A Large-scale Benchmark and a New Model (CVPR18, PAMI19)

DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021)

O2O-Afford: Annotation-Free Large-Scale Object-Object Affordance Learning (CoRL 2021)