Awesome-Video-Datasets
- Contributions are most welcome, if you have any suggestions or improvements, please create an issue or raise a pull request.
- Our group website: VIS Lab, University of Amsterdam.
Contents
- Action Recognition
- Video Classification
- Egocentric View
- Video Object Segmentation
- Object Detection
- Group Activity Recognition
- Movie
- Video Captioning
- 360 Videos
- Activity Localization
- Video and Language
- Action Segmentation
- Repetition Counting
- Audiovisual Learning
- Video Indexing
- Skill Determination
- Video Retrieval
- Single Object Tracking
- Multiple Objects Tracking
- Video Relation Detection
- Anomaly Detection
- Pose Estimation
- Dynamic Texture Classification
- Physics
Action Recognition
-
HOLLYWOOD2: Actions in Context (CVPR 2009)
[Paper][Homepage]
12 classes of human actions, 10 classes of scenes, 3,669 clips, 69 movies -
HMDB: A Large Video Database for Human Motion Recognition (ICCV 2011)
[Paper][Homepage]
51 classes, 7,000 clips -
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
[Paper][Homepage]
101 classes, 13k clips -
Sports-1M: Large-scale Video Classification with Convolutional Neural Networks
[Paper][Homepage]
1,000,000 videos, 487 classes -
ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding (CVPR 2015)
[Paper][Homepage]
203 classes, 137 untrimmed videos per class, 1.41 activity instances per video -
MPII-Cooking: Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data (IJCV 2015)
[Paper][Homepage]
67 fine-grained activities, 59 composite activities, 14,105 clips, 273 videos -
Kinetics
[Kinetics-400/Kinetics-600/Kinetics-700/Kinetics-700-2020] [Homepage]
400/600/700/700 classes, at least 400/600/600/700 clips per class -
Charades: Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding (ECCV 2016)
[Paper][Homepage]
9,848 annotated videos, 267 people, 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes -
Charades-Ego: Actor and Observer: Joint Modeling of First and Third-Person Videos (CVPR 2018)
[Paper][Homepage]
112 people, 4000 paired videos, 157 action classes -
20BN-jester: The Jester Dataset: A Large-Scale Video Dataset of Human Gestures (ICCVW 2019)
[Paper][Homepage]
148,092 videos, 27 classes, 1376 actors -
Moments in Time Dataset: one million videos for event understanding (TPAMI 2019)
[Paper][Homepage]
over 1,000,000 labelled videos for 339 Moment classes, the average number of labeled videos per class is 1,757 with a median of 2,775 -
Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding
[Paper][Homepage]
1.02 million videos, 313 action classes, 553,535 videos are annotated with more than one label and 257,491 videos are annotated with three or more labels -
20BN-SOMETHING-SOMETHING: The "something something" video database for learning and evaluating visual common sense
[Paper][Homepage]
100,000 videos across 174 classes -
EPIC-KITCHENS: Scaling Egocentric Vision: The EPIC-KITCHENS Dataset (ECCV 2018, extended into TPAMI 2020)
[Paper][Homepage]
100 hours, 37 participants, 20M frames, 90K action segments, 700 variable length videos, 97 verb classes, 300 noun classes, 4053 action classes -
HOMAGE: Home Action Genome: Cooperative Compositional Action Understanding (CVPR 2021)
[Paper][Homepage]
27 participants, 12 sensor types, 75 activities, 453 atomic actions, 1,752 synchronized sequences, 86 object classes, 29 relationship classes, 497,534 bounding boxes, 583,481 relationships -
MMAct: A Large-Scale Dataset for Cross Modal Human Action Understanding (ICCV 2019)
[Paper][Homepage]
36k video clips, 37 action classes, RGB+Keypoints+Acc+Gyo+Ori+Wi-Fi+Presure -
LEMMA: A Multi-view Dataset for LEarning Multi-agent Multi-task Activities (ECCV 2020)
[Paper][Homepage]
RGB-D, 641 action classes, 11,781 action segments, 4.6M frames -
NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis (CVPR 2016, TPAMI 2019)
[Paper][Homepage]
106 distinct subjects and contains more than 114 thousand video samples and 8 million frames, 120 action classes -
Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs (CVPR 2020)
[Paper][Homepage]
10K videos, 0.4M objects, 1.7M visual relationships -
TITAN: Future Forecast using Action Priors (CVPR 2020)
[Paper][Homepage]
700 labeled video-clips, 50 labels including vehicle states and actions, pedestrian age groups, and targeted pedestrian action attributes -
PKU-MMD: A Large Scale Benchmark for Continuous Multi-Modal Human Action Understanding (ACM Multimedia Workshop)
[Paper][Homepage]
1,076 long video sequences, 51 action categories, performed by 66 subjects in three camera views, 20,000 action instances, 5.4 million frames, RGB+depth+Infrared Radiation+Skeleton -
HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization
[Paper][Homepage]
HACS Clips: 1.5M annotated clips sampled from 504K untrimmed videos, HACS Segments: 139K action segments densely annotated in 50K untrimmed videos spanning 200 action categories -
Oops!: Predicting Unintentional Action in Video (CVPR 2020)
[Paper][Homepage]
20,338 videos, 7,368 annotated for training, 6,739 annotated for testing -
RareAct: A video dataset of unusual interactions
[Paper][Homepage]
122 different actions, 7,607 clips, 905 videos, 19 verbs, 38 nouns -
FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding (CVPR 2020)
[Paper][Homepage]
10 event categories, including 6 male events and 4 female events, 530 element categories -
THUMOS: The THUMOS challenge on action recognition for videos “in the wild”
[Paper][Homepage]
101 actions, train: 13,000 temporally trimmed videos, validation: 2100 temporally untrimmed videos with temporal annotations of actions, background: 3000 relevant videos, test: 5600 temporally untrimmed videos with withheld ground truth -
MultiTHUMOS: Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos (IJCV 2017)
[Paper][Homepage]
400 videos, 38,690 annotations of 65 action classes, 10.5 action classes per video -
Hierarchical Action Search: Searching for Actions on the Hyperbole (CVPR 2020)
[Paper][Homepage]
Hierarchical-ActivityNet, Hierarchical-Kinetics, and Hierarchical-Moments from ActivityNet, mini-Kinetics, and Moments-in-time; provide action hierarchies and action splits for unseen action search
Video Classification
-
COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis (CVPR 2019)
[Paper][Homepage]
11,827 videos, 180 tasks, 12 domains, 46,354 annotated segments -
VideoLT: Large-scale Long-tailed Video Recognition
[Paper][Homepage]
256,218 untrimmed videos, annotated into 1,004 classes with a long-tailed distribution -
Youtube-8M: A Large-Scale Video Classification Benchmark
[Paper][Homepage]
8,000,000 videos, 4000 visual entities -
HVU: Large Scale Holistic Video Understanding (ECCV 2020)
[Paper][Homepage]
572k videos in total with 9 million annotations for training, validation and test set spanning over 3142 labels, semantic aspects defined on categories of scenes, objects, actions, events, attributes and concepts -
VLOG: From Lifestyle Vlogs to Everyday Interactions (CVPR 2018)
[Paper][Homepage]
114K video clips, 10.7K participants, Annotations: Hand/Semantic Object, Hand Contact State, Scene Classification -
EEV: A Large-Scale Dataset for Studying Evoked Expressions from Video
[Paper][Homepage]
Each video is annotated at 6 Hz with 15 continuous evoked expression labels, 36.7 million annotations of viewer facial reactions to 23,574 videos (1,700 hours)
Egocentric View
-
Charades-Ego: Actor and Observer: Joint Modeling of First and Third-Person Videos (CVPR 2018)
[Paper][Homepage]
112 people, 4000 paired videos, 157 action classes -
EPIC-KITCHENS: Scaling Egocentric Vision: The EPIC-KITCHENS Dataset (ECCV 2018, extended into TPAMI 2020)
[Paper][Homepage]
100 hours, 37 participants, 20M frames, 90K action segments, 700 variable length videos, 97 verb classes, 300 noun classes, 4053 action classes -
HOMAGE: Home Action Genome: Cooperative Compositional Action Understanding (CVPR 2021)
[Paper][Homepage]
27 participants, 12 sensor types, 75 activities, 453 atomic actions, 1,752 synchronized sequences, 86 object classes, 29 relationship classes, 497,534 bounding boxes, 583,481 relationships
Video Object Segmentation
-
DAVIS: A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation (CVPR 2016)
[Paper][Homepage]
50 sequences, 3455 annotated frames -
SegTrack v2: Video Segmentation by Tracking Many Figure-Ground Segments (ICCV 2013)
[Paper][Homepage]
1,000 frames with pixel-level annotations -
UVO: Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation
[Paper][Homepage]
1200 videos, 108k frames, 12.29 objects per video -
VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild (CVPR 2021)
[Paper][Homepage]
3,536 videos, 251,632 pixel-level labeled frames, 124 categories, pixel-level annotations are provided at 15 f/s, a complete shot lasting 5 seconds on average
Object Detection
-
ImageNet VID
[Paper][Homepage]
30 categories, train: 3,862 video snippets, validation: 555 snippets -
YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video
[Paper][Homepage]
380,000 video segments about 19s long, 5.6 M bounding boxes, 23 types of objects -
Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations (CVPR 2021)
[Paper][Homepage]
15K annotated video clips supplemented with over 4M annotated images in the following categories: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes, manually annotated 3D bounding boxes for each object -
Water detection through spatio-temporal invariant descriptors
[Paper][Dataset]
260 videos
Dynamic Texture Classification
-
Dynamic Texture: A New Large Scale Dynamic Texture Dataset with Application to ConvNet Understanding (ECCV 2018)
[Paper][Homepage]
over 10,000 videos -
YUVL: Spacetime Texture Representation and Recognition Based on a Spatiotemporal Orientation Analysis (TPAMI 2012)
[Paper][Homepage][Dataset]
610 spacetime texture samples -
UCLA: Dynamic Texture Recognition (CVPR 2001)
[Paper][Dataset]
76 dynamic textures
Group Activity Recognition
-
Volleyball: A Hierarchical Deep Temporal Model for Group Activity Recognition
[Paper][Homepage]
4,830 clips, 8 group activity classes, nine individual actions -
Collective: What are they doing? : Collective activity classification using spatio-temporal relationship among people
[Paper][Homepage]
5 different collective activities, 44 clips
Movie
-
HOLLYWOOD2: Actions in Context (CVPR 2009)
[Paper][Homepage]
12 classes of human actions, 10 classes of scenes, 3,669 clips, 69 movies -
HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do
[Paper][Homepage]
10 movies from Romance, Drama, Fantasy, Adventure, Comedy -
MPII-MD: A Dataset for Movie Description
[Paper][Homepage]
94 videos, 68,337 clips, 68,375 descriptions -
MovieNet: A Holistic Dataset for Movie Understanding (ECCV 2020)
[Paper][Homepage]
1,100 movies, 1.1M characters with bounding boxes and identities, 42K scene boundaries, 2.5K aligned description sentences, 65K tags of place and action, and 92 K tags of cinematic style -
MovieQA: Story Understanding Benchmark (CVPR 2016)
[Paper][Homepage]
14,944 questions, 408 movies -
MovieGraphs: Towards Understanding Human-Centric Situations from Videos (CVPR 2018)
[Paper][Homepage]
7,637 movie clips, 51 movies, annotations: scene, situation, description, graph (Character, Attributes, Relationship, Interaction, Topic, Reason, Time stamp) -
Condensed Movies: Story Based Retrieval with Contextual Embeddings (ACCV 2020)
[Paper][Homepage]
33,976 captioned clips from 3,605 movies, 400K+ face-tracks, 8K+ labelled characters, 20K+ subtitles, densely pre-extracted features for each clip (RGB, Motion, Face, Subtitles, Scene)
360 Videos
-
Pano2Vid: Automatic Cinematography for Watching 360° Videos (ACCV 2016)
[Paper][Homepage]
20 out of 86 360° videos have labels for testing; 9,171 normal videos captured by humans for inference in training; topics: Soccer, Mountain Climbing, Parade, and Hiking) -
Deep 360 Pilot: Learning a Deep Agent for Piloting through 360◦ Sports Videos (CVPR 2017)
[Paper][Homepage]
342 360° videos, topics: basketball, parkour, BMX, skateboarding, and dance -
YT-ALL: Self-Supervised Generation of Spatial Audio for 360◦ Video (NeurIPS 2018)
[Paper][Homepage]
1,146 videos, half of the videos are live music performances -
YT360: Learning Representations from Audio-Visual Spatial Alignment (NeurIPS 2020)
[Paper][Homepage]
topics: musical performances, vlogs, sports, and others
Activity Localization
-
Hollywood2Tubes: Spot On: Action Localization from Pointly-Supervised Proposals
[Paper][Dataset]
train: 823 videos, 1,026 action instances, 16,411 annotations; test: 884 videos, 1,086 action instances, 15,835 annotations -
DALY: Human Action Localization with Sparse Spatial Supervision
[Paper][Homepage]
10 actions, 3.3M frames, 8,133 clips -
Action Completion: A temporal model for Moment Detection (BMVC 2018)
[Paper][Homepage]
completion moments of 16 actions from three datasets: HMDB, UCF101, RGBD-AC -
RGBD-Action-Completion: Beyond Action Recognition: Action Completion in RGB-D Data (BMVC 2016)
[Paper][Homepage]
414 complete/incomplete object interaction sequences, spanning six actions and captured using an RGB-D camera -
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
[Paper][Homepage]
80 atomic visual actions in 430 15-minute video clips, 1.58M action labels with multiple labels per person occurring frequently -
AVA-Kinetics: The AVA-Kinetics Localized Human Actions Video Dataset
[Paper][Homepage]
230k clips, 80 AVA action classes -
HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization
[Paper][Homepage]
HACS Clips: 1.5M annotated clips sampled from 504K untrimmed videos, HACS Segments: 139K action segments densely annotated in 50K untrimmed videos spanning 200 action categories -
CommonLocalization: Localizing the Common Action Among a Few Videos (ECCV 2020)
[Paper][Homepage]
few-shot common action localization, revised ActivityNet1.3 and Thumos14 -
CommonSpaceTime: Few-Shot Transformation of Common Actions into Time and Space (CVPR 2021)
[Paper][Homepage]
revised AVA and UCF101-24 -
MUSES: Multi-shot Temporal Event Localization: a Benchmark (CVPR 2021)
[Paper][Homepage]
31,477 event instances, 716 video hours, 19 shots per instance, 176 shots per video, 25 categories, 3,697 videos -
MEVA: A Large-Scale Multiview, Multimodal Video Dataset for Activity (WACV 2021)
[Paper][Homepage]
annotated 144 hours for 37 activity types, marking bounding boxes of actors and props, 38 RGB and thermal IR cameras -
TVSeries: Online Action Detection (ECCV 2016)
[Paper][Homepage]
27 episodes from 6 popular TV series, 30 action classes, 6,231 action instances
Video Captioning
-
VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events
[Paper][Homepage]
45,826 videos and their descriptions obtained by harvesting YouTube -
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language (CVPR 2016)
[Paper][Homepage]
10K web video clips, 200K clip-sentence pairs -
VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research (ICCV 2019)
[Paper][Homepage]
41,250 videos, 825,000 captions in both English and Chinese, over 206,000 English-Chinese parallel translation pairs -
ActivityNet Captions: Dense-Captioning Events in Videos (ICCV 2017)
[Paper][Homepage]
20k videos, 100k sentences -
ActivityNet Entities: Grounded Video Description
[Paper][Homepage]
14,281 annotated videos, 52k video segments with at least one noun phrase annotated per segment, augment the ActivityNet Captions dataset with 158k bounding box -
VTW: Title Generation for User Generated Videos (ECCV 2016)
[Paper][Homepage]
18100 video clips with an average of 1.5 minutes duration per clip -
Charades: Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding (ECCV 2016)
[Paper][Homepage]
9,848 annotated videos, 267 people, 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes
Video and Language
-
Lingual OTB99 & Lingual ImageNet Videos: Tracking by Natural Language Specification (CVPR 2017)
[Paper][Homepage]
natural language descriptions of the target object -
MPII-MD: A Dataset for Movie Description
[Paper][Homepage]
94 videos, 68,337 clips, 68,375 descriptions -
Narrated Instruction Videos: Unsupervised Learning from Narrated Instruction Videos
[Paper][Homepage]
150 videos, 800,000 frames, five tasks: Making a coffee, Changing car tire, Performing cardiopulmonary resuscitation (CPR), Jumping a car and Repotting a plant -
YouCook: A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching (CVPR 2013)
[Paper][Homepage]
88 YouTube cooking videos -
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips (ICCV 2019)
[Paper][Homepage]
136 million video clips sourced from 1.22M narrated instructional web videos, 23k different visual tasks -
How2: A Large-scale Dataset for Multimodal Language Understanding (NeurIPS 2018)
[Paper][Homepage]
80,000 clips, word-level time alignments to the ground-truth English subtitles -
Breakfast: The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities
[Paper][Homepage]
52 participants, 10 distinct cooking activities captured in 18 different kitchens, 48 action classes, 11,267 clips -
EPIC-KITCHENS: Scaling Egocentric Vision: The EPIC-KITCHENS Dataset (ECCV 2018, extended into TPAMI 2020)
[Paper][Homepage]
100 hours, 37 participants, 20M frames, 90K action segments, 700 variable length videos, 97 verb classes, 300 noun classes, 4053 action classes -
YouCook2: YouCookII Dataset
[Paper][Homepage]
2000 long untrimmed videos, 89 cooking recipes, each recipe includes 5 to 16 steps, each step should be described with one sentence -
QuerYD: A video dataset with textual and audio narrations (ICASSP 2021)
[Paper][Homepage]
1,400+ narrators, 200+ video hours, 70+ description hours -
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference (CVPR 2020)
[Paper][Homepage]
95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video -
CrossTask: weakly supervised learning from instructional videos (CVPR 2019)
[Paper][Homepage]
4.7K videos, 83 tasks
Action Segmentation
-
A2D: Can Humans Fly? Action Understanding with Multiple Classes of Actors (CVPR 2015)
[Paper][Homepage]
3,782 videos, actors: adult, baby, bird, cat, dog, ball and car, actions: climbing, crawling, eating, flying, jumping, rolling, running, and walking -
J-HMDB: Towards understanding action recognition (ICCV 2013)
[Paper][Homepage]
31,838 annotated frames, 21 categories involving a single person in action: brush hair, catch, clap, climb stairs, golf, jump, kick ball, pick, pour, pull-up, push, run, shoot ball, shoot bow, shoot gun, sit, stand, swing baseball, throw, walk, wave -
A2D Sentences & J-HMDB Sentences: Actor and Action Video Segmentation from a Sentence (CVPR 2018)
[Paper][Homepage]
A2D Sentences: 6,656 sentences, including 811 different nouns, 225 verbs and 189 adjectives, J-HMDB Sentences: 928 sentences, including 158 different nouns, 53 verbs and 23 adjectives
Audiovisual Learning
-
Audio Set: An ontology and human-labeled dataset for audio events (ICASSP 2017)
[Paper][Homepage]
632 audio event classes, 2,084,320 human-labeled 10-second sound clips -
MUSIC: The Sound of Pixels (ECCV 2018)
[Paper][Homepage]
685 untrimmed videos, 11 instrument categories -
AudioSet ZSL: Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zero-shot Classification and Retrieval of Videos (WACV 2020)
[Paper][Homepage]
33 classes, 156,416 videos -
Kinetics-Sound: Look, Listen and Learn (ICCV 2017)
[Paper]
34 action classes from Kinetics -
EPIC-KITCHENS: Scaling Egocentric Vision: The EPIC-KITCHENS Dataset (ECCV 2018, extended into TPAMI 2020)
[Paper][Homepage]
100 hours, 37 participants, 20M frames, 90K action segments, 700 variable length videos, 97 verb classes, 300 noun classes, 4053 action classes -
SoundNet: Learning Sound Representations from Unlabeled Video (NIPS 2016)
[Paper][Homepage]
2+ million videos -
AVE: Audio-Visual Event Localization in Unconstrained Videos (ECCV 2018)
[Paper][Homepage]
4,143 10-second videos, 28 audio-visual events -
LLP: Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing (ECCV 2020)
[Paper][Homepage]
11,849 YouTube video clips, 25 event categories -
VGG-Sound: A large scale audio-visual dataset
[Paper][Homepage]
200k videos, 309 audio classes -
YouTube-ASMR-300K: Telling Left from Right: Learning Spatial Correspondence of Sight and Sound (CVPR 2020)
[Paper][Homepage]
300K 10-second video clips with spatial audio -
XD-Violence: Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision (ECCV 2020)
[Paper][Homepage]
4754 untrimmed videos -
VGG-SS: Localizing Visual Sounds the Hard Way (CVPR 2021)
[Paper][Homepage]
5K videos, 200 categories -
VoxCeleb: Large-scale speaker verification in the wild
[Paper][Homepage]
a million ‘real-world’ utterances, over 7000 speakers -
EmoVoxCeleb: Emotion Recognition in Speech using Cross-Modal Transfer in the Wild
[Paper][Homepage]
1,251 speakers -
Speech2Gesture: Learning Individual Styles of Conversational Gesture (CVPR 2019)
[Paper][Homepage]
144-hour person-specific video, 10 speakers -
AVSpeech: Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
[Paper][Homepage]
150,000 distinct speakers, 290k YouTube videos -
LRW: Lip Reading in the Wild (ACCV 2016)
[Paper][Homepage]
1000 utterances of 500 different words -
LRW-1000: LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild (FG 2019)
[Paper][Homepage]
718018 video samples from 2000+ individual speakers of 1000 Mandarin words -
LRS2: Deep Audio-Visual Speech Recognition (TPAMI 2018)
[Paper][Homepage]
Thousands of natural sentences from British television -
LRS3-TED: a large-scale dataset for visual speech recognition
[Paper][Homepage]
thousands of spoken sentences from TED and TEDx videos -
CMLR: A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading (ACM MM Asia 2019)
[Paper][Homepage]
102072 spoken sentences of 11 speakers from national news program in China (CCTV) -
Countix-AV & Extreme Countix-AV: Repetitive Activity Counting by Sight and Sound (CVPR 2021)
[Paper][Homepage]
1,863 videos in Countix-AV, 214 videos in Extreme Countix-AV
Repetition Counting
-
QUVA Repetition: Real-World Repetition Estimation by Div, Grad and Curl (CVPR 2018)
[Paper][Homepage]
100 videos -
YTSegments: Live Repetition Counting (ICCV 2015)
[Paper][Homepage]
100 videos -
UCFRep: Context-Aware and Scale-Insensitive Temporal Repetition Counting (CVPR 2020)
[Paper][Homepage]
526 videos -
Countix: Counting Out Time: Class Agnostic Video Repetition Counting in the Wild (CVPR 2020)
[Paper][Homepage]
8,757 videos -
Countix-AV & Extreme Countix-AV: Repetitive Activity Counting by Sight and Sound (CVPR 2021)
[Paper][Homepage]
1,863 videos in Countix-AV, 214 videos in Extreme Countix-AV
Video Indexing
- MediaMill: The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia
[Paper][Homepage]
manually annotated concept lexicon
Skill Determination
-
EPIC-Skills: Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination (CVPR 2018)
[Paper][Homepage]
3 tasks, 113 videos, 1000 pairwise ranking annotations -
BEST: The Pros and Cons: Rank-aware Temporal Attention for Skill Determination in Long Videos (CVPR 2019)
[Paper][Homepage]
5 tasks, 500 videos, 13000 pairwise ranking annotations
Video Retrieval
-
TRECVID Challenge: TREC Video Retrieval Evaluation
[Homepage]
sources: YFCC100M, Flickr, etc -
Video Browser Showdown – The Video Retrieval Competition
[Homepage] -
TRECVID-VTT: TRECVID 2019: An Evaluation Campaign to Benchmark Video Activity Detection, Video Captioning and Matching, and Video Search & Retrieval
[Paper][Homepage]
9185 videos with captions -
V3C - A Research Video Collection
[Paper][Homepage]
7475 Vimeo videos, 1,082,657 short video segments -
IACC: Creating a web-scale video collection for research
[Paper][Homepage]
4600 Internet Archive videos -
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval (ECCV 2020)
[Paper][Homepage]
108,965 queries on 21,793 videos from 6 TV shows of diverse genres, where each query is associated with a tight temporal alignment
Single Object Tracking
-
Lingual OTB99 & Lingual ImageNet Videos: Tracking by Natural Language Specification (CVPR 2017)
[Paper][Homepage]
natural language descriptions of the target object -
OxUvA: Long-term Tracking in the Wild: A Benchmark (ECCV 2018)
[Paper][Homepage]
366 sequences spanning 14 hours of video -
LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking
[Paper][Homepage]
1,400 sequences with more than 3.5M frames, each frame is annotated with a bounding box -
TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild (ECCV 2018)
[Paper][Homepage]
30K videos with more than 14 million dense bounding box annotations, a new benchmark composed of 500 novel videos -
ALOV300+: Visual Tracking: An Experimental Survey (TPAMI 2014)
[Paper][Homepage][Dataset]
315 videos -
NUS-PRO: A New Visual Tracking Challenge (TPAMI 2015)
[Paper][Homepage]
365 image sequences -
UAV123: A Benchmark and Simulator for UAV Tracking (ECCV 2016)
[Paper][Homepage]
123 new and fully annotated HD video sequences captured from a low-altitude aerial perspective -
OTB2013: Online Object Tracking: A Benchmark (CVPR 2013)
[Paper][Homepage]
50 video sequences -
OTB2015: Object Tracking Benchmark (TPAMI 2015)
[Paper][Homepage]
100 video sequences -
VOT Challenge
[Homepage]
Multiple Objects Tracking
-
MOT Challenge
[Homepage] -
VisDrone: Vision Meets Drones: A Challenge
[Paper][Homepage] -
TAO: A Large-Scale Benchmark for Tracking Any Object
[Paper][Homepage]
2,907 videos, 833 classes, 17,287 tracks -
GMOT-40: A Benchmark for Generic Multiple Object Tracking
[Paper][Homepage]
40 carefully annotated sequences evenly distributed among 10 object categories -
BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning (CVPR 2020)
[Paper][Homepage]
100K videos and 10 tasks
Video Relation Detection
-
KIEV: Interactivity Proposals for Surveillance Videos
[Paper][Homepage]
a new task of spatio-temporal interactivity proposals -
ImageNet-VidVRD: Video Visual Relation Detection
[Paper][Homepage]
1,000 videos, 35 common subject/object categories and 132 relationships -
VidOR: Annotating Objects and Relations in User-Generated Videos
[Paper][Homepage]
10,000 videos selected from YFCC100M collection, 80 object categories and 50 predicate categories -
Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks (CVPR 2020)
[Paper][Homepage]
annotations for 180049 videos from the Something-Something Dataset -
Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs (CVPR 2020)
[Paper][Homepage]
10K videos, 0.4M objects, 1.7M visual relationships -
VidSitu: Visual Semantic Role Labeling for Video Understanding (CVPR 2021)
[Paper][Homepage]
29K 10-second movie clips richly annotated with a verb and semantic-roles every 2 seconds
Anomaly Detection
-
XD-Violence: Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision (ECCV 2020)
[Paper][Homepage]
4,754 untrimmed videos -
UCF-Crime: Real-world Anomaly Detection in Surveillance Videos
[Paper][Homepage]
1,900 videos
Pose Estimation
- YouTube Pose: Personalizing Human Video Pose Estimation (CVPR 2016)
[Paper][Homepage]
50 videos, 5,000 annotated frames