Awesome Video Datasets

Overview

Awesome-Video-Datasets

  • Contributions are most welcome, if you have any suggestions or improvements, please create an issue or raise a pull request.
  • Our group website: VIS Lab, University of Amsterdam.

Contents

Action Recognition

  • HOLLYWOOD2: Actions in Context (CVPR 2009)
    [Paper][Homepage]
    12 classes of human actions, 10 classes of scenes, 3,669 clips, 69 movies

  • HMDB: A Large Video Database for Human Motion Recognition (ICCV 2011)
    [Paper][Homepage]
    51 classes, 7,000 clips

  • UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
    [Paper][Homepage]
    101 classes, 13k clips

  • Sports-1M: Large-scale Video Classification with Convolutional Neural Networks
    [Paper][Homepage]
    1,000,000 videos, 487 classes

  • ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding (CVPR 2015)
    [Paper][Homepage]
    203 classes, 137 untrimmed videos per class, 1.41 activity instances per video

  • MPII-Cooking: Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data (IJCV 2015)
    [Paper][Homepage]
    67 fine-grained activities, 59 composite activities, 14,105 clips, 273 videos

  • Kinetics
    [Kinetics-400/Kinetics-600/Kinetics-700/Kinetics-700-2020] [Homepage]
    400/600/700/700 classes, at least 400/600/600/700 clips per class

  • Charades: Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding (ECCV 2016)
    [Paper][Homepage]
    9,848 annotated videos, 267 people, 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes

  • Charades-Ego: Actor and Observer: Joint Modeling of First and Third-Person Videos (CVPR 2018)
    [Paper][Homepage]
    112 people, 4000 paired videos, 157 action classes

  • 20BN-jester: The Jester Dataset: A Large-Scale Video Dataset of Human Gestures (ICCVW 2019)
    [Paper][Homepage]
    148,092 videos, 27 classes, 1376 actors

  • Moments in Time Dataset: one million videos for event understanding (TPAMI 2019)
    [Paper][Homepage]
    over 1,000,000 labelled videos for 339 Moment classes, the average number of labeled videos per class is 1,757 with a median of 2,775

  • Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding
    [Paper][Homepage]
    1.02 million videos, 313 action classes, 553,535 videos are annotated with more than one label and 257,491 videos are annotated with three or more labels

  • 20BN-SOMETHING-SOMETHING: The "something something" video database for learning and evaluating visual common sense
    [Paper][Homepage]
    100,000 videos across 174 classes

  • EPIC-KITCHENS: Scaling Egocentric Vision: The EPIC-KITCHENS Dataset (ECCV 2018, extended into TPAMI 2020)
    [Paper][Homepage]
    100 hours, 37 participants, 20M frames, 90K action segments, 700 variable length videos, 97 verb classes, 300 noun classes, 4053 action classes

  • HOMAGE: Home Action Genome: Cooperative Compositional Action Understanding (CVPR 2021)
    [Paper][Homepage]
    27 participants, 12 sensor types, 75 activities, 453 atomic actions, 1,752 synchronized sequences, 86 object classes, 29 relationship classes, 497,534 bounding boxes, 583,481 relationships

  • MMAct: A Large-Scale Dataset for Cross Modal Human Action Understanding (ICCV 2019)
    [Paper][Homepage]
    36k video clips, 37 action classes, RGB+Keypoints+Acc+Gyo+Ori+Wi-Fi+Presure

  • LEMMA: A Multi-view Dataset for LEarning Multi-agent Multi-task Activities (ECCV 2020)
    [Paper][Homepage]
    RGB-D, 641 action classes, 11,781 action segments, 4.6M frames

  • NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis (CVPR 2016, TPAMI 2019)
    [Paper][Homepage]
    106 distinct subjects and contains more than 114 thousand video samples and 8 million frames, 120 action classes

  • Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs (CVPR 2020)
    [Paper][Homepage]
    10K videos, 0.4M objects, 1.7M visual relationships

  • TITAN: Future Forecast using Action Priors (CVPR 2020)
    [Paper][Homepage]
    700 labeled video-clips, 50 labels including vehicle states and actions, pedestrian age groups, and targeted pedestrian action attributes

  • PKU-MMD: A Large Scale Benchmark for Continuous Multi-Modal Human Action Understanding (ACM Multimedia Workshop)
    [Paper][Homepage]
    1,076 long video sequences, 51 action categories, performed by 66 subjects in three camera views, 20,000 action instances, 5.4 million frames, RGB+depth+Infrared Radiation+Skeleton

  • HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization
    [Paper][Homepage]
    HACS Clips: 1.5M annotated clips sampled from 504K untrimmed videos, HACS Segments: 139K action segments densely annotated in 50K untrimmed videos spanning 200 action categories

  • Oops!: Predicting Unintentional Action in Video (CVPR 2020)
    [Paper][Homepage]
    20,338 videos, 7,368 annotated for training, 6,739 annotated for testing

  • RareAct: A video dataset of unusual interactions
    [Paper][Homepage]
    122 different actions, 7,607 clips, 905 videos, 19 verbs, 38 nouns

  • FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding (CVPR 2020)
    [Paper][Homepage]
    10 event categories, including 6 male events and 4 female events, 530 element categories

  • THUMOS: The THUMOS challenge on action recognition for videos “in the wild”
    [Paper][Homepage]
    101 actions, train: 13,000 temporally trimmed videos, validation: 2100 temporally untrimmed videos with temporal annotations of actions, background: 3000 relevant videos, test: 5600 temporally untrimmed videos with withheld ground truth

  • MultiTHUMOS: Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos (IJCV 2017)
    [Paper][Homepage]
    400 videos, 38,690 annotations of 65 action classes, 10.5 action classes per video

  • Hierarchical Action Search: Searching for Actions on the Hyperbole (CVPR 2020)
    [Paper][Homepage]
    Hierarchical-ActivityNet, Hierarchical-Kinetics, and Hierarchical-Moments from ActivityNet, mini-Kinetics, and Moments-in-time; provide action hierarchies and action splits for unseen action search

Video Classification

  • COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis (CVPR 2019)
    [Paper][Homepage]
    11,827 videos, 180 tasks, 12 domains, 46,354 annotated segments

  • VideoLT: Large-scale Long-tailed Video Recognition
    [Paper][Homepage]
    256,218 untrimmed videos, annotated into 1,004 classes with a long-tailed distribution

  • Youtube-8M: A Large-Scale Video Classification Benchmark
    [Paper][Homepage]
    8,000,000 videos, 4000 visual entities

  • HVU: Large Scale Holistic Video Understanding (ECCV 2020)
    [Paper][Homepage]
    572k videos in total with 9 million annotations for training, validation and test set spanning over 3142 labels, semantic aspects defined on categories of scenes, objects, actions, events, attributes and concepts

  • VLOG: From Lifestyle Vlogs to Everyday Interactions (CVPR 2018)
    [Paper][Homepage]
    114K video clips, 10.7K participants, Annotations: Hand/Semantic Object, Hand Contact State, Scene Classification

  • EEV: A Large-Scale Dataset for Studying Evoked Expressions from Video
    [Paper][Homepage]
    Each video is annotated at 6 Hz with 15 continuous evoked expression labels, 36.7 million annotations of viewer facial reactions to 23,574 videos (1,700 hours)

Egocentric View

  • Charades-Ego: Actor and Observer: Joint Modeling of First and Third-Person Videos (CVPR 2018)
    [Paper][Homepage]
    112 people, 4000 paired videos, 157 action classes

  • EPIC-KITCHENS: Scaling Egocentric Vision: The EPIC-KITCHENS Dataset (ECCV 2018, extended into TPAMI 2020)
    [Paper][Homepage]
    100 hours, 37 participants, 20M frames, 90K action segments, 700 variable length videos, 97 verb classes, 300 noun classes, 4053 action classes

  • HOMAGE: Home Action Genome: Cooperative Compositional Action Understanding (CVPR 2021)
    [Paper][Homepage]
    27 participants, 12 sensor types, 75 activities, 453 atomic actions, 1,752 synchronized sequences, 86 object classes, 29 relationship classes, 497,534 bounding boxes, 583,481 relationships

Video Object Segmentation

  • DAVIS: A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation (CVPR 2016)
    [Paper][Homepage]
    50 sequences, 3455 annotated frames

  • SegTrack v2: Video Segmentation by Tracking Many Figure-Ground Segments (ICCV 2013)
    [Paper][Homepage]
    1,000 frames with pixel-level annotations

  • UVO: Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation
    [Paper][Homepage]
    1200 videos, 108k frames, 12.29 objects per video

  • VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild (CVPR 2021)
    [Paper][Homepage]
    3,536 videos, 251,632 pixel-level labeled frames, 124 categories, pixel-level annotations are provided at 15 f/s, a complete shot lasting 5 seconds on average

Object Detection

  • ImageNet VID
    [Paper][Homepage]
    30 categories, train: 3,862 video snippets, validation: 555 snippets

  • YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video
    [Paper][Homepage]
    380,000 video segments about 19s long, 5.6 M bounding boxes, 23 types of objects

  • Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations (CVPR 2021)
    [Paper][Homepage]
    15K annotated video clips supplemented with over 4M annotated images in the following categories: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes, manually annotated 3D bounding boxes for each object

  • Water detection through spatio-temporal invariant descriptors
    [Paper][Dataset]
    260 videos

Dynamic Texture Classification

  • Dynamic Texture: A New Large Scale Dynamic Texture Dataset with Application to ConvNet Understanding (ECCV 2018)
    [Paper][Homepage]
    over 10,000 videos

  • YUVL: Spacetime Texture Representation and Recognition Based on a Spatiotemporal Orientation Analysis (TPAMI 2012)
    [Paper][Homepage][Dataset]
    610 spacetime texture samples

  • UCLA: Dynamic Texture Recognition (CVPR 2001)
    [Paper][Dataset]
    76 dynamic textures

Group Activity Recognition

  • Volleyball: A Hierarchical Deep Temporal Model for Group Activity Recognition
    [Paper][Homepage]
    4,830 clips, 8 group activity classes, nine individual actions

  • Collective: What are they doing? : Collective activity classification using spatio-temporal relationship among people
    [Paper][Homepage]
    5 different collective activities, 44 clips

Movie

  • HOLLYWOOD2: Actions in Context (CVPR 2009)
    [Paper][Homepage]
    12 classes of human actions, 10 classes of scenes, 3,669 clips, 69 movies

  • HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do
    [Paper][Homepage]
    10 movies from Romance, Drama, Fantasy, Adventure, Comedy

  • MPII-MD: A Dataset for Movie Description
    [Paper][Homepage]
    94 videos, 68,337 clips, 68,375 descriptions

  • MovieNet: A Holistic Dataset for Movie Understanding (ECCV 2020)
    [Paper][Homepage]
    1,100 movies, 1.1M characters with bounding boxes and identities, 42K scene boundaries, 2.5K aligned description sentences, 65K tags of place and action, and 92 K tags of cinematic style

  • MovieQA: Story Understanding Benchmark (CVPR 2016)
    [Paper][Homepage]
    14,944 questions, 408 movies

  • MovieGraphs: Towards Understanding Human-Centric Situations from Videos (CVPR 2018)
    [Paper][Homepage]
    7,637 movie clips, 51 movies, annotations: scene, situation, description, graph (Character, Attributes, Relationship, Interaction, Topic, Reason, Time stamp)

  • Condensed Movies: Story Based Retrieval with Contextual Embeddings (ACCV 2020)
    [Paper][Homepage]
    33,976 captioned clips from 3,605 movies, 400K+ face-tracks, 8K+ labelled characters, 20K+ subtitles, densely pre-extracted features for each clip (RGB, Motion, Face, Subtitles, Scene)

360 Videos

  • Pano2Vid: Automatic Cinematography for Watching 360° Videos (ACCV 2016)
    [Paper][Homepage]
    20 out of 86 360° videos have labels for testing; 9,171 normal videos captured by humans for inference in training; topics: Soccer, Mountain Climbing, Parade, and Hiking)

  • Deep 360 Pilot: Learning a Deep Agent for Piloting through 360◦ Sports Videos (CVPR 2017)
    [Paper][Homepage]
    342 360° videos, topics: basketball, parkour, BMX, skateboarding, and dance

  • YT-ALL: Self-Supervised Generation of Spatial Audio for 360◦ Video (NeurIPS 2018)
    [Paper][Homepage]
    1,146 videos, half of the videos are live music performances

  • YT360: Learning Representations from Audio-Visual Spatial Alignment (NeurIPS 2020)
    [Paper][Homepage]
    topics: musical performances, vlogs, sports, and others

Activity Localization

  • Hollywood2Tubes: Spot On: Action Localization from Pointly-Supervised Proposals
    [Paper][Dataset]
    train: 823 videos, 1,026 action instances, 16,411 annotations; test: 884 videos, 1,086 action instances, 15,835 annotations

  • DALY: Human Action Localization with Sparse Spatial Supervision
    [Paper][Homepage]
    10 actions, 3.3M frames, 8,133 clips

  • Action Completion: A temporal model for Moment Detection (BMVC 2018)
    [Paper][Homepage]
    completion moments of 16 actions from three datasets: HMDB, UCF101, RGBD-AC

  • RGBD-Action-Completion: Beyond Action Recognition: Action Completion in RGB-D Data (BMVC 2016)
    [Paper][Homepage]
    414 complete/incomplete object interaction sequences, spanning six actions and captured using an RGB-D camera

  • AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
    [Paper][Homepage]
    80 atomic visual actions in 430 15-minute video clips, 1.58M action labels with multiple labels per person occurring frequently

  • AVA-Kinetics: The AVA-Kinetics Localized Human Actions Video Dataset
    [Paper][Homepage]
    230k clips, 80 AVA action classes

  • HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization
    [Paper][Homepage]
    HACS Clips: 1.5M annotated clips sampled from 504K untrimmed videos, HACS Segments: 139K action segments densely annotated in 50K untrimmed videos spanning 200 action categories

  • CommonLocalization: Localizing the Common Action Among a Few Videos (ECCV 2020)
    [Paper][Homepage]
    few-shot common action localization, revised ActivityNet1.3 and Thumos14

  • CommonSpaceTime: Few-Shot Transformation of Common Actions into Time and Space (CVPR 2021)
    [Paper][Homepage]
    revised AVA and UCF101-24

  • MUSES: Multi-shot Temporal Event Localization: a Benchmark (CVPR 2021)
    [Paper][Homepage]
    31,477 event instances, 716 video hours, 19 shots per instance, 176 shots per video, 25 categories, 3,697 videos

  • MEVA: A Large-Scale Multiview, Multimodal Video Dataset for Activity (WACV 2021)
    [Paper][Homepage]
    annotated 144 hours for 37 activity types, marking bounding boxes of actors and props, 38 RGB and thermal IR cameras

  • TVSeries: Online Action Detection (ECCV 2016)
    [Paper][Homepage]
    27 episodes from 6 popular TV series, 30 action classes, 6,231 action instances

Video Captioning

  • VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events
    [Paper][Homepage]
    45,826 videos and their descriptions obtained by harvesting YouTube

  • MSR-VTT: A Large Video Description Dataset for Bridging Video and Language (CVPR 2016)
    [Paper][Homepage]
    10K web video clips, 200K clip-sentence pairs

  • VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research (ICCV 2019)
    [Paper][Homepage]
    41,250 videos, 825,000 captions in both English and Chinese, over 206,000 English-Chinese parallel translation pairs

  • ActivityNet Captions: Dense-Captioning Events in Videos (ICCV 2017)
    [Paper][Homepage]
    20k videos, 100k sentences

  • ActivityNet Entities: Grounded Video Description
    [Paper][Homepage]
    14,281 annotated videos, 52k video segments with at least one noun phrase annotated per segment, augment the ActivityNet Captions dataset with 158k bounding box

  • VTW: Title Generation for User Generated Videos (ECCV 2016)
    [Paper][Homepage]
    18100 video clips with an average of 1.5 minutes duration per clip

  • Charades: Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding (ECCV 2016)
    [Paper][Homepage]
    9,848 annotated videos, 267 people, 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes

Video and Language

  • Lingual OTB99 & Lingual ImageNet Videos: Tracking by Natural Language Specification (CVPR 2017)
    [Paper][Homepage]
    natural language descriptions of the target object

  • MPII-MD: A Dataset for Movie Description
    [Paper][Homepage]
    94 videos, 68,337 clips, 68,375 descriptions

  • Narrated Instruction Videos: Unsupervised Learning from Narrated Instruction Videos
    [Paper][Homepage]
    150 videos, 800,000 frames, five tasks: Making a coffee, Changing car tire, Performing cardiopulmonary resuscitation (CPR), Jumping a car and Repotting a plant

  • YouCook: A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching (CVPR 2013)
    [Paper][Homepage]
    88 YouTube cooking videos

  • HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips (ICCV 2019)
    [Paper][Homepage]
    136 million video clips sourced from 1.22M narrated instructional web videos, 23k different visual tasks

  • How2: A Large-scale Dataset for Multimodal Language Understanding (NeurIPS 2018)
    [Paper][Homepage]
    80,000 clips, word-level time alignments to the ground-truth English subtitles

  • Breakfast: The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities
    [Paper][Homepage]
    52 participants, 10 distinct cooking activities captured in 18 different kitchens, 48 action classes, 11,267 clips

  • EPIC-KITCHENS: Scaling Egocentric Vision: The EPIC-KITCHENS Dataset (ECCV 2018, extended into TPAMI 2020)
    [Paper][Homepage]
    100 hours, 37 participants, 20M frames, 90K action segments, 700 variable length videos, 97 verb classes, 300 noun classes, 4053 action classes

  • YouCook2: YouCookII Dataset
    [Paper][Homepage]
    2000 long untrimmed videos, 89 cooking recipes, each recipe includes 5 to 16 steps, each step should be described with one sentence

  • QuerYD: A video dataset with textual and audio narrations (ICASSP 2021)
    [Paper][Homepage]
    1,400+ narrators, 200+ video hours, 70+ description hours

  • VIOLIN: A Large-Scale Dataset for Video-and-Language Inference (CVPR 2020)
    [Paper][Homepage]
    95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video

  • CrossTask: weakly supervised learning from instructional videos (CVPR 2019)
    [Paper][Homepage]
    4.7K videos, 83 tasks

Action Segmentation

  • A2D: Can Humans Fly? Action Understanding with Multiple Classes of Actors (CVPR 2015)
    [Paper][Homepage]
    3,782 videos, actors: adult, baby, bird, cat, dog, ball and car, actions: climbing, crawling, eating, flying, jumping, rolling, running, and walking

  • J-HMDB: Towards understanding action recognition (ICCV 2013)
    [Paper][Homepage]
    31,838 annotated frames, 21 categories involving a single person in action: brush hair, catch, clap, climb stairs, golf, jump, kick ball, pick, pour, pull-up, push, run, shoot ball, shoot bow, shoot gun, sit, stand, swing baseball, throw, walk, wave

  • A2D Sentences & J-HMDB Sentences: Actor and Action Video Segmentation from a Sentence (CVPR 2018)
    [Paper][Homepage]
    A2D Sentences: 6,656 sentences, including 811 different nouns, 225 verbs and 189 adjectives, J-HMDB Sentences: 928 sentences, including 158 different nouns, 53 verbs and 23 adjectives

Audiovisual Learning

  • Audio Set: An ontology and human-labeled dataset for audio events (ICASSP 2017)
    [Paper][Homepage]
    632 audio event classes, 2,084,320 human-labeled 10-second sound clips

  • MUSIC: The Sound of Pixels (ECCV 2018)
    [Paper][Homepage]
    685 untrimmed videos, 11 instrument categories

  • AudioSet ZSL: Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zero-shot Classification and Retrieval of Videos (WACV 2020)
    [Paper][Homepage]
    33 classes, 156,416 videos

  • Kinetics-Sound: Look, Listen and Learn (ICCV 2017)
    [Paper]
    34 action classes from Kinetics

  • EPIC-KITCHENS: Scaling Egocentric Vision: The EPIC-KITCHENS Dataset (ECCV 2018, extended into TPAMI 2020)
    [Paper][Homepage]
    100 hours, 37 participants, 20M frames, 90K action segments, 700 variable length videos, 97 verb classes, 300 noun classes, 4053 action classes

  • SoundNet: Learning Sound Representations from Unlabeled Video (NIPS 2016)
    [Paper][Homepage]
    2+ million videos

  • AVE: Audio-Visual Event Localization in Unconstrained Videos (ECCV 2018)
    [Paper][Homepage]
    4,143 10-second videos, 28 audio-visual events

  • LLP: Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing (ECCV 2020)
    [Paper][Homepage]
    11,849 YouTube video clips, 25 event categories

  • VGG-Sound: A large scale audio-visual dataset
    [Paper][Homepage]
    200k videos, 309 audio classes

  • YouTube-ASMR-300K: Telling Left from Right: Learning Spatial Correspondence of Sight and Sound (CVPR 2020)
    [Paper][Homepage]
    300K 10-second video clips with spatial audio

  • XD-Violence: Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision (ECCV 2020)
    [Paper][Homepage]
    4754 untrimmed videos

  • VGG-SS: Localizing Visual Sounds the Hard Way (CVPR 2021)
    [Paper][Homepage]
    5K videos, 200 categories

  • VoxCeleb: Large-scale speaker verification in the wild
    [Paper][Homepage]
    a million ‘real-world’ utterances, over 7000 speakers

  • EmoVoxCeleb: Emotion Recognition in Speech using Cross-Modal Transfer in the Wild
    [Paper][Homepage]
    1,251 speakers

  • Speech2Gesture: Learning Individual Styles of Conversational Gesture (CVPR 2019)
    [Paper][Homepage]
    144-hour person-specific video, 10 speakers

  • AVSpeech: Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
    [Paper][Homepage]
    150,000 distinct speakers, 290k YouTube videos

  • LRW: Lip Reading in the Wild (ACCV 2016)
    [Paper][Homepage]
    1000 utterances of 500 different words

  • LRW-1000: LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild (FG 2019)
    [Paper][Homepage]
    718018 video samples from 2000+ individual speakers of 1000 Mandarin words

  • LRS2: Deep Audio-Visual Speech Recognition (TPAMI 2018)
    [Paper][Homepage]
    Thousands of natural sentences from British television

  • LRS3-TED: a large-scale dataset for visual speech recognition
    [Paper][Homepage]
    thousands of spoken sentences from TED and TEDx videos

  • CMLR: A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading (ACM MM Asia 2019)
    [Paper][Homepage]
    102072 spoken sentences of 11 speakers from national news program in China (CCTV)

  • Countix-AV & Extreme Countix-AV: Repetitive Activity Counting by Sight and Sound (CVPR 2021)
    [Paper][Homepage]
    1,863 videos in Countix-AV, 214 videos in Extreme Countix-AV

Repetition Counting

  • QUVA Repetition: Real-World Repetition Estimation by Div, Grad and Curl (CVPR 2018)
    [Paper][Homepage]
    100 videos

  • YTSegments: Live Repetition Counting (ICCV 2015)
    [Paper][Homepage]
    100 videos

  • UCFRep: Context-Aware and Scale-Insensitive Temporal Repetition Counting (CVPR 2020)
    [Paper][Homepage]
    526 videos

  • Countix: Counting Out Time: Class Agnostic Video Repetition Counting in the Wild (CVPR 2020)
    [Paper][Homepage]
    8,757 videos

  • Countix-AV & Extreme Countix-AV: Repetitive Activity Counting by Sight and Sound (CVPR 2021)
    [Paper][Homepage]
    1,863 videos in Countix-AV, 214 videos in Extreme Countix-AV

Video Indexing

  • MediaMill: The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia
    [Paper][Homepage]
    manually annotated concept lexicon

Skill Determination

  • EPIC-Skills: Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination (CVPR 2018)
    [Paper][Homepage]
    3 tasks, 113 videos, 1000 pairwise ranking annotations

  • BEST: The Pros and Cons: Rank-aware Temporal Attention for Skill Determination in Long Videos (CVPR 2019)
    [Paper][Homepage]
    5 tasks, 500 videos, 13000 pairwise ranking annotations

Video Retrieval

  • TRECVID Challenge: TREC Video Retrieval Evaluation
    [Homepage]
    sources: YFCC100M, Flickr, etc

  • Video Browser Showdown – The Video Retrieval Competition
    [Homepage]

  • TRECVID-VTT: TRECVID 2019: An Evaluation Campaign to Benchmark Video Activity Detection, Video Captioning and Matching, and Video Search & Retrieval
    [Paper][Homepage]
    9185 videos with captions

  • V3C - A Research Video Collection
    [Paper][Homepage]
    7475 Vimeo videos, 1,082,657 short video segments

  • IACC: Creating a web-scale video collection for research
    [Paper][Homepage]
    4600 Internet Archive videos

  • TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval (ECCV 2020)
    [Paper][Homepage]
    108,965 queries on 21,793 videos from 6 TV shows of diverse genres, where each query is associated with a tight temporal alignment

Single Object Tracking

  • Lingual OTB99 & Lingual ImageNet Videos: Tracking by Natural Language Specification (CVPR 2017)
    [Paper][Homepage]
    natural language descriptions of the target object

  • OxUvA: Long-term Tracking in the Wild: A Benchmark (ECCV 2018)
    [Paper][Homepage]
    366 sequences spanning 14 hours of video

  • LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking
    [Paper][Homepage]
    1,400 sequences with more than 3.5M frames, each frame is annotated with a bounding box

  • TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild (ECCV 2018)
    [Paper][Homepage]
    30K videos with more than 14 million dense bounding box annotations, a new benchmark composed of 500 novel videos

  • ALOV300+: Visual Tracking: An Experimental Survey (TPAMI 2014)
    [Paper][Homepage][Dataset]
    315 videos

  • NUS-PRO: A New Visual Tracking Challenge (TPAMI 2015)
    [Paper][Homepage]
    365 image sequences

  • UAV123: A Benchmark and Simulator for UAV Tracking (ECCV 2016)
    [Paper][Homepage]
    123 new and fully annotated HD video sequences captured from a low-altitude aerial perspective

  • OTB2013: Online Object Tracking: A Benchmark (CVPR 2013)
    [Paper][Homepage]
    50 video sequences

  • OTB2015: Object Tracking Benchmark (TPAMI 2015)
    [Paper][Homepage]
    100 video sequences

  • VOT Challenge
    [Homepage]

Multiple Objects Tracking

  • MOT Challenge
    [Homepage]

  • VisDrone: Vision Meets Drones: A Challenge
    [Paper][Homepage]

  • TAO: A Large-Scale Benchmark for Tracking Any Object
    [Paper][Homepage]
    2,907 videos, 833 classes, 17,287 tracks

  • GMOT-40: A Benchmark for Generic Multiple Object Tracking
    [Paper][Homepage]
    40 carefully annotated sequences evenly distributed among 10 object categories

  • BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning (CVPR 2020)
    [Paper][Homepage]
    100K videos and 10 tasks

Video Relation Detection

  • KIEV: Interactivity Proposals for Surveillance Videos
    [Paper][Homepage]
    a new task of spatio-temporal interactivity proposals

  • ImageNet-VidVRD: Video Visual Relation Detection
    [Paper][Homepage]
    1,000 videos, 35 common subject/object categories and 132 relationships

  • VidOR: Annotating Objects and Relations in User-Generated Videos
    [Paper][Homepage]
    10,000 videos selected from YFCC100M collection, 80 object categories and 50 predicate categories

  • Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks (CVPR 2020)
    [Paper][Homepage]
    annotations for 180049 videos from the Something-Something Dataset

  • Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs (CVPR 2020)
    [Paper][Homepage]
    10K videos, 0.4M objects, 1.7M visual relationships

  • VidSitu: Visual Semantic Role Labeling for Video Understanding (CVPR 2021)
    [Paper][Homepage]
    29K 10-second movie clips richly annotated with a verb and semantic-roles every 2 seconds

Anomaly Detection

  • XD-Violence: Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision (ECCV 2020)
    [Paper][Homepage]
    4,754 untrimmed videos

  • UCF-Crime: Real-world Anomaly Detection in Surveillance Videos
    [Paper][Homepage]
    1,900 videos

Pose Estimation

  • YouTube Pose: Personalizing Human Video Pose Estimation (CVPR 2016)
    [Paper][Homepage]
    50 videos, 5,000 annotated frames

Physics

  • Real-world Flag & FlagSim: Cloth in the Wind: A Case Study of Physical Measurement through Simulation (CVPR 2020)
    [Paper][Homepage]
    Real-world Flag: 2.7K train and 1.3K videos clips, FlagSim: 1,000 mesh sequences, 14, 000 training examples
Comments
  • Fixing EPIC-KITCHENS details

    Fixing EPIC-KITCHENS details

    EPIC-KITCHENS has been extended in 2020. The new information (in multiple places in your page) should be:

    EPIC-KITCHENS: Scaling Egocentric Vision: The EPIC-KITCHENS Dataset [https://arxiv.org/abs/2006.13256][Homepage] 100 hours, 37 participants, 29M frames, 90K action segments, 700 variable length videos, 97 verb classes, 300 noun classes, 4053 action classes

    Also EPIC-KITCHENS has a video retrieval benchmark and challenge. So it's not only for action recognition.

    Thanks for your efforts

    opened by dimadamen 4
  • Movie - Condensed Movies Dataset

    Movie - Condensed Movies Dataset

    Hey, awesome repo!

    Could I request that a dataset please be added in the "Movie" category

    Condensed Movies: Story Based Retrieval with Contextual Embeddings (ACCV 2020) paper: https://arxiv.org/pdf/2005.04208.pdf Homepage: https://www.robots.ox.ac.uk/~vgg/research/condensed-movies/ 33,976 captioned clips from 3,605 movies, 400K+ face-tracks, 8K+ labelled characters, 20K+ subtitles, densely pre-extracted features for each clip (RGB, Motion, Face, Subtitles, Scene)

    opened by Andrew-Brown1 2
  • Hi, please add PoseC3D (

    Hi, please add PoseC3D ("revisiting skeleton-based action recognition") skeleton datasets

    This dataset includes state-of-the-art 2D pose estimation results for various datasets: including FineGYM, NTURGB+D, UCF101, and HMDB51, which are available at https://github.com/open-mmlab/mmaction2/tree/master/tools/data/skeleton.

    opened by kennymckormick 1
  • Hi, please add a new language based single object tracking dataset (cvpr 2021)

    Hi, please add a new language based single object tracking dataset (cvpr 2021)

    Wang, X., Shu, X., Zhang, Z., Jiang, B., Wang, Y., Tian, Y., & Wu, F. (2021). Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark. cvpr 2021.

    Paper: https://arxiv.org/pdf/2103.16746 Homepage: https://sites.google.com/view/langtrackbenchmark/

    opened by wangxiao5791509 1
  • Toyota Smarthome: Real-World Activities of Daily Living

    Toyota Smarthome: Real-World Activities of Daily Living

    Das, Srijan, et al. "Toyota smarthome: Real-world activities of daily living." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.

    You may put it under the action reco category.

    opened by DAVEISHAN 1
  • TRECVID video captioning dataset (TRECVID-VTT) and HLVU

    TRECVID video captioning dataset (TRECVID-VTT) and HLVU

    Hello,

    Thanks for this awesome collection!, and also for adding some of the TRECVID datasets. Our datasets are either fully available for download OR available after submitting a data agreement form before download.

    1. Would you please add the VTT dataset: TRECVID-VTT https://ir.nist.gov/tv_vtt_data/ Paper : https://arxiv.org/abs/2009.09984

    2. Also recently we released a small annotated video understanding dataset (HLVU): https://drive.google.com/drive/u/1/folders/1q1Ca0aFJrF9tB8hsw-mrI9d4tzy5wlPZ Paper : https://dl.acm.org/doi/abs/10.1145/3372278.3390742

    Thanks! George

    opened by trecvid 1
  • Video Retrieval

    Video Retrieval

    Hey, this is a great list, good job!

    I'd have one comment though: under Video Retrieval you list TRECVID, which is a campaign and not a dataset. TRECVID has several tasks, each using different datasets. Not all of those are publicly available, but some of the larger ones are. The two most relevant here are IACC and V3C. IACC (Internet Archive Creative Commons) is a collection of video from the internet archive which has been used in TRECVID for several years and has in the process accumulated many types of annotations. The V3C (Vimeo Creative Commons Collection) is the newer general purpose video dataset, composed of creative commons content from vimeo and designed to be representative of web video in general. Both of these datasets have papers and resources associated with them. The main ones are https://dl.acm.org/doi/abs/10.1145/1631135.1631141 for IACC and https://link.springer.com/chapter/10.1007/978-3-030-05710-7_29 for V3C (also as preprint https://arxiv.org/abs/1810.04401). V3C is also used for another retrieval campaign, the Video Browser Showdown (https://videobrowsershowdown.org), where IACC was also used in the past. Listing this stuff just under TRECVID is therefore not entirely accurate.

    opened by lucaro 1
Owner
Yunhua Zhang
PhD candidate from VIS Lab, University of Amsterdam
Yunhua Zhang
Video Visual Relation Detection (VidVRD) tracklets generation. also for ACM MM Visual Relation Understanding Grand Challenge

VidVRD-tracklets This repository contains codes for Video Visual Relation Detection (VidVRD) tracklets generation based on MEGA and deepSORT. These tr

null 25 Dec 21, 2022
A curated list of the latest breakthroughs in AI by release date with a clear video explanation, link to a more in-depth article, and code.

A curated list of the latest breakthroughs in AI by release date with a clear video explanation, link to a more in-depth article, and code

Louis-François Bouchard 2.9k Jan 8, 2023
An awesome Python wrapper for an awesome Docker CLI!

An awesome Python wrapper for an awesome Docker CLI!

Gabriel de Marmiesse 303 Jan 3, 2023
A curated list of amazingly awesome Cybersecurity datasets

A curated list of amazingly awesome Cybersecurity datasets

null 758 Dec 28, 2022
An awesome list of AI for art and design - resources, and popular datasets and how we may apply computer vision tasks to art and design.

Awesome AI for Art & Design An awesome list of AI for art and design - resources, and popular datasets and how we may apply computer vision tasks to a

Margaret Maynard-Reid 20 Dec 21, 2022
A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

Jeong Ukjae 20 Jul 11, 2022
An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results

EasyDatas An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results Installation pip install git+https

Ximing Yang 4 Dec 14, 2021
Abhijith Neil Abraham 2 Nov 5, 2021
Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data.

Deep Learning Dataset Maker Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data. How to use Down

deepbands 25 Dec 15, 2022
A collection of existing KGQA datasets in the form of the huggingface datasets library, aiming to provide an easy-to-use access to them.

KGQA Datasets Brief Introduction This repository is a collection of existing KGQA datasets in the form of the huggingface datasets library, aiming to

Semantic Systems research group 21 Jan 6, 2023
Cl datasets - PyTorch image dataloaders and utility functions to load datasets for supervised continual learning

Continual learning datasets Introduction This repository contains PyTorch image

berjaoui 5 Aug 28, 2022
An extension package of 🤗 Datasets that provides support for executing arbitrary SQL queries on HF datasets

datasets_sql A ?? Datasets extension package that provides support for executing arbitrary SQL queries on HF datasets. It uses DuckDB as a SQL engine

Mario Šaško 19 Dec 15, 2022
OpenShot Video Editor is an award-winning free and open-source video editor for Linux, Mac, and Windows, and is dedicated to delivering high quality video editing and animation solutions to the world.

OpenShot Video Editor is an award-winning free and open-source video editor for Linux, Mac, and Windows, and is dedicated to delivering high quality v

OpenShot Studios, LLC 3.1k Jan 1, 2023
video streaming userbot (vsu) based on pytgcalls for streaming video trought the telegram video chat group.

VIDEO STREAM USERBOT ✨ an another telegram userbot for streaming video trought the telegram video chat. Environmental Variables ?? API_ID : Get this v

levina 6 Oct 17, 2021
📢 Video Chat Stream Telegram Bot. Can ⏳ Stream Live Videos, Radios, YouTube Videos & Telegram Video Files On Your Video Chat Of Channels & Groups !

Telegram Video Chat Bot (Beta) ?? Video Chat Stream Telegram Bot ?? Can Stream Live Videos, Radios, YouTube Videos & Telegram Video Files On Your Vide

brut✘⁶⁹ // ユスフ 15 Dec 24, 2022
Video Stream is a telegram bot project that's allow you to play video on telegram group video chat

Video Stream is a telegram bot project that's allow you to play video on telegram group video chat ?? Get SESSION_NAME from below: Pyrogram ## ✨ Featu

null 1 Nov 10, 2021
Video Stream is an Advanced Telegram Bot that's allow you to play Video & Music on Telegram Group Video Chat

Video Stream is an Advanced Telegram Bot that's allow you to play Video & Music on Telegram Group Video Chat ?? Stats ?? Get SESSION_NAME from below:

dark phoenix 12 May 8, 2022
Video Stream: an Advanced Telegram Bot that's allow you to play Video & Music on Telegram Group Video Chat

Video Stream is an Advanced Telegram Bot that's allow you to play Video & Music on Telegram Group Video Chat ?? Get SESSION_NAME from below: Pyrogram

Jonathan 6 Feb 8, 2022
Video Bot: an Advanced Telegram Bot that's allow you to play Video & Music on Telegram Group Video Chat

Video Bot is an Advanced Telegram Bot that's allow you to play Video & Music on

null 5 Jan 26, 2022