Facestar dataset. High quality audio-visual recordings of human conversational speech.

Meta Research

Last update: Dec 21, 2022

Related tags

Deep Learning facestar

Overview

Facestar Dataset

Description

Existing audio-visual datasets for human speech are either captured in a clean, controlled environment but contain only a small amount of speech data without natural conversations, or are collected in-the-wild with unreliable audio quality, interfering sounds, low face resolution, and unreliable or occluded lip motion.

The Facestar dataset aims to enable research on audio-visual modeling in a large-scale and high-quality setting. Core dataset features:

10 hours of high-quality audio-visual speech data
audio recordings in a quiet environment at 16kHz
video of resolution 1300 x 1600 at 60 frames per second
one female and one male speaker
natural speech: all data is conversational speech in a video-conferencing setup
full face visibility: speakers are facing the camera while talking

See the paper for more details. If you use the dataset, please cite

@inproceedings{yang2022audiovisual,
  title={Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis},
  author={Yang, Karren and Markovic, Dejan and Krenn, Steven and Agrawal, Vasu and Richard, Alexander},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}

Download

The dataset is partitioned into a trainset and a testset for each speaker. Within each partition, there are several sessions, each of which is further subdivided into cuts of about 30 seconds. For each session, the videos are provided without sound as sessionXX_cutYY.mp4 and the corresponding audio is provided in the wave file sessionXX_cutYY.wav.

Automatic Download

The download.py script automatically downloads the complete dataset and unzips the sessions. Needs to be run with python3. The complete dataset is about 30GB in size.

Manual Download

If you do not need the full dataset but only selected sessions, you can download them manually:

female speaker trainset (55 sessions)
female speaker testset (7 sessions)
male speaker trainset (45 sessions)
male speaker testset (5 sessions)

License

The dataset is released under CC-NC 4.0 International license.

You might also like...

Seeing Dynamic Scene in the Dark: High-Quality Video Dataset with Mechatronic Alignment (ICCV2021)

Seeing Dynamic Scene in the Dark: High-Quality Video Dataset with Mechatronic Alignment This is a pytorch project for the paper Seeing Dynamic Scene i

21 Nov 28, 2022

Hand gesture recognition based whiteboard that allows you to write on live webcam. This is the first version and has features like 4 different colors, eraser and a recording option that records your session and saves it in a "recordings" folder. Use index finger to draw and two or more fingers to move around and select items. Future version will contain more functionalities like changeable thickness, color palette, integration with zoom and google meet etc.

hand-write Hand gesture recognition based whiteboard that allows you to write on live webcam. This is the first version and has features like 4 differ

27 Dec 16, 2022

Real-time analysis of intracranial neurophysiology recordings.

py_neuromodulation Click this button to run the "Tutorial ML with py_neuro" notebooks: The py_neuromodulation toolbox allows for real time capable pro

Interventional Cognitive Neuromodulation - Neumann Lab Berlin

15 Nov 3, 2022

Experimenting with computer vision techniques to generate annotated image datasets from gameplay recordings automatically.

Experimenting with computer vision techniques to generate annotated image datasets from gameplay recordings automatically. The collected data will then be used to train a deep neural network that can detect enemy player models in real time, during gameplay. Finally, a virtual input device will adjust the player's crosshair based on live detections for greater accuracy.

3 Apr 24, 2022

Implementation of Common Image Evaluation Metrics by Sayed Nadim (sayednadim.github.io). The repo is built based on full reference image quality metrics such as L1, L2, PSNR, SSIM, LPIPS. and feature-level quality metrics such as FID, IS. It can be used for evaluating image denoising, colorization, inpainting, deraining, dehazing etc. where we have access to ground truth.

Image Quality Evaluation Metrics Implementation of some common full reference image quality metrics. The repo is built based on full reference image q

10 Jan 1, 2023

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

Portrait Photo Retouching with PPR10K Paper | Supplementary Material PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask an

184 Dec 11, 2022

Dataset used in "PlantDoc: A Dataset for Visual Plant Disease Detection" accepted in CODS-COMAD 2020

PlantDoc: A Dataset for Visual Plant Disease Detection This repository contains the Cropped-PlantDoc dataset used for benchmarking classification mode

109 Dec 29, 2022

A Python framework for conversational search

Chatty Goose Multi-stage Conversational Passage Retrieval: An Approach to Fusing Term Importance Estimation and Neural Query Rewriting Installation Ma

36 Oct 23, 2022

This is the repo for our work "Towards Persona-Based Empathetic Conversational Models" (EMNLP 2020)

Towards Persona-Based Empathetic Conversational Models (PEC) This is the repo for our work "Towards Persona-Based Empathetic Conversational Models" (E

35 Nov 17, 2022

Releases(paper_materials)

Owner

Meta Research

GitHub

Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces, originally created as a benchmark for generative adversarial networks (GAN)

Flickr-Faces-HQ Dataset (FFHQ) Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces, originally created as a benchmark for generative

2.9k Dec 28, 2022

Music source separation is a task to separate audio recordings into individual sources

Music Source Separation Music source separation is a task to separate audio recordings into individual sources. This repository is an PyTorch implmeme

958 Jan 3, 2023

a dnn ai project to classify which food people are eating on audio recordings

Deep Learning - EAT Challenge About This project is part of an AI challenge of the DeepLearning course 2021 at the University of Augsburg. The objecti

1 Oct 24, 2021

A two-stage U-Net for high-fidelity denoising of historical recordings

A two-stage U-Net for high-fidelity denoising of historical recordings Official repository of the paper (not submitted yet): E. Moliner and V. Välimäk

57 Jan 5, 2023

PyTorch implementation for ACL 2021 paper "Maria: A Visual Experience Powered Conversational Agent".

Maria: A Visual Experience Powered Conversational Agent This repository is the Pytorch implementation of our paper "Maria: A Visual Experience Powered

22 Dec 12, 2022

A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

About This repository provides data and code for the paper: Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development (subm

86 Dec 7, 2022

Facestar dataset. High quality audio-visual recordings of human conversational speech.

Related tags

Overview

Facestar Dataset

Description

Download

Automatic Download

Manual Download

License

You might also like...

Seeing Dynamic Scene in the Dark: High-Quality Video Dataset with Mechatronic Alignment (ICCV2021)

Real-time analysis of intracranial neurophysiology recordings.

Experimenting with computer vision techniques to generate annotated image datasets from gameplay recordings automatically.

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

Dataset used in "PlantDoc: A Dataset for Visual Plant Disease Detection" accepted in CODS-COMAD 2020

A Python framework for conversational search

This is the repo for our work "Towards Persona-Based Empathetic Conversational Models" (EMNLP 2020)

Releases(paper_materials)

paper_materials(Mar 28, 2022)

male_speaker_trainset(Mar 25, 2022)

male_speaker_testset(Mar 25, 2022)

female_speaker_trainset(Mar 25, 2022)

female_speaker_testset(Mar 25, 2022)

Owner

Meta Research

Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces, originally created as a benchmark for generative adversarial networks (GAN)

Music source separation is a task to separate audio recordings into individual sources

a dnn ai project to classify which food people are eating on audio recordings

A two-stage U-Net for high-fidelity denoising of historical recordings

PyTorch implementation for ACL 2021 paper "Maria: A Visual Experience Powered Conversational Agent".

A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech

BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Python codes for Lite Audio-Visual Speech Enhancement.

E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation