Contains links to publicly available datasets for modeling health outcomes using speech and language.

Tuka Alhanai

Last update: Dec 7, 2022

Related tags

Text Data & NLP speech-nlp-datasets

Overview

speech-nlp-datasets

Contains links to publicly available datasets for modeling various health outcomes using speech and language.

Speech-based Corpora

[Corpus] Speech Database of Typical Children and Children with SLI
Contains 103 children that are native Czech speakers with specific language impairment. (Grill et al., 2016)
[Corpus] mPower Study, Parkinson's Disease Data
Contains audio recorings of 800+ subjects with Parkinson's disease (+ controls) performing a structured mobile phone based test composed of voice, walking, tapping, and memory. Data collection study was performed by (Bot et al., (2016))
[Corpus] Distress Analysis Interview Corpus
Contains 189, 20-min long interviews of individuals speaking to a virtual agent. The corpus contains binary and multi-class labels for the severity of depression. The dataset contains audio recordings and features, text transcript, and facial features. The corpus was developed by Gratch et al., (2014) and featured in the Audio Visual Emotion Challenge (AVEC) 2016, 2017
[Corpus] Oxford LSVT Voice Rehabilitation Data Set
Contains 14 subjects with Parkinson's Disease used to evaluate whether voice rehabilitation improves phonation. (Tsanas et al., (2014))
Spanish Parkinson Corpus (contact authors for corpus?)
Contains 50 subjects with varying severity of Parkinson's, speaking Spanish. Corpus was first presented by (Arroyave et al., (2014)) and subsequently featured in the Interspeech 2015 Computational Paralinguistics Challenge.
[Corpus] Parkinson Speech Dataset with Multiple Types of Sound Recordings Data Set
Contains audio recordings from 40 subjects (including 20 control) generating sounds accordings to a transcript (sustained vowel, numbers, short sentences, words) from Turkey. (Sakar et al., (2013))
[Corpus] Mobile Device Voice Recordings at King's College London (MDVR-KCL) from both early and advanced Parkinson's disease patients and healthy controls
(Jaeger et al. (2019), doi:10.5281/zenodo.2867216))
[Corpus] Dem@Care
Dataset that contains audio, video, physiologic signals of Greek dementia patients in the lab or their home. (Factsheet)
[Corpus] TORGO Databse
Contains speech and articulatory data on 7 subjects with Cerebral Palsy or Amyotrophic Lateral Sclerosis. (Rudzicz et al., (2010))
Child Pathological Speech Database (CPSD) (contact authors for corpus?)
Contains speech recordings from 99 children on the autism spectrum or language impairmet (specific or not).
Original paper describing the corpus by Ringeval et al., (2010) and was also made available for the Interspeech 2013 Computational Paralinguistic Challenge.
[Corpus] Oxford Parkinson's Telemonitoring Dataset
Monitoring of 42 people with early-stage Parkinson's disease recruited to a six-month trial of a telemonitoring device for remote symptom progression monitoring. (Tsanas et al., (2009))
[Corpus] Oxford Parkinson Dataset
Contains recordings from 31 individuals. (Little et al., (2007))
[Corpus] Saarbruecken Voice Database
A collection of speech recordings from more than 2,000 people following a transcript of pronouncing vowels and a sentence. Each recording has an associated EEG signal. A subset of the speakers have a pathology (e.g. Laryngitis, Parkinson's disease). Citation: Barry, W. J., & Pützer, M. (2007). Saarbrucken voice database. Institute of Phonetics, Universität des Saarlandes, http://www. stimmdatenbank. coli. uni-saarland. de.
Example work: Martínez et al., 2012
[Corpus] ALS Voice Data Set
Contains voice recordings of 54 speakers, with 39 healthy speakers (23 males, 16 females) and 15 ALS patients with signs of bulbar dysfunction (6 males, 9 females). (Vashkevich et al., (2019))

TalkBank Project

[Corpus] CHILDES Database
Contains speech of children with different conditions (e.g. Autism, Down's syndrome, hearing impairment) and across different languages (e.g. English, Dutch, Greek, Mandarin).
MacWhinney, B. (2014). The CHILDES project: Tools for analyzing talk, Volume II: The database. Psychology Press.
[Corpus] DementiaBank (from TalkBank)
Contains recordings of individuals with dementia across different languages. Includes around 400 subjects, most notable in size and containing control subjects is:
- English Pitt: Longitudinal neuropsychological assessments of 319 subjects (dementia + control) performing Cookie Theft, Word Fluency, Story Recall, and Sentence Construction task. (Becker et al., 1994)
[Corpus] Clinical TalkBank
In addition to DementiaBank, TalkBank contains:
- RHDBank individuals with Right-Hemisphere Disorder
- TBIBank individuals with Traumatic Brain Injury
- AphasiaBank a communication disorder affecting ability to speak, write, and understand language due to some trauma to language parts of the brain.
- FluencyBank contains individuals with language disfluencies due to being a second language learner, or due to stuttering.

Text-based Corpora

[Corpus] Reddit Self-reported Depression Diagnosis (RSDD) dataset
Contains Reddit posts for ~9,000 users with a claim to depression and ~107,000 control users. (Yates et al., (2017))
[Corpus] MIMIC III (Medical Information Mart for Intensive Care)
Contains medical details and outcomes of 40,000+ patients (e.g. demographics, vital signs, laboratory tests, medications) as well as 2M+ free-text written medical notes from medical personnel (e.g. physicians, nurses, etc.). (Johnson et al., (2016)).
i2b2/UTHealth NLP Task (contact authors for corpus?)
Contains emergency medical records for 296 patients at Partners HealthCare and medical discharge and correspondance notes between medical personnel. Kumar et al., (2014) describes how the data was processed, and Stubbs et al. (2014) describes the 2014 task of identifying risk factors for heart disease over time.
Nun Study (contact authors for corpus?)
Diaries of 93 nuns to used to evaluate cognitive impairment (Alzheimer's disease) in later life. Also contains neuropsychology tests and autopsy information. Study was authored by (Snowdon et al.,(1996))

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

3.2k Dec 31, 2022

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

1k Dec 30, 2022

A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

1 Dec 20, 2021

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

2.2k Jan 9, 2023

Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

5 Dec 28, 2021

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

29 Oct 16, 2022

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 4, 2022

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Styleformer A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/cas

431 Dec 19, 2022

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

9 Jul 17, 2021

Contains links to publicly available datasets for modeling health outcomes using speech and language.

Related tags

Overview

speech-nlp-datasets

Speech-based Corpora

TalkBank Project

Text-based Corpora

You might also like...

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

Simple Speech to Text, Text to Speech

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Owner

Tuka Alhanai

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

This repository contains the code for "Generating Datasets with Pretrained Language Models".

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

Concept Modeling: Topic Modeling on Images and Text

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

Speech Recognition for Uyghur using Speech transformer