Reading List for topics in Sound Event Detection
Introduction
Sound event detection aims at processing the continuous acoustic signal and converting it into symbolic descriptions of the corresponding sound events present at the auditory scene. Sound event detection can be utilized in a variety of applications, including context-based indexing and retrieval in multimedia databases, unobtrusive monitoring in health care, and surveillance. Recently (since 2017), to utilise large multimedia data available, learning acoustic information from weak annotations was formulated. This reading list consists of papers which use weak annotation for learning symbolic descriptions of the corresponding sound events in the audio.
Papers covering multiple sub-areas are listed in both the sections. If there are any areas, papers, and datasets I missed, please let me know or feel free to make a pull request.
Maintained by Soham Deshmukh
Recent Content
INTERSPEECH 2021 papers added
ICASSP 2021 papers added
Table of Contents
- Survey Papers
- Areas
- Learning formulation
- Network architecture
- Pooling fuctions
- Missing or noisy audio
- Data Augmentation
- Generative Learning
- Representation Learning
- Multi-Task Learning
- Adversarial Attacks
- Few-Shot Learning
- Knowledge-transfer
- Polyphonic SED
- Joint learning
- Loss function
- Audio and Visual
- Audio and Text [Audio Captioning]
- Strongly and Weakly labelled data
- Healthcare
- Robotics
- Dataset
- Workshops/Conferences/Journals
- Tutorials
- Courses
- More
Research papers
Survey papers
Sound event detection and time–frequency segmentation from weakly labelled data, TASLP 2019
Areas
Learning formulation
Weakly supervised scalable audio content analysis, ICME 2016
Audio Event Detection using Weakly Labeled Data, 24th ACM Multimedia Conference 2016
An approach for self-training audio event detectors using web data, 25th EUSIPCO 2017
A joint detection-classification model for audio tagging of weakly labelled data, ICASSP 2017
Connectionist Temporal Localization for Sound Event Detection with Sequential Labeling, ICASSP 2019
Multi-Task Learning for Interpretable Weakly Labelled Sound Event Detection, ArXiv 2020
A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition, ICML 2020
Non-Negative Matrix Factorization-Convolutional Neural Network (NMF-CNN) For Sound Event Detection, ArXiv 2020
Duration robust weakly supervised sound event detection, ICASSP 2020
SeCoST:: Sequential Co-Supervision for Large Scale Weakly Labeled Audio Event Detection, ICASSP 2020
Guided Learning for Weakly-Labeled Semi-Supervised Sound Event Detection, ICASSP 2020
Unsupervised Contrastive Learning of Sound Event Representations, ICASSP 2021
Sound Event Detection Based on Curriculum Learning Considering Learning Difficulty of Events, ICASSP 2021
Comparison of Deep Co-Training and Mean-Teacher Approaches for Semi-Supervised Audio Tagging, ICASSP 2021
Enhancing Audio Augmentation Methods with Consistency Learning, ICASSP 2021
Network Architecture
Weakly-supervised audio event detection using event-specific Gaussian filters and fully convolutional networks, ICASSP 2017
Deep CNN Framework for Audio Event Recognition using Weakly Labeled Web Data, NIPS Workshop on Machine Learning for Audio 2017
Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network, ICASSP 2018
Orthogonality-Regularized Masked NMF for Learning on Weakly Labeled Audio Data, ICASSP 2018
Sound event detection and time–frequency segmentation from weakly labelled data, TASLP 2019
Attention-based Atrous Convolutional Neural Networks: Visualisation and Understanding Perspectives of Acoustic Scenes, ICASSP 2019
Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization, TASLP 2020
DD-CNN: Depthwise Disout Convolutional Neural Network for Low-complexity Acoustic Scene Classification, ArXiv 2020
Effective Perturbation based Semi-Supervised Learning Method for Sound Event Detection, INTERSPEECH 2020
Weakly-Supervised Sound Event Detection with Self-Attention, ICASSP 2020
Improving Deep Learning Sound Events Classifiers using Gram Matrix Feature-wise Correlations, ICASSP 2021
An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection, ICASSP 2021
AST: Audio Spectrogram Transformer, INTERSPEECH 2021
Event Specific Attention for Polyphonic Sound Event Detection, INTERSPEECH 2021
Pooling functions
Adaptive Pooling Operators for Weakly Labeled Sound Event Detection, TASLP 2018
Comparing the Max and Noisy-Or Pooling Functions in Multiple Instance Learning for Weakly Supervised Sequence Learning Tasks, Interspeech 2018
A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling, ICASSP 2019
Hierarchical Pooling Structure for Weakly Labeled Sound Event Detection, INTERSPEECH 2019
Weakly labelled audioset tagging with attention neural networks, TASLP 2019
Sound event detection and time–frequency segmentation from weakly labelled data, TASLP 2019
Multi-Task Learning for Interpretable Weakly Labelled Sound Event Detection, ArXiv 2019
A Global-Local Attention Framework for Weakly Labelled Audio Tagging, ICASSP 2021
Missing or noisy audio:
Sound event detection and time–frequency segmentation from weakly labelled data, TASLP 2019
Multi-Task Learning for Interpretable Weakly Labelled Sound Event Detection, ArXiv 2019
Improving weakly supervised sound event detection with self-supervised auxiliary tasks, INTERSPEECH 2021
Data Augmentation:
SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification, INTERSPEECH 2021
Generative Learning
Acoustic Scene Generation with Conditional Samplernn, ICASSP 2019
Representation Learning
Towards Learning a Universal Non-Semantic Representation of Speech, INTERSPEECH 2021
Contrastive Predictive Coding of Audio with an Adversary, INTERSPEECH 2020
ACCDOA: Activity-Coupled Cartesian Direction of Arrival Representation for Sound Event Localization and Detection, ICASSP 2021
FRILL: A Non-Semantic Speech Embedding for Mobile Devices, INTERSPEECH 2021
Multi-Task Learning
Multi-Task Learning for Interpretable Weakly Labelled Sound Event Detection, ArXiv 2019
Multi-Task Learning and post processing optimisation for sound event detection, DCASE 2019
Label-efficient audio classification through multitask learning and self-supervision, ICLR 2019
Few-Shot Learning
Few-Shot Audio Classification with Attentional Graph Neural Networks, INTERSPEECH 2019
Continual Learning of New Sound Classes Using Generative Replay, WASSPA 2019
Few-Shot Sound Event Detection, ICASSP 2020
Few-Shot Continual Learning for Audio Classification, ICASSP 2021
Unsupervised and Semi-Supervised Few-Shot Acoustic Event Classification, ICASSP 2021
Knowledge Transfer
Do sound event representations generalize to other audio tasks? A case study in audio transfer learning, INTERSPEECH 2021
Transfer learning of weakly labelled audio, WASPAA 2017
Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes, ICASSP 2018
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition, TASLP 2020
Polyphonic SED
A first attempt at polyphonic sound event detection using connectionist temporal classification, ICASSP 2017
Polyphonic Sound Event Detection with Weak Labeling, Thesis 2018
Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy, DCASE 2019
Evaluation of Post-Processing Algorithms for Polyphonic Sound Event Detection, WASPAA 2019
Specialized Decision Surface and Disentangled Feature for Weakly-Supervised Polyphonic Sound Event Detection, TASLP 2020
Joint task
A Joint Separation-Classification Model for Sound Event Detection of Weakly Labelled Data, ICASSP 2018
A Joint Framework for Audio Tagging and Weakly Supervised Acoustic Event Detection Using DenseNet with Global Average Pooling, INTERSPEECH 2020
Loss function
Impact of Sound Duration and Inactive Frames on Sound Event Detection Performance, ICASSP 2021
Audio and Visual
A Light-Weight Multimodal Framework for Improved Environmental Audio Tagging, ICASSP 2018
Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data, IJCAI 2020
Labelling unlabelled videos from scratch with multi-modal self-supervision, NeurIPS 2020
Audio-Visual Event Recognition Through the Lens of Adversary, ICASSP 2021
Audio and Text [Audio Captioning]
Automated audio captioning with recurrent neural networks, WASPAA 2017
Audio caption: Listen and tell, ICASSP 2018
AudioCaps: Generating captions for audios in the wild, NAACL 2019
Audio Captioning Based on Combined Audio and Semantic Embeddings, ISM 2020
Clotho: An Audio Captioning Dataset, ICASSP 2020
A Transformer-based Audio Captioning Model with Keyword Estimation, INTERSPEECH 2020
Text-to-Audio Grounding: Building Correspondence Between Captions and Sound Events, ICASSP 2021
Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags, ICASSP 2021
Strongly and Weakly labelled data
Audio event and scene recognition: A unified approach using strongly and weakly labeled data, IJCNN 2017
Others
Sound Event Detection Using Point-Labeled Data, WASPAA 2019
Dataset
Task | Dataset | Source | Num. Files |
---|---|---|---|
Sound Event Classification | ESC-50 | freesound.org | 2k files |
Sound Event Classification | DCASE17 Task 4 | YT videos | 2k files |
Sound Event Classification | US8K | freesound.org | 8k files |
Sound Event Classification | FSD50K | freesound.org | 50k files |
Sound Event Classification | AudioSet | YT videos | 2M files |
COVID-19 Detection using Coughs | DiCOVA | Volunteers recording audio via a website | 1k files |
Few-shot Bioacoustic Event Detection | DCASE21 Task 5 | audio | 4k+ files |
Acoustic Scene Classification | DCASE18 Task 1 | Recorded by TUT | 1.5k |
Various | VGG-Sound | Web videos | 200k files |
Audio Captioning | Clotho | freesound.org | 5k files |
Audio Captioning | AudioCaps | YT videos | 51k files |
Action Recognition | UCF101 | Web videos | 13k files |
Unlabeled | YFCC100M | Yahoo videos | 1M files |
Other audio-based datasets to consider
DCASE dataset list
Workshops/Conferences/Journals
List of old workshops (archived) and on-going workshops/conferences/journals:
Venues | link |
---|---|
Machine Learning for Audio Signal Processing, NIPS 2017 workshop | https://nips.cc/Conferences/2017/Schedule?showEvent=8790 |
MLSP: Machine Learning for Signal Processing | https://ieeemlsp.cc/ |
WASPAA: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics | https://www.waspaa.com |
ICASSP: IEEE International Conference on Acoustics Speech and Signal Processing | https://2021.ieeeicassp.org/ |
INTERSPEECH | https://www.interspeech2021.org/ |
IEEE/ACM Transactions on Audio, Speech and Language Processing | https://dl.acm.org/journal/taslp |
DCASE | http://dcase.community/ |
Tutorials
Sound Event Detection: A Tutorial
Resources
Computational Analysis of Sound Scenes and Events
More
If you are interested in audio-captioning, K. Drossos maintains a detailed reading list here