Audio Classification, Tagging & Sound Event Detection in PyTorch
Progress:
- Fine-tune on audio classification
- Fine-tune on audio tagging
- Fine-tune on sound event detection
- Add tagging metrics
- Add Tutorial
- Add Augmentation Notebook
- Add more schedulers
- Add FSDKaggle2019 dataset
- Add MTT dataset
- Add DESED
Model Zoo
AudioSet Pretrained Models
Model | Task | mAP (%) |
Sample Rate (kHz) |
Window Length | Num Mels | Fmax | Weights |
---|---|---|---|---|---|---|---|
CNN14 | Tagging | 43.1 | 32 | 1024 | 64 | 14k | download |
CNN14_16k | Tagging | 43.8 | 16 | 512 | 64 | 8k | download |
CNN14_DecisionLevelMax | SED | 38.5 | 32 | 1024 | 64 | 14k | download |
Note: These models will be used as a pretrained model in the fine-tuning tasks below. Check out audioset-tagging-cnn, if you want to train on AudioSet dataset.
Fine-tuned Classification Models
Model | Dataset | Accuracy (%) |
Sample Rate (kHz) |
Weights |
---|---|---|---|---|
CNN14 | ESC50 (Fold-5) | 95.75 | 32 | download |
CNN14 | FSDKaggle2018 (test) | 93.56 | 32 | download |
CNN14 | SpeechCommandsv1 (val/test) | 96.60/96.77 | 32 | download |
Fine-tuned Tagging Models
Model | Dataset | mAP(%) | AUC | d-prime | Sample Rate (kHz) |
Config | Weights |
---|---|---|---|---|---|---|---|
CNN14 | FSDKaggle2019 | - | - | - | 32 | - | - |
Fine-tuned SED Models
Model | Dataset | F1 | Sample Rate (kHz) |
Config | Weights |
---|---|---|---|---|---|
CNN14_DecisionLevelMax | DESED | - | 32 | - | - |
Supported Datasets
Dataset | Task | Classes | Train | Val | Test | Audio Length | Audio Spec | Size |
---|---|---|---|---|---|---|---|---|
ESC-50 | Classification | 50 | 2,000 | 5 folds | - | 5s | 44.1kHz, mono | 600MB |
UrbanSound8k | Classification | 10 | 8,732 | 10 folds | - | <=4s | Vary | 5.6GB |
FSDKaggle2018 | Classification | 41 | 9,473 | - | 1,600 | 300ms~30s | 44.1kHz, mono | 4.6GB |
SpeechCommandsv1 | Classification | 30 | 51,088 | 6,798 | 6,835 | <=1s | 16kHz, mono | 1.4GB |
SpeechCommandsv2 | Classification | 35 | 84,843 | 9,981 | 11,005 | <=1s | 16kHz, mono | 2.3GB |
FSDKaggle2019* | Tagging | 80 | 4,970+19,815 | - | 4,481 | 300ms~30s | 44.1kHz, mono | 24GB |
MTT* | Tagging | 50 | 19,000 | - | - | - | - | 3GB |
DESED* | SED | 10 | - | - | - | 10 | - | - |
Notes:
*
datasets are not available yet. Classification dataset are treated as multi-class/single-label classification and tagging and sed datasets are treated as multi-label classification.
Dataset Structure (click to expand)
Download the dataset and prepare it into the following structure.
datasets
|__ ESC50
|__ audio
|__ Urbansound8k
|__ audio
|__ FSDKaggle2018
|__ audio_train
|__ audio_test
|__ FSDKaggle2018.meta
|__ train_post_competition.csv
|__ test_post_competition_scoring_clips.csv
|__ SpeechCommandsv1/v2
|__ bed
|__ bird
|__ ...
|__ testing_list.txt
|__ validation_list.txt
Augmentations (click to expand)
Currently, the following augmentations are supported. More will be added in the future. You can test the effects of augmentations with this notebook
WaveForm Augmentations:
- MixUp
- Background Noise
- Gaussian Noise
- Fade In/Out
- Volume
- CutMix
Spectrogram Augmentations:
- Time Masking
- Frequency Masking
- Filter Augmentation
Usage
Requirements (click to expand)
- python >= 3.6
- pytorch >= 1.8.1
- torchaudio >= 0.8.1
Other requirements can be installed with pip install -r requirements.txt
.
Configuration (click to expand)
Training (click to expand)
To train with a single GPU:
$ python tools/train.py --cfg configs/CONFIG_FILE_NAME.yaml
To train with multiple gpus, set DDP
field in config file to true
and run as follows:
$ python -m torch.distributed.launch --nproc_per_node=2 --use_env tools/train.py --cfg configs/CONFIG_FILE_NAME.yaml
Evaluation (click to expand)
Make sure to set MODEL_PATH
of the configuration file to your trained model directory.
$ python tools/val.py --cfg configs/CONFIG_FILE.yaml
Audio Classification/Tagging Inference
- Set
MODEL_PATH
of the configuration file to your model's trained weights. - Change the dataset name in
DATASET
>>NAME
as your trained model's dataset. - Set the testing audio file path in
TEST
>>FILE
. - Run the following command.
$ python tools/infer.py --cfg configs/CONFIG_FILE.yaml
## for example
$ python tools/infer.py --cfg configs/audioset.yaml
You will get an output similar to this:
Class Confidence
---------------------- ------------
Speech 0.897762
Telephone bell ringing 0.752206
Telephone 0.219329
Inside, small room 0.20761
Music 0.0770325
Sound Event Detection Inference
- Set
MODEL_PATH
of the configuration file to your model's trained weights. - Change the dataset name in
DATASET
>>NAME
as your trained model's dataset. - Set the testing audio file path in
TEST
>>FILE
. - Run the following command.
$ python tools/sed_infer.py --cfg configs/CONFIG_FILE.yaml
## for example
$ python tools/sed_infer.py --cfg configs/audioset_sed.yaml
You will get an output similar to this:
Class Start End
---------------------- ------- -----
Speech 2.2 7
Telephone bell ringing 0 2.5
The following plot will also be shown, if you set PLOT
to true
:
References (click to expand)
Citations (click to expand)
@misc{kong2020panns,
title={PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition},
author={Qiuqiang Kong and Yin Cao and Turab Iqbal and Yuxuan Wang and Wenwu Wang and Mark D. Plumbley},
year={2020},
eprint={1912.10211},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
@misc{gong2021ast,
title={AST: Audio Spectrogram Transformer},
author={Yuan Gong and Yu-An Chung and James Glass},
year={2021},
eprint={2104.01778},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
@misc{nam2021heavily,
title={Heavily Augmented Sound Event Detection utilizing Weak Predictions},
author={Hyeonuk Nam and Byeong-Yun Ko and Gyeong-Tae Lee and Seong-Hu Kim and Won-Ho Jung and Sang-Min Choi and Yong-Hwa Park},
year={2021},
eprint={2107.03649},
archivePrefix={arXiv},
primaryClass={eess.AS}
}