sound_event_detection

A repository for manually annotating audio files to create labeled datasets for machine learning.

How to get started

I'm assuming you are running this on a Mac computer (this is the only operating system tested).

First, make sure you have installed Python3, FFmpeg, and SoX via Homebrew:

brew install python3 sox ffmpeg

Now, clone the repository and install all require dependencies:

cd ~
git clone [email protected]:jim-schwoebel/sound_event_detection.git
cd sound_event_detection
pip3 install -r requirements.txt

How to label

Just put audio files in the ./data folder, run label_files.py, and then you're ready to get started labeling! See the video below for a quick view on how this can occur (with as many files in the ./data directory that are there).

organizing data

First, put all the audio files in the ./data folder. This will allow for the script to go through all these files and set a window (usually 20 milliseconds) to label these audio files. Note that all the audio files in this folder must be uniquely named (e.g. 1.wav, 2.wav, etc.).

labeling data

Run the script with

cd ~
cd sound_event_detection
python3 label_files.py

This will then ask you for a few things - like the number of classes. Then, all the files are segmented into windows and you can annotate each file. In the example below, 19 files are created (@ 0.50 second windows for a 10 second speech file). See an example terminal session below.

how many classes do you want? (leave blank for 2) 
2
what is class 1? 
silence
what is class 2? 
speech
making fast_0.wav
making fast_1.wav
making fast_2.wav
making fast_3.wav
making fast_4.wav
making fast_5.wav
making fast_6.wav
making fast_7.wav
making fast_8.wav
making fast_9.wav
making fast_10.wav
making fast_11.wav
making fast_12.wav
making fast_13.wav
making fast_14.wav
making fast_15.wav
making fast_16.wav
making fast_17.wav
making fast_18.wav

fast_0.wav:

 File Size: 16.0k     Bit Rate: 257k
  Encoding: Signed PCM    
  Channels: 1 @ 16-bit   
Samplerate: 16000Hz      
Replaygain: off         
  Duration: 00:00:00.50  

In:100%  00:00:00.50 [00:00:00.00] Out:22.0k [      |      ]        Clip:0    
Done.
silence (0) or speech (1)?  0

After you finish annotating the file, the windowed events are then automatically sorted into the right folders (in the ./data/ directory). In this case, the 0.50 second serial snippets are in the 'speech' and 'silence' directory - all from 1 file (fast.wav). If you had multiple audio files, all the audio file windows would be sorted into these folders to easily prepare these files for machine learning.

What results is a .CSV annotation file for the entire length of the session in the ./processed/ folder along with the base audio file (e.g. 'fast.wav'). See below for the example annotation. This annotation is necessary for visualizing the file later (the 0.80 probability here can be changed in the settings.json to other values).

filename	onset	offset	event_label	probability
fast.wav	0	0.5	silence	0.8
fast.wav	0.5	1	speech	0.8
fast.wav	1	1.5	speech	0.8
fast.wav	1.5	2	speech	0.8
fast.wav	2	2.5	speech	0.8
fast.wav	2.5	3	speech	0.8
fast.wav	3	3.5	speech	0.8
fast.wav	3.5	4	speech	0.8
fast.wav	4	4.5	speech	0.8
fast.wav	4.5	5	speech	0.8
fast.wav	5	5.5	speech	0.8
fast.wav	5.5	6	speech	0.8
fast.wav	6	6.5	speech	0.8
fast.wav	6.5	7	speech	0.8
fast.wav	7	7.5	speech	0.8
fast.wav	7.5	8	speech	0.8
fast.wav	8	8.5	speech	0.8
fast.wav	8.5	9	speech	0.8
fast.wav	9	9.5	speech	0.8

changing default settings

You can change a few settings with the SETTINGS.JSON file. Note that for most speech recognition problems, a good window for humans to hear and annotate is 0.20 seconds (or 200 milliseconds), which is the default window used in this repository.

Setting (Variable)	Description	Possible values	Default value
overlapping	Determines whether or not to use overlapping windows for splicing.	True or False	False
model_feature	models data in the timesplit variable + plots onto .CSV file output (for the visualize_feature visualization)	True or False	True
plot_feature	Allows for the ability to plot spectrograms while labeling (8 visuals).	True or False	False
probability_default	Sets the default probability amount (only useful if probability_labeltype == True) for each labeled session.	0.0-1.0	0.80
probability_labeltype	Allows for you to automatically or manually label files with probability of events occuring. If True, the probability event metric is automatically computed with the probability_default value; if False, the probability event metric is manually annotated by the user.	True or False	True
timesplit	The window to splice audio by for object detection. If random splicing, the audio will randomly select an interval between 0.20 and 1 seconds (allows for data augmentation).	0.20-60 or "random"	0.20
visualize_feature	Allows for the ability to plot events after labeling each audio file.	True or False	False

Using machine learning models

training machine learning models from labels

You can train a machine learning model easily by running the train_audioTPOT.py script.

cd ~
cd sound_event_detection
python3 train_audioTPOT.py

You will then be prompted for a few things:

Is this a classification (c) or regression (r) problem? --> c
How many classes do you want to train? --> 2 
What is the name of class 1? --> silence
What is the name of class 2? --> speech

After this, all the audio files will be featurized with the librosa_featurizing embedding and modeled using TPOT, an AutoML package. Note that much of this code base is from the Voicebook repository: chapter_4_modeling. In this scenario, 25% of the data is left out for cross-validation.

A machine learning model is then trained on all the data provided in each folder in the ./data directory. Note that if you properly named the classes with label_files.py, then the classes should align (e.g. if you labeled two classes, speech and silence, you can train two classes, silence and speech).

making predictions on new files

You can then easily deploy this machine learning model on new audio files using the load_audioTPOT script.

applying pre-trained models

If instead you'd like to use some pre-trained models, you can use the ones included in the ./models directory. Here is an overview of all the current models and their accuracies.

Note many of these are overfitted on small datasets, so use these models at your own risk!! :)

Visualizing labels and predictions

We can use a third-party library called sed_vis (MIT licensed) to visualize annotated files. I've created a modification script that uses argv[] to pass through the .CSV file label and the audio file so that it works in this interface.

To visualize the files, all you need to do is place the audio file in the ./data folder (and assuming you already have a labeled file known as test.csv with an audio file test.wav - these will be generate with label.py), you can run

cd ~
cd sound_event_detection
python3 ./sed_vis/visualize.py ./processed/test.wav ./processed/test.csv

What will result will be a visualization like this with all the annotated sound events.

You can just change the command slightly to visualize all the machine learning models in the ./models directory as well. All you need to do is change the .CSV reference here (e.g. usually it's filename_2.csv):

cd ~
cd sound_event_detection
python3 ./sed_vis/visualize.py ./processed/test.wav ./processed/test_2.csv"

With this machine learning visualization, you can better hear how machine learning models are under- or over-fitted and augment datasets, as necessary, for machine learning training.

Datasets generated with script

Datasets used: [AudioSet], the [Common Voice Project], [YouTube], and [train-emotions].

Future things to do

debug why accuracy is coming out as 1.2 across all models instead of some (make better experience).
make sure all files are mono for the visualization library.
add regression capabilities {train_audioTPOT should allow for regression modeling and outputs}.
add YouTube integration for data (e.g. download YouTube video or playlist via link + auto label).
clean up readme and transfer most of this info to wiki. Use landing page to generate interest to star/clone.

Other resources

If you're interested to learn more about voice computing, I highly encoursge you to check out thie Voicebook repository. This repo contains 200+ open source scripts to get started with voice computing.

Here are some other libraries that may be of interest to learn more about sound event detection:

label_files.py

Install dependencies. Created new environment and pip installed each one. Requirements.txt didn’t work for some reason. My guess is it hung up when it reached the couple that needed special installation instructions: PyAudio and PocketSphinx. PyAudio was installed with following link: https://stackoverflow.com/questions/54998028/how-do-i-install-pyaudio-on-python-3-7. Jim provided following link to get PocketSphinx going on Windows, but it appears to work fine without it https://stackoverflow.com/questions/18889268/setting-up-pocketsphinx-for-python-in-windows. Line 205 in ‘label_files.py’: os.system('play %s'%(filename)) doesn’t appear to work with Windows. Command prompt returns some play error. Replaced with below to use in windows. File will then play in command prompt and annotation can take place. import winsound #(should be native in Windows Python install) print(filename) winsound.PlaySound(filename, winsound.SND_FILENAME) train_audioTPOT.py

Had issue with librosa_features.py: Error says librosa_feature.rmse(y)[0] (line 138) has no attribute rmse. Looing online, it says attribute changed to ‘rms’ instead of’ rmse’. Works after changing to ‘rms’. When doing quick testing ran into some json file issues. Not exactly sure what happened, but during this test there were only 1 or 2 files in class target 2. I reran and tried to make even number for each class and it works fine.

bug

This is a python package that turns any images into MIDI files that views the same as them

image_to_midi This is a python package that turns any images into MIDI files that views the same as them. This package firstly convert the image to AS

4 Mar 10, 2022

A python program to cut longer MP3 files (i.e. recordings of several songs) into the individual tracks.

I'm writing a python script to cut longer MP3 files (i.e. recordings of several songs) into the individual tracks called ReCut. So far there are two

1 Oct 27, 2021

Using python to generate a bat script of repetitive lines of code that differ in some way but can sort out a group of audio files according to their common names

Batch Sorting Using python to generate a bat script of repetitive lines of code that differ in some way but can sort out a group of audio files accord

1 Oct 29, 2021

Windows issues / did not test

label_files.py

Install dependencies. Created new environment and pip installed each one. Requirements.txt didn’t work for some reason. My guess is it hung up when it reached the couple that needed special installation instructions: PyAudio and PocketSphinx. PyAudio was installed with following link: https://stackoverflow.com/questions/54998028/how-do-i-install-pyaudio-on-python-3-7. Jim provided following link to get PocketSphinx going on Windows, but it appears to work fine without it https://stackoverflow.com/questions/18889268/setting-up-pocketsphinx-for-python-in-windows. Line 205 in ‘label_files.py’: os.system('play %s'%(filename)) doesn’t appear to work with Windows. Command prompt returns some play error. Replaced with below to use in windows. File will then play in command prompt and annotation can take place. import winsound #(should be native in Windows Python install) print(filename) winsound.PlaySound(filename, winsound.SND_FILENAME) train_audioTPOT.py

Had issue with librosa_features.py: Error says librosa_feature.rmse(y)[0] (line 138) has no attribute rmse. Looing online, it says attribute changed to ‘rms’ instead of’ rmse’. Works after changing to ‘rms’. When doing quick testing ran into some json file issues. Not exactly sure what happened, but during this test there were only 1 or 2 files in class target 2. I reran and tried to make even number for each class and it works fine.
bug

opened by jim-schwoebel 0

🎵 A repository for manually annotating files to create labeled acoustic datasets for machine learning.

Related tags

Overview

sound_event_detection

How to get started

How to label

organizing data

labeling data

changing default settings

Using machine learning models

training machine learning models from labels

making predictions on new files

applying pre-trained models

Visualizing labels and predictions

Datasets generated with script

Future things to do

Other resources

You might also like...

This is a python package that turns any images into MIDI files that views the same as them

A python program to cut longer MP3 files (i.e. recordings of several songs) into the individual tracks.

Using python to generate a bat script of repetitive lines of code that differ in some way but can sort out a group of audio files according to their common names

convert-to-opus-cli is a Python CLI program for converting audio files to opus audio format.

Carnatic Notes Predictor for audio files

Code to work with wave files!

GiantMIDI-Piano is a classical piano MIDI dataset contains 10,854 MIDI files of 2,786 composers

This bot can stream audio or video files and urls in telegram voice chats

A python program for visualizing MIDI files, and displaying them in a spiral layout

Comments

Windows issues / did not test

Owner

Jim Schwoebel

Guide & Examples to create deeplearning gstreamer plugins and use them in your pipeline

Gradient - A Python program designed to create a reactive and ambient music listening experience

A Python library for audio data augmentation. Inspired by albumentations. Useful for machine learning.

eyeD3 is a Python module and command line program for processing ID3 tags. Information about mp3 files (i.e bit rate, sample frequency, play time, etc.) is also provided. The formats supported are ID3v1 (1.0/1.1) and ID3v2 (2.3/2.4).

Read music meta data and length of MP3, OGG, OPUS, MP4, M4A, FLAC, WMA and Wave files with python 2 or 3

Python I/O for STEM audio files

Read music meta data and length of MP3, OGG, OPUS, MP4, M4A, FLAC, WMA and Wave files with python 2 or 3

This Bot can extract audios and subtitles from video files

Users can transcribe their favorite piano recordings to MIDI files after installation

python script for getting mp3 files from yaoutube playlist