Identify the emotion of multiple speakers in an Audio Segment

Suyash More

Last update: Jan 7, 2023

Related tags

Audio machine-learning deep-learning artificial-intelligence convolutional-neural-networks mfcc emotion-analysis speech-processing keras-tensorflow emotion-recognition colab-notebook mfcc-analysis uis-rnn diarization

Overview

MevonAI - Speech Emotion Recognition

Identify the emotion of multiple speakers in a Audio Segment
Report Bug · Request Feature

Try the Demo Here

About the Project
- Built With
Getting Started
- Installation
- Running the Application
How it Works
Contributing
License
Acknowledgements
FAQ

About The Project

The main aim of the project is to Identify the emotion of multiple speakers in a call audio as a application for customer satisfaction feedback in call centres.

Built With

Getting Started

Follow the Below Instructions for setting the project up on your local Machine.

Installation

Create a python virtual environment

sudo apt install python3-venv
mkdir mevonAI
cd mevonAI
python3 -m venv mevon-env
source mevon-env/bin/activate

Clone the repo

git clone https://github.com/SuyashMore/MevonAI-Speech-Emotion-Recognition.git

Install Dependencies

cd MevonAI-Speech-Emotion-Recognition/
cd src/
sudo chmod +x setup.sh
./setup.sh

Running the Application

Add audio files in .wav format for analysis in src/input/ folder
Run Speech Emotion Recognition using

python3 speechEmotionRecognition.py

By Default , the application will use the Pretrained Model Available in "src/model/"
Diarized files will be stored in "src/output/" folder
Predicted Emotions will be stored in a separate .csv file in src/ folder

Here's how it works:

Speaker Diarization

Speaker diarisation (or diarization) is the process of partitioning an input audio stream into homogeneous segments according to the speaker identity. It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with speaker recognition systems, by providing the speaker’s true identity. It is used to answer the question "who spoke when?" Speaker diarisation is a combination of speaker segmentation and speaker clustering. The first aims at finding speaker change points in an audio stream. The second aims at grouping together speech segments on the basis of speaker characteristics.

Feature Extraction

When we do Speech Recognition tasks, MFCCs is the state-of-the-art feature since it was invented in the 1980s.This shape determines what sound comes out. If we can determine the shape accurately, this should give us an accurate representation of the phoneme being produced. The shape of the vocal tract manifests itself in the envelope of the short time power spectrum, and the job of MFCCs is to accurately represent this envelope.

The Above Image represents the audio Waveform , the below image shows the converted MFCC Output on which we will Run our CNN Model.

CNN Model

Use Convolutional Neural Network to recognize emotion on the MFCCs with the following Architecture

model = Sequential()

#Input Layer
model.add(Conv2D(32, 5,strides=2,padding='same',
                 input_shape=(13,216,1)))
model.add(Activation('relu'))
model.add(BatchNormalization())

#Hidden Layer 1
model.add(Conv2D(64, 5,strides=2,padding='same',))
model.add(Activation('relu'))
model.add(BatchNormalization())

#Hidden Layer 2
model.add(Conv2D(64, 5,strides=2,padding='same',))
model.add(Activation('relu'))
model.add(BatchNormalization())

#Flatten Conv Net
model.add(Flatten())

#Output Layer
model.add(Dense(7))
model.add(Activation('softmax'))

Training the Model

Download RAVDESS Emotional speech audio dataset
2DConvolution.ipynb file is used to training the model

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Acknowledgements

FAQ

How do I do specifically so and so?
- Create an Issue to this repo , we will respond to the query

Comments

Not Running speechEmotionRecognition.py

When I try to run the app, I have this error : Using TensorFlow backend. /usr/local/lib/python3.6/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning) Traceback (most recent call last): File "./src/speechEmotionRecognition.py", line 21, in model = tf.keras.models.load_model('model/lstm_cnn_rectangular_lowdropout_trainedoncustomdata.h5') File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/keras/saving/save.py", line 146, in load_model loader_impl.parse_saved_model(filepath) File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/saved_model/loader_impl.py", line 83, in parse_saved_model constants.SAVED_MODEL_FILENAME_PB)) OSError: SavedModel file does not exist at: model/lstm_cnn_rectangular_lowdropout_trainedoncustomdata.h5/{saved_model.pbtxt|saved_model.pb}

Can you help me ?
bug

opened by lenoirdevinci 10
AttributeError: module 'tensorflow._api.v1.compat.v2' has no attribute '__internal__'

Hello Suyash, I am trying to run your project on Colab as I am new to this but I am facing some issues. 1) When running: !chmod +x setup.sh !./setup.sh I am receiving following errors :- ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. tensorflow-probability 0.12.1 requires gast>=0.3.2, but you have gast 0.2.2 which is incompatible. kapre 0.3.5 requires tensorflow>=2.0.0, but you have tensorflow 1.15.0 which is incompatible.

2) After this when I run !python3 speechEmotionRecognition.py receiving this error:- Using TensorFlow backend. Traceback (most recent call last): File "speechEmotionRecognition.py", line 2, in import keras File "/usr/local/lib/python3.7/dist-packages/keras/init.py", line 3, in from . import utils File "/usr/local/lib/python3.7/dist-packages/keras/utils/init.py", line 26, in from .vis_utils import model_to_dot File "/usr/local/lib/python3.7/dist-packages/keras/utils/vis_utils.py", line 7, in from ..models import Model File "/usr/local/lib/python3.7/dist-packages/keras/models.py", line 10, in from .engine.input_layer import Input File "/usr/local/lib/python3.7/dist-packages/keras/engine/init.py", line 3, in from .input_layer import Input File "/usr/local/lib/python3.7/dist-packages/keras/engine/input_layer.py", line 7, in from .base_layer import Layer File "/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer.py", line 12, in from .. import initializers File "/usr/local/lib/python3.7/dist-packages/keras/initializers/init.py", line 124, in populate_deserializable_objects() File "/usr/local/lib/python3.7/dist-packages/keras/initializers/init.py", line 49, in populate_deserializable_objects LOCAL.GENERATED_WITH_V2 = tf.internal.tf2.enabled() File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/util/module_wrapper.py", line 193, in getattr attr = getattr(self._tfmw_wrapped_module, name) AttributeError: module 'tensorflow._api.v1.compat.v2' has no attribute 'internal'
bug

opened by r-a-17 2
IndexError: too many indices for array: array is 1-dimensional, but 3 were indexed

when i use wav file, it is error: Traceback (most recent call last): File "D:/Projects/Audio/Emotion/SER/MevonAI-Speech-Emotion-Recognition/src/speakerDiarization.py", line 253, in main("filter_"+FILE_N, embedding_per_second=0.6, overlap_rate=0.4) File "D:/Projects/Audio/Emotion/SER/MevonAI-Speech-Emotion-Recognition/src/speakerDiarization.py", line 170, in main feats = np.array(feats)[:,0,:].astype(float) # [splits, embedding dim] IndexError: too many indices for array: array is 1-dimensional, but 3 were indexed
help wanted

opened by WestbrookZero 2
error for adding audio folder

hey there, I tried a lot to add audio folder but it throws error, even I made one folder for one file and changed the code but still its shows error
as such :- Traceback (most recent call last): File "speechEmotionRecognition.py", line 62, in bk.diarizeFromFolder(f'{INPUT_FOLDER_PATH}{subdir}{"/"}',(f'{OUTPUT_FOLDER_PATH}{subdir}{"/"}')) File "/content/emotion/src/bulkDiarize.py", line 29, in diarizeFromFolder diarizeAudio(TOTAL_PATH,TOTAL_OUTPUT_PATH,expectedSpeakers=2) File "/content/emotion/src/speakerDiarization.py", line 242, in diarizeAudio main("filterTemp.wav", embedding_per_second=0.6, overlap_rate=0.4,exportFile=exportFile,expectedSpeakers=expectedSpeakers) File "/content/emotion/src/speakerDiarization.py", line 170, in main feats = np.array(feats)[:,0,:].astype(float) # [splits, embedding dim] IndexError: too many indices for array: array is 1-dimensional, but 3 were indexed

please could u explain me .
help wanted

opened by har790 1
Running instructions

Hi @SuyashMore

Can you please tell me on which OS did you run this app? Because on Mac there is no 3.6.9 Python version to download. Could you please provide some more detailed instructions to run it because there are a lot of compatibilities issues?

Great job btw!
help wanted

opened by IvanLjubicic 1

Owner

Suyash More

Computer Engineer | VJTI

GitHub https://colab.research.google.com/github/SuyashMore/MevonAI-Speech-Emotion-Recognition/blob/master/src/notebooks/Emotion_Recognition_Demo.ipynb

cross-library (GStreamer + Core Audio + MAD + FFmpeg) audio decoding for Python

audioread Decode audio files using whichever backend is available. The library currently supports: Gstreamer via PyGObject. Core Audio on Mac OS X via

419 Dec 26, 2022

cross-library (GStreamer + Core Audio + MAD + FFmpeg) audio decoding for Python

audioread Decode audio files using whichever backend is available. The library currently supports: Gstreamer via PyGObject. Core Audio on Mac OS X via

359 Feb 15, 2021

Audio spatialization over WebRTC and JACK Audio Connection Kit

Audio spatialization over WebRTC Spatify provides a framework for building multichannel installations using WebRTC.

34 Jun 29, 2022

Audio augmentations library for PyTorch for audio in the time-domain

Audio augmentations library for PyTorch for audio in the time-domain, with support for stochastic data augmentations as used often in self-supervised / contrastive learning.

166 Jan 8, 2023

praudio provides audio preprocessing framework for Deep Learning audio applications

praudio provides objects and a script for performing complex preprocessing operations on entire audio datasets with one command.

105 Dec 26, 2022

convert-to-opus-cli is a Python CLI program for converting audio files to opus audio format.

convert-to-opus-cli convert-to-opus-cli is a Python CLI program for converting audio files to opus audio format. Installation Must have installed ffmp

4 Dec 21, 2022

FPGA based USB 2.0 high speed audio interface featuring multiple optical ADAT inputs and outputs

ADAT USB Audio Interface FPGA based USB 2.0 High Speed audio interface featuring multiple optical ADAT inputs and outputs Status / current limitations

78 Dec 31, 2022

BART aids transcribe tasks by taking a source audio file and creating automatic repeated loops, allowing transcribers to listen to fragments multiple times

BART (Beyond Audio Replay Technology) aids transcribe tasks by taking a source audio file and creating automatic repeated loops, allowing transcribers to listen to fragments multiple times (with possible overlap between segments).