Speech-Emotion-Analyzer - The neural network model is capable of detecting five different male/female emotions from audio speeches. (Deep Learning, NLP, Python)

Overview

Speech Emotion Analyzer

  • The idea behind creating this project was to build a machine learning model that could detect emotions from the speech we have with each other all the time. Nowadays personalization is something that is needed in all the things we experience everyday.

  • So why not have a emotion detector that will guage your emotions and in the future recommend you different things based on your mood. This can be used by multiple industries to offer different services like marketing company suggesting you to buy products based on your emotions, automotive industry can detect the persons emotions and adjust the speed of autonomous cars as required to avoid any collisions etc.

Analyzing audio signals

©Fabien_Ringeval_PhD_Thesis.

Datasets:

Made use of two different datasets:

  1. RAVDESS. This dataset includes around 1500 audio file input from 24 different actors. 12 male and 12 female where these actors record short audios in 8 different emotions i.e 1 = neutral, 2 = calm, 3 = happy, 4 = sad, 5 = angry, 6 = fearful, 7 = disgust, 8 = surprised.
    Each audio file is named in such a way that the 7th character is consistent with the different emotions that they represent.

  2. SAVEE. This dataset contains around 500 audio files recorded by 4 different male actors. The first two characters of the file name correspond to the different emotions that the potray.

Audio files:

Tested out the audio files by plotting out the waveform and a spectrogram to see the sample audio files.
Waveform

Spectrogram

Feature Extraction

The next step involves extracting the features from the audio files which will help our model learn between these audio files. For feature extraction we make use of the LibROSA library in python which is one of the libraries used for audio analysis.

  • Here there are some things to note. While extracting the features, all the audio files have been timed for 3 seconds to get equal number of features.
  • The sampling rate of each file is doubled keeping sampling frequency constant to get more features which will help classify the audio file when the size of dataset is small.

The extracted features looks as follows



These are array of values with lables appended to them.

Building Models

Since the project is a classification problem, Convolution Neural Network seems the obivious choice. We also built Multilayer perceptrons and Long Short Term Memory models but they under-performed with very low accuracies which couldn't pass the test while predicting the right emotions.

Building and tuning a model is a very time consuming process. The idea is to always start small without adding too many layers just for the sake of making it complex. After testing out with layers, the model which gave the max validation accuracy against test data was little more than 70%


Predictions

After tuning the model, tested it out by predicting the emotions for the test data. For a model with the given accuracy these are a sample of the actual vs predicted values.


Testing out with live voices.

In order to test out our model on voices that were completely different than what we have in our training and test data, we recorded our own voices with dfferent emotions and predicted the outcomes. You can see the results below: The audio contained a male voice which said "This coffee sucks" in a angry tone.



As you can see that the model has predicted the male voice and emotion very accurately in the image above.

NOTE: If you are using the model directly and want to decode the output ranging from 0 to 9 then the following list will help you.

0 - female_angry
1 - female_calm
2 - female_fearful
3 - female_happy
4 - female_sad
5 - male_angry
6 - male_calm
7 - male_fearful
8 - male_happy
9 - male_sad

Conclusion

Building the model was a challenging task as it involved lot of trail and error methods, tuning etc. The model is very well trained to distinguish between male and female voices and it distinguishes with 100% accuracy. The model was tuned to detect emotions with more than 70% accuracy. Accuracy can be increased by including more audio files for training.

Comments
  • TypeError: '<' not supported between instances of 'str' and 'int'

    TypeError: '<' not supported between instances of 'str' and 'int'

    from keras.utils import np_utils from sklearn.preprocessing import LabelEncoder

    X_train = np.array(trainfeatures) y_train = np.array(trainlabel) X_test = np.array(testfeatures) y_test = np.array(testlabel)

    lb = LabelEncoder()

    y_train = np_utils.to_categorical(lb.fit_transform(y_train)) y_test = np_utils.to_categorical(lb.fit_transform(y_test))

    ERROR AT THIS LINE... please help

    opened by uneverknwwhoim 23
  • Dataset question

    Dataset question

    Hi, thanks for the work, I have a question about the samples number. After I filter out all the data (ravdess and savee), I just have 1200 samples, 960 train data, and the final acc is around 0.5. I found your train samples is 1378 (X_train.shape is 1378*216), I wonder what I did wrong.

    opened by hackiey 9
  • Training from scratch doesn't reach the same loss

    Training from scratch doesn't reach the same loss

    Hey, thanks a lot for the release. I've tried training the model from scratch using the datasets, but I can't reach the same validation loss. I noticed that the pre-trained network in the repo has two more convolutional layers compared to the code in the notebook, but adding them back doesn't help either.

    Did you se any additional tricks for training?

    For reference, above is what I see, below is what you have in the dataset:

    57238914_2564767740218633_7361183013125750784_n

    opened by nicolov 8
  • getting RawData missing Error

    getting RawData missing Error

    mylist= os.listdir('RawData/')

    getting the error as FileNotFoundError. please let me know if anyone know how to solve this error and also guide me if where i need to place the dataset

    opened by saikumaradepu11 7
  • Accuracy problem

    Accuracy problem

    Hi Mitesh, I’m trying to obtain the 70% accuracy you got but I’m only getting a 35% could you please tell me the database and send me the exact code you used to get the 70%?? Thank you very much. My email is [email protected]

    opened by torrentillo 7
  • Loading and Testing

    Loading and Testing

    The model was imported perfectly. But the LabelEncoder did not work as: #added in cell 496 from sklearn.preprocessing import LabelEncoder lb = LabelEncoder() livepredictions = (lb.inverse_transform((liveabc))) livepredictions

    throws the error:

    This LabelEncoder instance is not fitted yet. Call 'fit' with appropriate arguments....

    If you could help with this, it will be of great help.

    PS: I started by importing all files in cell 1,2,3. Then added 'opt = keras.optimizers.rmsprop(lr=0.00001, decay=1e-6)' to cell 137, since opt was not defined then executed all the blocks in Demo section.

    opened by HarshitSoni1903 7
  • Having doubt regarding the Rawdata in code :final_results_gender_test.ipynb

    Having doubt regarding the Rawdata in code :final_results_gender_test.ipynb

    having doubt regarding these lines, what is the data in that and which format .

    mylist= os.listdir('RawData/')

    data, sampling_rate = librosa.load('RawData/f11 (2).wav')

    opened by vikash512 7
  • live demo

    live demo

    Hello! im getting this error when im trying to run the live demo. can please help me with this. Thank you.


    NameError Traceback (most recent call last) in ----> 1 livepreds = loaded_model.predict(twodim, 2 batch_size=32, 3 verbose=1)

    NameError: name 'loaded_model' is not defined

    opened by Ataya95 6
  • Loading and testing the model requires lb.fitTransform()?

    Loading and testing the model requires lb.fitTransform()?

    Hello,

    I am trying to use your model to test the live audio recorded 'output10.wav'. I get the error for livepredictions = (lb.inverse_transform((liveabc))) as

    This LabelEncoder instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

    Do I need to do lb.firTransform(y_train) and y_test before I just run your model? Is there any other possibility to test the model before getting all the features from dataset?

    Traceback (most recent call last): File "C:\Users\BhargaviiNadendla\Documents\GitHub\Speech-Emotion-Analyzer\load.py", line 73, in livepredictions = (lb.inverse_transform((liveabc))) File "D:\Anaconda3\envs\Speech-Emotion-Analyzer\lib\site-packages\sklearn\preprocessing\label.py", line 272, in inverse_transform check_is_fitted(self, 'classes_') File "D:\Anaconda3\envs\Speech-Emotion-Analyzer\lib\site-packages\sklearn\utils\validation.py", line 951, in check_is_fitted raise NotFittedError(msg % {'name': type(estimator).name}) sklearn.exceptions.NotFittedError: This LabelEncoder instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

    (Speech-Emotion-Analyzer) C:\Users\BhargaviiNadendla>sklearn.exceptions.NotFittedError: This LabelEncoder instance is not fitted yet. Call 'fit' with appropriate arguments before using this method. 'sklearn.exceptions.NotFittedError:' is not recognized as an internal or external command, operable program or batch file.

    opened by BhargaviNadendla 6
  • permission error 13

    permission error 13

    i cant run code error: C:\Speech-Emotion-Analyzer>python train.py Using TensorFlow backend. Traceback (most recent call last): File "train.py", line 102, in X,sample_rate = librosa.load('data/'+y) File "C:\Python35\lib\site-packages\librosa\core\audio.py", line 112, in load with audioread.audio_open(os.path.realpath(path)) as input_file: File "C:\Python35\lib\site-packages\audioread_init_.py", line 80, in audio_open return rawread.RawAudioFile(path) File "C:\Python35\lib\site-packages\audioread\rawread.py", line 61, in init self._fh = open(filename, 'rb') PermissionError: [Errno 13] Permission denied: 'C:\Speech-Emotion-Analyzer\data\Actor_01'

    opened by dangvansam98 6
  • Using the model for prediction

    Using the model for prediction

    Hello,

    I am trying to use the already trained model directly for predicting the emotions.

    I wrote this put this code in a python file and run it: def predict(): lb = LabelEncoder() Model_filename = 'saved_models/Emotion_Voice_Detection_Model.h5' Model = load_model(Model_filename) X, sample_rate = librosa.load('filename.wav', res_type='kaiser_fast',duration=2.5,sr=22050*2,offset=0.5) sample_rate = np.array(sample_rate) mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=13),axis=0) featurelive = mfccs livedf2 = featurelive livedf2= pd.DataFrame(data=livedf2) livedf2 = livedf2.stack().to_frame().T twodim= np.expand_dims(livedf2, axis=2) livepreds = Model.predict(twodim,batch_size=32,verbose=1) livepreds1=livepreds.argmax(axis=1) liveabc = livepreds1.astype(int).flatten() livepredictions = (lb.inverse_transform((liveabc))) livepredictions

    But, it displays an error in the (lb.inverse_transform), it says that the (lb) need to be trained first .. Is there a method where I can use it which returns the emotion's name, without a need for using the dataset and training the model again?

    Also I have another question, Is this model a language-independent model? Thanks,

    opened by DinaAlBassam 3
  • wrong extraction of features

    wrong extraction of features

    This paper doesn't make any sence because you are taking average of 13mfcc features which is quite absurd as you it is ridiculous we actually have to mean for all the frames so there should .T at np.mean at feature extraction and from there everything should change your model, accuracy every thing as your function is fundamentally wrong , Hope you change it as this repo is most stared one ,so this lead to miss information for many

    opened by chandrahaas02 0
  • AttributeError:'list' object has no attribute'items'

    AttributeError:'list' object has no attribute'items'

    When I tried Loading the model with final_result_gender_test, I got AttributeError:'list' object has no attribute'items'. Please tell me how to resolve.

    OS: MacOS Big Sur Environment: VSCode Docker Ubuntu 18.0.4

    librosa==0.8.0 numpy==1.18.5 matplotlib==3.1.0 tensorflow==2.2.0 Keras==2.4.3 sklearn==0.0

    opened by Co-Graph-Okuda 0
  • Inference code?

    Inference code?

    Hi, Thanks for the nice work. I was trying to just use your model for inference. I looked at the notebook and copied the necessary parts for inference, but get error This LabelEncoder instance is not fitted yet. Can you help what is missing in this code?

    import os
    
    from keras import regularizers
    import keras
    from keras.callbacks import ModelCheckpoint
    from keras.layers import Conv1D, MaxPooling1D, AveragePooling1D, Dense, Embedding, Input, Flatten, Dropout, Activation, LSTM
    from keras.models import Model, Sequential, model_from_json
    from keras.preprocessing import sequence
    from keras.preprocessing.sequence import pad_sequences
    from keras.preprocessing.text import Tokenizer
    from keras.utils import to_categorical
    import librosa
    import librosa.display
    from matplotlib.pyplot import specgram
    from sklearn.metrics import confusion_matrix
    from sklearn.preprocessing import LabelEncoder
    
    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    import tensorflow as tf
    
    
    opt = keras.optimizers.rmsprop(lr=0.00001, decay=1e-6)
    lb = LabelEncoder()
    
    
    json_file = open('model.json', 'r')
    loaded_model_json = json_file.read()
    json_file.close()
    loaded_model = model_from_json(loaded_model_json)
    # load weights into new model
    loaded_model.load_weights("saved_models/Emotion_Voice_Detection_Model.h5")
    print("Loaded model from disk")
     
    X, sample_rate = librosa.load('h04.wav', res_type='kaiser_fast',duration=2.5,sr=22050*2,offset=0.5)
    sample_rate = np.array(sample_rate)
    mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=13),axis=0)
    featurelive = mfccs
    livedf2 = featurelive
    livedf2= pd.DataFrame(data=livedf2)
    livedf2 = livedf2.stack().to_frame().T
    twodim= np.expand_dims(livedf2, axis=2)
    livepreds = loaded_model.predict(twodim, batch_size=32, verbose=1)
    
    livepreds1=livepreds.argmax(axis=1)
    liveabc = livepreds1.astype(int).flatten()
    livepredictions = (lb.inverse_transform((liveabc)))
    print(livepredictions)
    
    
    opened by arianaa30 2
Owner
Mitesh Puthran
Data Scientist trying to make sense.
Mitesh Puthran
SEOVER: Sentence-level Emotion Orientation Vector based Conversation Emotion Recognition Model

SEOVER-Master This code is the implementation of paper: SEOVER: Sentence-level Emotion Orientation Vector based Conversation Emotion Recognition Model

null 4 Feb 24, 2022
😇A pyTorch implementation of the DeepMoji model: state-of-the-art deep learning model for analyzing sentiment, emotion, sarcasm etc

------ Update September 2018 ------ It's been a year since TorchMoji and DeepMoji were released. We're trying to understand how it's being used such t

Hugging Face 865 Dec 24, 2022
Implementation of "StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis"

StrengthNet Implementation of "StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis" https://arxiv.org/abs/2110

RuiLiu 65 Dec 20, 2022
This is a model made out of Neural Network specifically a Convolutional Neural Network model

This is a model made out of Neural Network specifically a Convolutional Neural Network model. This was done with a pre-built dataset from the tensorflow and keras packages. There are other alternative libraries that can be used for this purpose, one of which is the PyTorch library.

null 9 Oct 18, 2022
This repo contains implementation of different architectures for emotion recognition in conversations.

Emotion Recognition in Conversations Updates ?? ?? ?? Date Announcements 03/08/2021 ?? ?? We have released a new dataset M2H2: A Multimodal Multiparty

Deep Cognition and Language Research (DeCLaRe) Lab 1k Dec 30, 2022
An implementation of the AlphaZero algorithm for Gomoku (also called Gobang or Five in a Row)

AlphaZero-Gomoku This is an implementation of the AlphaZero algorithm for playing the simple board game Gomoku (also called Gobang or Five in a Row) f

Junxiao Song 2.8k Dec 26, 2022
Identify the emotion of multiple speakers in an Audio Segment

MevonAI - Speech Emotion Recognition Identify the emotion of multiple speakers in a Audio Segment Report Bug · Request Feature Try the Demo Here Table

Suyash More 110 Dec 3, 2022
A real-time speech emotion recognition application using Scikit-learn and gradio

Speech-Emotion-Recognition-App A real-time speech emotion recognition application using Scikit-learn and gradio. Requirements librosa==0.6.3 numpy sou

Son Tran 6 Oct 4, 2022
Speech Emotion Recognition with Fusion of Acoustic- and Linguistic-Feature-Based Decisions

APSIPA-SER-with-A-and-T This code is the implementation of Speech Emotion Recognition (SER) with acoustic and linguistic features. The network model i

kenro515 3 Jan 4, 2023
A object detecting neural network powered by the yolo architecture and leveraging the PyTorch framework and associated libraries.

Yolo-Powered-Detector A object detecting neural network powered by the yolo architecture and leveraging the PyTorch framework and associated libraries

Luke Wilson 1 Dec 3, 2021
🔮 Execution time predictions for deep neural network training iterations across different GPUs.

Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training Habitat is a tool that predicts a deep neural network's

Geoffrey Yu 44 Dec 27, 2022
A Low Complexity Speech Enhancement Framework for Full-Band Audio (48kHz) based on Deep Filtering.

DeepFilterNet A Low Complexity Speech Enhancement Framework for Full-Band Audio (48kHz) based on Deep Filtering. libDF contains Rust code used for dat

Hendrik Schröter 292 Dec 25, 2022
Deep Learning: Architectures & Methods Project: Deep Learning for Audio Super-Resolution

Deep Learning: Architectures & Methods Project: Deep Learning for Audio Super-Resolution Figure: Example visualization of the method and baseline as a

Oliver Hahn 16 Dec 23, 2022
STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech Keon Lee, Ky

Keon Lee 114 Dec 12, 2022
Deep learning (neural network) based remote photoplethysmography: how to extract pulse signal from video using deep learning tools

Deep-rPPG: Camera-based pulse estimation using deep learning tools Deep learning (neural network) based remote photoplethysmography: how to extract pu

Terbe Dániel 138 Dec 17, 2022
BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation This is a demo implementation of BYOL for Audio (BYOL-A), a self-sup

NTT Communication Science Laboratories 160 Jan 4, 2023
MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch Onnx and CoreML.

MMdnn MMdnn is a comprehensive and cross-framework tool to convert, visualize and diagnose deep learning (DL) models. The "MM" stands for model manage

Microsoft 5.7k Jan 9, 2023
EmoTag helps you train emotion detection model for Chinese audios

emoTag emoTag helps you train emotion detection model for Chinese audios. Environment pip install -r requirement.txt Data We used Emotional Speech Dat

_zza 4 Sep 7, 2022
We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.

Multi-Modal Self-Supervision using GDT and StiCa This is an official pytorch implementation of papers: Multi-modal Self-Supervision from Generalized D

Facebook Research 42 Dec 9, 2022