Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition"

Overview

Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition"


Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition
Hua Zhang, Ruoyun Gou, Jili Shang, Fangyao Shen, Yifan Wu and Guojun Dai

Abstract: Speech emotion recognition (SER) is a difficult and challenging task because of the affective variances between different speakers. The performances of SER are extremely reliant on the extracted features from speech signals. To establish an effective features extracting and classification model is still a challenging task. In this paper, we propose a new method for SER based on Deep Convolution Neural Network (DCNN) and Bidirectional Long Short-Term Memory with Attention (BLSTMwA) model (DCNN-BLSTMwA). We first preprocess the speech samples by data enhancement and datasets balancing. Secondly, we extract three-channel of log Mel-spectrograms (static, delta, and delta-delta) as DCNN input. Then the DCNN model pre-trained on ImageNet dataset is applied to generate the segment-level features. We stack these features of a sentence into utterance-level features. Next, we adopt BLSTM to learn the high-level emotional features for temporal summarization, followed by an attention layer which can focus on emotionally relevant features. Finally, the learned high-level emotional features are fed into the Deep Neural Network (DNN) to predict the final emotion. Experiments on EMO-DB and IEMOCAP database obtain the unweighted average recall (UAR) of 87.86 and 68.50%, respectively, which are better than most popular SER methods and demonstrate the effectiveness of our propose method.

link to paper

Requirements

The project has been tested on a python=3.7 on Ubuntu 20.04 with the following packages:

tensorflow=2.7.0
librosa=0.8.1
scikit-learn=1.0.1

Uses librosa to read files, which needs sndfile.
Use sudo apt-get install libsndfile1 to install sndfile library

Usage

This repository can be used in the following ways:

  1. Using train.py.
    i. Download the RAVDESS dataset (only this dataset is supported as of now) and extract it within the dataset directory. Then run the commands below to move all files into the .dataset/ directory from indivisual sub folders like .dataset/Actor-xx. Run these from within the dataset directory. Make sure to be in the .dataset/ directory before running these comannds from a linux terminal.
    find . -mindepth 2 -type f -print -exec mv {} . \;  
    rm -r Actor_*
    Then 
    
    ii. Run train.py with required options. Use python train.py -h to check all options available. A saved_model will be put in the saved_model directory.
    iii. Use infer.py to run inference on a set of files.
  2. Using SpeechModel.py to get a Keras Model into your code. This model follows specifications mentioned in the paper. You may write your own dataset code.
    Example:
    # Your own dataset architecture
    from SpeechModel import SpeechModel
    SP = SpeechModel
    model = SP.create_model()
    # Rest of model training code
    
  3. Using just the load_wav and get_framed_log_melspectrogram functions from utils.py, you can write your own dataset funcion, as well as your own model. This function returns a (num_frames, 64, 64, 3) shaped array that can be fed to a TimeDistributed network of your choice.

Model Description

The model uses a TimeDistributed layer to feed all segments of a audio file that have been converted into 3 channel images to a pretrained CNN network (in this case, resnet50_v2, trained on imagenet). Following this, we have bi-lstm layers and attention layers. Then, there are Fully Connected Layers with dropout and finally, classification with 8 nodes.


(Image credits: Paper cited below)

Example usage

Training

  • Use python train.py -h to see a list of arguments.
  • python train.py 30 to train the model for 30 epochs

Inference

(Still to be implemented)

Limitations

  1. Currently this repo only supports dataset preparation for the RAVDESS model since different datasets describe their labels differently. You can still use this repo with other datasets by defining your own function to load the datasets and using the get_framed_log_melspectrograms function from utils.py.
    Then you may use SpeechModel.py to create a model based on specifications form the paper and train.
  2. Since I couldn't find a pretrained AlexNet model and didn't have the resources myself to train one from scratch, I used a pretrained ResNet 50 model. This may affect the performance and not match the results given by the authors.

Credits and acknowledgements:

I did this work for a hackathon. This method did not produce the best results for my use case. I suspect this was due to the dataset being very noisy.

Citation

AUTHOR=Zhang Hua, Gou Ruoyun, Shang Jili, Shen Fangyao, Wu Yifan, Dai Guojun
    
TITLE=Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition  
    
JOURNAL=Frontiers in Physiology     
    
VOLUME=12      
    
YEAR=2021
    
PAGES=177   
        
URL=https://www.frontiersin.org/article/10.3389/fphys.2021.643202     
    
DOI=10.3389/fphys.2021.643202    
    
ISSN=1664-042X   

ABSTRACT=Speech emotion recognition (SER) is a difficult and challenging task because of the affective variances between different speakers. The performances of SER are extremely reliant on the extracted features from speech signals. To establish an effective features extracting and classification model is still a challenging task. In this paper, we propose a new method for SER based on Deep Convolution Neural Network (DCNN) and Bidirectional Long Short-Term Memory with Attention (BLSTMwA) model (DCNN-BLSTMwA). We first preprocess the speech samples by data enhancement and datasets balancing. Secondly, we extract three-channel of log Mel-spectrograms (static, delta, and delta-delta) as DCNN input. Then the DCNN model pre-trained on ImageNet dataset is applied to generate the segment-level features. We stack these features of a sentence into utterance-level features. Next, we adopt BLSTM to learn the high-level emotional features for temporal summarization, followed by an attention layer which can focus on emotionally relevant features. Finally, the learned high-level emotional features are fed into the Deep Neural Network (DNN) to predict the final emotion. Experiments on EMO-DB and IEMOCAP database obtain the unweighted average recall (UAR) of 87.86 and 68.50%, respectively, which are better than most popular SER methods and demonstrate the effectiveness of our propose method.
You might also like...
Unofficial Implementation of MLP-Mixer in TensorFlow
Unofficial Implementation of MLP-Mixer in TensorFlow

mlp-mixer-tf Unofficial Implementation of MLP-Mixer [abs, pdf] in TensorFlow. Note: This project may have some bugs in it. I'm still learning how to i

Tensorflow implementation for Self-supervised Graph Learning for Recommendation

If the compilation is successful, the evaluator of cpp implementation will be called automatically. Otherwise, the evaluator of python implementation will be called.

Minimal implementation of PAWS (https://arxiv.org/abs/2104.13963) in TensorFlow.
Minimal implementation of PAWS (https://arxiv.org/abs/2104.13963) in TensorFlow.

PAWS-TF 🐾 Implementation of Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples (PAWS)

Unofficial TensorFlow  implementation of the Keyword Spotting Transformer model
Unofficial TensorFlow implementation of the Keyword Spotting Transformer model

Keyword Spotting Transformer This is the unofficial TensorFlow implementation of the Keyword Spotting Transformer model. This model is used to train o

A tensorflow implementation of GCN-LPA

GCN-LPA This repository is the implementation of GCN-LPA (arXiv): Unifying Graph Convolutional Neural Networks and Label Propagation Hongwei Wang, Jur

Official Tensorflow implementation of
Official Tensorflow implementation of "M-LSD: Towards Light-weight and Real-time Line Segment Detection"

M-LSD: Towards Light-weight and Real-time Line Segment Detection Official Tensorflow implementation of "M-LSD: Towards Light-weight and Real-time Line

Unofficial TensorFlow implementation of Protein Interface Prediction using Graph Convolutional Networks.
Unofficial TensorFlow implementation of Protein Interface Prediction using Graph Convolutional Networks.

[TensorFlow] Protein Interface Prediction using Graph Convolutional Networks Unofficial TensorFlow implementation of Protein Interface Prediction usin

Tensorflow implementation of MIRNet for Low-light image enhancement
Tensorflow implementation of MIRNet for Low-light image enhancement

MIRNet Tensorflow implementation of the MIRNet architecture as proposed by Learning Enriched Features for Real Image Restoration and Enhancement. Lanu

Tensorflow python implementation of
Tensorflow python implementation of "Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos"

Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos This repository is the official tensorflow python implementation

Owner
Ankush Malaker
Result driven, deep learning engineer with a passion to solve problems using computers and deep learning.
Ankush Malaker
Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy.

Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy. Now with tensorflow 1.0 support. Evaluation usa

Marcel R. 349 Aug 6, 2022
TensorFlow Ranking is a library for Learning-to-Rank (LTR) techniques on the TensorFlow platform

TensorFlow Ranking is a library for Learning-to-Rank (LTR) techniques on the TensorFlow platform

null 2.6k Jan 4, 2023
Robust Video Matting in PyTorch, TensorFlow, TensorFlow.js, ONNX, CoreML!

Robust Video Matting in PyTorch, TensorFlow, TensorFlow.js, ONNX, CoreML!

Peter Lin 6.5k Jan 4, 2023
Robust Video Matting in PyTorch, TensorFlow, TensorFlow.js, ONNX, CoreML!

Robust Video Matting (RVM) English | 中文 Official repository for the paper Robust High-Resolution Video Matting with Temporal Guidance. RVM is specific

flow-dev 2 Aug 21, 2022
Implementation of Restricted Boltzmann Machine (RBM) and its variants in Tensorflow

xRBM Library Implementation of Restricted Boltzmann Machine (RBM) and its variants in Tensorflow Installation Using pip: pip install xrbm Examples Tut

Omid Alemi 55 Dec 29, 2022
Functional TensorFlow Implementation of Singular Value Decomposition for paper Fast Graph Learning

tf-fsvd TensorFlow Implementation of Functional Singular Value Decomposition for paper Fast Graph Learning with Unique Optimal Solutions Cite If you f

Sami Abu-El-Haija 14 Nov 25, 2021
StyleGAN2 - Official TensorFlow Implementation

StyleGAN2 - Official TensorFlow Implementation

NVIDIA Research Projects 10.1k Dec 28, 2022
An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow implementation of SERank model. The code is developed based on TF-Ranking.

SERank An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow

Zhihu 44 Oct 20, 2022
Implementation of Perceiver, General Perception with Iterative Attention in TensorFlow

Perceiver This Python package implements Perceiver: General Perception with Iterative Attention by Andrew Jaegle in TensorFlow. This model builds on t

Rishit Dagli 84 Oct 15, 2022
Minimal implementation of Denoised Smoothing: A Provable Defense for Pretrained Classifiers in TensorFlow.

Denoised-Smoothing-TF Minimal implementation of Denoised Smoothing: A Provable Defense for Pretrained Classifiers in TensorFlow. Denoised Smoothing is

Sayak Paul 19 Dec 11, 2022