A CROSS-MODAL FUSION NETWORK BASED ON SELF-ATTENTION AND RESIDUAL STRUCTURE FOR MULTIMODAL EMOTION RECOGNITION

skeleton

Last update: Sep 26, 2022

Related tags

Deep Learning CFN-SR

Overview

CFN-SR

A CROSS-MODAL FUSION NETWORK BASED ON SELF-ATTENTION AND RESIDUAL STRUCTURE FOR MULTIMODAL EMOTION RECOGNITION

The audio-video based multimodal emotion recognition has attracted a lot of attention due to its robust performance. Most of the existing methods focus on proposing different cross-modal fusion strategies. However, these strategies introduce redundancy in the features of different modalities without fully considering the complementary properties between modal information, and these approaches do not guarantee the non-loss of original semantic information during intra- and inter-modal interactions. In this paper, we propose a novel cross-modal fusion network based on self-attention and residual structure (CFN-SR) for multimodal emotion recognition. Firstly, we perform representation learning for audio and video modalities to obtain the semantic features of the two modalities by efficient ResNeXt and 1D CNN, respectively. Secondly, we feed the features of the two modalities into the cross-modal blocks separately to ensure efficient complementarity and completeness of information through the self-attention mechanism and residual structure. Finally, we obtain the output of emotions by splicing the obtained fused representation with the original representation. To verify the effectiveness of the proposed method, we conduct experiments on the RAVDESS dataset. The experimental results show that the proposed CFN-SR achieves the state-of-the-art and obtains 75.76% accuracy with 26.30M parameters.

Setup

Install dependencies

pip install opencv-python moviepy librosa sklearn

Download the RAVDESS dataset using the bash script

bash scripts/download_ravdess.sh <path/to/RAVDESS>

Or download the files manually

and follow the folder structure below and have .csv files in landmarks/ (do not modify file names)

RAVDESS/
    landmarks/
        .csv landmark files
    Actor_01/
    ...
    Actor_24/

Preprocess the dataset using the following

python dataset_prep.py --datadir <path/to/RAVDESS>

Generated folder structure (do not modify file names)

RAVDESS/
    landmarks/
        .csv landmark files
    Actor_01/
    ...
    Actor_24/
    preprocessed/
        Actor_01/
        ...
        Actor_24/
            01-01-01-01-01-01-24.mp4/
                frames/
                    .jpg frames
                audios/
                    .wav raw audio
                    .npy MFCC features
            ...

Download checkpoints folder from Google Drive. The following script downloads all pretrained models (unimodal and MSAF) for all 6 folds.

bash scripts/download_checkpoints.sh

Train

python main_msaf.py --datadir <path/to/RAVDESS/preprocessed> --checkpointdir checkpoints --train

All parameters

usage: main_msaf.py [-h] [--datadir DATADIR] [--k_fold K_FOLD] [--lr LR]
                    [--batch_size BATCH_SIZE] [--num_workers NUM_WORKERS]
                    [--epochs EPOCHS] [--checkpointdir CHECKPOINTDIR] [--no_verbose]
                    [--log_interval LOG_INTERVAL] [--no_save] [--train]

Result

Model	Fusion Stage	Accuracy	#Params
Averaging	Late	68.82	25.92M
Multiplicative	Late	70.35	25.92M
Multiplication	Late	70.56	25.92M
Concat + FC	Early	71.04	26.87M
MCBP	Early	71.32	51.03M
MMTM	Model	73.12	31.97M
MSAF	Model	74.86	25.94M
ERANNs	Model	74.80
CFN-SR(Ours)	Model	75.76	26.30M

Reference

Note that some codes references MSAF

Comments

Missing ResNeXt50 and mfccNet

Hi,

I'm unable to test this network, as there's no checkpoints folder with ResNeXt50 and mfccNet within it. I would happily download the model separately, but I don't know to do so. The specific error is a FileNotFoundError when trying to load ResNeXt50. I expect the same will happen for mfccNet once ResNeXt50 exists:

Traceback (most recent call last):
  File "main_msaf.py", line 163, in <module>
    video_model_checkpoint = torch.load(video_model_path) if use_cuda else \
  File "/home/nickick/phd/CFN-SR/env/lib/python3.8/site-packages/torch/serialization.py", line 699, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/home/nickick/phd/CFN-SR/env/lib/python3.8/site-packages/torch/serialization.py", line 231, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/nickick/phd/CFN-SR/env/lib/python3.8/site-packages/torch/serialization.py", line 212, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'checkpoints/resnext50/fold_1_resnext50_best.pth'

Could you update the repository to include those two models, or include links on where to find them, or instructions on the steps taken to train them if that approach was taken?

Thanks, Nick

opened by Nickick-ICRS 1

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Episodic Transformers (E.T.) Episodic Transformer for Vision-and-Language Navigation Alexander Pashevich, Cordelia Schmid, Chen Sun Episodic Transform

62 Dec 24, 2022

Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms

A CROSS-MODAL FUSION NETWORK BASED ON SELF-ATTENTION AND RESIDUAL STRUCTURE FOR MULTIMODAL EMOTION RECOGNITION

Related tags

Overview

CFN-SR

Setup

Train

Result

Reference

You might also like...

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms

[CVPR'21] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion"

Fusion-DHL: WiFi, IMU, and Floorplan Fusion for Dense History of Locations in Indoor Environments

CCAFNet: Crossflow and Cross-scale Adaptive Fusion Network for Detecting Salient Objects in RGB-D Images

PyTorch code for our ECCV 2018 paper "Image Super-Resolution Using Very Deep Residual Channel Attention Networks"

Image Super-Resolution Using Very Deep Residual Channel Attention Networks

[AAAI 2021] MVFNet: Multi-View Fusion Network for Efficient Video Recognition

Comments

Missing ResNeXt50 and mfccNet

Owner

skeleton

PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

SEOVER: Sentence-level Emotion Orientation Vector based Conversation Emotion Recognition Model

Self-supervised Multi-modal Hybrid Fusion Network for Brain Tumor Segmentation

Speech Emotion Recognition with Fusion of Acoustic- and Linguistic-Feature-Based Decisions

Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral)

Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition"

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Official code of ICCV2021 paper "Residual Attention: A Simple but Effective Method for Multi-Label Recognition"

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.