Emotional conditioned music generation using transformer-based model.

hung anna

Last update: Nov 9, 2022

Related tags

Deep Learning EMOPIA

Overview

This is the official repository of EMOPIA: A Multi-Modal Pop Piano Dataset For Emotion Recognition and Emotion-based Music Generation. The paper has been accepted by International Society for Music Information Retrieval Conference 2021.

Note: We release the transcribed MIDI files. As for the audio part, due to the copyright issue, we will only release the YouTube ID of the tracks and the timestamp of them. You might use open source crawler to get the audio file.

Use EMOPIA by MusPy

install muspy

pip install muspy

Use it in your script

import muspy

emopia = muspy.EMOPIADataset("data/emopia/", download_and_extract=True)
emopia.convert()
music = emopia[0]
print(music.annotations[0].annotation)

You can get the label of the piece of music:

{'emo_class': '1', 'YouTube_ID': '0vLPYiPN7qY', 'seg_id': '0'}

emo_class: ['1', '2', '3', '4']
YouTube_ID: the YouTube ID of this piece of music
seg_id: means this piece of music is the ith piece we take from this song. (zero-based).

For more usage please refer to MusPy.

Emotion Classification

For the classification models and codes, please refer to this repo.

Conditional Generation

Environment

Install PyTorch and fast transformer:
- torch==1.7.0 (Please install it according to your CUDA version.)
- fast transformer :
```
pip install --user pytorch-fast-transformers 
```
  or refer to the original repository
Other requirements:

pip install -r requirements.txt

Usage

Inference

Download the checkpoints and put them into exp/

Manually:

By commend: (install gdown: pip install gdown)

#baseline:
gdown --id 1Q9vQYnNJ0hXBFwcxdWQgDNmzoW3MLl3h --output exp/baseline.zip

# no-pretrained transformer
gdown --id 1ZULJgBRu2Wb3jxFmGfAHP1v_tjoryFM7 --output exp/no-pretrained_transformer.zip

# pretrained transformer
gdown --id 19Seq18b2JNzOamEQMG1uarKjj27HJkHu --output exp/pretrained_transformer.zip

Inference options:

num_songs: number of midis you want to generate.
out_dir: the folder where the generated midi will be saved. If not specified, midi files will be saved to exp/MODEL_YOU_USED/gen_midis/.
task_type: the task_type needs to be the same as the task specified during training.
- '4-cls' for 4 class conditioning
- 'Arousal' for only conditioning on arousal
- 'Valence' for only conditioning on Valence
- 'ignore' for not conditioning
emo_tag: the target class of emotion you want to assign.
- If the task_type is '4-cls', emo_tag can be: 1,2,3,4, which refers to Q1, Q2, Q3, Q4.
- If the task_type is 'Arousal', emo_tag can be: 1, 2. 1 for High arousal, 2 for Low arousal.
- If the task_type is 'Valence', emo_tag can be: 1, 2. 1 for High Valence, 2 for Low Valence.

Inference

python main_cp.py --mode inference --task_type 4-cls --load_ckt CHECKPOINT_FOLDER --load_ckt_loss 25 --num_songs 10 --emo_tag 1

Train the model by yourself

Prepare the data follow the steps.
training options:

exp_name: the folder name that the checkpoints will be saved.
data_parallel: use data_parallel to let the training process faster. (0: not use, 1: use)
task_type: the conditioning task:
- '4-cls' for 4 class conditioning
- 'Arousal' for only conditioning on arousal
- 'Valence' for only conditioning on Valence
- 'ignore' for not conditioning
a. Only train on EMOPIA: (no-pretrained transformer in the paper)
```
  python main_cp.py --path_train_data emopia --exp_name YOUR_EXP_NAME --load_ckt none
```
b. Pre-train the transformer on AILabs17k:
```
  python main_cp.py --path_train_data ailabs --exp_name YOUR_EXP_NAME --load_ckt none --task_type ignore
```
c. fine-tune the transformer on EMOPIA: For example, you want to use the pre-trained model stored in 0309-1857 with loss= 30 to fine-tune:
```
  python main_cp.py --path_train_data emopia --exp_name YOUR_EXP_NAME --load_ckt 0309-1857 --load_ckt_loss 30
```

Baseline

The baseline code is based on the work of Learning to Generate Music with Sentiment
According to the author, the model works best when it is trained with 4096 neurons of LSTM, but takes 12 days for training. Therefore, due to the limit of computational resource, we used the size of 512 neurons instead of 4096.
In order to use this as evaluation against our model, the target emotion classes is expanded to 4Q instead of just positive/negative.

Authors

The paper is a co-working project with Joann, SeungHeon and Nabin. This repository is mentained by Joann and me.

License

The EMOPIA dataset is released under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0). It is provided primarily for research purposes and is prohibited to be used for commercial purposes. When sharing your result based on EMOPIA, any act that defames the original music owner is strictly prohibited.

The hand drawn piano in the logo comes from Adobe stock. The author is Burak. I purchased it under standard license.

Cite the dataset

@inproceedings{{EMOPIA},
         author = {Hung, Hsiao-Tzu and Ching, Joann and Doh, Seungheon and Kim, Nabin and Nam, Juhan and Yang, Yi-Hsuan},
         title = {{MOPIA}: A Multi-Modal Pop Piano Dataset For Emotion Recognition and Emotion-based Music Generation},
         booktitle = {Proc. Int. Society for Music Information Retrieval Conf.},
         year = {2021}
}

Comments

How to generate REMI data by models?
Hi, In the paper, I notice that the model used for Emotion(4Q/Valence/Arousal) classification in Objective metrics is LSTM-Att + REMI. But In the repo, the CP transformer model will generate a .mid file and a .npy for one music clip. Could I use both of them to generate REMI data? And how to generate the REMI data type output?

The evaluation method:

The metrics:

thanks.
opened by yen52205 7
4Q and annotators

Thank you so much for your great dataset. Here I am a little confused about the 4Q and annotators. I read your paper, and based on my understanding, 4Q are Q1 = HVHA, Q2 = HVLA, Q3 = LVHA, Q4 = LVLA, is that correct? And the 'label.csv' on zenodo, the annotator features are ABCD. What does that mean?

opened by piaoziyue 3
two bugs
Hello, excuse me, I am very interested in this job, but I find two bugs when running the code.

There is no initialization parameter args.in_attn in workspace/transformer/main_cp.py;

The function self.transformer_encoder(pos_emb, attn_mask, emb_emotion=emo_embd)) in workspace/transformer/models.py has no emb_emotion parameter.
opened by ExitPath 3
Surface-level objective metrics of emotion-conditioned generation and training reimplementation problems.

Hi, I tried to train "CP transformer w/ pre-training" by processed data you offered in repo from scratch. I used 1e-4 as pretraining learning rate, and selected loss_30.ckpt, and then trained it on EMOPIA dataset by 1e-5 learning rate. But I couldn't found further detail about surface-level objective metrics , like how many clips for each clip types for evaluation, and which loss checkpoint you used for evaluation...etc. To reproduce the surface-level objective metrics, I used the loss_25.ckpt w/ pre-training and generated by 4Q condition, each for 100 clips, and used muspy to get PR/NPC/POLY results (46.585/ 8.51/ 4.040805902072384 for each). In the paper, the results are below. ** ** Was there anything I didn't notice in training or evaluation? And could you please provide the detail for doing this surface-level objective metrics evaluation?

opened by yen52205 2
Where to find the timestamp for each segmented audio clip?

According to the dataset description on the Zenodo, I suppose the timestamp for each audio clip should be included in the label.csv file. While within this file, only the clip index within a particular song can be found. Did you segment the whole song with equal length? If so what is the duration for each clip, otherwise, how can I determine the timestamp?

opened by deepspike 2
Add Cog config and demo link
Hi @annahung31!

First of all, I'm really impressed with EMOPIA's generation capabilities, the generated samples are very coherent!

This pull request makes it possible to run EMOPIA in an interactive web interface on Replicate: https://replicate.ai/annahung31/emopia

Replicate runs a Docker image, built from the included cog.yaml file by an open source tool called Cog. That Docker image can be pulled by others as well, so people can run your model on the command line without having to install Python dependencies.

One question: I'm not 100% sure I've got the emotion tags right. Is this correct?

Emotion tag 1 == High valence, high arousal

Emotion tag 2 == Low valence, high arousal

Emotion tag 3 == Low valence, low arousal

Emotion tag 4 == High valence, low arousal

If you click the "Sign in with GitHub" button on Replicate you can edit the description and add more examples, and we'll feature your model on the Explore page so more people find out about it 😊

In case you're wondering who I am, I'm from Replicate, where we're trying to make machine learning reproducible. I did my PhD in source separation and often struggled to get baselines running, so we're trying to fix that by adding Docker images and demo pages to models we really like.
opened by andreasjansson 1
dataset

Hello, excuse me, I am very interested in this job, but I have two problems when running the code. 1. Can you provide the dataset that used in this work? or 2. Can you provide the code for data preprocessing? Thank you！

opened by ExitPath 1
Songs chosen in subjective-metrics

Hi, I read the paper related to this repo and was curious about the subjective-metrics.

Were the random 4 songs for each model in subjective metrics randomly chose from 400 generated songs used in objective-metrics?

thanks for answering.

opened by yen52205 0
Training problem with family token (y_type)

Hi, when I used your training code, I found there was something I didn't understand during model forwarding. During training process, the model firstly predicts the family token (y_type), and then predicts other kind of tokens. In the code below, it shows that you directly use ground truth family token to predict other kind of tokens.

But in the generate process, you use the family predicts earlier to predict other kind of tokens. I'm wondering why you choose the way to train and inference? And if this is possible to cause the inconsistence between training and inference?

thanks for anyone who could help me figure out this!!

opened by yen52205 0
Hi I would like to know how to run the baseline use your repo?

The baseline folder has many .py files, how to run baseline using your repo? I found the baseline's implementation need data/train and data/val, how can I get these and run the py?

opened by DRJYYDS 0
Question about ailab dataset, pretrained model and MIDI files of ailabs dataset.
Hi, I has some questions about the pretrained dataset (ailabs) and the pretrained models.

On the main page of this repo, I found a link that provided pretrained models. There were three kind of models, "baseline, non-pretrained and pretrained.". Is the "pretrained checkpoint" mean that "this model pretrained by ailabs, and after that finetuned by EMOPIA"? Is there any checkpoint only trained by ailabs dataset?

I tried to train a model with the same hyperparameter with the one in paper, and used the first stage model only trained by ailabs to generate some songs, following was the script I used: python main_cp.py --mode inference --load_ckt ailabs--load_ckt_loss 20 --d_model 512 --n_head 8 --n_layer 12 --ffd_dim 2048 --num_song 50 --emo_tag 0 --task_type ignore --out_dir 'exp/ailabs/gen_midis/loss_20/test_1/' During generating the songs, I noticed that the model would generally generated over 200 bars in a song, and the model didn't know when to stop. Did you generate songs with the model only trained by ailabs? Did you have the same results with me?

I wanted to know what kind of music in ailabs dataset, was there MIDI files of ailabs dataset? When I checked the compound word transformer repo here, I found the link for ailabs dataset. But when I downloaded it, I only saw "midi_analyzed, midi_sychronized, midi_transcribed" that related to MIDI in the folder. Was files in "midi_analyzed" folder, equal to the song represented by MIDI format?

If anyone know the solution, could you probably share with me. thanks!!
opened by yen52205 2

Emotional conditioned music generation using transformer-based model.

Related tags

Overview

Use EMOPIA by MusPy

Emotion Classification

Conditional Generation

Environment

Usage

Inference

Train the model by yourself

Baseline

Authors

License

Cite the dataset

Comments

Owner

hung anna

Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation".

Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021

Implementation of "StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis"

The personal repository of the work: *DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer*.

Official repository for the paper "Instance-Conditioned GAN"

DyStyle: Dynamic Neural Network for Multi-Attribute-Conditioned Style Editing

Learning Domain Invariant Representations in Goal-conditioned Block MDPs

PyTorch implementation of MuseMorphose, a Transformer-based model for music style transfer.

E2e music remastering system - End-to-end Music Remastering System Using Self-supervised and Adversarial Training

VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

Project for music generation system based on object tracking and CGAN

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

PyTorch implementation of ECCV 2020 paper "Foley Music: Learning to Generate Music from Videos "

Step by Step on how to create an vision recognition model using LOBE.ai, export the model and run the model in an Azure Function

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)

Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

Image-generation-baseline - MUGE Text To Image Generation Baseline

The personal repository of the work: DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer.