Emotional conditioned music generation using transformer-based model.

Related tags

Deep Learning EMOPIA
Overview

This is the official repository of EMOPIA: A Multi-Modal Pop Piano Dataset For Emotion Recognition and Emotion-based Music Generation. The paper has been accepted by International Society for Music Information Retrieval Conference 2021.

  • Note: We release the transcribed MIDI files. As for the audio part, due to the copyright issue, we will only release the YouTube ID of the tracks and the timestamp of them. You might use open source crawler to get the audio file.

Use EMOPIA by MusPy

  1. install muspy
pip install muspy
  1. Use it in your script
import muspy

emopia = muspy.EMOPIADataset("data/emopia/", download_and_extract=True)
emopia.convert()
music = emopia[0]
print(music.annotations[0].annotation)

You can get the label of the piece of music:

{'emo_class': '1', 'YouTube_ID': '0vLPYiPN7qY', 'seg_id': '0'}
  • emo_class: ['1', '2', '3', '4']
  • YouTube_ID: the YouTube ID of this piece of music
  • seg_id: means this piece of music is the ith piece we take from this song. (zero-based).

For more usage please refer to MusPy.

Emotion Classification

For the classification models and codes, please refer to this repo.

Conditional Generation

Environment

  1. Install PyTorch and fast transformer:

    • torch==1.7.0 (Please install it according to your CUDA version.)

    • fast transformer :

      pip install --user pytorch-fast-transformers 
      

      or refer to the original repository

  2. Other requirements:

    pip install -r requirements.txt

Usage

Inference

  1. Download the checkpoints and put them into exp/

    • Manually:

    • By commend: (install gdown: pip install gdown)

      #baseline:
      gdown --id 1Q9vQYnNJ0hXBFwcxdWQgDNmzoW3MLl3h --output exp/baseline.zip
      
      # no-pretrained transformer
      gdown --id 1ZULJgBRu2Wb3jxFmGfAHP1v_tjoryFM7 --output exp/no-pretrained_transformer.zip
      
      # pretrained transformer
      gdown --id 19Seq18b2JNzOamEQMG1uarKjj27HJkHu --output exp/pretrained_transformer.zip
      
  2. Inference options:

  • num_songs: number of midis you want to generate.

  • out_dir: the folder where the generated midi will be saved. If not specified, midi files will be saved to exp/MODEL_YOU_USED/gen_midis/.

  • task_type: the task_type needs to be the same as the task specified during training.

    • '4-cls' for 4 class conditioning
    • 'Arousal' for only conditioning on arousal
    • 'Valence' for only conditioning on Valence
    • 'ignore' for not conditioning
  • emo_tag: the target class of emotion you want to assign.

    • If the task_type is '4-cls', emo_tag can be: 1,2,3,4, which refers to Q1, Q2, Q3, Q4.
    • If the task_type is 'Arousal', emo_tag can be: 1, 2. 1 for High arousal, 2 for Low arousal.
    • If the task_type is 'Valence', emo_tag can be: 1, 2. 1 for High Valence, 2 for Low Valence.
  1. Inference

    python main_cp.py --mode inference --task_type 4-cls --load_ckt CHECKPOINT_FOLDER --load_ckt_loss 25 --num_songs 10 --emo_tag 1 
    

Train the model by yourself

  1. Prepare the data follow the steps.

  2. training options:

  • exp_name: the folder name that the checkpoints will be saved.

  • data_parallel: use data_parallel to let the training process faster. (0: not use, 1: use)

  • task_type: the conditioning task:

    • '4-cls' for 4 class conditioning
    • 'Arousal' for only conditioning on arousal
    • 'Valence' for only conditioning on Valence
    • 'ignore' for not conditioning

    a. Only train on EMOPIA: (no-pretrained transformer in the paper)

      python main_cp.py --path_train_data emopia --exp_name YOUR_EXP_NAME --load_ckt none
    

    b. Pre-train the transformer on AILabs17k:

      python main_cp.py --path_train_data ailabs --exp_name YOUR_EXP_NAME --load_ckt none --task_type ignore
    

    c. fine-tune the transformer on EMOPIA: For example, you want to use the pre-trained model stored in 0309-1857 with loss= 30 to fine-tune:

      python main_cp.py --path_train_data emopia --exp_name YOUR_EXP_NAME --load_ckt 0309-1857 --load_ckt_loss 30
    

Baseline

  1. The baseline code is based on the work of Learning to Generate Music with Sentiment

  2. According to the author, the model works best when it is trained with 4096 neurons of LSTM, but takes 12 days for training. Therefore, due to the limit of computational resource, we used the size of 512 neurons instead of 4096.

  3. In order to use this as evaluation against our model, the target emotion classes is expanded to 4Q instead of just positive/negative.

Authors

The paper is a co-working project with Joann, SeungHeon and Nabin. This repository is mentained by Joann and me.

License

The EMOPIA dataset is released under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0). It is provided primarily for research purposes and is prohibited to be used for commercial purposes. When sharing your result based on EMOPIA, any act that defames the original music owner is strictly prohibited.

The hand drawn piano in the logo comes from Adobe stock. The author is Burak. I purchased it under standard license.

Cite the dataset

@inproceedings{{EMOPIA},
         author = {Hung, Hsiao-Tzu and Ching, Joann and Doh, Seungheon and Kim, Nabin and Nam, Juhan and Yang, Yi-Hsuan},
         title = {{MOPIA}: A Multi-Modal Pop Piano Dataset For Emotion Recognition and Emotion-based Music Generation},
         booktitle = {Proc. Int. Society for Music Information Retrieval Conf.},
         year = {2021}
}
Comments
  • How to generate REMI data by models?

    How to generate REMI data by models?

    Hi, In the paper, I notice that the model used for Emotion(4Q/Valence/Arousal) classification in Objective metrics is LSTM-Att + REMI. But In the repo, the CP transformer model will generate a .mid file and a .npy for one music clip. Could I use both of them to generate REMI data? And how to generate the REMI data type output?

    1. The evaluation method: image
    2. The metrics: image

    thanks.

    opened by yen52205 7
  • 4Q and annotators

    4Q and annotators

    Thank you so much for your great dataset. Here I am a little confused about the 4Q and annotators. I read your paper, and based on my understanding, 4Q are Q1 = HVHA, Q2 = HVLA, Q3 = LVHA, Q4 = LVLA, is that correct? And the 'label.csv' on zenodo, the annotator features are ABCD. What does that mean?

    opened by piaoziyue 3
  • two bugs

    two bugs

    Hello, excuse me, I am very interested in this job, but I find two bugs when running the code.

    1. There is no initialization parameter args.in_attn in workspace/transformer/main_cp.py;
    2. The function self.transformer_encoder(pos_emb, attn_mask, emb_emotion=emo_embd)) in workspace/transformer/models.py has no emb_emotion parameter.
    opened by ExitPath 3
  • Surface-level objective metrics of emotion-conditioned generation and training reimplementation problems.

    Surface-level objective metrics of emotion-conditioned generation and training reimplementation problems.

    Hi, I tried to train "CP transformer w/ pre-training" by processed data you offered in repo from scratch. I used 1e-4 as pretraining learning rate, and selected loss_30.ckpt, and then trained it on EMOPIA dataset by 1e-5 learning rate. But I couldn't found further detail about surface-level objective metrics , like how many clips for each clip types for evaluation, and which loss checkpoint you used for evaluation...etc. To reproduce the surface-level objective metrics, I used the loss_25.ckpt w/ pre-training and generated by 4Q condition, each for 100 clips, and used muspy to get PR/NPC/POLY results (46.585/ 8.51/ 4.040805902072384 for each). In the paper, the results are below. ** image ** Was there anything I didn't notice in training or evaluation? And could you please provide the detail for doing this surface-level objective metrics evaluation?

    opened by yen52205 2
  • Where to find the timestamp for each segmented audio clip?

    Where to find the timestamp for each segmented audio clip?

    According to the dataset description on the Zenodo, I suppose the timestamp for each audio clip should be included in the label.csv file. While within this file, only the clip index within a particular song can be found. Did you segment the whole song with equal length? If so what is the duration for each clip, otherwise, how can I determine the timestamp?

    opened by deepspike 2
  • Add Cog config and demo link

    Add Cog config and demo link

    Hi @annahung31!

    First of all, I'm really impressed with EMOPIA's generation capabilities, the generated samples are very coherent!

    This pull request makes it possible to run EMOPIA in an interactive web interface on Replicate: https://replicate.ai/annahung31/emopia

    Replicate runs a Docker image, built from the included cog.yaml file by an open source tool called Cog. That Docker image can be pulled by others as well, so people can run your model on the command line without having to install Python dependencies.

    One question: I'm not 100% sure I've got the emotion tags right. Is this correct?

    • Emotion tag 1 == High valence, high arousal
    • Emotion tag 2 == Low valence, high arousal
    • Emotion tag 3 == Low valence, low arousal
    • Emotion tag 4 == High valence, low arousal

    If you click the "Sign in with GitHub" button on Replicate you can edit the description and add more examples, and we'll feature your model on the Explore page so more people find out about it 😊

    image

    In case you're wondering who I am, I'm from Replicate, where we're trying to make machine learning reproducible. I did my PhD in source separation and often struggled to get baselines running, so we're trying to fix that by adding Docker images and demo pages to models we really like.

    opened by andreasjansson 1
  • dataset

    dataset

    Hello, excuse me, I am very interested in this job, but I have two problems when running the code. 1. Can you provide the dataset that used in this work? or 2. Can you provide the code for data preprocessing? Thank you!

    opened by ExitPath 1
  • Songs chosen in subjective-metrics

    Songs chosen in subjective-metrics

    Hi, I read the paper related to this repo and was curious about the subjective-metrics.

    Were the random 4 songs for each model in subjective metrics randomly chose from 400 generated songs used in objective-metrics?

    thanks for answering.

    opened by yen52205 0
  • Training problem with family token (y_type)

    Training problem with family token (y_type)

    Hi, when I used your training code, I found there was something I didn't understand during model forwarding. During training process, the model firstly predicts the family token (y_type), and then predicts other kind of tokens. In the code below, it shows that you directly use ground truth family token to predict other kind of tokens. image

    But in the generate process, you use the family predicts earlier to predict other kind of tokens. I'm wondering why you choose the way to train and inference? And if this is possible to cause the inconsistence between training and inference?

    thanks for anyone who could help me figure out this!!

    opened by yen52205 0
  • Hi I would like to know how  to run the baseline use your repo?

    Hi I would like to know how to run the baseline use your repo?

    The baseline folder has many .py files, how to run baseline using your repo? I found the baseline's implementation need data/train and data/val, how can I get these and run the py?

    opened by DRJYYDS 0
  • Question about ailab dataset, pretrained model and MIDI files of ailabs dataset.

    Question about ailab dataset, pretrained model and MIDI files of ailabs dataset.

    Hi, I has some questions about the pretrained dataset (ailabs) and the pretrained models.

    1. On the main page of this repo, I found a link that provided pretrained models. There were three kind of models, "baseline, non-pretrained and pretrained.". Is the "pretrained checkpoint" mean that "this model pretrained by ailabs, and after that finetuned by EMOPIA"? Is there any checkpoint only trained by ailabs dataset?

    2. I tried to train a model with the same hyperparameter with the one in paper, and used the first stage model only trained by ailabs to generate some songs, following was the script I used: python main_cp.py --mode inference --load_ckt ailabs--load_ckt_loss 20 --d_model 512 --n_head 8 --n_layer 12 --ffd_dim 2048 --num_song 50 --emo_tag 0 --task_type ignore --out_dir 'exp/ailabs/gen_midis/loss_20/test_1/' During generating the songs, I noticed that the model would generally generated over 200 bars in a song, and the model didn't know when to stop. Did you generate songs with the model only trained by ailabs? Did you have the same results with me?

    3. I wanted to know what kind of music in ailabs dataset, was there MIDI files of ailabs dataset? When I checked the compound word transformer repo here, I found the link for ailabs dataset. But when I downloaded it, I only saw "midi_analyzed, midi_sychronized, midi_transcribed" that related to MIDI in the folder. Was files in "midi_analyzed" folder, equal to the song represented by MIDI format?

    If anyone know the solution, could you probably share with me. thanks!!

    opened by yen52205 2
Owner
hung anna
hung anna
Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation".

PixelTransformer Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation". Project Page Installation Please insta

Shubham Tulsiani 24 Dec 17, 2022
Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021

ACTOR Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021. Please visit our we

Mathis Petrovich 248 Dec 23, 2022
Implementation of "StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis"

StrengthNet Implementation of "StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis" https://arxiv.org/abs/2110

RuiLiu 65 Dec 20, 2022
The personal repository of the work: *DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer*.

DanceNet3D The personal repository of the work: DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer. Dataset and Results Pleas

南嘉Nanga 36 Dec 21, 2022
Official repository for the paper "Instance-Conditioned GAN"

Official repository for the paper "Instance-Conditioned GAN" by Arantxa Casanova, Marlene Careil, Jakob Verbeek, Michał Drożdżal, Adriana Romero-Soriano.

Facebook Research 510 Dec 30, 2022
DyStyle: Dynamic Neural Network for Multi-Attribute-Conditioned Style Editing

DyStyle: Dynamic Neural Network for Multi-Attribute-Conditioned Style Editing Figure: Joint multi-attribute edits using DyStyle model. Great diversity

null 74 Dec 3, 2022
Learning Domain Invariant Representations in Goal-conditioned Block MDPs

Learning Domain Invariant Representations in Goal-conditioned Block MDPs Beining Han, Chongyi Zheng, Harris Chan, Keiran Paster, Michael R. Zhang, Jim

Chongyi Zheng 3 Apr 12, 2022
PyTorch implementation of MuseMorphose, a Transformer-based model for music style transfer.

MuseMorphose This repository contains the official implementation of the following paper: Shih-Lun Wu, Yi-Hsuan Yang MuseMorphose: Full-Song and Fine-

Yating Music, Taiwan AI Labs 142 Jan 8, 2023
E2e music remastering system - End-to-end Music Remastering System Using Self-supervised and Adversarial Training

End-to-end Music Remastering System This repository includes source code and pre

Junghyun (Tony) Koo 37 Dec 15, 2022
VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

VSR-Transformer By Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool This paper proposes a new Transformer for video super-resolution (called VSR-Transf

Jiezhang Cao 225 Nov 13, 2022
Project for music generation system based on object tracking and CGAN

Project for music generation system based on object tracking and CGAN The project was inspired by MIDINet: A Convolutional Generative Adversarial Netw

null 1 Nov 21, 2021
Alex Pashevich 62 Dec 24, 2022
PyTorch implementation of ECCV 2020 paper "Foley Music: Learning to Generate Music from Videos "

Foley Music: Learning to Generate Music from Videos This repo holds the code for the framework presented on ECCV 2020. Foley Music: Learning to Genera

Chuang Gan 30 Nov 3, 2022
Step by Step on how to create an vision recognition model using LOBE.ai, export the model and run the model in an Azure Function

Step by Step on how to create an vision recognition model using LOBE.ai, export the model and run the model in an Azure Function

El Bruno 3 Mar 30, 2022
This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Swin Transformer for Object Detection This repo contains the supported code and configuration files to reproduce object detection results of Swin Tran

Swin Transformer 1.4k Dec 30, 2022
Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)

Swin-Transformer-Tensorflow A direct translation of the official PyTorch implementation of "Swin Transformer: Hierarchical Vision Transformer using Sh

null 52 Dec 29, 2022
Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

Storium GPT-2 Models This is the official repository for the GPT-2 models described in the EMNLP 2020 paper [STORIUM: A Dataset and Evaluation Platfor

Nader Akoury 27 Dec 20, 2022
A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

README.md shall be finished soon. WSSGG 0 Overview 1 Installation 1.1 Faster-RCNN 1.2 Language Parser 1.3 GloVe Embeddings 2 Settings 2.1 VG-GT-Graph

Keren Ye 35 Nov 20, 2022
Image-generation-baseline - MUGE Text To Image Generation Baseline

MUGE Text To Image Generation Baseline Requirements and Installation More detail

null 23 Oct 17, 2022