MusCaps: Generating Captions for Music Audio
Ilaria Manco1 2, Emmanouil Benetos1, Elio Quinton2, Gyorgy Fazekas1
1 Queen Mary University of London, 2 Universal Music Group
This repository is the official implementation of "MusCaps: Generating Captions for Music Audio" (IJCNN 2021). In this work, we propose an encoder-decoder model to generate natural language descriptions of music audio. We provide code to train our model on any dataset of (audio, caption) pairs, together with code to evaluate the generated descriptions on a set of automatic metrics (BLEU, METEOR, ROUGE, CIDEr, SPICE, SPIDEr).
Setup
The code was developed in Python 3.7 on Linux CentOS 7 and training was carried out on an RTX 2080 Ti GPU. Other GPUs and platforms have not been fully tested.
Clone the repo
git clone https://github.com/ilaria-manco/muscaps
cd muscaps
You'll need to have the libsndfile
library installed. All other requirements, including the code package, can be installed with
pip install -r requirements.txt
pip install -e .
Project structure
root
├─ configs # Config files
│ ├─ datasets
│ ├─ models
│ └─ default.yaml
├─ data # Folder to save data (input data, pretrained model weights, etc.)
│ ├─ audio_encoders
│ ├─ datasets
│ │ └─ dataset_name
| └── ...
├─ muscaps
| ├─ caption_evaluation_tools # Translation metrics eval on audio captioning
│ ├─ datasets # Dataset classes
│ ├─ models # Model code
│ ├─ modules # Model components
│ ├─ scripts # Python scripts for training, evaluation etc.
│ ├─ trainers # Trainer classes
│ └─ utils # Utils
└─ save # Saved model checkpoints, logs, configs, predictions
└─ experiments
├── experiment_id1
└── ...
Dataset
The datasets used in our experiments is private and cannot be shared, but details on how to prepare an equivalent music captioning dataset are provided in the data README.
Pre-trained audio feature extractors
For the audio feature extraction component, MusCaps uses CNN-based audio tagging models like musicnn. In our experiments, we use @minzwon's implementation and pre-trained models, which you can download from the official repo. For example, to obtain the weights for the HCNN model trained on the MagnaTagATune dataset, run the following commands
mkdir data/audio_encoders
cd data/audio_encoders/
wget https://github.com/minzwon/sota-music-tagging-models/raw/master/models/mtat/hcnn/best_model.pth
mv best_model.pth mtt_hcnn.pth
Training
Dataset, model and training configurations are set in the respective yaml
files in configs
. Some of the fields can be overridden by arguments in the CLI (for more details on this, refer to the training script).
To train the model with the default configs, simply run
cd muscaps/scripts/
python train.py <baseline/attention> --feature_extractor <musicnn/hcnn> --pretrained_model <msd/mtt> --device_num <gpu_number>
This will generate an experiment_id
and create a new folder in save/experiments
where the output will be saved.
If you wish to resume training from a saved checkpoint, run
python train.py <baseline/attention> --experiment_id <experiment_id> --device_num <gpu_number>
Evaluation
To evaluate a model saved under <experiment_id>
on the captioning task, run
cd muscaps/scripts/
python caption.py <experiment_id> --metrics True
Cite
@misc{manco2021muscaps,
title={MusCaps: Generating Captions for Music Audio},
author={Ilaria Manco and Emmanouil Benetos and Elio Quinton and Gyorgy Fazekas},
year={2021},
eprint={2104.11984},
archivePrefix={arXiv}
}
Acknowledgements
This repo reuses some code from the following repos:
- sota-music-tagging-models by @minzwon
- caption-evaluation-tools by @audio-captioning
- mmf by @facebookresearch
- a-PyTorch-Tutorial-to-Image-Captioning by @sgrvinod
- allennlp by @allenai
Contact
If you have any questions, please get in touch: i.manco@qmul.ac.uk.