Audio Retrieval with Natural Language Queries: A Benchmark Study
Paper | Project page | Text-to-audio search demo
This repository is the implementation of Audio Retrieval with Natural Language Queries: A Benchmark Study which builds on the Audio Retrieval with Natural Language Queries repository and provides code for downloading the SoundDescs dataset and for reproducing all result from Audio Retrieval with Natural Language Queries: A Benchmark Study. The code is based on the Use What You Have: Video retrieval using representations from collaborative experts and MMT: Multi-modal Transformer for Video Retrieval repositories.
The datasets used in this paper are SoundDescs, AudioCaps, CLOTHO, Activity-Net and QuerYD.
Requirements and datasets
The required libraries for running this code can be found in requirements.txt. Cuda 10.1 and Python 3.7 were used.
conda create --name audio-retrieval python=3.7
conda activate audio-retrieval
pip install -r requirements.txt
conda install -c conda-forge ffmpeg
To be able to run the code below, features extracted from various datasets need to be downloaded. If there is not enough space in your working location to store some of these features (for SoundDescs and AudioCaps the files are larger than 6GB while the others are under 1GB) then you will need to create a folder called data
inside this repository which should be a symlink to a folder where enogh memory exists. As an example, run the following from the audio-retrieval-benchmark base directory:
ln -s
data
To download features for the AudioCaps, Clotho, Activity-Net, and QuerYD datasets, follow the steps here. The SoundDescs features can be downloaded analogously:
python3 misc/sync_experts.py --dataset SoundDescs
In case you want to use the raw audio data for the SoundDescs, we explain how to download the SoundDescs dataset below.
SoundDescs dataset download and pre-processing
This is a tool to allow for easy download of audio files and text information from the https://sound-effects.bbcrewind.co.uk/search page.
Downloading audios
First download the download_links_renamed.txt or, if needed, the download_links.txt file. Save it in the folder that will be used for downloading audios. To be able to download the files the --download_folder flag must be set when running the commands below.
To only download a few audio files, use the --limit flag with non-zero values.
To download audio files in zip form for the SoundDescs dataset simply run the line below. To download multiple files at the same time use the processes flag. We recommend not using more than two processes to avoid being blocked by the website.
python sounddescs_download_audios.py --download_folder {location where to save files} --processes 2
To unzip the audio files to a new folder, run the line below. Here a larger number of processes can be used:
python sounddescs_download_audios.py --action unzipping --processes 20 --download_folder {location where to save files}
To re-sample the audio files at 16kHz and be put in the format needed to run CE, MoEE, and MMT, run the following command:
python sounddescs_wavs_transforms.py --exp resample --initial_folder {location where files were saved before} --dest_folder {location where resampled files are stored} --processes 20
Other files available that might prove useful are found in the sounddescs_data folder. The files are:
- categories.pkl - this file contains tags for most audio files. These tags can be Nature, Clocks, Sport etc. Some files have more than one tag and some have no tags.
- descriptions.pkl - this file contains the descriptions associated with the audio files. These are used as captions in our CE, MoEE and MMT experiments.
- extra_info.pkl - this file contains information about the audio content such as file type (e.g. MP3) or sample rate (e.g. 44.1KHz)
Terms and conditions for SoundDescs dataset
To download and use the SoundDescs dataset, you need to comply with the terms and conditions of the RemArc Licence.
This is from the official website that hosts the data:
By continuing, you agree to comply with the terms of the RemArc Licence for this and any future downloads.
Commercial use of this content is not allowed under the RemArc license.
For commercial use, buy the sound effect from Pro Sound Effects which can be found in the More Detail section for each sound effect.
Evaluating pretrained CE, MoEE, and MMT models on multiple seeds and reproducing results
To reproduce results for the CE, MoEE, and MMT models in the tables below, multiple models trained with different seeds need to be downloaded and evaluated on the test sets.
The steps needed to reproduce results are:
- Picking the experiment to be reproduced which is in the form
- misc/exps-names.md
. - Downloading the features and splits corresponding to the dataset for which the experiment is run. For example for AudioCaps run:
# fetch the pretrained experts for AudioCaps
python3 misc/sync_experts.py --dataset AudioCaps
Additional examples for the datasets used in this paper can be found in misc/exps-names.md
.
- Running the
eval.py
script.
For example, to reproduce the experiments for AudioCaps with complete visual and audio experts, run the following line:
python eval.py --experiment audiocaps-train-full-ce-r2p1d-inst-vggish-vggsound
If the --experiment flag is not provided, the eval.py
script will download and evaluate all CE and MoEE models on the test set.
Training a new model
Training a new CE audio-text embedding requires:
- The pretrained experts for the dataset used for training, which should be located in
/data/ /symlinked-feats misc/exps-names.md
. - A
config.json
file. You can define your own, or use one of the provided configs in the configs directory.
Training is then performed with the following command:
python3 train.py --config
--device
where
is the index of the GPU to train on. This option can be ommitted for training on the CPU.
For example, to train a new embedding for the CLOTHO dataset, run the following sequence of commands:
# fetch the pretrained experts for CLOTHO
python3 misc/sync_experts.py --dataset CLOTHO
# Train the model
python3 train.py --config configs/clotho/train-vggish-vggsound.json --device 0
To train MMT, use the following command:
python -m mmt/train.py --config
For example, to train MMT on the CLOTHO dataset, run the following sequence of commands:
# fetch the pretrained experts for CLOTHO
python3 misc/sync_experts.py --dataset CLOTHO
# Train MMT on CLOTHO
python -m mmt/train --config mmt/configs/clotho/Clotho_mmt.json
AudioCaps
These are the retrieval results obtained for the AudioCaps dataset when using only audio experts:
Experts | Task | R@1 | R@5 | R@10 | R@50 | MdR | MnR | Geom | params | Links |
---|---|---|---|---|---|---|---|---|---|---|
CE - VGGish | t2v | 18.5(0.3) | 47.4(0.1) | 62.0(0.5) | 89.3(0.3) | 6.0(0.0) | 22.7(0.3) | 37.9(0.1) | 7.39M | config, model |
CE - VGGish | v2t | 20.7(1.8) | 48.6(0.7) | 62.9(0.4) | 86.9(0.2) | 6.0(0.0) | 25.4(1.3) | 39.8(1.3) | 7.39M | config, model |
CE - VGGSound | t2v | 22.4(0.3) | 53.9(1.2) | 69.2(0.9) | 91.4(1.6) | 5.0(0.0) | 19.9(3.4) | 43.7(0.5) | 12.12M | config, model |
CE - VGGSound | v2t | 27.0(0.9) | 57.8(0.3) | 72.5(0.7) | 92.6(0.3) | 4.0(0.0) | 17.5(1.8) | 48.3(0.7) | 12.12M | config, model |
CE - VGGish + VGGSound | t2v | 23.6(0.6) | 56.2(0.5) | 71.4(0.5) | 92.3(1.5) | 4.0(0.0) | 18.3(3.0) | 45.6(0.5) | 21.86M | config, model |
CE - VGGish + VGGSound | v2t | 27.6(1.0) | 60.5(0.7) | 74.7(0.8) | 94.2(0.4) | 4.0(0.0) | 14.7(1.4) | 50.0(0.6) | 21.86M | config, model |
MoEE - VGGish + VGGSound | t2v | 23.0(0.7) | 55.7(0.3) | 71.0(1.2) | 93.0(0.3) | 4.0(0.0) | 16.3(0.5) | 45.0(0.8) | 8.90M | config, model |
MoEE - VGGish + VGGSound | v2t | 26.6(0.7) | 59.3(1.4) | 73.5(1.1) | 94.0(0.5) | 4.0(0.0) | 15.6(0.8) | 48.8(0.8) | 8.90M | config, model |
MMT - VGGish + VGGSound | t2v | 36.1(3.3) | 72.0(2.9) | 84.5(2.0) | 97.6(0.4) | 2.3(0.6) | 7.5(1.3) | 60.3(2.8) | 127.08M | config, model |
MMT - VGGish + VGGSound | v2t | 39.6(0.2) | 76.8(0.9) | 86.7(1.8) | 98.2(0.4) | 2.0(0.0) | 6.5(0.5) | 64.1(0.5) | 127.08M | config, model |
Using only visual experts for AudioCaps:
Experts | Task | R@1 | R@5 | R@10 | R@50 | MdR | MnR | Geom | params | Links |
---|---|---|---|---|---|---|---|---|---|---|
CE - Scene | t2v | 6.0(0.0) | 22.9(0.5) | 35.6(0.8) | 70.4(0.6) | 19.0(0.0) | 69.1(4.6) | 16.9(0.3) | 7.51M | config, model |
CE - Scene | v2t | 6.8(0.6) | 22.1(0.9) | 31.9(1.3) | 62.9(0.3) | 26.3(1.4) | 121.3(6.8) | 16.9(0.8) | 7.51M | config, model |
CE - R2P1D | t2v | 8.1(0.4) | 30.0(0.4) | 45.8(0.2) | 77.2(0.9) | 12.5(0.5) | 56.6(4.6) | 22.3(0.5) | 6.21M | config, model |
CE - R2P1D | v2t | 10.7(0.1) | 30.4(1.5) | 43.4(1.9) | 75.0(1.0) | 14.3(1.2) | 78.2(1.6) | 24.2(0.7) | 6.21M | config, model |
CE - Inst | t2v | 8.2(0.3) | 29.7(0.5) | 46.2(0.5) | 79.2(1.3) | 12.0(0.0) | 50.4(7.3) | 22.4(0.4) | 7.38M | config, model |
CE - Inst | v2t | 10.1(0.8) | 28.0(1.4) | 41.3(0.6) | 75.8(0.7) | 15.0(1.0) | 85.8(2.4) | 22.7(0.9) | 7.38M | config, model |
CE - Scene + R2P1D | t2v | 8.6(0.1) | 30.9(0.0) | 47.4(0.2) | 79.1(0.8) | 11.3(0.6) | 51.2(3.4) | 23.3(0.0) | 16.07M | config, model |
CE - Scene + R2P1D | v2t | 11.6(0.4) | 31.5(0.9) | 43.5(0.8) | 75.8(0.4) | 14.8(0.8) | 69.9(2.6) | 25.1(0.3) | 16.07M | config, model |
CE - Scene + Inst | t2v | 8.2(0.3) | 30.4(0.3) | 47.1(0.2) | 78.9(1.8) | 12.0(0.0) | 51.7(8.8) | 22.7(0.3) | 17.25M | config, model |
CE - Scene + Inst | v2t | 10.2(1.2) | 29.0(1.5) | 41.5(1.3) | 74.5(0.2) | 15.7(0.6) | 83.8(2.9) | 23.0(0.6) | 17.25M | config, model |
CE - R2P1D + Inst | t2v | 9.5(0.6) | 33.0(1.0) | 50.0(0.5) | 81.1(0.9) | 10.3(0.6) | 45.9(3.8) | 25.0(0.8) | 15.95M | config, model |
CE - R2P1D + Inst | v2t | 11.2(0.1) | 31.3(1.5) | 45.2(1.9) | 77.4(0.7) | 13.0(1.0) | 68.5(0.7) | 25.1(0.8) | 15.95M | config, model |
Visual and audio experts for AudioCaps:
Experts | Task | R@1 | R@5 | R@10 | R@50 | MdR | MnR | Geom | params | Links |
---|---|---|---|---|---|---|---|---|---|---|
CE - R2P1D + Inst + VGGish | t2v | 24.5(0.8) | 59.0(0.6) | 74.9(1.0) | 94.5(0.7) | 4.0(0.0) | 14.3(1.2) | 47.6(0.7) | 23.32M | config, model |
CE - R2P1D + Inst + VGGish | v2t | 31.0(2.2) | 64.5(1.0) | 78.8(1.2) | 95.5(0.1) | 3.0(0.0) | 11.4(0.9) | 54.0(1.8) | 23.32M | config, model |
CE - R2P1D + Inst + VGGSound | t2v | 27.6(0.2) | 63.8(0.6) | 78.0(0.8) | 94.7(0.1) | 3.0(0.0) | 13.4(0.8) | 51.6(0.2) | 28.05M | config, model |
CE - R2P1D + Inst + VGGSound | v2t | 32.7(0.9) | 69.2(1.0) | 82.4(0.4) | 96.8(0.3) | 2.8(0.3) | 9.3(0.2) | 57.1(0.7) | 28.05M | config, model |
CE - R2P1D + Inst +VGGish + VGGSound | t2v | 28.0(0.5) | 65.3(0.7) | 80.4(0.3) | 96.0(0.5) | 3.0(0.0) | 10.8(0.5) | 52.8(0.4) | 35.43M | config, model |
CE - R2P1D + Inst +VGGish + VGGSound | v2t | 35.8(0.6) | 70.2(1.6) | 83.3(0.6) | 98.3(0.4) | 2.0(0.0) | 7.8(0.5) | 59.4(0.4) | 35.43M | config, model |
CLOTHO
Experts | Task | R@1 | R@5 | R@10 | R@50 | MdR | MnR | Geom | params | Links |
---|---|---|---|---|---|---|---|---|---|---|
CE - VGGish | t2v | 4.0(0.2) | 15.0(0.9) | 25.4(0.5) | 61.4(1.1) | 31.7(1.5) | 78.2(2.2) | 11.5(0.5) | 7.39M | config, model |
CE - VGGish | v2t | 4.8(0.4) | 15.9(1.8) | 25.8(1.7) | 57.5(2.5) | 35.7(2.5) | 106.6(5.7) | 12.5(1.0) | 7.39M | config, model |
CE - VGGish + VGGSound | t2v | 6.7(0.4) | 21.6(0.6) | 33.2(0.3) | 69.8(0.3) | 22.3(0.6) | 58.3(1.1) | 16.9(0.2) | 21.86M | config, model |
CE - VGGish + VGGSound | v2t | 7.0(0.3) | 22.7(0.6) | 34.6(0.5) | 67.9(2.3) | 21.3(0.6) | 72.6(3.4) | 17.7(0.3) | 21.86M | config, model |
MoEE - VGGish + VGGSound | t2v | 6.0(0.1) | 20.8(0.7) | 32.3(0.3) | 68.5(0.5) | 23.0(0.0) | 60.2(0.8) | 16.0(0.3) | 8.90M | config, model |
MoEE - VGGish + VGGSound | v2t | 7.2(0.5) | 22.1(0.7) | 33.2(1.1) | 67.4(0.3) | 22.7(0.6) | 71.8(2.3) | 17.4(0.7) | 8.90M | config, model |
MMT - VGGish + VGGSound | t2v | 6.5(0.6) | 21.6(0.7) | 32.8(2.1) | 66.9(2.0) | 23.0(2.6) | 67.7(3.1) | 16.6(1.1) | 127.08M | config, model |
MMT - VGGish + VGGSound | v2t | 6.3(0.5) | 22.8(1.7) | 33.3(2.2) | 67.8(1.5) | 22.3(1.5) | 67.3(2.9) | 16.8(1.0) | 127.08M | config, model |
SoundDescs
Experts | Task | R@1 | R@5 | R@10 | R@50 | MdR | MnR | Geom | params | Links |
---|---|---|---|---|---|---|---|---|---|---|
CE - VGGish | t2v | 25.4(0.6) | 53.3(0.3) | 64.1(0.3) | 81.7(0.4) | 4.7(0.6) | 83.7(1.9) | 44.3(0.3) | 7.39M | config, model |
CE - VGGish | v2t | 24.2(0.3) | 52.3(0.3) | 62.5(0.2) | 80.9(0.3) | 5.0(0.0) | 83.6(1.1) | 42.9(0.3) | 7.39M | config, model |
CE - VGGish + VGGSound | t2v | 31.1(0.2) | 60.6(0.7) | 70.8(0.5) | 86.0(0.2) | 3.0(0.0) | 63.6(2.2) | 51.1(0.4) | 21.86M | config, model |
CE - VGGish + VGGSound | v2t | 30.8(0.8) | 60.3(0.3) | 69.5(0.1) | 85.4(0.2) | 3.0(0.0) | 63.2(0.6) | 50.5(0.4) | 21.86M | config, model |
MoEE - VGGish + VGGSound | t2v | 30.8(0.7) | 60.8(0.3) | 70.9(0.5) | 85.9(0.6) | 3.0(0.0) | 62.0(3.8) | 51.0(0.6) | 8.90M | config, model |
MoEE - VGGish + VGGSound | v2t | 30.9(0.3) | 60.3(0.4) | 70.1(0.3) | 85.3(0.6) | 3.0(0.0) | 61.5(3.2) | 50.7(0.3) | 8.90M | config, model |
MMT - VGGish + VGGSound | t2v | 30.7(0.4) | 61.8(1.0) | 72.2(0.8) | 88.8(0.4) | 3.0(0.0) | 34.0(0.6) | 51.5(0.5) | 127.08M | config, model |
MMT - VGGish + VGGSound | v2t | 31.4(0.8) | 63.2(0.7) | 73.4(0.5) | 89.0(0.3) | 3.0(0.0) | 32.5(0.4) | 52.6(0.7) | 127.08M | config, model |
Pretraining on SoundDescs, finetuning on AudioCaps
Experts | Task | R@1 | R@5 | R@10 | R@50 | MdR | MnR | Geom | params | Links |
---|---|---|---|---|---|---|---|---|---|---|
CE - VGGish + VGGSound | t2v | 23.3(0.7) | 52.2(0.1) | 63.9(0.5) | 84.3(0.3) | 5.0(0.0) | 59.9(1.6) | 42.7(0.5) | 21.86M | config, model |
CE - VGGish + VGGSound | v2t | 22.2(0.4) | 51.7(0.3) | 63.3(0.3) | 83.8(0.4) | 5.0(0.0) | 59.2(0.5) | 41.7(0.2) | 21.86M | config, model |
Pretraining on AudioCaps, finetuning on CLOTHO
Experts | Task | R@1 | R@5 | R@10 | R@50 | MdR | MnR | Geom | params | Links |
---|---|---|---|---|---|---|---|---|---|---|
CE - VGGish + VGGSound | t2v | 9.1(0.3) | 27.4(0.1) | 39.7(0.4) | 75.0(0.4) | 17.0(0.0) | 48.6(0.7) | 21.5(0.1) | 21.86M | config, model |
CE - VGGish + VGGSound | v2t | 11.1(1.1) | 26.9(0.7) | 39.6(1.1) | 73.7(0.6) | 16.3(0.6) | 57.4(1.8) | 22.8(1.2) | 21.86M | config, model |
Pretraining on SoundDescs, finetuning on CLOTHO
Experts | Task | R@1 | R@5 | R@10 | R@50 | MdR | MnR | Geom | params | Links |
---|---|---|---|---|---|---|---|---|---|---|
CE - VGGish + VGGSound | t2v | 6.4(0.5) | 21.1(1.2) | 32.5(1.7) | 69.3(1.4) | 22.7(1.5) | 57.6(2.3) | 16.3(1.0) | 21.86M | config, model |
CE - VGGish + VGGSound | v2t | 6.1(0.7) | 20.1(1.7) | 31.4(1.8) | 65.9(2.0) | 24.7(1.5) | 78.1(5.3) | 15.7(1.3) | 21.86M | config, model |
Pretraining on AudioCaps, finetuning on SoundDescs
Experts | Task | R@1 | R@5 | R@10 | R@50 | MdR | MnR | Geom | params | Links |
---|---|---|---|---|---|---|---|---|---|---|
CE - VGGish + VGGSound | t2v | 23.3(0.7) | 52.2(0.1) | 63.9(0.5) | 84.3(0.3) | 5.0(0.0) | 59.9(1.6) | 42.7(0.5) | 21.86M | config, model |
CE - VGGish + VGGSound | v2t | 22.2(0.4) | 51.7(0.3) | 63.3(0.3) | 83.8(0.4) | 5.0(0.0) | 59.2(1.3) | 41.7(0.2) | 21.86M | config, model |
Visual centric datasets
Experts | Task | R@1 | R@5 | R@10 | R@50 | MdR | MnR | Geom | params | Links |
---|---|---|---|---|---|---|---|---|---|---|
CE - VGGish QuerYD | t2v | 3.7(0.2) | 11.7(0.4) | 17.3(0.6) | 36.3(0.3) | 115.5(5.2) | 273.5(6.7) | 9.1(0.0) | 7.39M | config, model |
CE - VGGish QuerYD | v2t | 3.8(0.2) | 11.5(0.4) | 16.8(0.2) | 35.2(0.4) | 116.3(2.1) | 271.9(5.8) | 9.0(0.2) | 7.39M | config, model |
CE - VGGish Activity-Net | t2v | 1.4(0.1) | 5.0(0.1) | 8.5(0.2) | 22.1(0.9) | 312.0(25.6) | 765.6(35.8) | 3.9(0.1) | 7.39M | config, model |
CE - VGGish Activity-Net | v2t | 1.1(0.1) | 4.5(0.1) | 7.9(0.0) | 21.6(0.8) | 306.3(27.1) | 781.7(30.6) | 3.4(0.1) | 7.39M | config, model |
https://www.robots.ox.ac.uk/~vgg/research/audio-retrieval/
More information can be found at our project page:References
If you find this code useful, please consider citing [1,2,3,4].
[1]
@inproceedings{Koepke2021,
author = {Koepke, A.S. and Oncescu, A.-M. and Henriques, J. and Akata, Z. and Albanie, S.},
title = {Audio Retrieval with Natural Language Queries: A Benchmark Study},
booktitle = {arXiv preprint arXiv:2112.09418},
year = {2021}
}
[2]
@inproceedings{Oncescu21a,
author = {Oncescu, A.-M. and Koepke, A.S. and Henriques, J. and Akata, Z., Albanie, S.},
title = {Audio Retrieval with Natural Language Queries},
booktitle = {INTERSPEECH},
year = {2021}
}
[3]
@inproceedings{Liu2019a,
author = {Liu, Y. and Albanie, S. and Nagrani, A. and Zisserman, A.},
title = {Use What You Have: Video retrieval using representations from collaborative experts},
booktitle = {British Machine Vision Conference (BMVC)},
year = {2019},
}
[4]
@inproceedings{gabeur2020mmt,
author = {Gabeur, V. and Sun, C. and Alahari, K. and Schmid, C.},
title = {Multi-modal Transformer for Video Retrieval},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2020}
}