Who calls the shots? Rethinking Few-Shot Learning for Audio (WASPAA 2021)



This repo contains the source code for the paper "Who calls the shots? Rethinking Few-Shot Learning for Audio." (WASPAA 2021)

Table of Contents


Models in this work are trained on FSD-MIX-CLIPS, an open dataset of programmatically mixed audio clips with a controlled level of polyphony and signal-to-noise ratio. We use single-labeled clips from FSD50K as the source material for the foreground sound events and Brownian noise as the background to generate 281,039 10-second strongly-labeled soundscapes with Scaper. We refer this (intermediate) dataset of 10s soundscapes as FSD-MIX-SED. Each soundscape contains n events from n different sound classes where n is ranging from 1 to 5. We then extract 614,533 1s clips centered on each sound event in the soundscapes in FSD-MIX-SED to produce FSD-MIX-CLIPS.

Due to the large size of the dataset, instead of releasing the raw audio files, we release the source material, a subset of FSD50K, and soundscape annotations in JAMS format which can be used to reproduce FSD-MIX-SED using Scaper. All clips in FSD-MIX-CLIPS are extracted from FSD-MIX-SED. Therefore, for FSD-MIX-CLIPS, instead of releasing duplicated audio content, we provide annotations that specify the filename in FSD-MIX-SED and the corresponding starting time (in second) of each 1-second clip.

To reproduce FSD-MIX-SED:

  1. Download all files from Zenodo.
  2. Extract .tar.gz files. You will get
  • FSD_MIX_SED.annotations: 281,039 annotation files, 35GB
  • FSD_MIX_SED.source: 10,296 single-labeled audio clips, 1.9GB
  • FSD_MIX_CLIPS.annotations: 5 annotation files for each class/data split
  • vocab.json: 89 classes, each class is then labeled by its index in the list in following experiments. 0-58: base, 59-73: novel-val, 74-88: novel-test.

We will use FSD_MIX_SED.annotations and FSD_MIX_SED.source to reproduce the audio data in FSD_MIX_SED, and use the audio with FSD_MIX_CLIPS.annotation for the following training and evaluation.

  1. Install Scaper
  2. Generate soundscapes from jams files by running the command. Set annpaths and audiopath to the extracted folders, and savepath to the desired path to save output audio files.
python ./data/generate_soundscapes.py \
--annpath PATH-TO-FSD_MIX_SED.annotations \
--audiopath PATH-TO-FSD_MIX_SED.source \

Note that this will generate 281,039 audio files with a size of ~450GB to the folder FSD_MIX_SED.audio at the set savepath.

If you want to get the foreground material (FSD-MIX-SED.source) directly from FSD50K instead of downloading them, run

python ./data/preprocess_foreground_sounds.py \
--fsdpath PATH-TO-FSD50K \


We provide source code to train the best performing embedding model (pretrained OpenL3 + FC) and three different few-shot methods to predict both base and novel class data.


Once audio files are reproduced, we pre-compute OpenL3 embeddings of clips in FSD-MIX-CLIPS and save them.

  1. Install OpenL3
  2. Set paths of the downloaded FSD_MIX_CLIPS.annotations and generated FSD_MIX_SED.audio, and run
python get_openl3emb_and_filelist.py \
--annpath PATH-TO-FSD_MIX_CLIPS.annotations \
--audiopath PATH-TO-FSD_MIX_SED.audio \

This generates 614,533 .pkl files where each file contains an embedding. A set of filelists will also be saved under current folder.


Create conda environment from the environment.yml file and activate it.

Note that you only need the environment if you want to train/evaluate the models. For reproducing the dataset, see Dataset.

conda env create -f environment.yml
conda activate dfsl


  • Training configuration can be specified using config files in ./config
  • Model checkpoints will be saved in the folder ./experiments, and tensorboard data will be saved in the folder ./run

1. Base classifier

First, to train the base classifier on base classes, run

python train.py --config openl3CosineClassifier --openl3

2. Few-shot weight generator for DFSL

Once the base model is trained, we can train the few-shot weight generator for DFSL by running

python train.py --config openl3CosineClassifierGenWeightAttN5 --openl3

By default, DFSL is trained with 5 support examples: n=5, to train DFSL with different n, run

# n=10
python train.py --config openl3CosineClassifierGenWeightAttN10 --openl3

# n=20
python train.py --config openl3CosineClassifierGenWeightAttN20 --openl3

# n=30
python train.py --config openl3CosineClassifierGenWeightAttN30 --openl3


We evaluate the trained models on test data from both base and novel classes. For each novel class, we need to sample a support set. Run the command below to split the original filelist for test classes to test_support_filelist.pkl and test_query_filelist.pkl.

python get_test_support_and_query.py
  • Here we consider monophonic support examples with mixed(random) SNR. Code to run evaluation with polyphonic support examples with specific low/high SNR will be released soon.

For evaluation, we compute features for both base and novel test data, then make predictions and compute metrics in a joint label space. The computed features, model predictions, and metrics will be saved in the folder ./experiments. We consider 3 few-shot methods to predict novel classes. To test different number of support examples, set different n_pos in the following commands.

1. Prototype

# Extract embeddings of evaluation data and save them.
python save_features.py --config=openl3CosineClassifier --openl3

# Get and save model prediction, run this multiple time (niter) to count for random selection of novel examples.
python pred.py --config=openl3CosineClassifier --openl3 --niter 100 --n_base 59 --n_novel 15 --n_pos 5

# compute and save evaluation metrics based on model prediction
python metrics.py --config=audioset_pannCosineClassifier --openl3 --n_base 59 --n_novel 15 --n_pos 5


# Extract embeddings of evaluation data and save them.
python save_features.py --config=openl3CosineClassifierGenWeightAttN5 --openl3

# Get and save model prediction, run this multiple time (niter) to count for random selection of novel examples.
python pred.py --config=openl3CosineClassifierGenWeightAttN5 --openl3 --niter 100 --n_base 59 --n_novel 15 --n_pos 5

# compute and save evaluation metrics based on model prediction
python metrics.py --config=audioset_pannCosineClassifierGenWeightAttN5 --openl3 --n_base 59 --n_novel 15 --n_pos 5

3. Logistic regression

Train a binary logistic regression model for each novel class. Note that we need to sample n_neg of examples from the base training data as the negative examples. Default n_neg is 100. We also did a hyperparameter search on n_neg based on the validation data while n_pos changing from 5 to 30:

  • n_pos=5, n_neg=100
  • n_pos=10, n_neg=500
  • n_pos=20, n_neg=1000
  • n_pos=30, n_neg=5000
# Extract embeddings of evaluation data and save them.
python save_features.py --config=openl3CosineClassifier --openl3

# Train binary logistic regression models, predict test data, and compute metrics
python logistic_regression.py --config=openl3CosineClassifier --openl3 --niter 10 --n_base 59 --n_novel 15 --n_pos 5 --n_neg 100


This code is built upon the implementation from FewShotWithoutForgetting


Please cite our paper if you find the code or dataset useful for your research.

Y. Wang, N. J. Bryan, J. Salamon, M. Cartwright, and J. P. Bello. "Who calls the shots? Rethinking Few-shot Learning for Audio", IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021

You might also like...
Official code release for
Official code release for "Learned Spatial Representations for Few-shot Talking-Head Synthesis" ICCV 2021

Official code release for "Learned Spatial Representations for Few-shot Talking-Head Synthesis" ICCV 2021

Few-shot Learning of GPT-3

Few-shot Learning With Language Models This is a codebase to perform few-shot "in-context" learning using language models similar to the GPT-3 paper.

Library of various Few-Shot Learning frameworks for text classification

FewShotText This repository contains code for the paper A Neural Few-Shot Text Classification Reality Check Environment setup # Create environment pyt

Few-Shot Graph Learning for Molecular Property Prediction

Few-shot Graph Learning for Molecular Property Prediction Introduction This is the source code and dataset for the following paper: Few-shot Graph Lea

Few-shot Relation Extraction via Bayesian Meta-learning on Relation Graphs

Few-shot Relation Extraction via Bayesian Meta-learning on Relation Graphs This is an implemetation of the paper Few-shot Relation Extraction via Baye

True Few-Shot Learning with Language Models

This codebase supports using language models (LMs) for true few-shot learning: learning to perform a task using a limited number of examples from a single task distribution.

 Spatial Contrastive Learning for Few-Shot Classification (SCL)
Spatial Contrastive Learning for Few-Shot Classification (SCL)

This repo contains the official implementation of Spatial Contrastive Learning for Few-Shot Classification (SCL), which presents of a novel contrastive learning method applied to few-shot image classification in order to learn more general purpose embeddings, and facilitate the test-time adaptation to novel visual categories.

 Prototypical Networks for Few shot Learning in PyTorch
Prototypical Networks for Few shot Learning in PyTorch

Prototypical Networks for Few shot Learning in PyTorch Simple alternative Implementation of Prototypical Networks for Few Shot Learning (paper, code)

Pytorch implementation of the paper "Optimization as a Model for Few-Shot Learning"

Optimization as a Model for Few-Shot Learning This repo provides a Pytorch implementation for the Optimization as a Model for Few-Shot Learning paper.

  • How to label the Mix CLIPS form FSD-MIX-SED?

    How to label the Mix CLIPS form FSD-MIX-SED?

    "To label a clip, we consider all sound events within the 1s window. If an event overlaps with the window for more than 0.5s or half of the event duration, we add the corresponding class into the clip label. We then consider the number of classes within a clip as the level of polyphony with the assumption that it is rare to have short non-overlapping events within a 1s window. "

    Can you provide a demo? Thank you very much!

    opened by chester-w-xie 1
  • IndexError: index 87 is out of bounds for dimension 0 with size 87

    IndexError: index 87 is out of bounds for dimension 0 with size 87

    rethink-audio-fsl-main/dataloader.py", line 51, in getitem multihot_label[t] = 1 IndexError: index 87 is out of bounds for dimension 0 with size 87

    opened by chester-w-xie 0
  • Create revise_clips_annotations.py

    Create revise_clips_annotations.py

    Remove duplicate annotations from FSD_MIX_CLIPS_annotations - and generate updated annotation files usage:

    python revise_clips_annotations.py \
    --clips_ann_path path to FSD_MIX_CLIPS.annotations folder \
    --savepath path to save output
    opened by chester-w-xie 0
  • There may be some duplication of annotation information

    There may be some duplication of annotation information

    I run the following command to get OpenL3 embeddings of clips in FSD-MIX-CLIPS : python get_openl3emb_and_filelist.py
    --annpath PATH-TO-FSD_MIX_CLIPS.annotations
    --audiopath PATH-TO-FSD_MIX_SED.audio
    --savepath PATH_TO_SAVE_OUTPUT

    I have counted the number of .pkl files generated by the program and the results are as follows Base-train: 448,123 Base-val: 65,520 Base-test: 65,422

    Novel-val: 17,347 Novel-test: 16,636

    The total number is 613,048, not 614,533

    My guess is that some annotations in FSD_MIX_CLIPS.annotations may have overlapped and therefore overwritten the file during program execution.

    Therefore, I made a small change to line 25 of file get_openl3emb_and_filelist.py:

    outfile = join(savefolder, fname.replace('.wav', '' + str(start_sample) + '' + str(idx) + '.pkl'))

    The number of files obtained after rerunning the code is then consistent with what is described in the paper.

    Maybe you can check if the annotation information in FSD_MIX_CLIPS.annotations does overlap. Thank you,

    opened by chester-w-xie 3
Yu Wang
Ph.D. Candidate
Yu Wang
Few-NERD: Not Only a Few-shot NER Dataset

Few-NERD: Not Only a Few-shot NER Dataset This is the source code of the ACL-IJCNLP 2021 paper: Few-NERD: A Few-shot Named Entity Recognition Dataset.

THUNLP 319 Dec 30, 2022
Rethinking of Pedestrian Attribute Recognition: A Reliable Evaluation under Zero-Shot Pedestrian Identity Setting

Pytorch Pedestrian Attribute Recognition: A strong PyTorch baseline of pedestrian attribute recognition and multi-label classification.

Jian 79 Dec 18, 2022
Adaptive Prototype Learning and Allocation for Few-Shot Segmentation (CVPR 2021)

ASGNet The code is for the paper "Adaptive Prototype Learning and Allocation for Few-Shot Segmentation" (accepted to CVPR 2021) [arxiv] Overview data/

Gen Li 91 Dec 23, 2022
Code for 'Self-Guided and Cross-Guided Learning for Few-shot segmentation. (CVPR' 2021)'

SCL Introduction Code for 'Self-Guided and Cross-Guided Learning for Few-shot segmentation. (CVPR' 2021)' We evaluated our approach using two baseline

null 34 Oct 8, 2022
Pytorch Implementation for CVPR2018 Paper: Learning to Compare: Relation Network for Few-Shot Learning

LearningToCompare Pytorch Implementation for Paper: Learning to Compare: Relation Network for Few-Shot Learning Howto download mini-imagenet and make

Jackie Loong 246 Dec 19, 2022
git《FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding》(CVPR 2021) GitHub: [fig8]

FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding (CVPR 2021) This repo contains the implementation of our state-of-the-art fewshot ob

null 233 Dec 29, 2022
Official PyTorch Implementation of Hypercorrelation Squeeze for Few-Shot Segmentation, arXiv 2021

Hypercorrelation Squeeze for Few-Shot Segmentation This is the implementation of the paper "Hypercorrelation Squeeze for Few-Shot Segmentation" by Juh

Juhong Min 165 Dec 28, 2022
[CVPR 2021] Few-shot 3D Point Cloud Semantic Segmentation

Few-shot 3D Point Cloud Semantic Segmentation Created by Na Zhao from National University of Singapore Introduction This repository contains the PyTor

null 117 Dec 27, 2022
Code and data of the ACL 2021 paper: Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision

MetaAdaptRank This repository provides the implementation of meta-learning to reweight synthetic weak supervision data described in the paper Few-Shot

THUNLP 5 Jun 16, 2022
The official implementation of the CVPR 2021 paper FAPIS: a Few-shot Anchor-free Part-based Instance Segmenter

FAPIS The official implementation of the CVPR 2021 paper FAPIS: a Few-shot Anchor-free Part-based Instance Segmenter Introduction This repo is primari

Khoi Nguyen 8 Dec 11, 2022