A PyTorch Implementation of the paper - Choi, Woosung, et al. "Investigating u-nets with various intermediate blocks for spectrogram-based singing voice separation." 21th International Society for Music Information Retrieval Conference, ISMIR. 2020.

Overview

Investigating U-NETS With Various Intermediate Blocks For Spectrogram-based Singing Voice Separation

A Pytorch Implementation of the paper "Investigating U-NETS With Various Intermediate Blocks For Spectrogram-based Singing Voice Separation (ISMIR 2020)"

Installation

conda install pytorch=1.6 cudatoolkit=10.2 -c pytorch
conda install -c conda-forge ffmpeg librosa
conda install -c anaconda jupyter
pip install musdb museval pytorch_lightning effortless_config wandb pydub nltk spacy 

Dataset

  1. Download Musdb18
  2. Unzip files
  3. We recommend you to use the wav file mode for the fast data preparation.
    musdbconvert path/to/musdb-stems-root path/to/new/musdb-wav-root

Demonstration: A Pretrained Model (TFC_TDF_Net (large))

Colab Link

Tutorial

1. activate your conda

conda activate yourcondaname

2. Training a default UNet with TFC_TDFs

python main.py --musdb_root ../repos/musdb18_wav --musdb_is_wav True --filed_mode True --target_name vocals --mode train --gpus 4 --distributed_backend ddp --sync_batchnorm True --pin_memory True --num_workers 32 --precision 16 --run_id debug --optimizer adam --lr 0.001 --save_top_k 3 --patience 100 --min_epochs 1000 --max_epochs 2000 --n_fft 2048 --hop_length 1024 --num_frame 128  --train_loss spec_mse --val_loss raw_l1 --model tfc_tdf_net  --spec_est_mode mapping --spec_type complex --n_blocks 7 --internal_channels 24  --n_internal_layers 5 --kernel_size_t 3 --kernel_size_f 3 --min_bn_units 16 --tfc_tdf_activation relu  --first_conv_activation relu --last_activation identity --seed 2020

3. Evaluation

After training is done, checkpoints are saved in the following directory.

etc/modelname/run_id/*.ckpt

For evaluation,

python main.py --musdb_root ../repos/musdb18_wav --musdb_is_wav True --filed_mode True --target_name vocals --mode eval --gpus 1 --pin_memory True --num_workers 64 --precision 32 --run_id debug --batch_size 4 --n_fft 2048 --hop_length 1024 --num_frame 128 --train_loss spec_mse --val_loss raw_l1 --model tfc_tdf_net --spec_est_mode mapping --spec_type complex --n_blocks 7 --internal_channels 24 --n_internal_layers 5 --kernel_size_t 3 --kernel_size_f 3 --min_bn_units 16 --tfc_tdf_activation relu --first_conv_activation relu --last_activation identity --log wandb --ckpt vocals_epoch=891.ckpt

Below is the result.

wandb:          test_result/agg/vocals_SDR 6.954695
wandb:   test_result/agg/accompaniment_SAR 14.3738075
wandb:          test_result/agg/vocals_SIR 15.5527
wandb:   test_result/agg/accompaniment_SDR 13.561705
wandb:   test_result/agg/accompaniment_ISR 22.69328
wandb:   test_result/agg/accompaniment_SIR 18.68421
wandb:          test_result/agg/vocals_SAR 6.77698
wandb:          test_result/agg/vocals_ISR 12.45371

4. Interactive Report (wandb)

wandb report

Indermediate Blocks

Please see this document.

How to use

1. Training

1.1. Intermediate Block independent Parameters

1.1.A. General Parameters
  • --musdb_root musdb path
  • --musdb_is_wav whether the path contains wav files or not
  • --filed_mode whether you want to use filed mode or not. recommend to use it for the fast data preparation.
  • --target_name one of vocals, drum, bass, other
1.1.B. Training Environment
  • --mode train or eval
  • --gpus number of gpus
    • (WARN) gpus > 1 might be problematic when evaluating models.
  • distributed_backend use this option only when you are using multi-gpus. distributed backend, one of ddp, dp, ... we recommend you to use ddp.
  • --sync_batchnorm True only when you are using ddp
  • --pin_memory
  • --num_workers
  • --precision 16 or 32
  • --dev_mode whether you want a developement mode or not. dev mode is much faster because it uses only a small subset of the dataset.
  • --run_id (optional) directory path where you want to store logs and etc. if none then the timestamp.
  • --log True for default pytorch lightning log. wandb is also available.
  • --seed random seed for a deterministic result.
1.1.C. Training hyperparmeters
  • --batch_size trivial :)
  • --optimizer adam, rmsprop, etc
  • --lr learning rate
  • --save_top_k how many top-k epochs you want to save the training state (criterion: validation loss)
  • --patience early stop control parameter. see pytorch lightning docs.
  • --min_epochs trivial :)
  • --max_epochs trivial :)
  • --model
    • tfc_tdf_net
    • tfc_net
    • tdc_net
1.1.D. Fourier parameters
  • --n_fft
  • --hop_length
  • num_frame number of frames (time slices)
1.1.F. criterion
  • --train_loss: spec_mse, raw_l1, etc...
  • --val_loss: spec_mse, raw_l1, etc...

1.2. U-net Parameters

  • --n_blocks: number of intermediate blocks. must be an odd integer. (default=7)
  • --input_channels:
    • if you use two-channeled complex-valued spectrogram, then 4
    • if you use two-channeled manginutde spectrogram, then 2
  • --internal_channels: number of internal chennels (default=24)
  • --first_conv_activation: (default='relu')
  • --last_activation: (default='sigmoid')
  • --t_down_layers: list of layer where you want to doubles/halves the time resolution. if None, ds/us applied to every single layer. (default=None)
  • --f_down_layers: list of layer where you want to doubles/halves the frequency resolution. if None, ds/us applied to every single layer. (default=None)

1.3. SVS Framework

  • --spec_type: type of a spectrogram. ['complex', 'magnitude']

  • --spec_est_mode: spectrogram estimation method. ['mapping', 'masking']

  • CaC Framework

    • you can use cac framework [1] by setting
      • --spec_type complex --spec_est_mode mapping --last_activation identity
  • Mag-only Framework

    • if you want to use the traditional magnitude-only estimation with sigmoid, then try
      • --spec_type magnitude --spec_est_mode masking --last_activation sigmoid
    • you can also change the last activation as follows
      • --spec_type magnitude --spec_est_mode masking --last_activation relu
  • Alternatives

    • you can build an svs framework with any combination of these parameters
    • e.g. --spec_type complex --spec_est_mode masking --last_activation tanh

1.4. Block-dependent Parameters

1.4.A. TDF Net
  • --bn_factor: bottleneck factor $bn$ (default=16)
  • --min_bn_units: when target frequency domain size is too small, we just use this value instead of $\frac{f}{bn}$. (default=16)
  • --bias: (default=False)
  • --tdf_activation: activation function of each block (default=relu)

1.4.B. TDC Net
  • --n_internal_layers: number of 1-d CNNs in a block (default=5)
  • --kernel_size_f: size of kernel of frequency-dimension (default=3)
  • --tdc_activation: activation function of each block (default=relu)

1.4.C. TFC Net
  • --n_internal_layers: number of 1-d CNNs in a block (default=5)
  • --kernel_size_t: size of kernel of time-dimension (default=3)
  • --kernel_size_f: size of kernel of frequency-dimension (default=3)
  • --tfc_activation: activation function of each block (default=relu)

1.4.D. TFC_TDF Net
  • --n_internal_layers: number of 1-d CNNs in a block (default=5)
  • --kernel_size_t: size of kernel of time-dimension (default=3)
  • --kernel_size_f: size of kernel of frequency-dimension (default=3)
  • --tfc_tdf_activation: activation function of each block (default=relu)
  • --bn_factor: bottleneck factor $bn$ (default=16)
  • --min_bn_units: when target frequency domain size is too small, we just use this value instead of $\frac{f}{bn}$. (default=16)
  • --tfc_tdf_bias: (default=False)

1.4.E. TDC_RNN Net
  • '--n_internal_layers' : number of 1-d CNNs in a block (default=5)

  • '--kernel_size_f' : size of kernel of frequency-dimension (default=3)

  • '--bn_factor_rnn' : (default=16)

  • '--num_layers_rnn' : (default=1)

  • '--bias_rnn' : bool, (default=False)

  • '--min_bn_units_rnn' : (default=16)

  • '--bn_factor_tdf' : (default=16)

  • '--bias_tdf' : bool, (default=False)

  • '--tdc_rnn_activation' : (default='relu')

current bug - cuda error occurs when tdc_rnn net with precision 16

Reproducible Experimental Results

  • TFC_TDF_large
    • parameters
    --musdb_root ../repos/musdb18_wav
    --musdb_is_wav True
    --filed_mode True
    
    --gpus 4
    --distributed_backend ddp
    --sync_batchnorm True
    
    --num_workers 72
    --train_loss spec_mse
    --val_loss raw_l1
    --batch_size 12
    --precision 16
    --pin_memory True
    --num_worker 72         
    --save_top_k 3
    --patience 200
    --run_id debug_large
    --log wandb
    --min_epochs 2000
    --max_epochs 3000
    
    --optimizer adam
    --lr 0.001
    
    --model tfc_tdf_net
    --n_fft 4096
    --hop_length 1024
    --num_frame 128
    --spec_type complex
    --spec_est_mode mapping
    --last_activation identity
    --n_blocks 9
    --internal_channels 24
    --n_internal_layers 5
    --kernel_size_t 3 
    --kernel_size_f 3 
    --tfc_tdf_bias True
    --seed 2020
    
    
    • training
    python main.py --musdb_root ../repos/musdb18_wav --musdb_is_wav True --filed_mode True --gpus 4 --distributed_backend ddp --sync_batchnorm True --num_workers 72 --train_loss spec_mse --val_loss raw_l1 --batch_size 24 --precision 16 --pin_memory True --num_worker 72 --save_top_k 3 --patience 200 --run_id debug_large --log wandb --min_epochs 2000 --max_epochs 3000 --optimizer adam --lr 0.001 --model tfc_tdf_net --n_fft 4096 --hop_length 1024 --num_frame 128 --spec_type complex --spec_est_mode mapping --last_activation identity --n_blocks 9 --internal_channels 24 --n_internal_layers 5 --kernel_size_t 3 --kernel_size_f 3 --tfc_tdf_bias True --seed 2020
    • evaluation result (epoch 2007)
      • SDR 8.029
      • ISR 13.708
      • SIR 16.409
      • SAR 7.533

Interactive Report (wandb)

wandb report

You can cite this paper as follows:

@inproceedings{choi_2020, Author = {Choi, Woosung and Kim, Minseok and Chung, Jaehwa and Lee, Daewon and Jung, Soonyoung}, Booktitle = {21th International Society for Music Information Retrieval Conference}, Editor = {ISMIR}, Month = {OCTOBER}, Title = {Investigating U-Nets with various intermediate blocks for spectrogram-based singing voice separation.}, Year = {2020}}

Reference

[1] Woosung Choi, Minseok Kim, Jaehwa Chung, DaewonLee, and Soonyoung Jung, “Investigating u-nets with various intermediate blocks for spectrogram-based singingvoice separation.,” in 21th International Society for Music Information Retrieval Conference, ISMIR, Ed., OCTOBER 2020.

You might also like...
Pytorch implementation of paper
Pytorch implementation of paper "Learning Co-segmentation by Segment Swapping for Retrieval and Discovery"

SegSwap Pytorch implementation of paper "Learning Co-segmentation by Segment Swapping for Retrieval and Discovery" [PDF] [Project page] If our project

Implementing Graph Convolutional Networks and Information Retrieval Mechanisms using pure Python and NumPy

Implementing Graph Convolutional Networks and Information Retrieval Mechanisms using pure Python and NumPy

RETRO-pytorch - Implementation of RETRO, Deepmind's Retrieval based Attention net, in Pytorch
RETRO-pytorch - Implementation of RETRO, Deepmind's Retrieval based Attention net, in Pytorch

RETRO - Pytorch (wip) Implementation of RETRO, Deepmind's Retrieval based Attent

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

"Inductive Entity Representations from Text via Link Prediction" @ The Web Conference 2021

Inductive entity representations from text via link prediction This repository contains the code used for the experiments in the paper "Inductive enti

Ratatoskr: Worcester Tech's conference scheduling system

Ratatoskr: Worcester Tech's conference scheduling system In Norse mythology, Ratatoskr is a squirrel who runs up and down the world tree Yggdrasil to

Learning the Beauty in Songs: Neural Singing Voice Beautifier; ACL 2022 (Main conference); Official code
Learning the Beauty in Songs: Neural Singing Voice Beautifier; ACL 2022 (Main conference); Official code

Learning the Beauty in Songs: Neural Singing Voice Beautifier Jinglin Liu, Chengxi Li, Yi Ren, Zhiying Zhu, Zhou Zhao Zhejiang University ACL 2022 Mai

PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval (M2HSE) PyTorch code fo

E2e music remastering system - End-to-end Music Remastering System Using Self-supervised and Adversarial Training
E2e music remastering system - End-to-end Music Remastering System Using Self-supervised and Adversarial Training

End-to-end Music Remastering System This repository includes source code and pre

Comments
  • Training time

    Training time

    Hello, Thank you for your sharing codes.

    Could you please tell me how long does it take you to train this model (TFC_TDF_Net (large))? In my setup, one 2080ti, batch_size=8, it takes 12 minutes for every epoch. It means that I need to wait 17 days for 2000 epochs. It looks like unreasonable. Is this the same as your training time?

    Looking forward to you.

    opened by ujscjj 2
  • ImportError: cannot import name 'EvalResult' from 'pytorch_lightning'

    ImportError: cannot import name 'EvalResult' from 'pytorch_lightning'

    Hello, I encountered this error when configuring the environment. How can I solve it? Traceback (most recent call last): File "main.py", line 7, in from source_separation.models.scripts import trainer, evaluator File "/home/Music/ISMIR2020_U_Nets_SVS-master/source_separation/models/scripts/trainer.py", line 8, in from source_separation.models.model_definition import get_class_by_name File "/home/Music/ISMIR2020_U_Nets_SVS-master/source_separation/models/model_definition.py", line 3, in from source_separation.models.tdc_net import TDC_NET_Framework File "/home/Music/ISMIR2020_U_Nets_SVS-master/source_separation/models/tdc_net.py", line 8, in from source_separation.models.dense_u_net import Dense_UNET_Framework, Dense_UNET File "/home/Music/ISMIR2020_U_Nets_SVS-master/source_separation/models/dense_u_net.py", line 9, in from source_separation.models.separation_framework import Spectrogram_based File "/home/Music/ISMIR2020_U_Nets_SVS-master/source_separation/models/separation_framework.py", line 12, in from pytorch_lightning import EvalResult ImportError: cannot import name 'EvalResult' from 'pytorch_lightning' (/opt/conda/lib/python3.7/site-packages/pytorch_lightning/init.py)

    opened by liuxianlin 2
  • Volume issue and glitches

    Volume issue and glitches

    Hello, thanks for good example repository!

    Is there any way to fix lags in model output? I mean two things:

    1. much part of realtive volume is changed. Some pieces of song sounds in increased volume, relative to all track, comparing to original. Some times it's sound very glitchy, so sounds that were very quet in original, now very loud and much louder than rest of track.
    2. Some strange sound in output, like glitches. Like some non voice sounds from original track, but in increased volume and transformed sound. Sometimes it sound scarry )))

    Tranks in advance!

    opened by BitBlueDevStudio 0
Owner
Woosung Choi
WooSung Choi Ph.d candidate @IELab-AT-KOREA-UNIV Seoul, Korea
Woosung Choi
Music Source Separation; Train & Eval & Inference piplines and pretrained models we used for 2021 ISMIR MDX Challenge.

Music Source Separation with Channel-wise Subband Phase Aware ResUnet (CWS-PResUNet) Introduction This repo contains the pretrained Music Source Separ

Lau 100 Dec 25, 2022
PyTorch implementation of ECCV 2020 paper "Foley Music: Learning to Generate Music from Videos "

Foley Music: Learning to Generate Music from Videos This repo holds the code for the framework presented on ECCV 2020. Foley Music: Learning to Genera

Chuang Gan 30 Nov 3, 2022
working repo for my xumx-sliCQ submissions to the ISMIR 2021 MDX

Music Demixing Challenge - xumx-sliCQ This repository is the GitHub mirror of my working submission repository for the AICrowd ISMIR 2021 Music Demixi

null 4 Aug 25, 2021
Image-retrieval-baseline - MUGE Multimodal Retrieval Baseline

MUGE Multimodal Retrieval Baseline This repo is implemented based on the open_cl

null 47 Dec 16, 2022
Retrieval.pytorch - The code we used in [2020 DIGIX]

Retrieval.pytorch - The code we used in [2020 DIGIX]

Guo-Hua Wang 2 Feb 7, 2022
The official implementation for ACL 2021 "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval".

Code for "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval" (ACL 2021, Long) This is the repository for baseline m

Akari Asai 25 Oct 30, 2022
Github for the conference paper GLOD-Gaussian Likelihood OOD detector

FOOD - Fast OOD Detector Pytorch implamentation of the confernce peper FOOD arxiv link. Abstract Deep neural networks (DNNs) perform well at classifyi

null 17 Jun 19, 2022
Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Text-AutoAugment (TAA) This repository contains the code for our paper Text AutoAugment: Learning Compositional Augmentation Policy for Text Classific

LancoPKU 105 Jan 3, 2023
Abstractive opinion summarization system (SelSum) and the largest dataset of Amazon product summaries (AmaSum). EMNLP 2021 conference paper.

Learning Opinion Summarizers by Selecting Informative Reviews This repository contains the codebase and the dataset for the corresponding EMNLP 2021

Arthur Bražinskas 39 Jan 1, 2023
This repo is the code release of EMNLP 2021 conference paper "Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories".

Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories This repo is the code release of EMNLP 2021 con

null 12 Nov 22, 2022