Deep Learning: Architectures & Methods Project: Deep Learning for Audio Super-Resolution


Deep Learning: Architectures & Methods Project: Deep Learning for Audio Super-Resolution

Figure: Example visualization of the method and baseline as a spectogram

This is the implementation of our Project for the course "Deep Learning: Architectures and Methods" by Prof. Christian Kersting from the Artificial Intelligence and Machine Learning Lab at the Technical University of Darmstadt in the summer semester 2021.

In the field of audio signal processing, Super-Resolution is one of the most relevant topics. The motivation is to reconstruct high- quality audio from low-quality signals. From a practical perspective, the technique has applications in telephony or generally in applications in which audio is transmitted and has to be compressed accordingly. Other applications are the processing of ancient recordings, for example old sound recordings of music, speech or videos. First approaches of the combination of machine learning and audio signal processing lead to promising results and outperform standard techniques. Accordingly the scope of the project was to reimplement the paper Temporal FiLM: Capturing Long-Range SequenceDependencies with Feature-Wise Modulation by Birnbaum et al. in PyTorch, reproduce the results and extend them further to the music domain.

This repository contains everything needed to prepare the data sets, train the model and create final evaluation and visualization of the results. We also provide the weights of the models to reproduce our reported results.


This project was originally developed with Python 3.8, PyTorch 1.7, and CUDA 11.0. The training requires at least one NVIDIA GeForce GTX 980 (4GB memory).

  • Create conda environment:
conda create --name audiosr
source activate audiosr
conda install PYTORCH torchvision cudatoolkit=11.0 -c pytorch
  • Install the dependencies:
pip install -r requirements.txt

Dataset preparation

To reproduce the results shown below tha datasets have to be prepared. This repo includes scripts to prepare the following dataset:

VCTK preparation

  • run from ./datasets to create a h5 container of a specified input.
  • to reproduce results prepare the following h5 files:
python \
  --file-list vctk/speaker1/speaker1-train-files.txt \
  --in-dir ./VCTK-Corpus/wav48/p225/ \
  --out vctk-speaker1-train.4.16000.8192.4096.h5 \
  --scale 4 \
  --sr 16000 \
  --dimension 8192 \
  --stride 4096 \
  --interpolate \
python \
  --file-list vctk/speaker1/speaker1-val-files.txt \
  --in-dir ./VCTK-Corpus/wav48/p225/ \
  --out vctk-speaker1-val.4.16000.8192.4096.h5 \
  --scale 4 \
  --sr 16000 \
  --dimension 8192 \
  --stride 4096 \
  --interpolate \

GTZAN preparation

  • run from ./datasets to create a h5 container of a specified input.
  • to reproduce results prepare the following h5 files:
python \
  --file-list gtzan/blues_wav_list_train.txt \
  --in-dir gtzan/data/genres/blues/ \
  --out blues-train.4.22000.8192.16384.h5 \
  --scale 4 \
  --sr 22000 \
  --dimension 8192 \
  --stride 16384 \
  --interpolate \
python \
  --file-list gtzan/blues_wav_list_val.txt \
  --in-dir gtzan/data/genres/blues/ \
  --out blues-val.4.22000.8192.16384.h5 \
  --scale 4 \
  --sr 22000 \
  --dimension 8192 \
  --stride 16384 \
  --interpolate \

Piano dataset preparation

python \
  --file-list data/music_train.npy \
  --out piano-train.4.16000.8192.131072.h5 \
  --scale 4 \
  --sr 16000 \
  --dimension 8192 \
  --stride 131072 \
  --interpolate \
python \
  --file-list data/music_valid.npy \
  --out piano-val.4.16000.8192.131072.h5 \
  --scale 4 \
  --sr 16000 \
  --dimension 8192 \
  --stride 131072 \
  --interpolate \


  • the --in-dir argument has to be adapted to the respective dataset location
  • The dimension parameter and sampling rate define the absolute length of a patch (dim/sr = length patch)


Generally, there are three main models in this implementation.


On the one hand the b-spline interpolation which serves as the baseline and can be found in the data loader in


On the other hand two neural networks whose implementation can be found in the /models/ folder. In a first step a model was implemented which uses a batchnorm layer instead of the later used TFILM layer. This is implemented in The final model, which is also used in the paper, can be found in

Train Model

To run the trainings use the following commands and change the dataset root the corresponding domain.

python \
  --dataset-root hereroottodataset! \
  --epochs 50 \
  --lr 3*10e-4 \
  --batch-size 16 


Save examples from inference

It is possible to evaluate any given wav-file with the script by invoking the --save-example flag and saving the results as wav-files and spectrogram plots. The script performs the following steps:

  • prepares all files in a provided list (--wave-file-list) and creates a low-res version and the baseline reconstruction
  • runs inference on the prepared files to create a super-resolution output
  • saves all results to the "examples" folder with the respective file names
  • saves spectrogram plots of all versions as pdf-files


It is important to adapt the sampling parameter (--sr) which is set to 16000 by default. The sampling rate has to be the one of the original wav file. The scale (--scale) defines the down sampling factor which is set to 4 by default. Depending on which trained model is used for the inference the parameters --checkpoints-root and --checkpoint have to be specified accordingly.

To reproduce an example from our plot run the following command from the repo root directory (modify --checkpoints-root if necessary):

python \
  --save-example \
  --wave-file-list assets/save_wav_list.txt \
  --scale 4 \
  --sr 16000 \
  --checkpoint pretrained/vctk_speaker1_pretrained.pth


Training Dataset Ratio BASELINE SNR (dB) BASELINE LSD (dB) METHOD SNR (dB) METHOD LSD (dB) Checkpoint
VTCK SingleSpeaker r = 4 15.6 5.4 16.6 3.2 Checkpoint
Piano r = 4 19.7 2.9 20.4 2.2 Checkpoint
GTZAN (Genre: Blues) r = 4 13.3 7.8 13.8 3.8 Checkpoint

Qualitative Examples

Here we provide a qualitative example per Dataset. These can be generated using

VTCK SingleSpeaker Piano GTZAN (Genre: Blues)
Low Resolution Low Resolution Low Resolution
Baseline Baseline Baseline
Method Method Method
High Resolution High Resolution High Resolution
You might also like...
Official implementation of Deep Burst Super-Resolution
Official implementation of Deep Burst Super-Resolution

Deep-Burst-SR Official implementation of Deep Burst Super-Resolution Publication: Deep Burst Super-Resolution. Goutam Bhat, Martin Danelljan, Luc Van

Image Super-Resolution Using Very Deep Residual Channel Attention Networks
Image Super-Resolution Using Very Deep Residual Channel Attention Networks

Image Super-Resolution Using Very Deep Residual Channel Attention Networks

Torch implementation of
Torch implementation of "Enhanced Deep Residual Networks for Single Image Super-Resolution"

NTIRE2017 Super-resolution Challenge: SNU_CVLab Introduction This is our project repository for CVPR 2017 Workshop (2nd NTIRE). We, Team SNU_CVLab, (B

Official implementation of Unfolded Deep Kernel Estimation for Blind Image Super-resolution.

Unfolded Deep Kernel Estimation for Blind Image Super-resolution Hongyi Zheng, Hongwei Yong, Lei Zhang, "Unfolded Deep Kernel Estimation for Blind Ima

Official PyTorch implementation of the paper
Official PyTorch implementation of the paper "Deep Constrained Least Squares for Blind Image Super-Resolution", CVPR 2022.

Deep Constrained Least Squares for Blind Image Super-Resolution [Paper] This is the official implementation of 'Deep Constrained Least Squares for Bli

Keras like implementation of Deep Learning architectures from scratch using numpy.

Mini-Keras Keras like implementation of Deep Learning architectures from scratch using numpy. How to contribute? The project contains implementations

[CVPR 2021] Unsupervised Degradation Representation Learning for Blind Super-Resolution
[CVPR 2021] Unsupervised Degradation Representation Learning for Blind Super-Resolution

DASR Pytorch implementation of "Unsupervised Degradation Representation Learning for Blind Super-Resolution", CVPR 2021 [arXiv] Overview Requirements

The implementation of ICASSP 2020 paper
The implementation of ICASSP 2020 paper "Pixel-level self-paced learning for super-resolution"

Pixel-level Self-Paced Learning for Super-Resolution This is an official implementaion of the paper Pixel-level Self-Paced Learning for Super-Resoluti

Learning To Have An Ear For Face Super-Resolution

Learning To Have An Ear For Face Super-Resolution [Project Page] This repository contains demo code of our CVPR2020 paper. Training and evaluation on

  • RuntimeError: Error(s) in loading state_dict for TFILMUNet

    RuntimeError: Error(s) in loading state_dict for TFILMUNet

    python --save-example --wav-file-list list.txt --scale 4 --sr 22000 --dimension 8192 --stride 16384 --checkpoint 1.pth Namespace(dataset_type='gtzan', dataset_root='/home/bubble/Documents/audio_super_resolution-master/datasets/gtzan/blues-val.4.22000.8192.16384.h5', full_root='/home/bubble/Documents/audio_super_resolution-master/datasets/gtzan/blues-val.4.22000.8192.16384.h5', dataset_split='val', save_examples=True, wav_file_list='list.txt', scale=4, dimension=8192, stride=16384, sr=22000, checkpoints_root='/home/bubble/Documents/audio_super_resolution-master/checkpoints/runs', checkpoint='1.pth', batch_size=1, num_workers=1, method='base', mode='inf') Run Inference on example files: Traceback (most recent call last): File "/home/bubble/Documents/audio_super_resolution-master/", line 286, in <module> run_examples(clargs) File "/home/bubble/Documents/audio_super_resolution-master/", line 158, in run_examples model.load_state_dict(torch.load(checkpoint, map_location=device), strict=True) File "/home/bubble/anaconda3/envs/audiosr/lib/python3.9/site-packages/torch/nn/modules/", line 1223, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for TFILMUNet: Missing key(s) in state_dict: "tfilm_d1.lstm.weight_ih_l0_reverse", "tfilm_d1.lstm.weight_hh_l0_reverse", "tfilm_d1.lstm.bias_ih_l0_reverse", "tfilm_d1.lstm.bias_hh_l0_reverse", "tfilm_d2.lstm.weight_ih_l0_reverse", "tfilm_d2.lstm.weight_hh_l0_reverse", "tfilm_d2.lstm.bias_ih_l0_reverse", "tfilm_d2.lstm.bias_hh_l0_reverse", "tfilm_d3.lstm.weight_ih_l0_reverse", "tfilm_d3.lstm.weight_hh_l0_reverse", "tfilm_d3.lstm.bias_ih_l0_reverse", "tfilm_d3.lstm.bias_hh_l0_reverse", "tfilm_d4.lstm.weight_ih_l0_reverse", "tfilm_d4.lstm.weight_hh_l0_reverse", "tfilm_d4.lstm.bias_ih_l0_reverse", "tfilm_d4.lstm.bias_hh_l0_reverse", "tfilm_b.lstm.weight_ih_l0_reverse", "tfilm_b.lstm.weight_hh_l0_reverse", "tfilm_b.lstm.bias_ih_l0_reverse", "tfilm_b.lstm.bias_hh_l0_reverse", "tfilm_u4.lstm.weight_ih_l0_reverse", "tfilm_u4.lstm.weight_hh_l0_reverse", "tfilm_u4.lstm.bias_ih_l0_reverse", "tfilm_u4.lstm.bias_hh_l0_reverse", "tfilm_u3.lstm.weight_ih_l0_reverse", "tfilm_u3.lstm.weight_hh_l0_reverse", "tfilm_u3.lstm.bias_ih_l0_reverse", "tfilm_u3.lstm.bias_hh_l0_reverse", "tfilm_u2.lstm.weight_ih_l0_reverse", "tfilm_u2.lstm.weight_hh_l0_reverse", "tfilm_u2.lstm.bias_ih_l0_reverse", "tfilm_u2.lstm.bias_hh_l0_reverse", "tfilm_u1.lstm.weight_ih_l0_reverse", "tfilm_u1.lstm.weight_hh_l0_reverse", "tfilm_u1.lstm.bias_ih_l0_reverse", "tfilm_u1.lstm.bias_hh_l0_reverse". Flags i used: python --save-example --wav-file-list list.txt --scale 4 --sr 22000 --dimension 8192 --stride 16384 --checkpoint 1.pth

    Checkpoint used: GTZAN

    Dataset prepared for gtzan

    Am I doing something wrong?

    opened by bubblegg 3
  • Request to re-upload pre-trained models

    Request to re-upload pre-trained models

    Hello, we would love to test the audio_super_resolution, but github limits files to a maximum of 100mb, would it be possible to upload the models to another host? preferably google drive? Thanks in advance.

    opened by lucasbr15 1
Oliver Hahn
Master Thesis @VIsual Inference Lab | Grad Student @Technical University of Darmstadt
Oliver Hahn
The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution.

WSRGlow The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution. Audio sa

Kexun Zhang 96 Jan 3, 2023
Project page of the paper 'Analyzing Perception-Distortion Tradeoff using Enhanced Perceptual Super-resolution Network' (ECCVW 2018)

EPSR (Enhanced Perceptual Super-resolution Network) paper This repo provides the test code, pretrained models, and results on benchmark datasets of ou

Subeesh Vasu 78 Nov 19, 2022
The project covers common metrics for super-resolution performance evaluation.

Super-Resolution Performance Evaluation Code The project covers common metrics for super-resolution performance evaluation. Metrics support The script

xmy 10 Aug 3, 2022
PyTorch implementation of 1712.06087 "Zero-Shot" Super-Resolution using Deep Internal Learning

Unofficial PyTorch implementation of "Zero-Shot" Super-Resolution using Deep Internal Learning Unofficial Implementation of 1712.06087 "Zero-Shot" Sup

Jacob Gildenblat 196 Nov 27, 2022
aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)

Bayesian Methods for Hackers Using Python and PyMC The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chap

Cameron Davidson-Pilon 25.1k Jan 2, 2023
A PyTorch-based open-source framework that provides methods for improving the weakly annotated data and allows researchers to efficiently develop and compare their own methods.

Knodle (Knowledge-supervised Deep Learning Framework) - a new framework for weak supervision with neural networks. It provides a modularization for se

null 93 Nov 6, 2022
Implementation of temporal pooling methods studied in [ICIP'20] A Comparative Evaluation Of Temporal Pooling Methods For Blind Video Quality Assessment

Implementation of temporal pooling methods studied in [ICIP'20] A Comparative Evaluation Of Temporal Pooling Methods For Blind Video Quality Assessment

Zhengzhong Tu 5 Sep 16, 2022
Pytorch implementation of Deep Recursive Residual Network for Super Resolution (DRRN)

DRRN-pytorch This is an unofficial implementation of "Deep Recursive Residual Network for Super Resolution (DRRN)", CVPR 2017 in Pytorch. [Paper] You

yun_yang 192 Dec 12, 2022
PyTorch code for our ECCV 2018 paper "Image Super-Resolution Using Very Deep Residual Channel Attention Networks"

PyTorch code for our ECCV 2018 paper "Image Super-Resolution Using Very Deep Residual Channel Attention Networks"

Yulun Zhang 1.2k Dec 26, 2022
PyTorch version of the paper 'Enhanced Deep Residual Networks for Single Image Super-Resolution' (CVPRW 2017)

About PyTorch 1.2.0 Now the master branch supports PyTorch 1.2.0 by default. Due to the serious version problem (especially

Sanghyun Son 2.1k Jan 1, 2023