SelfRemaster: Self-Supervised Speech Restoration
Official implementation of SelfRemaster: Self-Supervised Speech Restoration with Analysis-by-Synthesis Approach Using Channel Modeling
Demo
- Audio samples
- Audio effect transfer with Gradio + HuggingFace Spaces
🤗
Setup
- Clone this repository:
git clone https://github.com/Takaaki-Saeki/ssl_speech_restoration.git
- CD into this repository:
cd ssl_speech_restoration
- Install python packages and download some pretrained models:
./setup.sh
Getting started
- If you use default Japanese corpora
- Download JSUT Basic5000 and JVS Corpus
- Downsample them to 22.05 kHz and Place them under
data/
asjsut_22k
andjvs_22k
- Place simulated low-quality data under
./data
asjsut_22k-low
andjvs_22k-low
- Or you can use arbitrary datasets by modifying config files
Training
You can choose MelSpec
or SourFilter
models with --config_path
option.
As shown in the paper, MelSpec
model is of higher-quality.
Firstly you need to split the data to train/val/test and dump them by the following command.
python preprocess.py --config_path configs/train/${feature}/ssl_jsut.yaml
To perform self-supervised learning with dual learning, run the following command.
python train.py \
--config_path configs/train/${feature}/ssl_jsut.yaml \
--stage ssl-dual \
--run_name ssl_melspec_dual
For other options, refer to train.py
.
Speech restoration
To perform speech restoration of the test data, run the following command.
python eval.py \
--config_path configs/test/${feature}/ssl_jsut.yaml \
--ckpt_path ${path to checkpoint} \
--stage ssl-dual \
--run_name ssl_melspec_dual
For other options, see eval.py
.
Audio effect transfer
You can run a simple audio effect transfer demo using a model pretrained with real data.
Run the following command.
python aet_demo.py
Or you can customize the dataset or model.
You need to edit audio_effect_transfer.yaml
and run the following command.
python aet.py \
--config_path configs/test/melspec/audio_effect_transfer.yaml \
--stage ssl-dual \
--run_name aet_melspec_dual
For other options, see aet.py
.
Pretrained models
See here.
Reproducing results
You can generate simulated low-quality data as in the paper with the following command.
python simulated_data.py \
--in_dir ${input_directory (e.g., path to jsut_22k)} \
--output_dir ${output_directory (e.g., path to jsut_22k-low)} \
--corpus_type ${single-speaker corpus or multi-speaker corpus} \
--deg_type lowpass
Then download the pretrained model correspond to the deg_type and run the following command.
python eval.py \
--config_path configs/train/${feature}/ssl_jsut.yaml \
--ckpt_path ${path to checkpoint} \
--stage ssl-dual \
--run_name ssl_melspec_dual
Citation
@article{saeki22selfremaster,
title={{SelfRemaster}: {S}elf-Supervised Speech Restoration with Analysis-by-Synthesis Approach Using Channel Modeling},
author={T. Saeki and S. Takamichi and T. Nakamura and N. Tanji and H. Saruwatari},
journal={arXiv preprint arXiv:2203.12937},
year={2022}
}