SelfRemaster: SSL Speech Restoration

Takaaki Saeki

Last update: Jan 7, 2023

Related tags

Overview

SelfRemaster: Self-Supervised Speech Restoration

Official implementation of SelfRemaster: Self-Supervised Speech Restoration with Analysis-by-Synthesis Approach Using Channel Modeling

Demo

Audio samples
Audio effect transfer with Gradio + HuggingFace Spaces 🤗

Setup

Clone this repository: git clone https://github.com/Takaaki-Saeki/ssl_speech_restoration.git
CD into this repository: cd ssl_speech_restoration
Install python packages and download some pretrained models: ./setup.sh

Getting started

If you use default Japanese corpora
- Download JSUT Basic5000 and JVS Corpus
- Downsample them to 22.05 kHz and Place them under data/ as jsut_22k and jvs_22k
- Place simulated low-quality data under ./data as jsut_22k-low and jvs_22k-low
Or you can use arbitrary datasets by modifying config files

Training

You can choose MelSpec or SourFilter models with --config_path option.
As shown in the paper, MelSpec model is of higher-quality.

Firstly you need to split the data to train/val/test and dump them by the following command.

python preprocess.py --config_path configs/train/${feature}/ssl_jsut.yaml

To perform self-supervised learning with dual learning, run the following command.

python train.py \
    --config_path configs/train/${feature}/ssl_jsut.yaml \
    --stage ssl-dual \
    --run_name ssl_melspec_dual

For other options, refer to train.py.

Speech restoration

To perform speech restoration of the test data, run the following command.

python eval.py \
    --config_path configs/test/${feature}/ssl_jsut.yaml \
    --ckpt_path ${path to checkpoint} \
    --stage ssl-dual \
    --run_name ssl_melspec_dual

For other options, see eval.py.

Audio effect transfer

You can run a simple audio effect transfer demo using a model pretrained with real data.
Run the following command.

python aet_demo.py

Or you can customize the dataset or model.
You need to edit audio_effect_transfer.yaml and run the following command.

python aet.py \
    --config_path configs/test/melspec/audio_effect_transfer.yaml \
    --stage ssl-dual \
    --run_name aet_melspec_dual

For other options, see aet.py.

Pretrained models

See here.

Reproducing results

You can generate simulated low-quality data as in the paper with the following command.

python simulated_data.py \
    --in_dir ${input_directory (e.g., path to jsut_22k)} \
    --output_dir ${output_directory (e.g., path to jsut_22k-low)} \
    --corpus_type ${single-speaker corpus or multi-speaker corpus} \
    --deg_type lowpass

Then download the pretrained model correspond to the deg_type and run the following command.

python eval.py \
    --config_path configs/train/${feature}/ssl_jsut.yaml \
    --ckpt_path ${path to checkpoint} \
    --stage ssl-dual \
    --run_name ssl_melspec_dual

Citation

@article{saeki22selfremaster,
  title={{SelfRemaster}: {S}elf-Supervised Speech Restoration with Analysis-by-Synthesis Approach Using Channel Modeling},
  author={T. Saeki and S. Takamichi and T. Nakamura and N. Tanji and H. Saruwatari},
  journal={arXiv preprint arXiv:2203.12937},
  year={2022}
}

Reference

You might also like...

Punctuation Restoration using Transformer Models for High-and Low-Resource Languages

Punctuation Restoration using Transformer Models This repository contins official implementation of the paper Punctuation Restoration using Transforme

142 Jan 1, 2023

SwinIR: Image Restoration Using Swin Transformer

SwinIR: Image Restoration Using Swin Transformer This repository is the official PyTorch implementation of SwinIR: Image Restoration Using Shifted Win

2.4k Jan 8, 2023

Image Restoration Toolbox (PyTorch). Training and testing codes for DPIR, USRNet, DnCNN, FFDNet, SRMD, DPSR, BSRGAN, SwinIR

2k Dec 31, 2022

Dynamic Attentive Graph Learning for Image Restoration, ICCV2021 [PyTorch Code]

Dynamic Attentive Graph Learning for Image Restoration This repository is for GATIR introduced in the following paper: Chong Mou, Jian Zhang, Zhuoyuan

84 Dec 9, 2022

An official repository for Paper "Uformer: A General U-Shaped Transformer for Image Restoration".

Uformer: A General U-Shaped Transformer for Image Restoration Zhendong Wang, Xiaodong Cun, Jianmin Bao and Jianzhuang Liu Paper: https://arxiv.org/abs

497 Dec 22, 2022

Image Restoration Using Swin Transformer for VapourSynth

SwinIR SwinIR function for VapourSynth, based on https://github.com/JingyunLiang/SwinIR. Dependencies NumPy PyTorch, preferably with CUDA. Note that t

11 Jun 19, 2022

Half Instance Normalization Network for Image Restoration

HINet Half Instance Normalization Network for Image Restoration, based on https://github.com/megvii-model/HINet. Dependencies NumPy PyTorch, preferabl

4 Jun 6, 2022

Official repository for "Restormer: Efficient Transformer for High-Resolution Image Restoration". SOTA for motion deblurring, image deraining, denoising (Gaussian/real data), and defocus deblurring.

Restormer: Efficient Transformer for High-Resolution Image Restoration Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan,

906 Dec 30, 2022

The 7th edition of NTIRE: New Trends in Image Restoration and Enhancement workshop will be held on June 2022 in conjunction with CVPR 2022.

NTIRE 2022 - Image Inpainting Challenge Important dates 2022.02.01: Release of train data (input and output images) and validation data (only input) 2

37 Nov 27, 2022

Comments

running train.py compains about lack of data

Thank you very much for the interesting paper and the code repo.

I downloaded jvs and jsut dataset, unpacked them, renamed and degraded them accordingly, e.g.:

#!/usr/bin/env bash

set -ev

dir=jsut_ver1.1

[ -e "$dir" ] || {
  >&2 echo "error: invalid directory '$dir'"
  exit 1
}

lowdir="jsut_22k"
degradedir="jsut_22k-low"

replace_once() {
  s=$1; shift
  from=$1; shift
  to=$1; shift
  env python3 -c "print('$s'.replace('$from', '$to', 1))"
}

# create subdirs
find "$dir" -type d | while IFS= read -r line; do
  mkdir -pv "$(replace_once "$line" "$dir" "$lowdir")"
done

# downsample to 22k
find "$dir" -type f | sort -n | while IFS= read -r line; do
  [ -e "$line" ] || {
    echo "no such file $line"
    exit 1
  }
  output=$(replace_once "$line" "$dir" "$lowdir")
  [ -e "$output" ] &&  {
    continue
  }
  if [ -z "$(echo "$line" | grep -E ".wav$")" ]; then
    #cp -v "$line" "$output"
    continue
  fi
  echo "downsample '$line' -> '$output'"
  ffmpeg -nostdin -hide_banner -loglevel error -i "$line" -ac 1 -ar 22050 -q:a 0 -y "$output"
done

# create subdirs
find "$dir" -type d | while IFS= read -r line; do
  mkdir -p "$(replace_once "$line" "$dir" "$degradedir")"
done

# degrade audio
find "$lowdir" -type f | sort -n | while IFS= read -r line; do
  [ -e "$line" ] || {
    echo "no such file $line"
    exit 1
  }
  output=$(replace_once "$line" "$lowdir" "$degradedir")
  [ -e "$output" ] &&  {
    continue
  }
  if [ -z "$(echo "$line" | grep -E ".wav$")" ]; then
    #cp -v "$line" "$output"
    continue
  fi
  echo "degrade '$line' -> '$output'"
  tmp="/tmp/jsut_$(basename "$output")"
  ./degrade_audio.py "$line" "$tmp"
  mv "$tmp" "$output"
done

Then I do a similar thing with jvs dataset, but restructure so that the *.wav files are found under */*.wav mask somehow (15k files).

In configs/train/melspec/ssl_jsut.yaml i change:

  source_path: "./data/jsut_22k-low/basic5000/wav"
  aux_path: "./data/jsut_22k/basic5000/wav"

Running this seems to generate a lot of pickles for 5000+14997 files (changing jsut:

python3 preprocess.py --config_path configs/train/melspec/ssl_jsut.yaml

Then running

env python3 train.py \
    --config_path configs/train/melspec/ssl_jsut.yaml \
    --stage ssl-dual \
    --run_name ssl_melspec_dual

Produces "index 0 not found" errors in the dataset, e.g:

  File "./ssl_speech_restoration/dataset.py", line 205, in __getitem__
    d_batch["wavstask"] = torch.from_numpy(self.d_out["wavstask"][idx])
IndexError: index 0 is out of bounds for axis 0 with size 0

Changing ssl-dual into pretrain produces some "augment key not found" error.

What would be the correct pipeline? Is there something I could try to make it train?

Thanks

opened by theoden8 5

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/gradio/routes.py", line 275, in predict
	output = await app.blocks.process_api(body, username, session_state)
  File "/usr/local/lib/python3.8/dist-packages/gradio/blocks.py", line 274, in process_api
	predictions = await run_in_threadpool(block_fn.fn, *processed_input)
  File "/usr/local/lib/python3.8/dist-packages/starlette/concurrency.py", line 41, in run_in_threadpool
	return await anyio.to_thread.run_sync(func, *args)
  File "/usr/local/lib/python3.8/dist-packages/anyio/to_thread.py", line 31, in run_sync
	return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
	return await future
  File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 867, in run
	result = context.run(func, *args)
  File "/usr/local/lib/python3.8/dist-packages/gradio/interface.py", line 500, in <lambda>
	lambda *args: self.run_prediction(args)[0]
  File "/usr/local/lib/python3.8/dist-packages/gradio/interface.py", line 682, in run_prediction
	prediction = predict_fn(*processed_input)
  File "aet_demo.py", line 60, in transfer
	src_model = SSLDualLightningModule(config).load_from_checkpoint(
  File "/root/ssl_speech_restoration/lightning_module.py", line 623, in __init__
	super().__init__(config)
  File "/root/ssl_speech_restoration/lightning_module.py", line 307, in __init__
	self.vocoder = load_vocoder(config)
  File "/root/ssl_speech_restoration/utils.py", line 44, in load_vocoder
	vocoder.load_state_dict(torch.load(config["general"]["hifigan_path"])["generator"])
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 608, in load
	return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 787, in _legacy_load
	result = unpickler.load()
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 743, in persistent_load
	deserialized_objects[root_key] = restore_location(obj, location)
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 175, in default_restore_location
	result = fn(storage, location)
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 151, in _cuda_deserialize
	device = validate_cuda_device(location)
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 135, in validate_cuda_device
	raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

that is when upload a sample file with spanish

no-issue-activity

opened by johnfelipe 3

No versions in requirements.txt
Hello. Thanks for publishing your code and checkpoints 😃

I've come across the following error

dataset.py:145: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.

Although this warning disappears when you add dtype=object, I came across another problem later on and was unable to get the system running.

My suggestion is to add version numbers for each dependency in requirements.txt. That way, we can know which versions of each library form a working solution, and the code will continue to work in the future after libraries have changed.
opened by chrisbaume 1
quality of restored speech not good

Hi

I tried the Hugging face demo on my wav file but the quality is not good. Is it because the vocoder is trained on Japanese corpus. Is there a general speech restoration model?

opened by sciai-ai 1