Evaluation and Benchmarking of Speech Super-resolution Methods

Related tags

Overview

Speech Super-resolution Evaluation and Benchmarking

What this repo do:

A toolbox for the evaluation of speech super-resolution algorithms.
Unify the evaluation pipline of speech super-resolution algorithms for a easier comparison between different systems.
Benchmarking speech super-resolution methods (pull request is welcome). Encouraging reproducible research.

I build this repo while I'm writing my paper for INTERSPEECH 2022: Neural Vocoder is All You Need for Speech Super-resolution. The model mentioned in this paper, NVSR, will also be open-sourced here.

Installation

Install via pip:

pip3 install ssr_eval

Please make sure you have already installed sox.

Quick Example

A basic example: Evaluate on a system that do nothing:

from ssr_eval import test 
test()

The evaluation result json file will be stored in the ./results directory: Example file
The code will automatically handle stuffs like downloading test sets.
You will find a field "averaged" at the bottom of the json file that looks like below. This field mark the performance of the system.

"averaged": {
        "proc_fft_24000_44100": {
            "lsd": 5.152331300436993,
            "log_sispec": 5.8051057146229095,
            "sispec": 30.23394207533686,
            "ssim": 0.8484425044157442
        }
    }

Here we report four metrics:

Log spectral distance(LSD).
Log scale invariant spectral distance [1] (log-sispec).
Scale invariant spectral distance [1] (sispec).
Structral similarity (SSIM).

⚠️ LSD is the most widely used metric for super-resolution. And I include another three metrics just in case you need them.

Below is the code of test()

from ssr_eval import SSR_Eval_Helper, BasicTestee

# You need to implement a class for the model to be evaluated.
class MyTestee(BasicTestee):
    def __init__(self) -> None:
        super().__init__()

    # You need to implement this function
    def infer(self, x):
        """A testee that do nothing

        Args:
            x (np.array): [sample,], with model_input_sr sample rate
            target (np.array): [sample,], with model_output_sr sample rate

        Returns:
            np.array: [sample,]
        """
        return x

def test():
    testee = MyTestee()
    # Initialize a evaluation helper
    helper = SSR_Eval_Helper(
        testee,
        test_name="unprocessed",  # Test name for storing the result
        input_sr=44100,  # The sampling rate of the input x in the 'infer' function
        output_sr=44100,  # The sampling rate of the output x in the 'infer' function
        evaluation_sr=48000,  # The sampling rate to calculate evaluation metrics.
        setting_fft={
            "cutoff_freq": [
                12000
            ],  # The cutoff frequency of the input x in the 'infer' function
        },
        save_processed_result=True
    )
    # Perform evaluation
    ## Use all eight speakers in the test set for evaluation (limit_test_speaker=-1) 
    ## Evaluate on 10 utterance for each speaker (limit_test_nums=10)
    helper.evaluate(limit_test_nums=10, limit_test_speaker=-1)

The code will automatically handle stuffs like downloading test sets. The evaluation result will be saved in the ./results directory.

Baselines

We provide several pretrained baselines. For example, to run the NVSR baseline, you can click the link in the following table for more details.

Table.1 Log-spectral distance (LSD) on different input sampling-rate (Evaluated on 44.1kHz).

Method	One for all	Params	2kHz	4kHz	8kHz	12kHz	16kHz	24kHz	32kHz	AVG
NVSR [Pretrained Model]	Yes	99.0M	1.04	0.98	0.91	0.85	0.79	0.70	0.60	0.84
WSRGlow(24kHz→48kHz)	No	229.9M	-	-	-	-	-	0.79	-	-
WSRGlow(12kHz→48kHz)	No	229.9M	-	-	-	0.87	-	-	-	-
WSRGlow(8kHz→48kHz)	No	229.9M	-	-	0.98	-	-	-	-	-
WSRGlow(4kHz→48kHz)	No	229.9M	-	1.12	-	-	-	-	-	-
Nu-wave(24kHz→48kHz)	No	3.0M	-	-	-	-	-	1.22	-	-
Nu-wave(12kHz→48kHz)	No	3.0M	-	-	-	1.40	-	-	-	-
Nu-wave(8kHz→48kHz)	No	3.0M	-	-	1.42	-	-	-	-	-
Nu-wave(4kHz→48kHz)	No	3.0M	-	1.42	-	-	-	-	-	-
Unprocessed	-	-	5.69	5.50	5.15	4.85	4.54	3.84	2.95	4.65

Click the link of the model for more details.

Here "one for all" means model can process flexible input sampling rate.

Features

The following code demonstrate the full options in the SSR_Eval_Helper:

testee = MyTestee()
helper = SSR_Eval_Helper(testee, # Your testsee object with 'infer' function implemented
                        test_name="unprocess",  # The name of this test. Used for saving the log file in the ./results directory
                        test_data_root="./your_path/vctk_test", # The directory to store the test data, which will be automatically downloaded.
                        input_sr=44100, # The sampling rate of the input x in the 'infer' function
                        output_sr=44100, # The sampling rate of the output x in the 'infer' function
                        evaluation_sr=48000, # The sampling rate to calculate evaluation metrics. 
                        save_processed_result=False, # If True, save model output in the dataset directory.
                        # (Recommend/Default) Use fourier method to simulate low-resolution effect
                        setting_fft = {
                            "cutoff_freq": [1000, 2000, 4000, 6000, 8000, 12000, 16000], # The cutoff frequency of the input x in the 'infer' function
                        }, 
                        # Use lowpass filtering to simulate low-resolution effect. All possible combinations will be evaluated. 
                        setting_lowpass_filtering = {
                            "filter":["cheby","butter","bessel","ellip"], # The type of filter 
                            "cutoff_freq": [1000, 2000, 4000, 6000, 8000, 12000, 16000], 
                            "filter_order": [3,6,9] # Filter orders
                        }, 
                        # Use subsampling method to simulate low-resolution effect
                        setting_subsampling = {
                            "cutoff_freq": [1000, 2000, 4000, 6000, 8000, 12000, 16000],
                        }, 
                        # Use mp3 compression method to simulate low-resolution effect
                        setting_mp3_compression = {
                            "low_kbps": [32, 48, 64, 96, 128],
                        },
)

helper.evaluate(limit_test_nums=10, # For each speaker, only evaluate on 10 utterances.
                limit_test_speaker=-1 # Evaluate on all the speakers. 
                )

⚠️ I recommand all the users to use fourier method (setting_fft) to simulate low-resolution effect for the convinence of comparing between different system.

Dataset Details

We build the test sets using VCTK (version 0.92), a multi-speaker English corpus that contains 110 speakers with different accents.

Speakers used for the test set: p360, p361, p362, p363, p364, p374, p376, s5
For the remaining 100 speakers, p280 and p315 are omitted for the technical issues.
Other 98 speakers are used for training.

Citation

If you find this repo useful for your research, please consider citing:

@misc{liu2022neural,
      title={Neural Vocoder is All You Need for Speech Super-resolution}, 
      author={Haohe Liu and Woosung Choi and Xubo Liu and Qiuqiang Kong and Qiao Tian and DeLiang Wang},
      year={2022},
      eprint={2203.14941},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

Reference

[1] Liu, Haohe, et al. "VoiceFixer: Toward General Speech Restoration with Neural Vocoder." arXiv preprint arXiv:2109.13731 (2021).

Comments

Use the pretrained model to predict 16 kHz track

Hello,

Your work is awesome! Can I use the provided 48 kHz pretrained model to predict the 16 kHz track directly? e.g. 4 kHz to 16 kHz. Or do I need to retrain the model?

Best regards

opened by ruizhecao96 4
Incredible! How to run inference on a custom file?

Super impressed by your results! Curious to know how I could run a sample audio file through your model to upsample it. It seems the code provided here simply evaluates the model: https://github.com/haoheliu/ssr_eval/tree/main/examples/NVSR

I'll try to figure it out from that but would love any help whatsoever. No pressure whatsoever if busy though!

opened by youssefavx 3
Running pre-trained NSVR

Hello, I am trying to run the pre-trained NSVR. After successfully installing requirements, running "python main.py" results in a EOFError. Here is the produced traceback:

Traceback (most recent call last): File "main.py", line 172, in testee = eval(test_name)(device=device) File "main.py", line 114, in init super(NVSRPostProcTestee, self).init(device) File "main.py", line 56, in init self.model = Model(channels=1) File "\ssr_eval-main\examples\NVSR\nvsr_unet.py", line 84, in init self.vocoder = Vocoder(sample_rate=44100).to(device) File "E:\Anaconda\lib\site-packages\voicefixer\vocoder\base.py", line 14, in init self._load_pretrain(Config.ckpt) File "E:\Anaconda\lib\site-packages\voicefixer\vocoder\base.py", line 19, in _load_pretrain checkpoint = load_checkpoint(pth, torch.device("cpu")) File "E:\Anaconda\lib\site-packages\voicefixer\vocoder\model\util.py", line 92, in load_checkpoint checkpoint = torch.load(checkpoint_path, map_location=device) File "E:\Anaconda\lib\site-packages\torch\serialization.py", line 713, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "E:\Anaconda\lib\site-packages\torch\serialization.py", line 920, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) EOFError: Ran out of input

opened by kether82 1
Missing `pytorch-lightning` dependency in NVSR requirements

The NVSR/nvsr_unet.py module imports pytorch_lightning (line 10), but this isn't included in the NVSR/requirements.txt file (pytorch-lightning), just wanted to report while installing your package!

opened by lmmx 1
Is it possible to run this model on multiple GPUs?

(Talking of NVSR) I ran this model on a cloud GPU A100 80 GB and managed to get it up to 7 minutes to upsample. I'm now curious how far it can go. If I have 8x A100 GPUs would it be possible to be able to upsample a 56 min file? Or is this model not designed to run inference on multiple GPUs?

(I know my method so far has been splitting long audio files then upsampling but I'd like to avoid splitting).

opened by youssefavx 0

Evaluation and Benchmarking of Speech Super-resolution Methods

Related tags

Overview

Speech Super-resolution Evaluation and Benchmarking

Installation

Quick Example

Baselines

Features

Dataset Details

Citation

Reference

Comments

Use the pretrained model to predict 16 kHz track

Incredible! How to run inference on a custom file?

Running pre-trained NSVR

Missing `pytorch-lightning` dependency in NVSR requirements

Is it possible to run this model on multiple GPUs?

Owner

Haohe Liu (刘濠赫)

Implementation of temporal pooling methods studied in [ICIP'20] A Comparative Evaluation Of Temporal Pooling Methods For Blind Video Quality Assessment

Evaluation Pipeline for our ECCV2020: Journey Towards Tiny Perceptual Super-Resolution.

The project covers common metrics for super-resolution performance evaluation.

Deep Learning: Architectures & Methods Project: Deep Learning for Audio Super-Resolution

A PyTorch-based open-source framework that provides methods for improving the weakly annotated data and allows researchers to efficiently develop and compare their own methods.

aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)

PyTorch implementation of the end-to-end coreference resolution model with different higher-order inference methods.

Super-Fast-Adversarial-Training - A PyTorch Implementation code for developing super fast adversarial training

A framework for joint super-resolution and image synthesis, without requiring real training data

MASA-SR: Matching Acceleration and Spatial Adaptation for Reference-Based Image Super-Resolution (CVPR2021)

PyTorch implementation of Graph Convolutional Networks in Feature Space for Image Deblurring and Super-resolution, IJCNN 2021.

Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond

BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment

Unofficial pytorch implementation of the paper "Dynamic High-Pass Filtering and Multi-Spectral Attention for Image Super-Resolution"

Code repo for "RBSRICNN: Raw Burst Super-Resolution through Iterative Convolutional Neural Network" (Machine Learning and the Physical Sciences workshop in NeurIPS 2021).

Blind Image Super-resolution with Elaborate Degradation Modeling on Noise and Kernel

Pmapper is a super-resolution and deconvolution toolkit for python 3.6+