Python implementation of the Short Term Objective Intelligibility measure

Pariente Manuel

Last update: Dec 21, 2022

Related tags

Audio pystoi

Overview

Python implementation of STOI

Implementation of the classical and extended Short Term Objective Intelligibility measures

Intelligibility measure which is highly correlated with the intelligibility of degraded speech signals, e.g., due to additive noise, single/multi-channel noise reduction, binary masking and vocoded speech as in CI simulations. The STOI-measure is intrusive, i.e., a function of the clean and degraded speech signals. STOI may be a good alternative to the speech intelligibility index (SII) or the speech transmission index (STI), when you are interested in the effect of nonlinear processing to noisy speech, e.g., noise reduction, binary masking algorithms, on speech intelligibility.
Description taken from Cees Taal's website

Install

pip install pystoi or pip3 install pystoi

Usage

import soundfile as sf
from pystoi import stoi

clean, fs = sf.read('path/to/clean/audio')
denoised, fs = sf.read('path/to/denoised/audio')

# Clean and den should have the same length, and be 1D
d = stoi(clean, denoised, fs, extended=False)

Matlab code & Testing

All the Matlab code in this repo is taken from or adapted from the code available here (STOI – Short-Time Objective Intelligibility Measure – ) written by Cees Taal.

Thanks to Cees Taal who open-sourced his Matlab implementation and enabled thorough testing of this python code.

If you want to run the tests, you will need Matlab, matlab.engine (install instructions here) and matlab_wrapper (install with pip install matlab_wrapper). The tests can only be ran under Python 2.7 as matlab.engine and matlab_wrapper are only compatible with Python2.7 Tests are passing at relative and absolute tolerance of 1e-3, which is enough for the considered application (all the variability is coming from the resampling method when signals are not natively sampled at 10kHz).

Very big thanks to @gauss256 who translated all the matlab scripts to Octave, and wrote all the tests for it!

Contribute

Any contribution are welcome~, specially to improve the execution speed of the code~ (thank you Przemek Pobrotyn for a 4x speed-up!) :

~~Improve the resampling method to match Matlab's resampling in tests/.~~ This can be considered a solved issue thanks to @gauss256 !
Write tests for Python 3 (with transplant for example)

References

[1] C.H.Taal, R.C.Hendriks, R.Heusdens, J.Jensen 'A Short-Time Objective Intelligibility Measure for Time-Frequency Weighted Noisy Speech', ICASSP 2010, Texas, Dallas.
[2] C.H.Taal, R.C.Hendriks, R.Heusdens, J.Jensen 'An Algorithm for Intelligibility Prediction of Time-Frequency Weighted Noisy Speech', IEEE Transactions on Audio, Speech, and Language Processing, 2011.
[3] J. Jensen and C. H. Taal, 'An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers', IEEE Transactions on Audio, Speech and Language Processing, 2016.

Comments

remove_silent_frames

I think the way the signal is being reconstructed in remove_silent_frames is not quite right. The evidence for this is that if you start with a signal that has no silence, the output of remove_silent_frames should be the same as the input, but it's not.

The problem is that the window does not satisfy the COLA constraint so the overlap-add technique needs to be modified to compensate.

I know how to fix this but I wanted to get your opinion before preparing a PR. The issue is that the problem lies in the original STOI MATLAB code, not in pystoi. So if it is fixed, tests that compare to the output of MATLAB will fail.

The magnitude of the error is probably not large, so it's a tradeoff. Is it better to have correct silence removal or consistency with MATLAB output?

opened by gauss256 14
Vectorization
As per contribution request, in this pull request we add vectorisation of a number of computations, specfically:

Vectorization of nearly all of utils.remove_silent_frames

Vectorization of computations in both classical and extended STOI

There are still a few for loops left in form of list comprehension, though.

As a result, we obtain a nearly 4-fold speed up in computations of classical STOI and 1.5-fold speed up in computations of extended STOI.

We also demonstrate that the results of vectorized implementation agree with the previous one, within tolerance. See documentation for np.allclose for details about tolerance threshold.

The minuscule differences in results are most likely due to numerical inaccuracies when performing matrix operations vs doing computations in loops.

See the images below for results and time comparisons.

Before merging, please update the README file accordingly. Thank you!
opened by PrzemekPobrotyn 7
TensorFlow

I am working on a version of STOI in TensorFlow. I am posting this here just in case @mpariente or someone else is also working on that. We could avoid duplicating effort.

opened by gauss256 7
Correct size of removed silence arrays

The code was not removing silent frames properly. This was not caught in the unit tests because there are no silent frames in the test data.

I have confirmed the commit in my own unit tests, but they are based on Octave rather than MATLAB and it would take some work (which I may yet do) to commit them.

The unit test data consists of random numbers. To reproduce the problem and verify the fix, zero out the middle third of the data. Then compare the results of pystoi to MATLAB or Octave.

opened by gauss256 6

AxisError when signal contains silence

The stoi function produces an error if a reference signal only contains a short piece of speech. This seems to be caused by the removal of silent frames.

This is a minimal example using WSJ0-2mix data. Replace wsj0_2mix_root with the root to the WSJ0-2mix data. You might have to remove the suffix _2 if you have a newer version of the WJ0-2mix database:

from pathlib import Path
from pystoi.stoi import stoi
import soundfile as sf

wsj0_2mix_root = Path('<path to WSJ0-2mix root dir>')

observation = sf.read(str(wsj0_2mix_root / 'data/2speakers/wav8k/min/cv/mix/40ba0112_1.2757_01nc0218_-1.2757.wav'))[0]
target = sf.read(str(wsj0_2mix_root / 'data/2speakers/wav8k/min/cv/s2/40ba0112_1.2757_01nc0218_-1.2757_2.wav'))[0]

stoi(target, observation, 8000)

---------------------------------------------------------------------------
AxisError                                 Traceback (most recent call last)
<ipython-input-167-eb5a1701f57b> in <module>
      9 
     10 
---> 11 stoi(target, observation, 8000)

.../python3.7/site-packages/pystoi/stoi.py in stoi(x, y, fs_sig, extended)
     75         # Find normalization constants and normalize
     76         normalization_consts = (
---> 77             np.linalg.norm(x_segments, axis=2, keepdims=True) /
     78             (np.linalg.norm(y_segments, axis=2, keepdims=True) + utils.EPS))
     79         y_segments_normalized = y_segments * normalization_consts

.../python3.7/site-packages/numpy/linalg/linalg.py in norm(x, ord, axis, keepdims)
   2479             # special case for speedup
   2480             s = (x.conj() * x).real
-> 2481             return sqrt(add.reduce(s, axis=axis, keepdims=keepdims))
   2482         else:
   2483             try:

AxisError: axis 2 is out of bounds for array of dimension 1

Is this a bug in the implementation or a general flaw of the STOI metric? Do you have a suggestion on how to handle this issue?

opened by thequilo 5

Is there any difference in Resample() between Matlab and Octave?

This code is really helpful for my study. Thank you for this awesome work! There is no issue I am going to raise. Just some questions about MATLAB and Octave.

The sample rate of my audio is 16000. According to the README file, the test will fail if I use python to do the resample. My question is how about the resample() function between MATLAB and Octave? Are they equivalent? It will be very appreciated if someone could answer this.

opened by NearLinHere 5
Numpy.dtype Size change

Is this the expected error message when I run the function with 16,000 Hz wav files, as opposed to 10kHz?

RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88

opened by christianhwa 5
Introduce a faster overlap_and_add method

This PR introduces a new overlap_and_add method, inspired by the one implemented in tensorflow (https://github.com/tensorflow/tensorflow/blob/v2.7.0/tensorflow/python/ops/signal/reconstruction_ops.py#L30-L167) , that is ~50 times faster than the previous one on a 30-seconds audio because it vectorises the code instead of making a for loop.

opened by giamic 3
Is resampling really required?

Hi, The original paper (http://cas.et.tudelft.nl/pubs/Taal2010.pdf) mentions in the start of section 2 that the metric is supposed to be used on audio at a sampling rate of 10000 Hz. Is this really necessary? I get fairly similar results regardless of whether or not I resample my audio.

Thanks!

opened by anujstam 3
some int casts and prevented a log of 0

As mentioned, very nice that you did a port to Python. I tried it in Python3 and had to make some small tweaks. I did not run the tests because I don't have a Matlab license ( another very good reason we have a Python version :) ).

opened by chtaal 3
Ability for batched tensor

Thank you for the code! I have a question: Does this code has the ability to calc the stoi for batched tensor with size of [B, num_of_samples] or even with size of [B, num_speaker, num_of_samples]. I checked and I think it has not this ability, right? Can you maybe expand the implementaion to this scenario please?

opened by MordehayM 2
Future warnings raised

Hi, I'm running stoi(signal1, signal2, sr, extended=True) where signal1 and signal2 are both numpy.ndarray

and I'm getting the following future warning: /usr/lib/python3/dist-packages/scipy/signal/signaltools.py:2383: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use arr[tuple(seq)] instead of arr[seq]. In the future this will be interpreted as an array index, arr[np.array(seq)], which will result either in an error or a different result. return y[keep]

Any idea how to avoid this from happening?

Thanks

opened by m-mandel 1
Batch vectorisation

Allow users to pass batches of audio waveforms, vectorising the relevant code.

This adds flexibility to the code and also makes it faster to run on batched data. There are still probably a few points in which we could optimise even more, but this should do for a first iteration.

I have some tests on my local machine that seem to show that the output hasn't changed. They are not in the PR because I didn't want to keep two versions of every function I changed, it felt confusing and ultimately useless.

Unfortunately I'm not able to run the octave / matlab tests locally to verify if they still pass.

opened by giamic 1

Exception For Small Inputs

Currently pystoi.stoi doesn't support small inputs, but throws a non indicative error:

In [28]:  pystoi.stoi(np.arange(100), np.arange(100), 32000, extended=False)
---------------------------------------------------------------------------
AxisError                                 Traceback (most recent call last)
<ipython-input-28-3f8d814254e5> in <module>
----> 1 pystoi.stoi(np.arange(100), np.arange(100), 32000, extended=False)

~/venv/py3/lib/python3.7/site-packages/pystoi/stoi.py in stoi(x, y, fs_sig, extended)
     56 
     57     # Remove silent frames
---> 58     x, y = utils.remove_silent_frames(x, y, DYN_RANGE, N_FRAME, int(N_FRAME/2))
     59 
     60     # Take STFT

~/venv/py3/lib/python3.7/site-packages/pystoi/utils.py in remove_silent_frames(x, y, dyn_range, framelen, hop)
    122 
    123     # Compute energies in dB
--> 124     x_energies = 20 * np.log10(np.linalg.norm(x_frames, axis=1) + EPS)
    125 
    126     # Find boolean mask of energies lower than dynamic_range dB

<__array_function__ internals> in norm(*args, **kwargs)

~/venv/py3/lib/python3.7/site-packages/numpy-1.19.2-py3.7-linux-x86_64.egg/numpy/linalg/linalg.py in norm(x, ord, axis, keepdims)
   2559             # special case for speedup
   2560             s = (x.conj() * x).real
-> 2561             return sqrt(add.reduce(s, axis=axis, keepdims=keepdims))
   2562         # None of the str-type keywords for ord ('fro', 'nuc')
   2563         # are valid for vectors

AxisError: axis 1 is out of bounds for array of dimension 1

opened by hovavalon 5

Weird STOI Output

Hi,

Recently I was trying to evaluate some signals by calculating the stoi of each signals with this package. I used pystoi.stoi.stoifunction to calculate the stoi. When I input two identical signals as ref_signal and processed_signal, it output 1 perfectly. However, when I replaced processed signal with microphone signals I recorded with and without background music playing, it turned out that the STOI of the signal when background music was presented is always higher, which made no sense. I'm wondering if I'm using the function the wrong way or is there anything wrong with my audio file or understanding about STOI.

I've uploaded my audio files at the following website as well as my code to evaluate STOI. https://github.com/nanaChang/stoiCheckFile

Thank you!

opened by nanaChang 7

Owner

Pariente Manuel

Audio researcher

GitHub

Supysonic is a Python implementation of the Subsonic server API.

Supysonic Supysonic is a Python implementation of the Subsonic server API. Current supported features are: browsing (by folders or tags) streaming of

228 Nov 19, 2022

A fast MDCT implementation using SciPy and FFTs

MDCT A fast MDCT implementation using SciPy and FFTs Installation As usual pip install mdct Dependencies NumPy SciPy STFT Usage import mdct spectrum

43 Sep 2, 2022

Implementation of "Slow-Fast Auditory Streams for Audio Recognition, ICASSP, 2021" in PyTorch

Auditory Slow-Fast This repository implements the model proposed in the paper: Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen, Slow-Fa

57 Dec 7, 2022

Algorithmic and AI MIDI Drums Generator Implementation

8 Dec 30, 2022

Official implementation of A cappella: Audio-visual Singing VoiceSeparation, from BMVC21

Y-Net Official implementation of A cappella: Audio-visual Singing VoiceSeparation, British Machine Vision Conference 2021 Project page: ipcv.github.io

12 Oct 22, 2022

Official implementation of A cappella: Audio-visual Singing VoiceSeparation, from BMVC21

Y-Net Official implementation of A cappella: Audio-visual Singing VoiceSeparation, British Machine Vision Conference 2021 Project page: ipcv.github.io

12 Oct 22, 2022

Music Streaming Platform based on full implementation of DBSM

Symphony Music Streaming Platform based on full implementation of DBSM List of Commands Insert User (INSERT) Function to implement input in USER Get a

1 Nov 12, 2021

cross-library (GStreamer + Core Audio + MAD + FFmpeg) audio decoding for Python

audioread Decode audio files using whichever backend is available. The library currently supports: Gstreamer via PyGObject. Core Audio on Mac OS X via

419 Dec 26, 2022

Audio fingerprinting and recognition in Python

dejavu Audio fingerprinting and recognition algorithm implemented in Python, see the explanation here: How it works Dejavu can memorize audio by liste

6k Jan 6, 2023

Python library for audio and music analysis

librosa A python package for music and audio analysis. Documentation See https://librosa.org/doc/ for a complete reference manual and introductory tut

5.6k Jan 6, 2023

Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications

A Python library for audio feature extraction, classification, segmentation and applications This doc contains general info. Click here for the comple

5.1k Jan 2, 2023

Scalable audio processing framework written in Python with a RESTful API

TimeSide : scalable audio processing framework and server written in Python TimeSide is a python framework enabling low and high level audio analysis,

340 Jan 4, 2023

eyeD3 is a Python module and command line program for processing ID3 tags. Information about mp3 files (i.e bit rate, sample frequency, play time, etc.) is also provided. The formats supported are ID3v1 (1.0/1.1) and ID3v2 (2.3/2.4).

Status About eyeD3 is a Python tool for working with audio files, specifically MP3 files containing ID3 metadata (i.e. song info). It provides a comma

425 Jan 1, 2023

Python implementation of the Short Term Objective Intelligibility measure

Related tags

Overview

Python implementation of STOI

Install

Usage

Matlab code & Testing

Contribute

References

Comments

Owner

Pariente Manuel

Supysonic is a Python implementation of the Subsonic server API.

A fast MDCT implementation using SciPy and FFTs

Implementation of "Slow-Fast Auditory Streams for Audio Recognition, ICASSP, 2021" in PyTorch

Algorithmic and AI MIDI Drums Generator Implementation

Official implementation of A cappella: Audio-visual Singing VoiceSeparation, from BMVC21

Official implementation of A cappella: Audio-visual Singing VoiceSeparation, from BMVC21

Music Streaming Platform based on full implementation of DBSM

cross-library (GStreamer + Core Audio + MAD + FFmpeg) audio decoding for Python

Audio fingerprinting and recognition in Python

Python library for audio and music analysis

Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications

Scalable audio processing framework written in Python with a RESTful API

eyeD3 is a Python module and command line program for processing ID3 tags. Information about mp3 files (i.e bit rate, sample frequency, play time, etc.) is also provided. The formats supported are ID3v1 (1.0/1.1) and ID3v2 (2.3/2.4).

Python module for handling audio metadata

Read music meta data and length of MP3, OGG, OPUS, MP4, M4A, FLAC, WMA and Wave files with python 2 or 3

Telegram Voice-Chat Bot Written In Python Using Pyrogram.

Expressive Digital Signal Processing (DSP) package for Python

cross-library (GStreamer + Core Audio + MAD + FFmpeg) audio decoding for Python

Python wrapper around sox.