This library provides common speech features for ASR including MFCCs and filterbank energies.

James Lyons

Last update: Jan 4, 2023

Related tags

Audio python_speech_features

Overview

python_speech_features

This library provides common speech features for ASR including MFCCs and filterbank energies. If you are not sure what MFCCs are, and would like to know more have a look at this MFCC tutorial

Project Documentation

To cite, please use: James Lyons et al. (2020, January 14). jameslyons/python_speech_features: release v0.6.1 (Version 0.6.1). Zenodo. http://doi.org/10.5281/zenodo.3607820

Installation

This project is on pypi

To install from pypi:

pip install python_speech_features

From this repository:

git clone https://github.com/jameslyons/python_speech_features
python setup.py develop

Usage

Supported features:

Mel Frequency Cepstral Coefficients
Filterbank Energies
Log Filterbank Energies
Spectral Subband Centroids

Example use

From here you can write the features to a file etc.

MFCC Features

The default parameters should work fairly well for most cases, if you want to change the MFCC parameters, the following parameters are supported:

python
def mfcc(signal,samplerate=16000,winlen=0.025,winstep=0.01,numcep=13,
                 nfilt=26,nfft=512,lowfreq=0,highfreq=None,preemph=0.97,
     ceplifter=22,appendEnergy=True)

Parameter	Description
signal	the audio signal from which to compute features. Should be an N*1 array
samplerate	the samplerate of the signal we are working with.
winlen	the length of the analysis window in seconds. Default is 0.025s (25 milliseconds)
winstep	the step between successive windows in seconds. Default is 0.01s (10 milliseconds)
numcep	the number of cepstrum to return, default 13
nfilt	the number of filters in the filterbank, default 26.
nfft	the FFT size. Default is 512
lowfreq	lowest band edge of mel filters. In Hz, default is 0
highfreq	highest band edge of mel filters. In Hz, default is samplerate/2
preemph	apply preemphasis filter with preemph as coefficient. 0 is no filter. Default is 0.97
ceplifter	apply a lifter to final cepstral coefficients. 0 is no lifter. Default is 22
appendEnergy	if this is true, the zeroth cepstral coefficient is replaced with the log of the total frame energy.
returns	A numpy array of size (NUMFRAMES by numcep) containing features. Each row holds 1 feature vector.

Filterbank Features

These filters are raw filterbank energies. For most applications you will want the logarithm of these features. The default parameters should work fairly well for most cases. If you want to change the fbank parameters, the following parameters are supported:

python
def fbank(signal,samplerate=16000,winlen=0.025,winstep=0.01,
      nfilt=26,nfft=512,lowfreq=0,highfreq=None,preemph=0.97)

Parameter	Description
signal	the audio signal from which to compute features. Should be an N*1 array
samplerate	the samplerate of the signal we are working with
winlen	the length of the analysis window in seconds. Default is 0.025s (25 milliseconds)
winstep	the step between successive windows in seconds. Default is 0.01s (10 milliseconds)
nfilt	the number of filters in the filterbank, default 26.
nfft	the FFT size. Default is 512.
lowfreq	lowest band edge of mel filters. In Hz, default is 0
highfreq	highest band edge of mel filters. In Hz, default is samplerate/2
preemph	apply preemphasis filter with preemph as coefficient. 0 is no filter. Default is 0.97
returns	A numpy array of size (NUMFRAMES by nfilt) containing features. Each row holds 1 feature vector. The second return value is the energy in each frame (total energy, unwindowed)

Reference

sample english.wav obtained from:

wget http://voyager.jpl.nasa.gov/spacecraft/audio/english.au
sox english.au -e signed-integer english.wav

Comments

Question about hamming window length

When using mfcc, the window parameter can use numpy.hamming, but numpy.hamming is a funtion, and it can take an int input as number of points in the output window. See numpy.hamming Doc. However, frame_len is used in sigproc.py Could you please how does the np.hamming work in mfcc? What if I want to input a specific window length? Thank you !

opened by leeeeeeo 14
obtain the noise data

Hi, my aim is to get the noise data from a audio file, which is from classroom, it is meaning that i want to remove the teachers voice. So, what should I do ?

opened by YueWenWu 7
How to ignore the NFFT warning

$TX3NH0WQ451{_PJD 7O BML$

I don't want to increase the NFFT cause I think it is acceptable to distortion. So may there is any way to ignore the annoying warnings?

I have tried using module warning , but it dosen't work.

opened by igo312 5
Why there's big difference using 16k and 44.1k sample rate

Hi: I recorded some wav file originally in 44.1k sample rate, and then I convert this file to 16k by sox. After that I use this python script to caculate the MFCC feature of the 44.1k file and 16k file, but found that the result was completely different. And one same file no matter convert to 44.1k or 16k, I think the result should be the same. Isn't that ?

opened by robotnc 5
What if the frame length is greater than NFFT?

I'm not an expert in this kind of stuff, so I'm sorry if this will be a waste of time.

From the numpy.fft.rfft documentation [in our case: n=NFTT, input=frame]: "Number of points along transformation axis in the input to use. If n is smaller than the length of the input, the input is cropped. If it is larger, the input is padded with zeros. If n is not given, the length of the input along the axis specified by axis is used."

Is not this cropping something we want to avoid? Because, as far as I've seen, there's not any check in the code about how the frame size compares to NFTT.

opened by janluke 4
inconsistent result with HTK
Hi,

I tried to compare the MFCC features generated using HTK, and those generated by python_speech_features. Unfortunately, somehow they always mismatch.

Below is the configuration I used for HTK

SOURCEFORMAT = NIST TARGETKIND = MFCC_0 TARGETRATE = 100000 SAVECOMPRESSED = F SAVEWITHCRC = F WINDOWSIZE = 250000 USEHAMMING = F PREEMCOEF = 0.97 NUMCHANS = 26 CEPLIFTER = 22 NUMCEPS = 12 ENORMALISE = F

The configuration for python_speech_features is default.

I also tried adding USEPOWER = F/T, and still features obtained are very different (actually, for file TIMITcorpus/TIMIT/TRAIN/DR8/FBCG1/SX442, I got 358 frames for HTK, but only 354 frames for python_speech_features.

Any insight? I'm a newbie in speech recognition, and may have committed some silly mistakes..
opened by zym1010 4
Troubles when porting
Hi, I am trying to port this algorithm to JavaScript and I am running into the following:

feat = numpy.dot(pspec,fb.T)

(https://github.com/jameslyons/python_speech_features/blob/master/features/base.py#L56)

The issue I am running into is that pspec and fb here should have the same dimensions, but for some reason they don't. Is there something in the algorithm, some kind of balance between parameters for example, which should cause these two arrays to have the same dimensions?
opened by mauritslamers 4
[Question:] How to capture intensity or perceived loudness of a given audio file at regular intervals

If you are playing a song on your laptop, As you increase the volume from 0 to 100, the audio becomes louder and louder.

Say I have an .mp3 or .wav , how do I capture this ^ perceived loudness/intensity at regular intervals (may be 0.1 second) in the audio using python speech features?

Any advice is appreciated.

Thanks Vivek

opened by StanSilas 3
can't get same result as compute-mfcc-feats.

compute-mfcc-feats --window-type=hamming --dither=0.0 --use-energy=false --sample-frequency=8000 --num-mel-bins=40 --num-ceps=40 --low-freq=40 --raw-energy=false --remove-dc-offset=false --high-freq=3800 scp:wav.scp ark,scp:feats.ark,feats.scp

mfcc(signal=sig, samplerate=rate, winlen=0.025, winstep=0.01, numcep=40, nfilt=40, lowfreq=40, highfreq=3800, appendEnergy=False, winfunc = lambda x: np.hamming(x) )

is there some difference ?

opened by bjtommychen 3

inconsistency with librosa

I compared the mfcc of librosa with python_speech_analysis package and got totally different results.

Which one is correct? librosa list of first frame coefficients:

[-395.07433842032867, -7.1149347948192963e-14, 3.5772469223901538e-14, -1.7476140989485184e-14, 3.1665300829452658e-14, -4.4214136625668904e-14, 6.7157035631648599e-14, 1.5013974158050108e-14, 2.9512326634271699e-14, 7.2275398398734558e-14, -1.5043753316598812e-13, -2.2358383003147776e-14, 1.6209256159527285e-13]

python_speech_analysis list of first frame coefficients:

[-169.91598446684722, 1.3219891974654943, 0.22216979881740945, -0.7368248288464827, 0.26268194306407788, 1.8470757480486224, 3.2670900572694435, 2.3726120692753563, 1.4983949546889608, 0.67862219561000914, -0.44705590991616034, 0.39184067109778226, -0.48048214059101707]

import librosa
import python_speech_features
from scipy.signal.windows import hann

n_mfcc = 13
n_mels = 40
n_fft = 512 # in librosa, win_length is assumed to be equal to n_fft implicitly
hop_length = 160
fmin = 0
fmax = None
y, sr = librosa.load(librosa.util.example_audio_file())
sr = 16000  # fake sample rate just to make the point

# librosa
mfcc_librosa = librosa.feature.mfcc(y=y, sr=sr, n_fft=n_fft,
                                    n_mfcc=n_mfcc, n_mels=n_mels,
                                    hop_length=hop_length,
                                    fmin=fmin, fmax=fmax)

# python_speech_features
# no preemph nor ceplifter in librosa, so setting to zero
# librosa default stft window is hann
mfcc_speech = python_speech_features.mfcc(signal=y, samplerate=sr, winlen=n_fft / sr, winstep=hop_length / sr,
                                          numcep=n_mfcc, nfilt=n_mels, nfft=n_fft, lowfreq=fmin, highfreq=fmax,
                                          preemph=0, ceplifter=0, appendEnergy=False, winfunc=hann)


print(list(mfcc_librosa[:, 0]))
print(list(mfcc_speech[0, :]))

opened by chananshgong 3

Filterbank=80

It works fine for filterbank=40.But when I try for 80, the third filterbank out is constant value like this -36.04365,-36.04365,-36.04365,-36.04365,-36.04365,-36.04365

I have attached the image showing speech,spectrogram, logfilterbank for 80 filters

opened by madhavsund 3

Use another augmented assignment statement

:eyes: Some source code analysis tools can help to find opportunities for improving software components. :thought_balloon: I propose to increase the usage of augmented assignment statements accordingly.

diff --git a/python_speech_features/sigproc.py b/python_speech_features/sigproc.py
index a786c4f..b8729ea 100644
--- a/python_speech_features/sigproc.py
+++ b/python_speech_features/sigproc.py
@@ -84,7 +84,7 @@ def deframesig(frames, siglen, frame_len, frame_step, winfunc=lambda x: numpy.on
                                                indices[i, :]] + win + 1e-15  # add a little bit so it is never zero
         rec_signal[indices[i, :]] = rec_signal[indices[i, :]] + frames[i, :]
 
-    rec_signal = rec_signal / window_correction
+    rec_signal /= window_correction
     return rec_signal[0:siglen]

opened by elfring 0

High CPU Utilization
I observe that exetacting MFCC or MFB features utilizes almost all the CPU with 100% capacity. I am sure that extracting these features doesn't requires so much of computation.

I am processing only one file at a time and not using any parallalization. Here is the code I am using

import glob import numpy as np import scipy.io as sio import scipy.io.wavfile from python_speech_features import * filelist = glob.glob("/home/divraj/scribetech/dataset/voxceleb1/test/wav/*/*/*.wav") for file in filelist: sr, audio = sio.wavfile.read(file) features, energies = fbank(audio, samplerate=16000, nfilt=40, winlen=0.025, winfunc=np.hamming)

What is the reason for high CPU utilization?
opened by divyeshrajpura4114 2
viseme generation

@jameslyons thank you so much for this repo. Hi everyone I am trying to produce viseme from audio. can you please guide me about how can i generate viseme using this repo. or any other repo you may refer or another relevant work which I can extend to my main goal. Would really appreciate any help

opened by AhmadManzoor 0
[Question:] inverse fbank back to wav

Hi, thanks for the library. I use it to compute the fbank, do some stuff on them, and than i get a new one. Is there a way to convert the new fbank back to the waw? I have the original starting file theoretically is possible.

opened by matdtr 0
Minor issue on round vs. floor

In this line:

https://github.com/jameslyons/python_speech_features/blob/9a2d76c6336d969d51ad3aa0d129b99297dcf55e/python_speech_features/base.py#L169

I think you are assuming that np.floor(t+1) = np.round(t), but that is not true. I think your want:

bin = numpy.round((nfft)*mel2hz(melpoints)/samplerate) bin = numpy.floor((nfft+0.5)*mel2hz(melpoints)/samplerate)

This is a minor point because they often give the same and it doesn't matter in practice. I just found this point a little confusing in your write-up.

Thanks for the blogpost and this code!

opened by keithchugg 0

Releases(0.6.1)

0.6.1(Jan 14, 2020)

Source code(tar.gz)
Source code(zip)
v0.6.0(Jan 14, 2020)

Source code(tar.gz)
Source code(zip)
v0.6(Jan 14, 2020)

this is for creating a doi on zenodo
Source code(tar.gz)
Source code(zip)

Owner

James Lyons

GitHub

C++ library for audio and music analysis, description and synthesis, including Python bindings

Essentia Essentia is an open-source C++ library for audio analysis and audio-based music information retrieval released under the Affero GPL license.

Music Technology Group - Universitat Pompeu Fabra

2.3k Jan 3, 2023

commonfate 📦commonfate 📦 - Common Fate Model and Transform.

Common Fate Transform and Model for Python This package is a python implementation of the Common Fate Transform and Model to be used for audio source

18 Jan 8, 2022

This is an AI that runs in the terminal. It is a voice assistant that can do common activities and can also help in your coding doubts like

1 Nov 5, 2021

Using python to generate a bat script of repetitive lines of code that differ in some way but can sort out a group of audio files according to their common names

Batch Sorting Using python to generate a bat script of repetitive lines of code that differ in some way but can sort out a group of audio files accord

1 Oct 29, 2021

:speech_balloon: SpeechPy - A Library for Speech Processing and Recognition: http://speechpy.readthedocs.io/en/latest/

SpeechPy Official Project Documentation Table of Contents Documentation Which Python versions are supported Citation How to Install? Local Installatio

870 Dec 27, 2022

Speech recognition module for Python, supporting several engines and APIs, online and offline.

SpeechRecognition Library for performing speech recognition, with support for several engines and APIs, online and offline. Speech recognition engine/

6.7k Jan 8, 2023

praudio provides audio preprocessing framework for Deep Learning audio applications

praudio provides objects and a script for performing complex preprocessing operations on entire audio datasets with one command.

105 Dec 26, 2022

Audio features extraction

Yaafe Yet Another Audio Feature Extractor Build status Branch master : Branch dev : Anaconda : Install Conda Yaafe can be easily install with conda. T

231 Dec 26, 2022

spafe: Simplified Python Audio-Features Extraction

spafe aims to simplify features extractions from mono audio files. The library can extract of the following features: BFCC, LFCC, LPC, LPCC, MFCC, IMFCC, MSRCC, NGCC, PNCC, PSRCC, PLP, RPLP, Frequency-stats etc. It also provides various filterbank modules (Mel, Bark and Gammatone filterbanks) and other spectral statistics.

310 Jan 1, 2023

Voice package for Pycord adding extra features.

VoiceIO Voice package for Pycord adding extra features. Example Down bellow is an example of what you can currently do. import voiceio process = voic

1 Dec 24, 2021

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

Project DeepSpeech DeepSpeech is an open-source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu's Deep Spee

20.8k Jan 3, 2023

This library provides common speech features for ASR including MFCCs and filterbank energies.

Related tags

Overview

python_speech_features

Installation

Usage

MFCC Features

Filterbank Features

Reference

Comments

Releases(0.6.1)

0.6.1(Jan 14, 2020)

v0.6.0(Jan 14, 2020)

v0.6(Jan 14, 2020)

Owner

James Lyons

C++ library for audio and music analysis, description and synthesis, including Python bindings

commonfate 📦commonfate 📦 - Common Fate Model and Transform.

This is an AI that runs in the terminal. It is a voice assistant that can do common activities and can also help in your coding doubts like

Using python to generate a bat script of repetitive lines of code that differ in some way but can sort out a group of audio files according to their common names

:speech_balloon: SpeechPy - A Library for Speech Processing and Recognition: http://speechpy.readthedocs.io/en/latest/

Speech recognition module for Python, supporting several engines and APIs, online and offline.

praudio provides audio preprocessing framework for Deep Learning audio applications

Audio features extraction

spafe: Simplified Python Audio-Features Extraction

Voice package for Pycord adding extra features.

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

Conferencing Speech Challenge

Speech Algorithms Collections

Simple, hackable offline speech to text - using the VOSK-API.

Voicefixer aims at the restoration of human speech regardless how serious its degraded.

Some utils for auto speech recognition

Music player and music library manager for Linux, Windows, and macOS

Python library for audio and music analysis

Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications