:speech_balloon: SpeechPy - A Library for Speech Processing and Recognition: http://speechpy.readthedocs.io/en/latest/

Amirsina Torfi

Last update: Dec 27, 2022

Related tags

Overview

SpeechPy Official Project Documentation

https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat

https://coveralls.io/repos/github/astorfi/speechpy/badge.svg?branch=master

https://img.shields.io/twitter/follow/amirsinatorfi.svg?label=Follow&style=social

Documentation
Which Python versions are supported
Citation
How to Install?
- Local Installation
- Pypi
What Features are supported?
Post Processing
- Global cepstral mean and variance normalization (CMVN)
- Local cepstral mean and variance normalization (CMVN) over a sliding window
Tests
Example
Dependencies
Acknowledgements
Contributing
Disclaimer

Documentation

This library provides most frequent used speech features including MFCCs and filterbank energies alongside with the log-energy of filterbanks. If you are interested to see what are MFCCs and how they are generated please refer to this wiki page.

Please refer to the following links for further informations:

SpeechPy Official Project Documentation

Paper

Which Python versions are supported

Currently, the package has been tested and verified using Python 2.7, 3.4 and 3.5.

Citation

If you used this package, please kindly cite it as follows:

@article{torfi2018speechpy,
  title={SpeechPy-A Library for Speech Processing and Recognition},
  author={Torfi, Amirsina},
  journal={arXiv preprint arXiv:1803.01094},
  year={2018}
 }

How to Install?

There are two possible ways for installation of this package: local installation and PyPi.

Local Installation

For local installation at first the repository must be cloned:

git clone https://github.com/astorfi/speech_feature_extraction.git

After cloning the reposity, root to the repository directory then execute:

python setup.py develop

Pypi

The package is available on PyPi. For direct installation simply execute the following:

pip install speechpy

What Features are supported?

Mel Frequency Cepstral Coefficients(MFCCs)
Filterbank Energies
Log Filterbank Energies

Please refer to SpeechPy Official Project Documentation for details about the supported features.

MFCC Features

The supported attributes for generating MFCC features can be seen by investigating the related function:

def mfcc(signal, sampling_frequency, frame_length=0.020, frame_stride=0.01,num_cepstral =13,
       num_filters=40, fft_length=512, low_frequency=0, high_frequency=None, dc_elimination=True):
      """Compute MFCC features from an audio signal.
      :param signal: the audio signal from which to compute features. Should be an N x 1 array
      :param sampling_frequency: the sampling frequency of the signal we are working with.
      :param frame_length: the length of each frame in seconds. Default is 0.020s
      :param frame_stride: the step between successive frames in seconds. Default is 0.02s (means no overlap)
      :param num_filters: the number of filters in the filterbank, default 40.
      :param fft_length: number of FFT points. Default is 512.
      :param low_frequency: lowest band edge of mel filters. In Hz, default is 0.
      :param high_frequency: highest band edge of mel filters. In Hz, default is samplerate/2
      :param num_cepstral: Number of cepstral coefficients.
      :param dc_elimination: hIf the first dc component should be eliminated or not.
      :returns: A numpy array of size (num_frames x num_cepstral) containing mfcc features.
      """

Filterbank Energy Features

def mfe(signal, sampling_frequency, frame_length=0.020, frame_stride=0.01,
          num_filters=40, fft_length=512, low_frequency=0, high_frequency=None):
    """Compute Mel-filterbank energy features from an audio signal.
    :param signal: the audio signal from which to compute features. Should be an N x 1 array
    :param sampling_frequency: the sampling frequency of the signal we are working with.
    :param frame_length: the length of each frame in seconds. Default is 0.020s
    :param frame_stride: the step between successive frames in seconds. Default is 0.02s (means no overlap)
    :param num_filters: the number of filters in the filterbank, default 40.
    :param fft_length: number of FFT points. Default is 512.
    :param low_frequency: lowest band edge of mel filters. In Hz, default is 0.
    :param high_frequency: highest band edge of mel filters. In Hz, default is samplerate/2
    :returns:
              features: the energy of fiterbank: num_frames x num_filters
              frame_energies: the energy of each frame: num_frames x 1
    """

log - Filterbank Energy Features

The attributes for log_filterbank energies are the same for filterbank energies too.

def lmfe(signal, sampling_frequency, frame_length=0.020, frame_stride=0.01,
     num_filters=40, fft_length=512, low_frequency=0, high_frequency=None):
    """Compute log Mel-filterbank energy features from an audio signal.
    :param signal: the audio signal from which to compute features. Should be an N x 1 array
    :param sampling_frequency: the sampling frequency of the signal we are working with.
    :param frame_length: the length of each frame in seconds. Default is 0.020s
    :param frame_stride: the step between successive frames in seconds. Default is 0.02s (means no overlap)
    :param num_filters: the number of filters in the filterbank, default 40.
    :param fft_length: number of FFT points. Default is 512.
    :param low_frequency: lowest band edge of mel filters. In Hz, default is 0.
    :param high_frequency: highest band edge of mel filters. In Hz, default is samplerate/2
    :returns:
              features: the energy of fiterbank: num_frames x num_filters
              frame_log_energies: the log energy of each frame: num_frames x 1
    """

Stack Frames

In Stack_Frames function, the stack of frames will be generated from the signal.

def stack_frames(sig, sampling_frequency, frame_length=0.020, frame_stride=0.020, Filter=lambda x: numpy.ones((x,)),
         zero_padding=True):
    """Frame a signal into overlapping frames.
    :param sig: The audio signal to frame of size (N,).
    :param sampling_frequency: The sampling frequency of the signal.
    :param frame_length: The length of the frame in second.
    :param frame_stride: The stride between frames.
    :param Filter: The time-domain filter for applying to each frame. By default it is one so nothing will be changed.
    :param zero_padding: If the samples is not a multiple of frame_length(number of frames sample), zero padding will
                         be done for generating last frame.
    :returns: Array of frames. size: number_of_frames x frame_len.
    """

Post Processing

There are some post-processing operation that are supported in speechpy.

Global cepstral mean and variance normalization (CMVN)

This function performs global cepstral mean and variance normalization (CMVN) to remove the channel effects. The code assumes that there is one observation per row.

def cmvn(vec, variance_normalization=False):
    """
    This function is aimed to perform global ``cepstral mean and variance normalization``
    (CMVN) on input feature vector "vec". The code assumes that there is one observation per row.

    :param:
          vec: input feature matrix (size:(num_observation,num_features))
          variance_normalization: If the variance normilization should be performed or not.
    :return:
          The mean(or mean+variance) normalized feature vector.
    """

Local cepstral mean and variance normalization (CMVN) over a sliding window

This function performs local cepstral mean and variance normalization (CMVN) over sliding windows. The code assumes that there is one observation per row.

def cmvnw(vec, win_size=301, variance_normalization=False):
    """
    This function is aimed to perform local cepstral mean and variance normalization on a sliding window.
    (CMVN) on input feature vector "vec". The code assumes that there is one observation per row.
    :param
          vec: input feature matrix (size:(num_observation,num_features))
          win_size: The size of sliding window for local normalization and should be odd.
                    default=301 which is around 3s if 100 Hz rate is considered(== 10ms frame stide)
          variance_normalization: If the variance normilization should be performed or not.

    :return: The mean(or mean+variance) normalized feature vector.
    """

Tests

SpeechPy includes some unit tests. To run the tests, cd into the speechpy/tests directory and run:

python -m pytest

For installing the requirements you only need to install pytest.

Example

The test example can be seen in test/test.py as below:

import scipy.io.wavfile as wav
import numpy as np
import speechpy
import os

file_name = os.path.join(os.path.dirname(os.path.abspath(__file__)),'Alesis-Sanctuary-QCard-AcoustcBas-C2.wav')
fs, signal = wav.read(file_name)
signal = signal[:,0]

# Example of pre-emphasizing.
signal_preemphasized = speechpy.processing.preemphasis(signal, cof=0.98)

# Example of staching frames
frames = speechpy.processing.stack_frames(signal, sampling_frequency=fs, frame_length=0.020, frame_stride=0.01, filter=lambda x: np.ones((x,)),
         zero_padding=True)

# Example of extracting power spectrum
power_spectrum = speechpy.processing.power_spectrum(frames, fft_points=512)
print('power spectrum shape=', power_spectrum.shape)

############# Extract MFCC features #############
mfcc = speechpy.feature.mfcc(signal, sampling_frequency=fs, frame_length=0.020, frame_stride=0.01,
             num_filters=40, fft_length=512, low_frequency=0, high_frequency=None)
mfcc_cmvn = speechpy.processing.cmvnw(mfcc,win_size=301,variance_normalization=True)
print('mfcc(mean + variance normalized) feature shape=', mfcc_cmvn.shape)

mfcc_feature_cube = speechpy.feature.extract_derivative_feature(mfcc)
print('mfcc feature cube shape=', mfcc_feature_cube.shape)

############# Extract logenergy features #############
logenergy = speechpy.feature.lmfe(signal, sampling_frequency=fs, frame_length=0.020, frame_stride=0.01,
             num_filters=40, fft_length=512, low_frequency=0, high_frequency=None)
logenergy_feature_cube = speechpy.feature.extract_derivative_feature(logenergy)
print('logenergy features=', logenergy.shape)

For ectracting the feature at first, the signal samples will be stacked into frames. The features are computed for each frame in the stacked frames collection.

Dependencies

Two packages of Scipy and NumPy are the required dependencies which will be installed automatically by running the setup.py file.

Acknowledgements

This work is based upon a work supported by the Center for Identification Technology Research and the National Science Foundation under Grant #1650474.

Contributing

When contributing to this repository, you are more than welcome to discuss your feedback with any of the owners of this repository. For typos, please do not create a pull request. Instead, declare them in issues or email the repository owner. For technical and conceptual questions please feel free to directly contact the repository owner. Before asking general questions related to the concepts and techniques provided in this project, please make sure to read and understand its associated paper.

Pull Request Process

Please consider the following criterions in order to help us in a better way:

The pull request is mainly expected to be a code script suggestion or improvement.
A pull request related to non-code-script sections is expected to make a significant difference in the documentation. Otherwise, it is expected to be announced in the issues section.
Ensure any install or build dependencies are removed before the end of the layer when doing a build and creating a pull request.
Add comments with details of changes to the interface, this includes new environment variables, exposed ports, useful file locations and container parameters.
You may merge the Pull Request in once you have the sign-off of at least one other developer, or if you do not have permission to do that, you may request the owner to merge it for you if you believe all checks are passed.

Declaring issues

For declaring issues, you can directly email the repository owner. However, preferably please create an issue as it might be the issue that other repository followers may encounter. That way, the question to other developers will be answered as well.

Final Note

We are looking forward to your kind feedback. Please help us to improve this open source project and make our work better. For contribution, please create a pull request and we will investigate it promptly. Once again, we appreciate your kind feedback and elaborate code inspections.

Disclaimer

Although by dramatic chages, some portion of this library is inspired by the python speech features library.

We clain the following advantages for our library:

More accurate operations have been performed for the mel-frequency calculations.
The package supports different Python versions.
The feature are generated in a more organized way as cubic features.
The package is well-tested and integrated.
The package is up-to-date and actively developing.
The package has been used for research purposes.
Exceptions and extreme cases are handled in this library.

Comments

Handle small signal sizes better

Currently, passing a signal size equal to or less than the frame length throws an exception. This changes it so that when a signal of the same length is given, it gives a result like normal, and when a signall smaller than that is given, it outputs an empty array of the correct dimensions.

With change:

import numpy as np
from speechpy.main import mfcc
mfcc(np.ones((999)), 1000, 1, 1, 2)  # array([], shape=(0, 2), dtype=float64)
mfcc(np.ones((1000)), 1000, 1, 1, 2)  # array([[ 6.23832463,  0.        ]])
mfcc(np.ones((1999)), 1000, 1, 1, 2)  # array([[ 6.23832463,  0.        ]])
mfcc(np.ones((2000)), 1000, 1, 1, 2)  # array([[ 6.23832463,  0.        ], [ 6.23832463,  0.        ]])

Before:

import numpy as np
from speechpy.main import mfcc
mfcc(np.ones((999)), 1000, 1, 1, 2)  # UnboundLocalError: local variable 'signal' referenced before assignment
mfcc(np.ones((1000)), 1000, 1, 1, 2)  # UnboundLocalError: local variable 'signal' referenced before assignment
mfcc(np.ones((1999)), 1000, 1, 1, 2)  # array([[ 6.23832463,  0.        ]])
mfcc(np.ones((2000)), 1000, 1, 1, 2)  # array([[ 6.23832463,  0.        ], [ 6.23832463,  0.        ]])

opened by MatthewScholefield 10

no module main

Hi tryinig to use this in windows 7 64 bits 3.5 python by pip or by git no errors during the pip install

when i import pyspeech got this one :

import speechpy

ImportError Traceback (most recent call last) in () ----> 1 import speechpy

c:\anaconda3\lib\site-packages\speechpy_init_.py in () ----> 1 from main import * 2 from processing import *

ImportError: No module named 'main'

opened by alain2208 10
cmvnw: Division by zero

In encountered the following warning during the variance normalization of the speech features:

RuntimeWarning: divide by zero encountered in true_divide

cmvnw This is probably not the desired behavior, I don't know what the best solution in this case is though.

opened by thomasZen 6
Fixed a small bug rarely causing type mismatch

when you use the derivative_extraction function, you are likely to concatenate it with your features, and your features' type might be chosen carefully in a manner sensitive to memory usage.

for example in my project i have a large dataset (which is normal for the use case of this library) which i use float32 as the datatype of it's values, since it's accuracy is enough and it reduces the memory footprint by half compared to float64, but when i used the derivative_extraction it calculated it's values in float64 and when i concatenated the derivatives and the original features all the values where converted to float64 and the system's memory usage was doubled, which wasn't immediately apparent.

this is not that critical and it could be fixed by converting types from outside the library call, but why bother with the inconvenience when changing the datatype of the derivatives array from the incoming feature's datatype is not needed.

opened by omaraltayyan 4
Re-add speed improvements without error

This pulls in #11 again, but disabling the caching in Python 2 rather than trying to use a backport of it. The reason is that in some installations, despite being installed, Python 2 will fail on the backports.functools_lru_cache import. This fixes #15.

@astorfi Let me know if you still see the error with this. Thanks

opened by MatthewScholefield 4
Speed improvements
Add function caching to speed up computation of functions called with the same parameters

Remove use of np.round since it's slower for primitives

From testing in my use case, I get the following speed increases relative to the given function:

stack_frames()

Removing np.round: 20%

Adding create_indices lru: 60%

mfcc()

Adding lru to filterbank: 310%
opened by MatthewScholefield 4

Installing release 2.3 appears to install 2.2

I have installed SpeechPy as part of my review of the package. Installing the package, I found a minor issue with the version number: I explicitly checked out the '2.3' release for installation whereas the install script output refers to version 2.2:

$ python setup.py develop
running develop
running egg_info
creating speechpy.egg-info
writing speechpy.egg-info/PKG-INFO
writing dependency_links to speechpy.egg-info/dependency_links.txt
writing requirements to speechpy.egg-info/requires.txt
writing top-level names to speechpy.egg-info/top_level.txt
writing manifest file 'speechpy.egg-info/SOURCES.txt'
reading manifest file 'speechpy.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'speechpy.egg-info/SOURCES.txt'
running build_ext
Creating /home/tha/.conda/envs/joss-review-tmp/lib/python3.6/site-packages/speechpy.egg-link (link to .)
Adding speechpy 2.2 to easy-install.pth file

Installed /home/tha/other-repo/speechpy
Processing dependencies for speechpy==2.2
Searching for numpy==1.14.3
Best match: numpy 1.14.3
Adding numpy 1.14.3 to easy-install.pth file

Using /home/tha/.conda/envs/joss-review-tmp/lib/python3.6/site-packages
Searching for scipy==1.1.0
Best match: scipy 1.1.0
Adding scipy 1.1.0 to easy-install.pth file

Using /home/tha/.conda/envs/joss-review-tmp/lib/python3.6/site-packages
Finished processing dependencies for speechpy==2.2

I suspect this is just some configuration text string that was not properly updated for the 2.3 release?

opened by ThomasA 3

Correct wav format?

Im trying to extract mfcc features from audio of a video file.

I tried FFMPEG:

def extractAudioFromVideo(video, audio_out="out.wav"):
	cmd="ffmpeg -i {} -acodec pcm_s16le -ac 1 -ar 16000 {}".format(video, audio_out)
	os.system(cmd)
	return audio_out

def extractAudioMFCC(file_name="out.wav"):
	fs, signal = wav.read(file_name)
	signal = signal[:,0]

	############# Extract MFCC features #############
	mfcc = speechpy.feature.mfcc(signal, sampling_frequency=fs, frame_length=0.020, frame_stride=0.01,num_filters=40, fft_length=512, low_frequency=0, high_frequency=None)
	mfcc_cmvn = speechpy.processing.cmvnw(mfcc,win_size=301,variance_normalization=True)
	print('mfcc(mean + variance normalized) feature shape=', mfcc_cmvn.shape)


extractAudioMFCC("test.mp4", audio_out="out.wav")
extractAudioMFCC("out.wav")

The error I get:

Traceback (most recent call last): File "TWK.py", line 99, in extractAudioMFCC() File "TWK.py", line 22, in extractAudioMFCC signal = signal[:,0] IndexError: too many indices for array

Am I using the wrong wav format?

opened by taewookim 3

Update citation. Closes issue #2.

As per issue #2. The citation text content is from https://zenodo.org/record/840395/export/hx via https://zenodo.org/badge/latestdoi/87262342, which is the link of the button you mentioned in your comment on issue #2.

opened by yarwelp 3
MFCC Feature
Respected Sir, Greetings of the day !!!

Sir first of all thank you so much for such amazing library you shared with us.

Sir I am using SpeechPy library for extracting the MFCC of audio signal.

Sir I have an audio signal of 16kHz, 32bit float PCM, Mono channel. I am using framelength 100ms with 50% overlapping.

I used below code for extraction of MFCC,

fs, signal = wav.read("b0.wav") signal = signal / abs(max(signal)) #Convert into double mfcc = speechpy.feature.mfcc (signal , sampling_frequency=fs, frame_length=0.1, frame_stride=0.05, num_filters=40, fft_length=2048, low_frequency=0, high_frequency=None)

Respected Sir, I got confusion because I used python_speech_features library also to extract mfcc and for verification of my result. But both are giving different result.

mfcc1 = python_speech_features.base.mfcc(signal, samplerate=fs, winlen=0.1, winstep=0.05, numcep=13, nfilt=26, nfft=2048, lowfreq=0, highfreq=None, preemph=0.97, ceplifter=22, appendEnergy=True)

I wanted to know where I am doing mistake.

My Questions Are:

Is the above code sequence is correct to extract mfcc using speechpy library ?

While using speechpy.feature.mfcc function, preemphasis operation is not performed? That is the reason both library are giving different result.

Should we have to perform seprately preemphasis using below code then we have to give the output of preemphasis to mfcc?

signal_preemphasized = speechpy.processing.preemphasis(signal, cof=0.98)

Why both library are giving different result ?

Its my humble request respected Sir Please response to my query. I am not getting clarification. What to use and which is correct.

I am sorry for my poor English.
opened by ghost 2
A feature request：How can I judge user intentions ？

Hello， I have a need for speech recognition now, and I have read many documents of this project, but I am still not sure whether this project can meet my need：

Now I have hundreds of thousands of wav audio files, which are only one to five seconds and divided into two categories, one is positive answer, the other is negative answer, but I do not have the text information corresponding to each wav file, now my demand is whether I can use this project to make intention judgment？

For example, if I input an audio data, then I can get the intention expressed by this audio, but there is no text corresponding to this audio

Any help will be greatly appreciated！

opened by gangyahaidao 2

Fixed some bugs in mel filterbanks.

I wrote some code to compare the mel filterbanks in librosa, python speech feature and speechpy, and found two problems.

1. The initialization of the band edge of the Mel filterbanks may be wrong.
1. The calculation to convert frequency to fft bin number is wrong.

import matplotlib.pyplot as plt
import numpy as np
import librosa
import python_speech_features as psf
import speechpy

n_fft = 256        # The number of FFT components
n_filter = 20      # The number of filters in the filterbank
samplerate = 16000 # The samplerate of the signal
low_freq = 0       # The lowest band edge of the filters
high_freq = 8000   # The highest band edge of the filters

librosa_fbanks = librosa.filters.mel(
    sr=samplerate, n_fft=n_fft, n_mels=n_filter, fmin=low_freq, fmax=high_freq, norm=None)
print("Librosa mel fbanks shape:{}".format(librosa_fbanks.shape))

psf_fbanks = psf.base.get_filterbanks(
    nfilt=n_filter, nfft=n_fft, samplerate=samplerate, lowfreq=low_freq, highfreq=high_freq)
print("PSF mel fbanks shape:{}".format(psf_fbanks.shape))

coefficients = int(n_fft/2 + 1)
speechpy_fbanks = speechpy.feature.filterbanks(
    n_filter, coefficients, sampling_freq=samplerate, low_freq=low_freq, high_freq=high_freq)
print("Speechpy mel fbanks shape:{}".format(speechpy_fbanks.shape))

fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(10, 10))

x = np.array(list(range(speechpy_fbanks.shape[1])))
x = x * (samplerate / (n_fft + 1))

for i in range(librosa_fbanks.shape[0]):
    axes[0].plot(x, librosa_fbanks[i])
axes[0].set_title("librosa mel fbanks")

for i in range(psf_fbanks.shape[0]):
    axes[1].plot(x, psf_fbanks[i])
axes[1].set_title("psf mel fbanks")

for i in range(speechpy_fbanks.shape[0]):
    axes[2].plot(x, speechpy_fbanks[i])
axes[2].set_title("speechpy mel fbanks")

plt.show()

As shown in the figure, the parameter setting of low_freq of filterbanks of speechpy is invalid, and the filterbanks only covers half of the frequency band.

The first problem is caused by

low_freq = low_freq or 300.

When low_freq is 0, low_freq or 300 will return 300 instead of 0.

The second problem is a calculation error.

freq_index = (
    np.floor(
        (coefficients +
         1) *
        hertz /
        sampling_freq)).astype(int)

coefficients is equal to fftpoints/2 +1, which cannot cover the complete frequency band. We should use fftpoints instead of coefficients for calculation.

As shown in my code，I have fixed the above two bugs and hope to get your review and merge. Thank you!

opened by yorange1 3

stack frames calculation
Hi Amirsina,

First of all, great project!

I noticed in the mfcc the last frame_length of the signal buffer is always missing. When the number of stack frames is calculated (in the function stack_frames), the sample_buffer is decreased with the frame_length before it is divided in a number of stack frames.

See snippet: https://github.com/astorfi/speechpy/blob/4ece793cc52e36decd60dd89aca25233d773afe6/speechpy/processing.py#L103-L104

On a 1 second sample buffer this is hardly noticeable, but if we run the mfcc on smaller buffers this becomes significant.

If the calculation is done in this way:

numframes = (int(math.ceil((length_signal - (frame_sample_length - frame_stride)) / frame_stride)))

The full sample buffer is used if frame_sample_length equals the frame_stride and adjusted correctly on differences between the frame_length and frame_stride.
opened by automatiek 0

Releases(2.4)

2.4(Jul 23, 2018)

Latest release accepted by JOSS
Source code(tar.gz)
Source code(zip)
2.3.1(Jun 30, 2018)

New release modifying the code considering PEP8 style guide and adding tests.
Source code(tar.gz)
Source code(zip)
2.3(May 14, 2018)

Source code(tar.gz)
Source code(zip)
2.2(Mar 3, 2018)

Handling zero variance for cepstral mean variance normalization.
Source code(tar.gz)
Source code(zip)
2.1(Nov 26, 2017)

Source code(tar.gz)
Source code(zip)
2.0(Nov 26, 2017)

This release supports Python version 2.7, 3.4 & 3.5!
Source code(tar.gz)
Source code(zip)
1.3.5(Nov 26, 2017)

Now supporting Python versions 2.7, 3.4 and 3.5!
Source code(tar.gz)
Source code(zip)
1.3.3(Nov 26, 2017)

Add power spectrum.
Source code(tar.gz)
Source code(zip)
1.3.2(Nov 23, 2017)

Adding cepstral mean-variance normalization over the sliding window.
Source code(tar.gz)
Source code(zip)
1.3.1(Nov 21, 2017)

Adding cepstral mean variance normalization.
Source code(tar.gz)
Source code(zip)
1.2.1(Oct 27, 2017)

Latest release supported Python 3.5 and handling extreme cases.
Source code(tar.gz)
Source code(zip)
1.2(Oct 23, 2017)

Latest release supported Python 3.5.
Source code(tar.gz)
Source code(zip)
1.1.1(Aug 8, 2017)

This release supports Python 3.5 and well as Python 2.7 and Python 3.4.
Source code(tar.gz)
Source code(zip)
1.1.0(Jul 30, 2017)

In this release, the Python 3.5 support has been added.
Source code(tar.gz)
Source code(zip)
1.0.0(Jun 17, 2017)

This packages contains the feature extraction software implementation for our paper Text-Independent Speaker Verification Using 3D Convolutional Neural Networks (Torfi, Nasrabadi, Dawson), arXiv:1705.09422 . It includes important pre-processing operations such as frame-stacking and feature-extraction functions for speech features such as MFCC and etc.
Source code(tar.gz)
Source code(zip)

Owner

Amirsina Torfi

PhD & Developer working on Deep Learning, Computer Vision & NLP

GitHub

Speech recognition module for Python, supporting several engines and APIs, online and offline.

SpeechRecognition Library for performing speech recognition, with support for several engines and APIs, online and offline. Speech recognition engine/

6.7k Jan 8, 2023

Some utils for auto speech recognition

About Some utils for auto speech recognition. Utils Util Description Script Reset audio Reset sample rate, sample width, etc of audios.

1 Jan 24, 2022

This library provides common speech features for ASR including MFCCs and filterbank energies.

python_speech_features This library provides common speech features for ASR including MFCCs and filterbank energies. If you are not sure what MFCCs ar

2.2k Jan 4, 2023

Python audio and music signal processing library

madmom Madmom is an audio signal processing library written in Python with a strong focus on music information retrieval (MIR) tasks. The library is i

1k Dec 26, 2022

A Python 3 script for capturing and recording a SDR stream to a WAV file (or serving it to a HTTP audio stream).

rfsoapyfile A Python 3 script for capturing and recording a SDR stream to a WAV file (or serving it to a HTTP audio stream). The script is threaded fo

4 Dec 19, 2022

eyeD3 is a Python module and command line program for processing ID3 tags. Information about mp3 files (i.e bit rate, sample frequency, play time, etc.) is also provided. The formats supported are ID3v1 (1.0/1.1) and ID3v2 (2.3/2.4).

Status About eyeD3 is a Python tool for working with audio files, specifically MP3 files containing ID3 metadata (i.e. song info). It provides a comma

425 Jan 1, 2023

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

Project DeepSpeech DeepSpeech is an open-source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu's Deep Spee

20.8k Jan 3, 2023

Conferencing Speech Challenge

ConferencingSpeech 2021 challenge This repository contains the datasets list and scripts required for the ConferencingSpeech challenge. For more detai

73 Nov 29, 2022

Speech Algorithms Collections

498 Jan 6, 2023

Simple, hackable offline speech to text - using the VOSK-API.

Nerd Dictation Offline Speech to Text for Desktop Linux. This is a utility that provides simple access speech to text for using in Linux without being

844 Jan 7, 2023

Voicefixer aims at the restoration of human speech regardless how serious its degraded.

324 Dec 26, 2022

Scalable audio processing framework written in Python with a RESTful API

TimeSide : scalable audio processing framework and server written in Python TimeSide is a python framework enabling low and high level audio analysis,

340 Jan 4, 2023

Expressive Digital Signal Processing (DSP) package for Python

AudioLazy Development Last release PyPI status Real-Time Expressive Digital Signal Processing (DSP) Package for Python! Laziness and object representa

642 Dec 26, 2022

An audio digital processing toolbox based on a workflow/pipeline principle

AudioTK Audio ToolKit is a set of audio filters. It helps assembling workflows for specific audio processing workloads. The audio workflow is split in

238 Oct 18, 2022

Pyroomacoustics is a package for audio signal processing for indoor applications. It was developed as a fast prototyping platform for beamforming algorithms in indoor scenarios.

Summary Pyroomacoustics is a software package aimed at the rapid development and testing of audio array processing algorithms. The content of the pack

1k Jan 9, 2023

:speech_balloon: SpeechPy - A Library for Speech Processing and Recognition: http://speechpy.readthedocs.io/en/latest/

Related tags

Overview

Table of Contents

Comments

stack_frames()

mfcc()

Releases(2.4)

2.4(Jul 23, 2018)

2.3.1(Jun 30, 2018)

2.3(May 14, 2018)

2.2(Mar 3, 2018)

2.1(Nov 26, 2017)

2.0(Nov 26, 2017)

1.3.5(Nov 26, 2017)

1.3.3(Nov 26, 2017)

1.3.2(Nov 23, 2017)

1.3.1(Nov 21, 2017)

1.2.1(Oct 27, 2017)

1.2(Oct 23, 2017)

1.1.1(Aug 8, 2017)

1.1.0(Jul 30, 2017)

1.0.0(Jun 17, 2017)

Owner

Amirsina Torfi

Speech recognition module for Python, supporting several engines and APIs, online and offline.

Some utils for auto speech recognition

This library provides common speech features for ASR including MFCCs and filterbank energies.

Python audio and music signal processing library

A Python 3 script for capturing and recording a SDR stream to a WAV file (or serving it to a HTTP audio stream).

eyeD3 is a Python module and command line program for processing ID3 tags. Information about mp3 files (i.e bit rate, sample frequency, play time, etc.) is also provided. The formats supported are ID3v1 (1.0/1.1) and ID3v2 (2.3/2.4).

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

Conferencing Speech Challenge

Speech Algorithms Collections

Simple, hackable offline speech to text - using the VOSK-API.

Voicefixer aims at the restoration of human speech regardless how serious its degraded.

Scalable audio processing framework written in Python with a RESTful API

Expressive Digital Signal Processing (DSP) package for Python

An audio digital processing toolbox based on a workflow/pipeline principle

Pyroomacoustics is a package for audio signal processing for indoor applications. It was developed as a fast prototyping platform for beamforming algorithms in indoor scenarios.

Expressive Digital Signal Processing (DSP) package for Python

Accompanying code for our paper "Point Cloud Audio Processing"

Audio fingerprinting and recognition in Python

Implementation of "Slow-Fast Auditory Streams for Audio Recognition, ICASSP, 2021" in PyTorch