Data manipulation and transformation for audio signal processing, powered by PyTorch

Overview

torchaudio: an audio library for PyTorch

Build Status Coverage Documentation

The aim of torchaudio is to apply PyTorch to the audio domain. By supporting PyTorch, torchaudio follows the same philosophy of providing strong GPU acceleration, having a focus on trainable features through the autograd system, and having consistent style (tensor names and dimension names). Therefore, it is primarily a machine learning library and not a general signal processing library. The benefits of PyTorch can be seen in torchaudio through having all the computations be through PyTorch operations which makes it easy to use and feel like a natural extension.

Dependencies

  • PyTorch (See below for the compatible versions)
  • [optional] vesis84/kaldi-io-for-python commit cb46cb1f44318a5d04d4941cf39084c5b021241e or above

The following are the corresponding torchaudio versions and supported Python versions.

torch torchaudio python
master / nightly master / nightly >=3.6, <=3.9
1.8.0 0.8.0 >=3.6, <=3.9
1.7.1 0.7.2 >=3.6, <=3.9
1.7.0 0.7.0 >=3.6, <=3.8
1.6.0 0.6.0 >=3.6, <=3.8
1.5.0 0.5.0 >=3.5, <=3.8
1.4.0 0.4.0 ==2.7, >=3.5, <=3.8

Installation

Binary Distributions

To install the latest version using anaconda, run:

conda install -c pytorch torchaudio

To install the latest pip wheels, run:

pip install torchaudio -f https://download.pytorch.org/whl/torch_stable.html

(If you do not have torch already installed, this will default to installing torch from PyPI. If you need a different torch configuration, preinstall torch before running this command.)

Nightly build

Note that nightly build is build on PyTorch's nightly build. Therefore, you need to install the latest PyTorch when you use nightly build of torchaudio.

pip

pip install numpy
pip install --pre torchaudio -f https://download.pytorch.org/whl/nightly/torch_nightly.html

conda

conda install -y -c pytorch-nightly torchaudio

From Source

The build process builds libsox and some codecs that torchaudio need to link to. This is achieve by setting the environment variable BUILD_SOX=1. The build process will fetch and build libmad, lame, flac, vorbis, opus, and libsox before building extension. This process requires cmake and pkg-config.

# Linux
BUILD_SOX=1 python setup.py install

# OSX
BUILD_SOX=1 MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py install

# Windows
# We need to use the MSVC x64 toolset for compilation, with Visual Studio's vcvarsall.bat or directly with vcvars64.bat.
# These batch files are under Visual Studio's installation folder, under 'VC\Auxiliary\Build\'.
# More information available at:
#   https://docs.microsoft.com/en-us/cpp/build/how-to-enable-a-64-bit-visual-cpp-toolset-on-the-command-line?view=msvc-160#use-vcvarsallbat-to-set-a-64-bit-hosted-build-architecture
call "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvarsall.bat" x64 && set BUILD_SOX=0 && python setup.py install
# or
call "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvars64.bat" && set BUILD_SOX=0 && python setup.py install

This is known to work on linux and unix distributions such as Ubuntu and CentOS 7 and macOS. If you try this on a new system and find a solution to make it work, feel free to share it by opening an issue.

Troubleshooting

checking build system type... ./config.guess: unable to guess system type

Since the configuration file for codecs are old, they cannot correctly detect the new environments, such as Jetson Aarch. You need to replace the config.guess file in ./third_party/tmp/lame-3.99.5/config.guess and/or ./third_party/tmp/libmad-0.15.1b/config.guess with the latest one.

See also: #658

Undefined reference to `tgetnum' when using `BUILD_SOX`

If while building from within an anaconda environment you come across errors similar to the following:

../bin/ld: console.c:(.text+0xc1): undefined reference to `tgetnum'

Install ncurses from conda-forge before running python setup.py install:

# Install ncurses from conda-forge
conda install -c conda-forge ncurses

Quick Usage

import torchaudio

waveform, sample_rate = torchaudio.load('foo.wav')  # load tensor from file
torchaudio.save('foo_save.wav', waveform, sample_rate)  # save tensor to file

Backend Dispatch

By default in OSX and Linux, torchaudio uses SoX as a backend to load and save files. The backend can be changed to SoundFile using the following. See SoundFile for installation instructions.

import torchaudio
torchaudio.set_audio_backend("soundfile")  # switch backend

waveform, sample_rate = torchaudio.load('foo.wav')  # load tensor from file, as usual
torchaudio.save('foo_save.wav', waveform, sample_rate)  # save tensor to file, as usual

Unlike SoX, SoundFile does not currently support mp3.

API Reference

API Reference is located here: http://pytorch.org/audio/

Conventions

With torchaudio being a machine learning library and built on top of PyTorch, torchaudio is standardized around the following naming conventions. Tensors are assumed to have "channel" as the first dimension and time as the last dimension (when applicable). This makes it consistent with PyTorch's dimensions. For size names, the prefix n_ is used (e.g. "a tensor of size (n_freq, n_mel)") whereas dimension names do not have this prefix (e.g. "a tensor of dimension (channel, time)")

  • waveform: a tensor of audio samples with dimensions (channel, time)
  • sample_rate: the rate of audio dimensions (samples per second)
  • specgram: a tensor of spectrogram with dimensions (channel, freq, time)
  • mel_specgram: a mel spectrogram with dimensions (channel, mel, time)
  • hop_length: the number of samples between the starts of consecutive frames
  • n_fft: the number of Fourier bins
  • n_mel, n_mfcc: the number of mel and MFCC bins
  • n_freq: the number of bins in a linear spectrogram
  • min_freq: the lowest frequency of the lowest band in a spectrogram
  • max_freq: the highest frequency of the highest band in a spectrogram
  • win_length: the length of the STFT window
  • window_fn: for functions that creates windows e.g. torch.hann_window

Transforms expect and return the following dimensions.

  • Spectrogram: (channel, time) -> (channel, freq, time)
  • AmplitudeToDB: (channel, freq, time) -> (channel, freq, time)
  • MelScale: (channel, freq, time) -> (channel, mel, time)
  • MelSpectrogram: (channel, time) -> (channel, mel, time)
  • MFCC: (channel, time) -> (channel, mfcc, time)
  • MuLawEncode: (channel, time) -> (channel, time)
  • MuLawDecode: (channel, time) -> (channel, time)
  • Resample: (channel, time) -> (channel, time)
  • Fade: (channel, time) -> (channel, time)
  • Vol: (channel, time) -> (channel, time)

Complex numbers are supported via tensors of dimension (..., 2), and torchaudio provides complex_norm and angle to convert such a tensor into its magnitude and phase. Here, and in the documentation, we use an ellipsis "..." as a placeholder for the rest of the dimensions of a tensor, e.g. optional batching and channel dimensions.

Contributing Guidelines

Please refer to CONTRIBUTING.md

Disclaimer on Datasets

This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!

Comments
  • [Announcement] Improving I/O for correct and consistent experience

    [Announcement] Improving I/O for correct and consistent experience

    tl;dr: how to migrate to new backend/interface in 0.7

    • If you are using torchaudio in Linux/macOS environments, please use torchaudio.set_audio_backend("sox_io") to adopt to the upcoming changes.

    • If you are in Windows environment, please set torchaudio.USE_SOUNDFILE_LEGACY_INTERFACE = False and reload backend to use the new interface.

    • Note that this ships with some bug-fixes for formats other than 16bit signed integer WAV, so you might experience some BC-breaking changes as described in the section below.

    News [UPDATE] 2021/03/06

    • All the migration works have been completed on master branch.

    [UPDATE] 2021/02/12

    • Added bits_per_sample and encoding argument (replaced dtype) to save function.

    [UPDATE] 2021/01/29

    • Added encoding to AudioMetaData

    [UPDATE] 2021/01/22

    • Added format argument to load/info/save function.
    • bits_per_sample to AudioMetaData

    [UPDATE] 2020/10/21

    • Added Description of "soundfile" backend legacy interface.

    [UPDATE] 2020/09/18

    • Added migration guide for "soundfile" backend.
    • Moved the phase when "soundfile" backend signatures change from 0.9.0 to 0.8.0 so that they match with "sox_io" backend, which becomes default in 0.8.0.

    [UPDATE] 2020/09/17

    • Added information on deprecation of native libsox structures such as signalinfo_t and encoding_t.

    Improving I/O for correct and consistent experience

    This is an announcement for users that we are making backward-incompatible changes to I/O functions of torchaudio backends from 0.7.0 release throughout 0.9.0 release.

    What is affected?

    • Public APIs

      • torchaudio.load
        • [Linux/macOS] By switching the default backend from "sox" backend to "sox_io" backend in 0.8.0, loading audio formats other than 16bit signed integer WAV returns the correct tensor.
        • [Linux/macOS/Windows] The signature of "soundfile" backend will be change in 0.8.0 to match that of "sox_io" backend.
      • torchaudio.save
        • [Linux/macOS] By switching to "sox_io" backend, saving audio files will no longer degrade the data. The supported format will be restricted to the tested formats only. (please refer to the doc for the supported formats.)
        • [Linux/macOS/Windows] The signature of "soundfile" backend will be change in 0.8.0 to match that of "sox_io" backend.
      • torchaudio.info
        • [Linux/macOS/Windows] The signature of "soundfile" backend will be change in 0.8.0 to match that of "sox_io" backend.
      • torchaudio.load_wav
        • will be removed in 0.9.0. (load function with normalize=False will provide the same functionality)
    • Internal APIs The following functions/classes of "sox" backend were accidentally exposed and will be removed in 0.9.0. There is no replacement for them. Please use save/load/info functions.

      • torchaudio.save_encinfo
        • will be removed in 0.9.0
      • torchaudio.get_sox_signalinfo_t
        • will be removed in 0.9.0
      • torchaudio.get_sox_encodinginfo_t
        • will be removed in 0.9.0
      • torchaudio.get_sox_option_t
        • will be removed in 0.9.0
      • torchaudio.get_sox_bool
        • will be removed in 0.9.0

    The signatures of the other backends are not planned to be changed within this overhaul plan.

    • Classes
      • torchaudio.SignalInfo and torchaudio.EncodingInfo
        • will be replaced with AudioMetaData in 0.8.0 for "soundfile" backend
        • will be removed in 0.9.0

    Why

    There are currently three backends in torchaudio. (Please refer to the documentation for the detail.)

    "sox" backend is the original backend, which binds libsox with pybind11. The functionalities (load / save / info) of this backend are not well-tested and have number of issues. (See https://github.com/pytorch/audio/pull/726).

    Fixing these issues in backward-compatible manner is not straightforward. Therefore while we were adding TorchScript-compatible I/O functions, we decided to deprecate this original "sox" backend and replace it with the new backend ("sox_io" backend), which is confirmed not to have those issues.

    When we are switching the default backend for Linux/macOS from "sox" to "sox_io" backend, we would like to align the interface of "soundfile" backend, therefore, we introduced the new interface (not a new backend to reduce the number of public API) to "soundfile" backend.

    When / What Changes

    The following is the timeline for the planned changes;

    | Phase | Expected Release | Expected Changes | |:-----:|:----------------:|------------------| | 1 | 0.7.0
    (Oct 2020) |

    • "sox" backend issues deprecation warning. ~#904~
    • "soundfile" backend issues warning of expected signature change. ~#906~
    • Add the new interface to "soubdfile" backend. ~#922~
    • load_wav function of all backends are marked as deprecated. ~#905~
    | | 2 | 0.8.0
    (March 2021) |
    • [BC-Breaking] "sox_io" backend becomes default backend. Function signatures of "soundfile" backend are aligned with "sox_io" backend. ~#978~
    • get_sox_XXX functions issue deprecation warning. ~#975~
    | | 3 | 0.9.0 |
    • "sox" backend is removed. ~#1311~
    • The legacy interface of "soundfile" backend is removed. ~#1311~
    • [BC-Breaking] load_wav functions are removed from all backends. ~#1362~
    |

    Planned signature changes of "soundfile" backend in 0.8.0

    The following is the planned signature change of "soundfile" backend functions in 0.8.0 release.

    info function

    AudioMetaData implementation can be found here. The placement of the AudioMetaData might be changed.

    ~0.7.0 0.8.0
    def info(
      filepath: str,
    ) ->
      Tuple[SignalInfo, EncodingInfo]
    
    def info(
      filepath: str,
      format: Optional[str],
    ) ->
      AudioMetaData
    

    Migration

    The values returned from info function will be changed. Please use the corresponding new attributes.

    ~0.7.0 0.8.0
    si, ei = torchaudio.info(filepath)
    sample_rate = si.rate
    num_frames = si.length
    num_channels = si.channels
    precision = si.precision
    bits_per_sample = ei.bits_per_sample
    encoding = ei.encoding
    
    metadata = torchaudio.info(filepath)
    sample_rate = metadata.sample_rate
    num_frames = metadata.num_frames
    num_channels = metadata.num_channels
    bits_per_sample = metadata.bits_per_sample
    encoding = metadata.encoding
    

    Note If the attribute you are using is missing, file a Feature Request issue.

    load function

    ~0.7.0 0.8.0
    def load(
      filepath: str,
      # out: Optional[Tensor] = None,
          # To be removed.
          # Currently not used
          # Raise AssertionError if given
      normalization: Optional[bool] = True,
          # To be renamed to normalize.
          # Currently only accept True
          # Raise AssertionError if given
      channels_first: Optional[bool] = True,
      num_frames: int = 0,
      offset: int = 0,
          # To be renamed to frame_offset
      # signalinfo: SignalInfo = None,
          # To be removed
          # Currently not used
          # Raise AssertionError if given
      # encodinginfo: EncodingInfo = None,
          # To be removed
          # Currently not used
          # Raise AssertionError if given
      filetype: Optional[str] = None
          # To be removed
          # Currently not used
    ) -> Tuple[Tensor, int]
    
    def load(
      filepath: str,
      frame_offset: int = 0,
      num_frames: int = -1,
      normalize: bool = True,
      channels_first: bool = True,
      format: Optional[str] = None,  # only required for file-like object input
    ) -> Tuple[Tensor, int]
    
    Migration

    Please change the argument names;

    • normalization -> normalize
    • offset -> frame_offst
    ~0.7.0 0.8.0
    waveform, sample_rate = torchaudio.load(
        filepath,
        normalization=normalization,
        channels_first=channels_first,
        num_frames=num_frames,
        offset=offset,
    )
    
    waveform, sample_rate = torchaudio.load(
        filepath,
        frame_offset=frame_offset,
        num_frames=num_frames,
        normalize= normalization,
        channels_first=channels_first,
    )
    

    save function

    ~0.7.0 0.8.0
    def save(
      filepath: str,
      src: Tensor,
      sample_rate: int,
      precision: int = 16,
        # moved to `bits_per_sample` argument
      channels_first: bool = True
    )
    
    def save(
      filepath: str,
      src: Tensor,
      sample_rate: int,
      channels_first: bool = True,
      compression: Optional[float] = None,
        # Added only for compatibility.
        # soundfile does not support compression option
        # Raises Warning if not None
      format: Optional[str] = None,
      encoding: Optoinal[str] = None,
      bits_per_sample: Optional[int] = None,
    )
    
    Migration
    ~0.7.0 0.8.0
    torchaudio.save(
        filepath,
        waveform,
        sample_rate,
        channels_first
    )
    
    torchaudio.save(
        filepath,
        waveform,
        sample_rate,
        channels_first,
        bits_per_sample=16,
    )
    # You can also designate audio format with `format` and configure the encoding with `compression` and `encoding`. See https://pytorch.org/audio/master/backend.html#save for the detail 
    

    BC-breaking changes

    Read and write operations on the formats other than WAV 16-bit signed integer were affected by small bugs.

    opened by mthrok 41
  • Installation on apple silicon (M1)

    Installation on apple silicon (M1)

    ❓ Questions and Help

    Hi, is there a possibility of installing this package on the Apple M1 machine? Since the installation fails for me from both pip and conda. Thanks!

    triaged 
    opened by anticdimi 36
  • Add backprop support to lfilter

    Add backprop support to lfilter

    This merge solve issue #704.

    It moves the original python implementation of lfilter into c++ backend, and register a custom autograd kernel to support torchscript as @vincentqb mentioned in #704 .

    A simple test case is added to test whether the gradient is valid or not.

    Notes

    Some differences to the old lfilter:

    • The old implementation use direct-form I; the new one use direct-form II.
    • A mix of indexing and matmul operation at https://github.com/pytorch/audio/blob/e83d557a0d48249b20eba42952a2ed61a3d9644b/torchaudio/functional/filtering.py#L881

    is replaced by a single conv1d function call. https://github.com/yoyololicon/audio/blob/4e2ff32b50d56ce168fcee872c95ffc6cde82eaa/torchaudio/csrc/lfilter.cpp#L123

    cla signed 
    opened by yoyololicon 29
  • Add VCTK_092 dataset

    Add VCTK_092 dataset

    • Updated dataset URL's
    • Updated the zip file checksum
    • Fixed URL path parsing for the updated link
    • Updated audio folder and audio extension names

    The last URL returned a 404 HTTP not found error as the dataset was moved to a new link.

    Verified the changes by loading the dataset into dataloader and looping over every file. Tested by training some epochs.

    opened by Abhi011999 27
  • SignalInfo in new AudioMetaData

    SignalInfo in new AudioMetaData

    Hi, We're using sox_io backend for torchaudio. New version of the AudioMetaData class returned by the info method doesn't have all the data I need for analysis. info method in the deprecated sox backend used to return sox_signalinfo_t and sox_encodinginfo_t that had much more detailed information about encoding, bit depth etc. What should I use to get the info about audio file's encoding, bit depth and other details?

    opened by ghost 25
  • Cannot import torchaudio with torch 1.1

    Cannot import torchaudio with torch 1.1

    I am trying to use torchaudio with torch 1.1. It compiled successfully with Python 3.6, but when I want to import the torchaudio package, I get this error:

    >>> import torchaudio
    Traceback (most recent call last):
      File "/home/daniel/envs/pytorch-py3/lib/python3.6/site-packages/torch/jit/annotations.py", line 95, in parse_type_line
        arg_ann = eval(arg_ann_str, _eval_env)
      File "<string>", line 1, in <module>
    NameError: name 'Optional' is not defined
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/daniel/envs/pytorch-py3/lib/python3.6/site-packages/torchaudio-0.2-py3.6-linux-x86_64.egg/torchaudio/__init__.py", line 7, in <module>
        from torchaudio import transforms, datasets, kaldi_io, sox_effects, legacy
      File "/home/daniel/envs/pytorch-py3/lib/python3.6/site-packages/torchaudio-0.2-py3.6-linux-x86_64.egg/torchaudio/transforms.py", line 6, in <module>
        from . import functional as F
      File "/home/daniel/envs/pytorch-py3/lib/python3.6/site-packages/torchaudio-0.2-py3.6-linux-x86_64.egg/torchaudio/functional.py", line 108, in <module>
        @torch.jit.script
      File "/home/daniel/envs/pytorch-py3/lib/python3.6/site-packages/torch/jit/__init__.py", line 824, in script
        fn = torch._C._jit_script_compile(ast, _rcb, get_default_args(obj))
      File "/home/daniel/envs/pytorch-py3/lib/python3.6/site-packages/torch/jit/annotations.py", line 55, in get_signature
        return parse_type_line(type_line)
      File "/home/daniel/envs/pytorch-py3/lib/python3.6/site-packages/torch/jit/annotations.py", line 97, in parse_type_line
        raise RuntimeError("Failed to parse the argument list of a type annotation: {}".format(str(e)))
    RuntimeError: Failed to parse the argument list of a type annotation: name 'Optional' is not defined
    

    It ran fine on torch 1.0 - maybe some problem with differing source code between torch 1.0 and 1.1?

    opened by f90 24
  • Building torchaudio 0.7 on `ppc64le`

    Building torchaudio 0.7 on `ppc64le`

    Hi there, I am trying to install torchaudio from source, and I run into the following error:

    CMake Error at [...]/anaconda3/envs/pytorch1.7real/lib/python3.7/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:58 (message):
      Your installed Caffe2 version uses protobuf but the protobuf library cannot
      be found.  Did you accidentally remove it, or have you set the right
      CMAKE_PREFIX_PATH? If you do not have protobuf, you will need to install
      protobuf and set the library path accordingly.
    

    I think I have protobuf installed because when I run pip install protobuf I get

    Requirement already satisfied: protobuf in [...]/anaconda3/envs/pytorch1.7real/lib/python3.7/site-packages (3.9.2)
    Requirement already satisfied: six>=1.9 in [...]/anaconda3/envs/pytorch1.7real/lib/python3.7/site-packages (from protobuf) (1.15.0)
    Requirement already satisfied: setuptools in [...]/anaconda3/envs/pytorch1.7real/lib/python3.7/site-packages (from protobuf) (58.0.4)
    

    I've tried modifying the CMAKE_PREFIX_PATH but I don't really know what I'm doing there, and I'm not sure if that's the right way to go.

    Help would be appreciated! I need to install from source since there's no other way I can install on the machine I'm using right now.

    opened by j4sonzhao 23
  • InverseMelScale Implementation

    InverseMelScale Implementation

    InverseMelScale implementation for #351 Test codes are coming!

    Edit: It would be great if someone could chime in on the test code guideline for torchaudio?

    opened by jaeyeun97 23
  • Are the nightly conda builds broken?

    Are the nightly conda builds broken?

    13:35 $ conda install pytorch torchvision torchaudio -c pytorch-nightly
    Collecting package metadata (current_repodata.json): done
    Solving environment: failed with initial frozen solve. Retrying with flexible solve.
    Collecting package metadata (repodata.json): done
    Solving environment: failed with initial frozen solve. Retrying with flexible solve.
    
    PackagesNotFoundError: The following packages are not available from current channels:
    
      - torchaudio
    
    Current channels:
    
      - https://conda.anaconda.org/pytorch-nightly/osx-arm64
      - https://conda.anaconda.org/pytorch-nightly/noarch
      - https://repo.anaconda.com/pkgs/main/osx-arm64
      - https://repo.anaconda.com/pkgs/main/noarch
      - https://repo.anaconda.com/pkgs/r/osx-arm64
      - https://repo.anaconda.com/pkgs/r/noarch
    
    To search for alternate channels that may provide the conda package you're
    looking for, navigate to
    
        https://anaconda.org
    
    and use the search bar at the top of the page.
    

    Tested just now on OSX

    triaged 
    opened by greaber 22
  • torchaudio.load() for file-like object fails for mp3 files

    torchaudio.load() for file-like object fails for mp3 files

    🐛 Describe the bug

    Description

    This error occurs when trying to read a file-like object that contains an MP3 audio. This error does not occur for file-like objects that contain WAV audio.

    Stack Trace

    formats: can't determine type of file `'
    Traceback (most recent call last):
      File "<stdin>", line 2, in <module>
      File "/home/rob/code/english_toolkit/.venv/lib/python3.8/site-packages/torchaudio/backend/sox_io_backend.py", line 149, in load
        return torchaudio._torchaudio.load_audio_fileobj(
    RuntimeError: Error loading audio file: failed to open file <in memory buffer>
    

    Reproducible Snippets

    To reproduce with MP3:

    with requests.get("https://filesamples.com/samples/audio/mp3/sample3.mp3", stream=True) as response:
         y,sr = torchaudio.load(response.raw)
    

    To verify this is not an issue for WAV"

    with requests.get("https://www2.cs.uic.edu/~i101/SoundFiles/gettysburg10.wav", stream=True) as response:
         y,sr = torchaudio.load(response.raw)
    

    Relevant Documentation

    These snippets copy exactly the torchaudio load filelike object documentation image

    Versions

    torchaudio version: 0.11.0

    triaged 
    opened by rbracco 21
  • pip install forces install of old torch 1.3.0

    pip install forces install of old torch 1.3.0

    I have torch 1.3.1 installed

    pip install torchaudio -f https://download.pytorch.org/whl/torch_stable.html forces collecting of torch 1.3.0 even though a newer version is installed

    This is an official installation way from README. If pip is mis-behaving, maybe a conda install should be recommended too

    opened by vadimkantorov 20
  • Add ctc beam search decoder using cuda implementation

    Add ctc beam search decoder using cuda implementation

    🚀 The feature

    Hi audio team, I would like propose to integrate a cuda based ctc prefix beam search decoder into torchaudio. It implemented ctc prefix beam search algorithms using cuda kernels, then, gave python api through pybind. It has better performance comparing with current ctc_decoder using cpu. If you have interests to adopt this implementation, I could prepare a PR for it. I could put the kernel implementations under third_party or other places, add Python api in https://github.com/pytorch/audio/blob/main/torchaudio/models/decoder/_ctc_decoder.py.

    Motivation, pitch

    Motivation is to accelerate ctc beam search decoding process using GPU.

    Alternatives

    No response

    Additional context

    No response

    opened by yuekaizhang 0
  • Use PyBind for binding utilities

    Use PyBind for binding utilities

    Summary: Merge utility binding

    This commit updates the utility binding, so that we can use is_module_available() for checking the existence of extension modules.

    To ensure the existence of module, this commit migrates the binding of utility functions to PyBind11.

    Going forward, we should use TorchBind for ops that we want to support TorchScript, otherwise default to PyBind11. (PyBind has advantage of not copying strings.)

    Differential Revision: D42355992

    fb-exported cla signed 
    opened by mthrok 2
  • Rename generator to vocoder in HiFiGAN model and factory functions

    Rename generator to vocoder in HiFiGAN model and factory functions

    The generator part of HiFiGAN model is a vocoder which converts mel spectrogram to waveform. It makes more sense to name it as vocoder for better understanding.

    cla signed 
    opened by nateanl 2
  • Fix filtering function fallback mechanism

    Fix filtering function fallback mechanism

    lfilter, overdrive have faster implementation written in C++. If they are not available, torchaudio is supposed to fall back on Python-based implementation.

    The original fallback mechanism relied on error type and messages from PyTorch core, which has been changed.

    This commit updates it for more proper fallback mechanism.

    cla signed 
    opened by mthrok 4
  • TorchAudio FFmpeg migration

    TorchAudio FFmpeg migration

    Overview

    We propose the following end state for TorchAudio’s I/O functions info, load, save:

    • FFmpeg is the primary backend for TorchAudio’s I/O functions info, load, save.
    • FFmpeg-, SoX-, and soundfile-based backends are user-selectable from said I/O functions and are no longer determined by global state.
    • All of FFmpeg, SoX, and soundfile are optional dependencies.
    • I/O functions no longer support TorchScript.

    Context

    TorchAudio’s functions info, load, and save currently rely on two third-party libraries: SoX and soundfile. Whereas SoX is used in the Linux and Mac distributions, soundfile is used in the Windows distribution.

    Through the years, we’ve encountered several issues with SoX:

    • Its handling of in-memory decoding is buggy and requires a local patch to fix.
    • It accesses the internal structure of FILE* object. As a result, 14.4.42 does not compile on Windows with MSVC newer than 2013. This precludes us from using SoX across platforms.
    • It attempts to rewind stdin.
    • It has not been actively developed/maintained since 2015, which doesn’t lend confidence that the aforementioned issues will be addressed.
    • It has caused other user-facing problems:
      • https://github.com/pytorch/audio/issues/2870
      • https://github.com/pytorch/audio/issues/2356

    Separately, our work around streaming I/O introduced FFmpeg as a dependency. FFmpeg's advantages over SoX include the following:

    • It’s battle tested. It’s been developed for over 20 years now and is widely used in industry.
    • The library code is portable across Linux, Mac, and Windows.
    • It supports a wide variety of codecs, from basic to advanced, for both audio and video.
    • It supports GPU acceleration in decoding and encoding.
    • The C API offers a high degree of customizability.
      • It abstracts away many things like codecs, file formats, and devices.
      • It allows for implementing custom I/O such as in-memory decoding/encoding with file-like object protocol.
    • It’s being actively developed, with the latest version (5.1) having been released in July 2022.

    End state

    To address the issues above, we propose the following end state:

    • FFmpeg is the primary backend for TorchAudio’s I/O functions info, load, save.
    • FFmpeg-, SoX-, and soundfile-based backends are user-selectable from said I/O functions and are no longer determined by global state.
    • All of FFmpeg, SoX, and soundfile are optional dependencies.
    • I/O functions no longer support TorchScript.

    We anticipate this end state bringing greater cross-platform consistency, simplifying our codebase, and delivering an improved user experience.

    Plan

    Release 0.14

    • Introduce option to {info, load, save} that allows users to choose any of FFmpeg, SoX, and soundfile as the I/O backend for both file paths and file objects, while preserving the existing behavior, i.e. Linux and Mac distributions default to relying on SoX, Windows distributions default to relying on soundfile.
      • Doing so naturally removes TorchScript support in {info, load, save}.
    • Add deprecation warnings that convey that release 0.15 will make FFmpeg the default backend for files and file objects for {info, load, save} and encourage users to switch over to FFmpeg.

    Release 0.15

    • Make FFmpeg the default backend for files and file objects for {info, load, save} across all platforms.
    • Make SoX an optional dependency that is dynamically linked if available. Linking enables the SoX backend and torchaudio.sox_effects.
    • Remove file-like object handling for the SoX backend.
    • Remove dependence of backend selection on global state.
    RFC 
    opened by hwangjeff 0
Releases(v0.13.1)
  • v0.13.1(Dec 15, 2022)

    This is a minor release, which is compatible with PyTorch 1.13.1 and includes bug fixes, improvements and documentation updates. There is no new feature added.

    Bug Fix

    IO

    • Make buffer size configurable in ffmpeg file object operations and set size in backend (#2810)
    • Fix issue with the missing video frame in StreamWriter (#2789)
    • Fix decimal FPS handling StreamWriter (#2831)
    • Fix wrong frame allocation in StreamWriter (#2905)
    • Fix duplicated memory allocation in StreamWriter (#2906)

    Model

    • Fix HuBERT model initialization (#2846, #2886)

    Recipe

    • Fix issues in HuBERT fine-tuning recipe (#2851)
    • Fix automatic mixed precision in HuBERT pre-training recipe (#2854)
    Source code(tar.gz)
    Source code(zip)
  • v0.13.0(Oct 28, 2022)

    Highlights

    TorchAudio 0.13.0 release includes:

    • Source separation models and pre-trained bundles (Hybrid Demucs, ConvTasNet)
    • New datasets and metadata mode for the SUPERB benchmark
    • Custom language model support for CTC beam search decoding
    • StreamWriter for audio and video encoding

    [Beta] Source Separation Models and Bundles

    Hybrid Demucs is a music source separation model that uses both spectrogram and time domain features. It has demonstrated state-of-the-art performance in the Sony Music DeMixing Challenge. (citation: https://arxiv.org/abs/2111.03600)

    The TorchAudio v0.13 release includes the following features

    • MUSDB_HQ Dataset, which is used in Hybrid Demucs training (docs)
    • Hybrid Demucs model architecture (docs)
    • Three factory functions suitable for different sample rate ranges
    • Pre-trained pipelines (docs) and tutorial

    SDR Results of pre-trained pipelines on MUSDB-HQ test set | Pipeline | All | Drums | Bass | Other | Vocals | | ----- | ----- | ----- | ----- | ----- | ----- | | HDEMUCS_HIGH_MUSDB* | 6.42 | 7.76 | 6.51 | 4.47 | 6.93 | | HDEMUCS_HIGH_MUSDB_PLUS** | 9.37 | 11.38 | 10.53 | 7.24 | 8.32 |

    * Trained on the training data of MUSDB-HQ dataset. ** Trained on both training and test sets of MUSDB-HQ and 150 extra songs from an internal database that were specifically produced for Meta.

    Special thanks to @adefossez for the guidance.

    ConvTasNet model architecture was added in TorchAudio 0.7.0. It is the first source separation model that outperforms the oracle ideal ratio mask. In this release, TorchAudio adds the pre-trained pipeline that is trained within TorchAudio on the Libri2Mix dataset. The pipeline achieves 15.6dB SDR improvement and 15.3dB Si-SNR improvement on the Libri2Mix test set.

    [Beta] Datasets and Metadata Mode for SUPERB Benchmarks

    With the addition of four new audio-related datasets, there is now support for all downstream tasks in version 1 of the SUPERB benchmark. Furthermore, these datasets support metadata mode through a get_metadata function, which enables faster dataset iteration or preprocessing without the need to load or store waveforms.

    Datasets with metadata functionality:

    [Beta] Custom Language Model support in CTC Beam Search Decoding

    In release 0.12, TorchAudio released a CTC beam search decoder with KenLM language model support. This release, there is added functionality for creating custom Python language models that are compatible with the decoder, using the torchaudio.models.decoder.CTCDecoderLM wrapper.

    [Beta] StreamWriter

    torchaudio.io.StreamWriter is a class for encoding media including audio and video. This can handle a wide variety of codecs, chunk-by-chunk encoding and GPU encoding.

    Backward-incompatible changes

    • [BC-breaking] Fix momentum in transforms.GriffinLim (#2568) The GriffinLim implementations in transforms and functional used the momentum parameter differently, resulting in inconsistent results between the two implementations. The transforms.GriffinLim usage of momentum is updated to resolve this discrepancy.
    • Make torchaudio.info decode audio to compute num_frames if it is not found in metadata (#2740). In such cases, torchaudio.info may now return non-zero values for num_frames.

    Bug Fixes

    • Fix random Gaussian generation (#2639) torchaudio.compliance.kaldi.fbank with dither option produced a different output from kaldi because it used a skewed, rather than gaussian, distribution for dither. This is updated in this release to correctly use a random gaussian instead.
    • Update download link for speech commands (#2777) The previous download link for SpeechCommands v2 did not include data for the valid and test sets, resulting in errors when trying to use those subsets. Update the download link to correctly download the whole dataset.

    New Features

    IO

    • Add metadata to source stream info (#2461, #2464)
    • Add utility function to fetch FFmpeg library versions (#2467)
    • Add YUV444P support to StreamReader (#2516)
    • Add StreamWriter (#2628, #2648, #2505)
    • Support in-memory decoding via Tensor wrapper in StreamReader (#2694)
    • Add StreamReader Tensor Binding to src (#2699)
    • Add StreamWriter media device/streaming tutorial (#2708)
    • Add StreamWriter tutorial (#2698)

    Ops

    • Add ITU-R BS.1770-4 loudness recommendation (#2472)
    • Add convolution operator (#2602)
    • Add additive noise function (#2608)

    Models

    • Hybrid Demucs model implementation (#2506)
    • Docstring change for Hybrid Demucs (#2542, #2570)
    • Add NNLM support to CTC Decoder (#2528, #2658)
    • Move hybrid demucs model out of prototype (#2668)
    • Move conv_tasnet_base doc out of prototype (#2675)
    • Add custom lm example to decoder tutorial (#2762)

    Pipelines

    • Add SourceSeparationBundle to prototype (#2440, #2559)
    • Adding pipeline changes, factory functions to HDemucs (#2547, #2565)
    • Create tutorial for HDemucs (#2572)
    • Add HDEMUCS_HIGH_MUSDB (#2601)
    • Move SourceSeparationBundle and pre-trained ConvTasNet pipeline into Beta (#2669)
    • Move Hybrid Demucs pipeline to beta (#2673)
    • Update description of HDemucs pipelines

    Datasets

    • Add fluent speech commands (#2480, #2510)
    • Add musdb dataset and tests (#2484)
    • Add VoxCeleb1 dataset (#2349)
    • Add metadata function for LibriSpeech (#2653)
    • Add Speech Commands metadata function (#2687)
    • Add metadata mode for various datasets (#2697)
    • Add IEMOCAP dataset (#2732)
    • Add Snips Dataset (#2738)
    • Add metadata for Librimix (#2751)
    • Add file name to returned item in Snips dataset (#2775)
    • Update IEMOCAP variants and labels (#2778)

    Improvements

    IO

    • Replace runtime_error exception with TORCH_CHECK (#2550, #2551, #2592)
    • Refactor StreamReader (#2507, #2508, #2512, #2530, #2531, #2533, #2534)
    • Refactor sox C++ (#2636, #2663)
    • Delay the import of kaldi_io (#2573)

    Ops

    • Speed up resample with kernel generation modification (#2553, #2561) The kernel generation for resampling is optimized in this release. The following table illustrates the performance improvements from the previous release for the torchaudio.functional.resample function using the sinc resampling method, on float32 tensor with two channels and one second duration.

    CPU | torchaudio version | 8k → 16k [Hz] | 16k → 8k | 16k → 44.1k | 44.1k → 16k | | ----- | ----- | ----- | ----- | ----- | | 0.13 | 0.256 | 0.549 | 0.769 | 0.820 | | 0.12 | 0.386 | 0.534 | 31.8 | 12.1 |

    CUDA | torchaudio version | 8k → 16k [Hz] | 16k → 8k | 16k → 44.1k | 44.1k → 16k | | ----- | ----- | ----- | ----- | ----- | | 0.13 | 0.332 | 0.336 | 0.345 | 0.381 | | 0.12 | 0.524 | 0.334 | 64.4 | 22.8 |

    • Add normalization parameter on spectrogram and inverse spectrogram (#2554)
    • Replace assert with raise for ops (#2579, #2599)
    • Replace CHECK_ by TORCH_CHECK_ (#2582)
    • Fix argument validation in TorchAudio filtering (#2609)

    Models

    • Switch to flashlight decoder from upstream (#2557)
    • Add dimension and shape check (#2563)
    • Replace assert with raise in models (#2578, #2590)
    • Migrate CTC decoder code (#2580)
    • Enable CTC decoder in Windows (#2587)

    Datasets

    • Replace assert with raise in datasets (#2571)
    • Add unit test for LibriMix dataset (#2659)
    • Add gtzan download note (#2763)

    Tutorials

    • Tweak tutorials (#2630, #2733)
    • Update ASR inference tutorial (#2631)
    • Update and fix tutorials (#2661, #2701)
    • Introduce IO section to getting started tutorials (#2703)
    • Update HW video processing tutorial (#2739)
    • Update tutorial author information (#2764)
    • Fix typos in tacotron2 tutorial (#2761)
    • Fix fading in hybrid demucs tutorial (#2771)
    • Fix leaking matplotlib figure (#2769)
    • Update resampling tutorial (#2773)

    Recipes

    • Use lazy import for joblib (#2498)
    • Revise LibriSpeech Conformer RNN-T recipe (#2535)
    • Fix bug in Conformer RNN-T recipe (#2611)
    • Replace bg_iterator in examples (#2645)
    • Remove obsolete examples (#2655)
    • Fix LibriSpeech Conforner RNN-T eval script (#2666)
    • Replace IValue::toString()->string() with IValue::toStringRef() (#2700)
    • Improve wav2vec2/hubert model for pre-training (#2716)
    • Improve hubert recipe for pre-training and fine-tuning (#2744)

    WER improvement on LibriSpeech dev and test sets | | Viterbi (v0.12) | Viterbi (v0.13) | KenLM (v0.12) | KenLM (v0.13) | | ----- | ----- | ----- | ----- | ----- | | dev-clean | 10.7 | 10.9 | 4.4 | 4.2 | | dev-other | 18.3 | 17.5 | 9.7 | 9.4 | | test-clean | 10.8 | 10.9 | 4.4 | 4.4 | | test-other | 18.5 | 17.8 | 10.1 | 9.5 |

    Documentation

    Examples

    • Add example for Vol transform (#2597)
    • Add example for Vad transform (#2598)
    • Add example for SlidingWindowCmn transform (#2600)
    • Add example for MelScale transform (#2616)
    • Add example for AmplitudeToDB transform (#2615)
    • Add example for InverseMelScale transform (#2635)
    • Add example for MFCC transform (#2637)
    • Add example for LFCC transform (#2640)
    • Add example for Loudness transform (#2641)

    Other

    • Remove CTC decoder prototype message (#2459)
    • Fix docstring (#2540)
    • Dataset docstring change (#2575)
    • Fix typo - "dimension" (#2596)
    • Add note for lexicon free decoder output (#2603)
    • Fix stylecheck (#2606)
    • Fix dataset docs parsing issue with extra spaces (#2607)
    • Remove outdated doc (#2617)
    • Use double quotes for string in functional and transforms (#2618)
    • Fix doc warning (#2627)
    • Update README.md (#2633)
    • Sphinx-gallery updates (#2629, #2638, #2736, #2678, #2679)
    • Tweak documentation (#2656)
    • Consolidate bibliography / reference (#2676)
    • Tweak badge link URL generation (#2677)
    • Adopt :autosummary: in torchaudio docs (#2664, #2681, #2683, #2684, #2693, #2689, #2690, #2692)
    • Update sox info docstring to account for mp3 frame count handling (#2742)
    • Fix HuBERT docstring (#2746)
    • Fix CTCDecoder doc (#2766)
    • Fix torchaudio.backend doc (#2781)

    Build/CI

    • Simplify the requirements to minimum runtime dependencies (#2313)
    • Bump version to 0.13 (#2460)
    • Add tagged builds to torchaudio (#2471)
    • Update config.guess to the latest (#2479)
    • Pin MKL to 2020.04 (#2486)
    • Integration test fix deleting temporary directory (#2569)
    • Refactor cmake (#2585)
    • Introducing pytorch-cuda metapackage (#2612)
    • Move xcode to 14 from 12.5 (#2622)
    • Update nightly wheels to ROCm5.2 (#2672)
    • Lint updates (#2389, #2487)
    • M1 build updates (#2473, #2474, #2496, #2674)
    • CUDA-related updates: versions, builds, and checks (#2501, #2623, #2670, #2707, #2710, #2721, #2724)
    • Release-related updates (#2489, #2492, #2495, #2759)
    • Fix Anaconda upload (#2581, #2621)
    • Fix windows python 3.8 loading path (#2735, #2747)
    Source code(tar.gz)
    Source code(zip)
  • v0.12.1(Aug 5, 2022)

    This is a minor release, which is compatible with PyTorch 1.12.1 and include small bug fixes, improvements and documentation update. There is no new feature added.

    Bug Fix

    • #2560 Fix fall back failure in sox_io backend
    • #2588 Fix hubert fine-tuning recipe bugs

    Improvement

    • #2552 Remove unused boost source code
    • #2527 Improve speech enhancement tutorial
    • #2544 Update forced alignment tutorial
    • #2595 Update data augmentation tutorial

    For the full feature of v0.12, please refer to the v0.12.0 release note.

    Source code(tar.gz)
    Source code(zip)
  • v0.12.0(Jun 28, 2022)

    TorchAudio 0.12.0 Release Notes

    Highlights

    TorchAudio 0.12.0 includes the following:

    • CTC beam search decoder
    • New beamforming modules and methods
    • Streaming API

    [Beta] CTC beam search decoder

    To support inference-time decoding, the release adds the wav2letter CTC beam search decoder, ported over from Flashlight (GitHub). Both lexicon and lexicon-free decoding are supported, and decoding can be done without a language model or with a KenLM n-gram language model. Compatible token, lexicon, and certain pretrained KenLM files for the LibriSpeech dataset are also available for download.

    For usage details, please check out the documentation and ASR inference tutorial.

    [Beta] New beamforming modules and methods

    To improve flexibility in usage, the release adds two new beamforming modules under torchaudio.transforms: SoudenMVDR and RTFMVDR. They differ from MVDR mainly in that they:

    • Use power spectral density (PSD) and relative transfer function (RTF) matrices as inputs instead of time-frequency masks. The module can be integrated with neural networks that directly predict complex-valued STFT coefficients of speech and noise.
    • Add reference_channel as an input argument in the forward method to allow users to select the reference channel in model training or dynamically change the reference channel in inference.

    Besides the two modules, the release adds new function-level beamforming methods under torchaudio.functional. These include

    For usage details, please check out the documentation at torchaudio.transforms and torchaudio.functional and the Speech Enhancement with MVDR Beamforming tutorial.

    [Beta] Streaming API

    StreamReader is TorchAudio’s new I/O API. It is backed by FFmpeg† and allows users to

    • Decode various audio and video formats, including MP4 and AAC.
    • Handle various input forms, such as local files, network protocols, microphones, webcams, screen captures and file-like objects.
    • Iterate over and decode media chunk-by-chunk, while changing the sample rate or frame rate.
    • Apply various audio and video filters, such as low-pass filter and image scaling.
    • Decode video with Nvidia's hardware-based decoder (NVDEC).

    For usage details, please check out the documentation and tutorials:

    † To use StreamReader, FFmpeg libraries are required. Please install FFmpeg. The coverage of codecs depends on how these libraries are configured. TorchAudio official binaries are compiled to work with FFmpeg 4 libraries; FFmpeg 5 can be used if TorchAudio is built from source.

    Backwards-incompatible changes

    I/O

    • MP3 decoding is now handled by FFmpeg in sox_io backend. (#2419, #2428)
      • FFmpeg is now used as fallback in sox_io backend, and now MP3 decoding is handled by FFmpeg. To load MP3 audio with torchaudio.load, please install a compatible version of FFmpeg (Version 4 when using an official binary distribution).
      • Note that, whereas the previous MP3 decoding scheme pads the output audio, the new scheme does not. As a consequence, the new version returns shorter audio tensors.
      • torchaudio.info now returns num_frames=0 for MP3.

    Models

    • Change underlying implementation of RNN-T hypothesis to tuple (#2339)
      • In release 0.11, Hypothesis subclassed namedtuple. Containers of namedtuple instances, however, are incompatible with the PyTorch Lite Interpreter. To achieve compatibility, Hypothesis has been modified in release 0.12 to instead alias tuple. This affects RNNTBeamSearch as it accepts and returns a list of Hypothesis instances.

    Bug Fixes

    Ops

    • Fix return dtype in MVDR module (#2376)
      • In release 0.11, the MVDR module converts the dtype of input spectrum to complex128 to improve the precision and robustness of downstream matrix computations. The output dtype, however, is not correctly converted back to the original dtype. In release 0.12, we fix the output dtype to be consistent with the original input dtype.

    Build

    • Fix Kaldi submodule integration (#2269)
    • Pin jinja2 version for build_docs (#2292)
    • Use sourceforge url to fetch zlib (#2297)

    New Features

    I/O

    • Add Streaming API (#2041, #2042, #2043, #2044, #2045, #2046, #2047, #2111, #2113, #2114, #2115, #2135, #2164, #2168, #2202, #2204, #2263, #2264, #2312, #2373, #2378, #2402, #2403, #2427, #2429)
    • Add YUV420P format support to Streaming API (#2334)
    • Support specifying decoder and its options (#2327)
    • Add NV12 format support in Streaming API (#2330)
    • Add HW acceleration support on Streaming API (#2331)
    • Add file-like object support to Streaming API (#2400)
    • Make FFmpeg log level configurable (#2439)
    • Set the default ffmpeg log level to FATAL (#2447)

    Ops

    • New beamforming methods (#2227, #2228, #2229, #2230, #2231, #2232, #2369, #2401)
    • New MVDR modules (#2367, #2368)
    • Add and refactor CTC lexicon beam search decoder (#2075, #2079, #2089, #2112, #2117, #2136, #2174, #2184, #2185, #2273, #2289)
    • Add lexicon free CTC decoder (#2342)
    • Add Pretrained LM Support for Decoder (#2275)
    • Move CTC beam search decoder to beta (#2410)

    Datasets

    • Add QUESST14 dataset (#2290, #2435, #2458)
    • Add LibriLightLimited dataset (#2302)

    Improvements

    I/O

    • Use FFmpeg-based I/O as fallback in sox_io backend. (#2416, #2418, #2423)

    Ops

    • Raise error for resampling int waveform (#2318)
    • Move multi-channel modules to a separate file (#2382)
    • Refactor MVDR module (#2383)

    Models

    • Add an option to use Tanh instead of ReLU in RNNT joiner (#2319)
    • Support GroupNorm and re-ordering Convolution/MHA in Conformer (#2320)
    • Add extra arguments to hubert pretrain factory functions (#2345)
    • Add feature_grad_mult argument to HuBERTPretrainModel (#2335)

    Datasets

    • Refactor LibriSpeech dataset (#2387)
    • Raising RuntimeErrors when datasets missing (#2430)

    Performance

    • Make Pitchshift for faster by caching resampling kernel (#2441) The following table illustrates the performance improvement over the previous release by comparing the time in msecs it takes torchaudio.transforms.PitchShift, after its first call, to perform the operation on float32 Tensor with two channels and 8000 frames, resampled to 44.1 kHz across various shifted steps.

    | TorchAudio Version | 2 | 3 | 4 | 5 | | ----- | ----- | ----- | ----- | ----- | | 0.12 | 2.76 | 5 | 1860 | 223 | | 0.11 | 6.71 | 161 | 8680 | 1450 |

    Tests

    • Add complex dtype support in functional autograd test (#2244)
    • Refactor torchscript consistency test in functional (#2246)
    • Add unit tests for PyTorch Lightning modules of emformer_rnnt recipes (#2240)
    • Refactor batch consistency test in functional (#2245)
    • Run smoke tests on regular PRs (#2364)
    • Refactor smoke test executions (#2365)
    • Move seed to setup (#2425)
    • Remove possible manual seeds from test files (#2436)

    Build

    • Revise the parameterization of third party libraries (#2282)
    • Use zlib v1.2.12 with GitHub source (#2300)
    • Fix ffmpeg integration for ffmpeg 5.0 (#2326)
    • Use custom FFmpeg libraries for torchaudio binary distributions (#2355)
    • Adding m1 builds to torchaudio (#2421)

    Other

    • Add download utility specialized for torchaudio (#2283)
    • Use module-level __getattr__ to implement delayed initialization (#2377)
    • Update build_doc job to use Conda CUDA package (#2395)
    • Update I/O initialization (#2417)
    • Add Python 3.10 (build and test) (#2224)
    • Retrieve version from version.txt (#2434)
    • Disable OpenMP on mac (#2431)

    Examples

    Ops

    • Add CTC decoder example for librispeech (#2130, #2161)
    • Fix LM, arguments in CTC decoding script (#2235, #2315)
    • Use pretrained LM API for decoder example (#2317)

    Pipelines

    • Refactor pipeline_demo.py to support variant EMFORMER_RNNT bundles (#2203)
    • Refactor eval and pipeline_demo scripts in emformer_rnnt (#2238)
    • Refactor pipeline_demo script in emformer_rnnt recipes (#2239)
    • Add EMFORMER_RNNT_BASE_MUSTC into pipeline demo script (#2248)

    Tests

    • Add unit tests for Emformer RNN-T LibriSpeech recipe (#2216)
    • Add fixed random seed for Emformer RNN-T recipe test (#2220)

    Training recipes

    • Add recipe for HuBERT model pre-training (#2143, #2198, #2296, #2310, #2311, #2412)
    • Add HuBERT fine-tuning recipe (#2352)
    • Refactor Emformer RNNT recipes (#2212)
    • Fix bugs from Emformer RNN-T recipes merge (#2217)
    • Add SentencePiece model training script for LibriSpeech Emformer RNN-T (#2218)
    • Add training recipe for Emformer RNNT trained on MuST-C release v2.0 dataset (#2219)
    • Refactor ArgumentParser arguments in emformer_rnnt recipes (#2236)
    • Add shebang lines to scripts in emformer_rnnt recipes (#2237)
    • Introduce DistributedBatchSampler (#2299)
    • Add Conformer RNN-T LibriSpeech training recipe (#2329)
    • Refactor LibriSpeech Conformer RNN-T recipe (#2366)
    • Refactor LibriSpeech Lightning datamodule to accommodate different dataset implementations (#2437)

    Prototypes

    Models

    • Add Conformer RNN-T model prototype (#2322)
    • Add ConvEmformer module (streaming-capable Conformer) (#2324, #2358)
    • Add conv_tasnet_base factory function to prototype (#2411)

    Pipelines

    • Add EMFORMER_RNNT_BASE_MUSTC bundle to torchaudio.prototype (#2241)

    Documentation

    • Add ASR CTC decoding inference tutorial (#2106)
    • Update context building to not delay the inference (#2213)
    • Update online ASR tutorial (#2226)
    • Update CTC decoder docs and add citation (#2278)
    • [Doc] fix typo and backlink (#2281)
    • Fix calculation of SNR value in tutorial (#2285)
    • Add notes about prototype features in tutorials (#2288)
    • Update README around version compatibility matrix (#2293)
    • Update decoder pretrained lm docs (#2291)
    • Add devices/properties badges (#2321)
    • Fix LibriMix documentation (#2351)
    • Update wavernn.py (#2347)
    • Add citations for datasets (#2371)
    • Update audio I/O tutorials (#2385)
    • Update MVDR beamforming tutorial (#2398)
    • Update audio feature extraction tutorial (#2391)
    • Update audio resampling tutorial (#2386)
    • Update audio data augmentation tutorial (#2388)
    • Add tutorial to use NVDEC with Stream API (#2393)
    • Expand subsections in tutorials by default (#2397)
    • Fix documentation (#2407)
    • Fix documentation (#2409)
    • Dataset doc fixes (#2426)
    • Update CTC decoder docs (#2443)
    • Split Streaming API tutorials into two (#2446)
    • Update HW decoding tutorial and add notes about unseekable object (#2408)
    Source code(tar.gz)
    Source code(zip)
  • v0.11.0(Mar 10, 2022)

    torchaudio 0.11.0 Release Note

    Highlights

    TorchAudio 0.11.0 release includes:

    • Emformer (paper) RNN-T components, training recipe, and pre-trained pipeline for streaming ASR
    • Voxpopuli pre-trained pipelines
    • HuBERTPretrainModel for training HuBERT from scratch
    • Conformer model for speech recognition
    • Drop Python 3.6 support

    [Beta] Emformer RNN-T

    To support streaming ASR use cases, the release adds implementations of Emformer (docs), an RNN-T model that uses Emformer (emformer_rnnt_base), and an RNN-T beam search decoder (RNNTBeamSearch). It also includes a pipeline bundle (EMFORMER_RNNT_BASE_LIBRISPEECH) that wraps pre- and post-processing components, the beam search decoder, and the RNN-T Emformer model with weights pre-trained on LibriSpeech, which in whole allow for performing streaming ASR inference out of the box. For reference and reproducibility, the release provides the training recipe used to produce the pre-trained weights in the examples directory.

    [Beta] HuBERT Pretrain Model

    The masked prediction training of HuBERT model requires the masked logits, unmasked logits, and feature norm as the outputs. The logits are for cross-entropy losses and the feature norm is for penalty loss. The release adds HuBERTPretrainModel and corresponding factory functions (hubert_pretrain_base, hubert_pretrain_large, and hubert_pretrain_xlarge) to enable training from scratch.

    [Beta] Conformer (paper)

    The release adds an implementation of Conformer (docs), a convolution-augmented transformer architecture that has achieved state-of-the-art results on speech recognition benchmarks.

    Backward-incompatible changes

    Ops

    • Removed deprecated F.magphase, F.angle, F.complex_norm, and T.ComplexNorm. (#1934, #1935, #1942)
      • Utility functions for pseudo complex types were deprecated in 0.10, and now they are removed in 0.11. For the detail of this migration plan, please refer to #1337.
    • Dropped pseudo complex support from F.spectrogram, T.Spectrogram, F.phase_vocoder, and T.TimeStretch (#1957, #1958)
      • The support for the pseudo complex type was deprecated in 0.10, and now they are removed in 0.11. For the detail of this migration plan, please refer to #1337.
    • Removed deprecated create_fb_matrix (#1998)
      • create_fb_matrix was replaced by melscale_fbanks in release 0.10. It is removed in 0.11. Please use melscale_fbanks.

    Datasets

    • Removed deprecated VCTK (#1825)
      • The original VCTK archive file is no longer accessible. Please migrate to VCTK_092 class for the latest version of the dataset.
    • Removed deprecated dataset utils (#1826)
      • Undocumented methods diskcache_iterator and bg_iterator were deprecated in 0.10. They are removed in 0.11. Please cease the usage of them.

    Models

    • Removed unused dimension from pretrained Wav2Vec2 ASR (#1914)
      • The final linear layer of Wav2Vec2 ASR models included dimensions (<s>, <pad>, </s>, <unk>) that were not related to ASR tasks and not used. These dimensions were removed.

    Build

    • Dropped support for Python3.6 (#2119, #2139)
      • Following the lifecycle of Python-3.6, torchaudio dropped the support for Python 3.6.

    New Features

    RNN-T Emformer

    • Introduced Emformer (#1801)
    • Added Emformer RNN-T model (#2003)
    • Added RNN-T beam search decoder (#2028)
    • Cleaned up Emformer module (#2091)
    • Added pretrained Emformer RNN-T streaming ASR inference pipeline (#2093)
    • Reorganized RNN-T components in prototype module (#2110)
    • Added integration test for Emformer RNN-T LibriSpeech pipeline (#2172)
    • Registered RNN-T pipeline global stats constants as buffers (#2175)
    • Refactored RNN-T factory function to support num_symbols argument (#2178)
    • Fixed output shape description in RNN-T docstrings (#2179)
    • Removed invalid token blanking logic from RNN-T decoder (#2180)
    • Updated stale prototype references (#2189)
    • Revised RNN-T pipeline streaming decoding logic (#2192)
    • Cleaned up Emformer (#2207)
    • Applied minor fixes to Emformer implementation (#2252)

    Conformer

    • Introduced Conformer (#2068)
    • Removed subsampling and positional embedding logic from Conformer (#2171)
    • Moved ASR features out of prototype (#2187)
    • Passed bias and dropout args to Conformer convolution block (#2215)
    • Adjusted Conformer args (#2223)

    Datasets

    • Added DR-VCTK dataset (#1819)

    Models

    • Added HuBERT pretrain model to enable training from scratch (#2064)
    • Added feature mean square value to HuBERT Pretrain model output (#2128)

    Pipelines

    • Added wav2vec2 ASR French pretrained from voxpopuli (#1919)
    • Added wav2vec2 ASR Spanish pretrained model from voxpopuli (#1924)
    • Added wav2vec2 ASR German pretrained model from voxpopuli (#1953)
    • Added wav2vec2 ASR Italian pretrained model from voxpopuli (#1954)
    • Added wav2vec2 ASR English pretrained model from voxpopuli (#1956)

    Build

    • Added CUDA-11.5 builds to torchaudio (#2067)

    Improvements

    I/O

    • Fixed load behavior for 24-bit input (#2084)

    Ops

    • Added OpenMP support (#1761)
    • Improved MVDR stability (#2004)
    • Relaxed dtype for MVDR (#2024)
    • Added warnings in mu_law* for the wrong input type (#2034)
    • Added parameter p to TimeMasking (#2090)
    • Removed unused vars from RNN-T loss (#2142)
    • Removed complex32 dtype in F.griffinlim (#2233)

    Datasets

    • Deprecated data utils (#2073)
    • Updated URLs for libritts (#2074)
    • Added subset support for TEDLIUM release3 dataset (#2157)

    Models

    • Replaced dropout with Dropout (#1815)
    • Inplace initialization of RNN weights (#2010)
    • Updated to xavier_uniform and avoid legacy data.uniform_ initialization (#2018)
    • Allowed Tacotron2 decode batch_size 1 examples (#2156)

    Pipelines

    • Added tool to convert voxpopuli model (#1923)
    • Refactored wav2vec2 pipeline util (#1925)
    • Allowed the customization of axis exclusion for ASR head (#1932)
    • Tweaked wav2vec2 checkpoint conversion tool (#1938)
    • Added melkwargs setting for MFCC in HuBERT pipeline (#1949)

    Documentation

    • Added 0.10.0 to version compatibility matrix (#1862)
    • Removed MACOSX_DEPLOYMENT_TARGET (#1880)
    • Updated intersphinx inventory (#1893)
    • Updated compatibility matrix to include LTS version (#1896)
    • Updated CONTRIBUTING with doc conventions (#1898)
    • Added anaconda stats to README (#1910)
    • Updated README.md (#1916)
    • Added citation information (#1947)
    • Updated CONTRIBUTING.md (#1975)
    • Doc fixes (#1982)
    • Added tutorial to CONTRIBUTING (#1990)
    • Fixed docstring (#2002)
    • Fixed minor typo (#2012)
    • Updated audio augmentation tutorial (#2082)
    • Added Sphinx gallery automatically (#2101)
    • Disabled matplotlib warning in tutorial rendering (#2107)
    • Updated prototype documentations (#2108)
    • Added custom CSS to make signatures appear in multi-line (#2123)
    • Updated prototype pipeline documentation (#2148)
    • Tweaked documentation (#2152)

    Tests

    • Refactored integration test (#1922)
    • Enabled integration tests on CI (#1939)
    • Removed facebook folder in wav2vec unit tests (#2015)
    • Temporarily skipped threadpool test (#2025)
    • Revised Griffin-Lim transform test to reduce execution time (#2037)
    • Fixed CircleCI test failures (#2069)
    • Do not auto-skip tests on CI (#2127)
    • Relaxed absolute tolerance for Kaldi compat tests (#2165)
    • Added tacotron2 unit test with different batch_size (#2176)

    Build

    • Updated GPU resource class (#1791)
    • Updated the main version to 0.11.0 (#1793)
    • Updated windows cuda installer 11.1.0 to 11.1.1 (#1795)
    • Renamed build_tools to tools (#1812)
    • Limit Windows GPU testing to CUDA-11.3 only (#1842)
    • Used cu113 for unittest_windows_gpu (#1853)
    • USE_CUDA in windows and reduce one vcvarsall (#1854)
    • Check torch installation before building package (#1867)
    • Install tools from conda instead of brew (#1873)
    • Cleaned up setup.py (#1900)
    • Moved TorchAudio conda package to use pytorch-mutex (#1904)
    • Updated smoke test docker image (#1905)
    • Fixed formatting CIRCLECI_TAG when building docs (#1915)
    • Fetch third party sources automatically (#1966)
    • Disabled SPHINXOPT=-W for local env (#2013)
    • Improved installing nightly pytorch (#2026)
    • Improved cuda installation on windows (#2032)
    • Refactored the library loading mechanism (#2038)
    • Cleaned up libtorchaudio customization logic (#2039)
    • Refactored and functionize the library definition (#2040)
    • Introduced helper function to define extension (#2077)
    • Standardized the location of third-party source code (#2086)
    • Show lint diff with color (#2102)
    • Updated third party submodule setup (#2132)
    • Suppressed stderr from subprocess in setup.py (#2133)
    • Fixed header include (#2135)
    • Updated ROCM version 4.1 -> 4.3.1 and 4.5 (#2186)
    • Added "cu102" back (#2190)
    • Pinned flake8 version (#2191)

    Style

    • Removed trailing whitespace (#1803)
    • Fixed style checks (#1913)
    • Resolved lint warning (#1971)
    • Enabled CLANGFORMAT (#1999)
    • Fixed style checks in examples/tutorials (#2006)
    • OSS config for lint checks (#2066)
    • Excluded sphinx-gallery examples (#2071)
    • Reverted linting exemptions introduced in #2071 (#2087)
    • Applied arc lint to pytorch audio (#2096)
    • Enforced lint checks and fix/mute lint errors (#2116)

    Other

    • Replaced issue templates with new issue forms (#1802)
    • Notify merger if PR is incorrectly labeled (#1937)
    • Added script to collect PRs between commits (#1943)
    • Fixed PR labeling requirement (#1946)
    • Refactored collecting-PR script for release note (#1951)
    • Fixed bandit failure (#1960)
    • Renamed bug fix label (#1961)
    • Updated PR label notifier (#1964)
    • Reverted "Update PR label notifier (#1964)" (#1965)
    • Consolidated network utils (#1974)
    • Added PR collecting script (#2008)
    • Re-sync with internal repository (#2017)
    • Updated script for getting PR merger and labels (#2030)
    • Fixed third party archive fetch job (#2095)
    • Use python:3.X Docker image for build doc (#2151)
    • Updated PR labeling workflow (#2160)
    • Fixed librosa calls (#2208)

    Examples

    Ops

    • Removed the MVDR tutorial in examples (#2109)
    • Abstracted BucketizeSampler to be usable outside of HuBERT example (#2147)
    • Refactored BucketizeBatchSampler and HuBERTDataset (#2150)
    • Removed multiprocessing from audio dataset tutorial (#2163)

    Models

    • Added training recipe for RNN-T Emformer ASR model (#2052)
    • Added global stats script and new json for LibriSpeech RNN-T training recipe (#2183)

    Pipelines

    • Added preprocessing scripts for HuBERT model training (#1911)
    • Supported multi-node training for source separation pipeline (#1968)
    • Added bucketize sampler and dataset for HuBERT Base model training pipeline (#2000)
    • Added librispeech inference script (#2130)

    Other

    • Added unmaintained warnings (#1813)
    • torch.quantization -> torch.ao.quantization (#1823)
    • Use download.pytorch.org for asset URL (#2182)
    • Added deprecation path for renamed training type plugins (#11227)
    • Renamed DDPPlugin to DDPStrategy (#11142)
    Source code(tar.gz)
    Source code(zip)
  • v0.10.2(Jan 27, 2022)

  • v0.10.1(Dec 16, 2021)

    This is a minor release, which is compatible with PyTorch 1.10.1 and include small bug fix, improvements and documentation update. There is no new feature added.

    Bug Fix

    • #2050 Allow whitespace as TORCH_CUDA_ARCH_LIST delimiter

    Improvement

    • #2054 Fetch third party source code automatically The build process now fetches third party source code (git submodule and cmake external projects)
    • #2059 Improve documentation

    For the full feature of v0.10, please refer to the v0.10.0 release note.

    Source code(tar.gz)
    Source code(zip)
  • v0.10.0(Oct 21, 2021)

    torchaudio 0.10.0 Release Note

    Highlights

    torchaudio 0.10.0 release includes:

    • New models (Tacotron2, HuBERT) and datasets (CMUDict, LibriMix)
    • Pretrained model support for ASR (Wav2Vec2, HuBERT) and TTS (WaveRNN, Tacotron2)
    • New operations (RNN Transducer loss, MVDR beamforming, PitchShift, etc)
    • CUDA-enabled binaries

    [Beta] Wav2Vec2 / HuBERT Models and Pretrained Weights

    HuBERT model architectures (“base”, “large” and “extra large” configurations) are added. In addition to that, support for pretrained weights from wav2vec 2.0, Unsupervised Cross-lingual Representation Learning and HuBERT are added.

    These pretrained weights can be used for feature extractions and downstream task adaptation.

    >>> import torchaudio
    >>>
    >>> # Build the model and load pretrained weight.
    >>> model = torchaudio.pipelines.HUBERT_BASE.get_model()
    >>> # Perform feature extraction.
    >>> features, lengths = model.extract_features(waveforms)
    >>> # Pass the features to downstream task
    >>> ...
    

    Some of the pretrained weights are fine-tuned for ASR tasks. The following example illustrates how to use weights and access to associated information, such as labels, which can be used in subsequent CTC decoding steps. (Note: torchaudio does not provide a CTC decoding mechanism.)

    >>> import torchaudio
    >>>
    >>> bundle = torchaudio.pipelines.HUBERT_ASR_LARGE
    >>>
    >>> # Build the model and load pretrained weight.
    >>> model = bundle.get_model()
    Downloading:
    100%|███████████████████████████████| 1.18G/1.18G [00:17<00:00, 73.8MB/s]
    >>> # Check the corresponding labels of the output.
    >>> labels = bundle.get_labels()
    >>> print(labels)
    ('<s>', '<pad>', '</s>', '<unk>', '|', 'E', 'T', 'A', 'O', 'N', 'I', 'H', 'S', 'R', 'D', 'L', 'U', 'M', 'W', 'C', 'F', 'G', 'Y', 'P', 'B', 'V', 'K', "'", 'X', 'J', 'Q', 'Z')
    >>>
    >>> # Infer the label probability distribution
    >>> waveform, sample_rate = torchaudio.load(hello-world.wav')
    >>>
    >>> emissions, _ = model(waveform)
    >>>
    >>> # Pass emission to (hypothetical) decoder
    >>> transcripts = ctc_decode(emissions, labels)
    >>> print(transcripts[0])
    HELLO WORLD
    

    [Beta] Tacotron2 and TTS Pipeline

    A new model architecture, Tacotron2 is added, alongside several pretrained weights for TTS (text-to-speech). Since these TTS pipelines are composed of multiple models and specific data processing, so as to make it easy to use associated objects, a notion of bundle is introduced. Bundles provide a common access point to create a pipeline with a set of pretrained weights. They are available under torchaudio.pipelines module. The following example illustrates a TTS pipeline where two models (Tacotron2 and WaveRNN) are used together.

    >>> import torchaudio
    >>>
    >>> bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH
    >>>
    >>> # Build text processor, Tacotron2 and vocoder (WaveRNN) model
    >>> processor = bundle.get_text_preprocessor()
    >>> tacotron2 = bundle.get_tacotron2()
    Downloading:
    100%|███████████████████████████████| 107M/107M [00:01<00:00, 87.9MB/s]
    >>> vocoder = bundle.get_vocoder()
    Downloading:
    100%|███████████████████████████████| 16.7M/16.7M [00:00<00:00, 78.1MB/s]
    >>>
    >>> text = "Hello World!"
    >>>
    >>> # Encode text
    >>> input, lengths = processor(text)
    >>>
    >>> # Generate (mel-scale) spectrogram
    >>> specgram, lengths, _ = tacotron2.infer(input, lengths)
    >>>
    >>> # Convert spectrogram to waveform
    >>> waveforms, lengths = vocoder(specgram, lengths)
    >>>
    >>> # Save audio
    >>> torchaudio.save('hello-world.wav', waveforms, vocoder.sample_rate)
    

    [Beta] RNN Transducer Loss

    The loss function used in the RNN transducer architecture, which is widely used for speech recognition tasks, is added. The loss function (torchaudio.functional.rnnt_loss or torchaudio.transforms.RNNTLoss) supports float16 and float32 logits, has autograd and torchscript support, and can be run on both CPU and GPU, which has a custom CUDA kernel implementation for improved performance.

    [Beta] MVDR Beamforming

    This release adds support for MVDR beamforming on multi-channel audio using Time-Frequency masks. There are three solutions (ref_channel, stv_evd, stv_power) and it supports single-channel and multi-channel (perform average in the method) masks. It provides an online option that recursively updates the parameters for streaming audio. Please refer to the MVDR tutorial.

    GPU Build

    This release adds GPU builds that support custom CUDA kernels in torchaudio, like the one being used for RNN transducer loss. Following this change, torchaudio’s binary distribution now includes CPU-only versions and CUDA-enabled versions. To use CUDA-enabled binaries, PyTorch also needs to be compatible with CUDA.

    Additional Features

    torchaudio.functional.lfilter now supports batch processing and multiple filters. Additional operations, including pitch shift, LFCC, and inverse spectrogram, are now supported in this release. The datasets CMUDict and LibriMix are added as well.

    Backward Incompatible Changes

    I/O

    • Default to PCM_16 for flac on soundfile backend (#1604)
      • When saving FLAC format with “soundfile” backend, PCM_24 (the previous default) could cause warping. The default has been changed to PCM_16, which does not suffer this.

    Ops

    • Default to native complex type when returning raw spectrogram (#1549)
      • When power=None, torchaudio.functional.spectrogram and torchaudio.transforms.Spectrogram now defaults to return_complex=True, which returns Tensor of native complex type (such as torch.cfloat and torch.cdouble). To use a pseudo complex type, pass the resulting tensor to torch.view_as_real.
    • Remove deprecated kaldi.resample_waveform (#1555)
      • Please use torchaudio.functional.resample.
    • Replace waveform with specgram in SlidingWindowCmn (#1859)
      • The argument name was corrected to specgram.
    • Ensure integer input frequencies for resample (#1857)
      • Sampling rates were silently cast to integers in the resampling implementation, so it now requires integer sampling rate inputs to ensure expected resampling quality.

    Wav2Vec2

    • Update extract_features of Wav2Vec2Model (#1776)
      • The previous implementation returned outputs from convolutional feature extractors. To match the behavior with the original fairseq’s implementation, the method was changed to return the outputs of the intermediate layers of transformer layers. To achieve the original behavior, please use Wav2Vec2Model.feature_extractor().
    • Move fine-tune specific module out of wav2vec2 encoder (#1782)
      • The internal structure of Wav2Vec2Model was updated. Wav2Vec2Model.encoder.read_out module is moved to Wav2Vec2Model.aux. If you have serialized state dict, please replace the key encoder.read_out with aux.
    • Updated wav2vec2 factory functions for more customizability (#1783, #1804, #1830)
      • The signatures of wav2vec2 factory functions are changed. num_out parameter has been changed to aux_num_out and other parameters are added before it. Please update the code from wav2vec2_base(num_out) to wav2vec2_base(aux_num_out=num_out).

    Deprecations

    • Add melscale_fbanks and deprecate create_fb_matrix (#1653)
      • As linear_fbanks is introduced, create_fb_matrix is renamed to melscale_fbanks. The original create_fb_matrix is now deprecated. Please use melscale_fbanks.
    • Deprecate VCTK dataset (#1810)
      • This dataset has been taken down and is no longer available. Please use VCTK_092 dataset.
    • Deprecate data utils (#1809)
      • bg_iterator and diskcache_iterator are known to not improve the throughput of data loaders. Please cease their usage.

    New Features

    Models

    Tacotron2

    • Add Tacotron2 model (#1621, #1647, #1844)
    • Add Tacotron2 loss function (#1764)
    • Add Tacotron2 inference method (#1648, #1839, #1849)
    • Add phoneme text preprocessing for Tacotron2 (#1668)
    • Move Tacotron2 out of prototype (#1714)

    HuBERT

    • Add HuBERT model architectures (#1769, #1811)

    Pretrained Weights and Pipelines

    • Add pretrained weights for wavernn (#1612)

    • Add Tacotron2 pretrained models (#1693)

    • Add HUBERT pretrained weights (#1821, #1824)

    • Add pretrained weights from wav2vec2.0 and XLSR papers (#1827)

    • Add customization support to wav2vec2 labels (#1834)

    • Default pretrained weights to eval mode (#1843)

    • Move wav2vec2 pretrained models to pipelines module (#1876)

    • Add TTS bundle/pipelines (#1872)

    • Fix vocoder interface (#1895)

    • Fix Phonemizer download (#1897)


    RNN Transducer Loss

    • Add reduction parameter for RNNT loss (#1590)

    • Rename RNNT loss C++ parameters (#1602)

    • Rename transducer to RNNT (#1603)

    • Remove gradient variable from RNNT loss Python code (#1616)

    • Remove reuse_logits_for_grads option for RNNT loss (#1610)

    • Remove fused_log_softmax option from RNNT loss (#1615)

    • RNNT loss resolve null gradient (#1707)

    • Move RNNT loss out of prototype (#1711)


    MVDR Beamforming

    • Add MVDR module to example (#1709)

    • Add normalization to steering vector solutions in MVDR Module (#1765)

    • Move MVDR and PSD modules to transforms (#1771)

    • Add MVDR beamforming tutorial to example directory (#1768)


    Ops

    • Add edit_distance (#1601)

    • Add PitchShift to functional and transform (#1629)

    • Add LFCC feature to transforms (#1611)

    • Add InverseSpectrogram to transforms and functional (#1652)


    Datasets

    • Add CMUDict dataset (#1627)

    • Move LibriMix dataset to datasets directory (#1833)


    Improvements

    I/O

    • Make buffer size for function info configurable (#1634)


    Ops

    • Replace deprecated AutoNonVariableTypeMode (#1583)

    • Remove lazy behavior from MelScale (#1636)

    • Simplify axis value checks (#1501)

    • Use at::parallel_for in lfilter core loop (#1557)

    • Add filterbanks support to lfilter (#1587)

    • Add batch support to lfilter (#1638)

    • Use integer rates in pitch shift resample (#1861)


    Models

    • Rename infer method to forward for WaveRNNInferenceWrapper (#1650)

    • Refactor WaveRNN infer and move it to the codebase (#1704)

    • Make the core wav2vec2 factory function public (#1829)

    • Refactor WaveRNNInferenceWrapper (#1845)

    • Store n_bits in WaveRNN (#1847)

    • Replace custom padding with torch’s native impl (#1846)

    • Avoid concatenation in loop (#1850)

    • Add lengths param to WaveRNN.infer (#1851)

    • Add sample rate to wav2vec2 bundle (#1878)

    • Remove factory functions of Tacotron2 and WaveRNN (#1874)


    Datasets

    • Fix encoding of CMUDict data reading (#1665)

    • Rename utterance to transcript in datasets (#1841)

    • Clean up constructor of CMUDict (#1852)


    Performance

    • Refactor transforms.Fade on GPU computation (#1871)

    CUDA Tensor shape | [1,4,8000] | [1,4,16000] | [1,4,32000] -- | -- | -- | -- 0.10 | 119 | 120 | 123 0.9 | 160 | 184 | 240

    Unit: msec

    Examples

    • Add text preprocessing utilities for TTS pipeline (#1639)

    • Replace simple_ctc with Python greedy decoder (#1558)

    • Add an inference example for WaveRNN (#1637)

    • Refactor coding style for WaveRNN example (#1663)

    • Add style checks on example files on CI (#1667)

    • Add Tacotron2 training script (#1642)

    • Add an inference example for Tacotron2 (#1654)

    • Fix Tacotron2 inference example (#1716)

    • Fix WaveRNN training example (#1740)

    • Training recipe for ConvTasNet on Libri2Mix dataset (#1757)


    Build

    • Update skipIfNoCuda decorator and force GPU tests in GPU CIs (#1559)

    • Temporarily pin nightly version on Linux/macOS CPU unittest (#1598)

    • Temporarily pin nightly version on Linux GPU unitest (#1606)

    • Revert CI hot fix (#1614)

    • Expose USE_CUDA in build (#1609)

    • Pin MKL to 2021.2.0 (#1655)

    • Simplify extension initialization (#1649)

    • Synchronize extension initialization mechanism with fbcode (#1682)

    • Ensure we’re propagating BUILD_VERSION (#1697)

    • Guard Kaldi’s version generation (#1715)

    • Update sphinx to 3.5.4 (#1685)

    • Default to BUILD_SOX=1 in non-Windows systems (#1725)

    • Add CUDA install step to Win Packaging jobs (#1732)

    • setup.py should parse TORCH_CUDA_ARCH_LIST (#1733)

    • Simplify the extension initialization process (#1734)

    • Fix CUDA build logic for _torchaudio.so (#1737)

    • Enable Linux wheel/conda GPU package builds (#1730)

    • Increase no_output_timeout to 20m for WinConda (#1738)

    • Build torchaudio for 11.3 as well (#1747)

    • Upload wheels to respective folders (#1751)

    • Extract PyBind11 feature implementations (#1739)

    • Update the way to access libsox global config (#1755)

    • Fix ROCM build error (#1729)

    • Fix compile warnings (#1762)

    • Migrate CircleCI docker image (#1767)

    • Split extension into custom impl and Python wrapper libraries (#1752)

    • Put libtorchaudio in lib directory (#1773)

    • Update win gpu image from previous to stable (#1786)

    • Set libtorch audio suffix as pyd on Windows (#1788)

    • Fix build on Windows with CUDA (#1787)

    • Enable audio windows cuda tests (#1777)

    • Set release and base PyTorch version (#1816)

    • Exclude prototype if it is in release (#1870)

    • Log prototype exclusion (#1882)

    • Update prototype exclusion (#1885)

    • Remove alpha from version number (#1901)


    Testing

    • Migrate resample tests from kaldi to functional (#1520)

    • Add autograd gradcheck test for RNN transducer loss (#1532)

    • Fix HF wav2vec2 test (#1585)

    • Update unit test CUDA to 10.2 (#1605)

    • Fix CircleCI unittest environemnt

    • Remove skipIfRocm from test_fileobj_flac in soundfile.save_test (#1626)

    • MFCC test refactor (#1618)

    • Refactor RNNT Loss Unit Tests (#1630)

    • Reduce sample rate to avoid test time out (#1640)

    • Refactor text preprocessing tests in Tacotron2 example (#1635)

    • Move test initialization logic to dedicated directory (#1680)

    • Update pitch shift batch consistency test (#1700)

    • Refactor scripting in test (#1727)

    • Update the version of fairseq used for testing (#1745)

    • Put output tensor on proper device in get_whitenoise (#1744)

    • Refactor batch consistency test in transforms (#1772)

    • Tweak test name by appending factory function name (#1780)

    • Enable audio windows cuda tests (#1777)

    • Skip hubert_asr_xlarge TS test on Windows (#1800)

    • Skip hubert_xlarge TS test on Windows (#1807)


    Others

    • Remove unused files (#1588)

    • Remove residuals for removed modules (#1599)

    • Remove torchscript bc test references (#1623)

    • Remove torchaudio._internal.fft module (#1631)


    Misc

    • Rename master branch to main (#1649)

    • Fix Python spacing (#1670)

    • Lint fix (#1726)

    • Add .gitattributes (#1731)

    • Style fixes (#1766)

    • Update reference from master to main elsewhere (#1784)


    Bug Fixes

    • Fix models import (#1664)

    • Fix HF model integration (#1781)


    Documentation

    • README Updates

      • Update README (#1544)

      • Remove NumPy dependency from README (#1582)

      • Fix typos and sentence structure in README.md (#1633)

      • Update and move convention section to CONTRIBUTING.md (#1635)

      • Remove unnecessary README (#1728)

      • Add link to TTS colab example to README (#1748)

      • Fix typo in source separation README (#1774)

    • Docstring Changes

      • Set removal version of pseudo complex support (#1553)

      • Update docs (#1584)

      • Add return type in doc for RNNT loss (#1591)

      • Improve RNNT loss docstrings (#1642)

      • Add documentation for CMUDict’s property (#1683)

      • Refactor lfilter docs (#1698)

      • Standardize optional types in docstrings (#1746)

      • Fix return type of wav2vec2 model (#1790)

      • Add equations to MVDR docstring (#1789)

      • Standardize tensor shapes format in docs (#1838)

      • Add license to pre-trained model doc (#1836)

      • Update Tacotron2 docs (#1840)

      • Fix PitchShift docstring (#1866)

      • Update descriptions of lengths parameters (#1890)

      • Standardization and minor fixes (#1892)

      • Update models/pipelines doc (#1894)

    • Docs formatting

      • Remove override CSS (#1554)

      • Add prototype.tacotron2 page to docs (#1695)

      • Add doc for InverseSepctrogram (#1706)

      • Add sections to transforms docs (#1720)

      • Add edit_distance to documentation with a new category Metric (#1743)

      • Fix model subsections (#1775)

      • List all the pre-trained models on right bar (#1828)

      • Put pretrained weights to subsection (#1879)

    • Examples (see #1564)

      • Add example code for Resample (#1644)

      • Fix examples in transforms (#1646)

      • Add example for ComplexNorm (#1658)

      • Add example for MuLawEncoding (#1586)

      • Add example for Spectrogram (#1566)

      • Add example for GriffinLim (#1671)

      • Add example for MuLawDecoding (#1684)

      • Add example for Fade transform (#1719)

      • Update RNNT loss docs and add example (#1835)

      • Add SpecAugment figure/citation (#1887)

      • Add filter bank figures (#1891)

    Source code(tar.gz)
    Source code(zip)
  • v0.9.1(Sep 27, 2021)

  • v0.9.0(Jun 15, 2021)

    torchaudio 0.9.0 Release Note

    Highlights

    torchaudio 0.9.0 release includes:

    • Lots of performance improvements. (filtering, resampling, spectral operation)
    • Popular wav2vec2.0 model architecture.
    • Improved autograd support.

    [Beta] Wav2Vec2.0 Model

    This release includes model architectures from wav2vec2.0 paper with utility functions that allow importing pretrained model parameters published on fairseq and Hugging Face Hub. Now you can easily run speech recognition with torchaudio. These model architectures also support TorchScript, and you can deploy them with ONNX or in non-Python environments, such as C++, Android and iOS. Please checkout our C++, Android and iOS examples. The following snippets illustrate how to create a deployable model.

    # Import fine-tuned model from Hugging Face Hub
    import transformers
    from torchaudio.models.wav2vec2.utils import import_huggingface_model
    
    original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
    imported = import_huggingface_model(original)
    
    # Import fine-tuned model from fairseq
    import fairseq
    from torchaudio.models.wav2vec2.utils import import_fairseq_model
    
    Original, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task(
        ["wav2vec_small_960h.pt"], arg_overrides={'data': "<data_dir>"})
    imported = import_fairseq_model(original[0].w2v_encoder)
    
    # Build uninitialized model and load state dict
    from torchaudio.models import wav2vec2_base
    
    model = wav2vec2_base(num_out=32)
    model.load_state_dict(imported.state_dict())
    
    # Quantize / script / optimize for mobile
    quantized_model = torch.quantization.quantize_dynamic(
        model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)
    scripted_model = torch.jit.script(quantized_model)
    optimized_model = optimize_for_mobile(scripted_model)
    optimized_model.save("model_for_deployment.pt")
    

    Filtering Improvement

    The internal implementation of lfilter has been updated to support autograd on both CPU and CUDA. Additionally, the performance on CPU is significantly improved. These improvements also apply to biquad variants.

    The following table illustrates the performance improvements compared against the previous releases. lfilter was applied on float32 tensors with one channel and different number of frames.

    torchaudio version

    256

    512

    1024

    0.9

    0.282

    0.381

    0.564

    0.8

    0.493

    0.780

    1.37

    0.7

    5.42

    10.8

    22.3

    Unit: msec

    Complex Tensor Migration

    torchaudio has functions that handle complex-valued tensors. In early days when PyTorch did not have a complex dtype, torchaudio adopted the convention to use an extra dimension to represent real and imaginary parts. In PyTorch 1.6, new dtyps, such as torch.cfloat and torch.cdouble were introduced to represent complex values natively. (In the following, we refer to torchaudio’s original convention as pseudo complex types, and PyTorch’s native dtype as native complex types.)

    As the native complex types have become mature and stable, torchaudio has started to migrate complex functions to use the native complex type. In this release, the internal implementation was updated to use the native complex types, and interfaces were updated to allow passing/receiving native complex type directly. Users can choose to keep using the pseudo complex type or opt in to use native complex type. However, please note that the use of the pseudo complex type is now deprecated. These functions are tested to support TorchScript and autograd. For the detail of this migration plan, please refer to #1337.

    Additionally, switching the internal implementation to the native complex types improved the performance. Since the internal implementation uses native complex type regardless of which complex type is passed/returned, users will automatically benefit from this performance improvement.

    The following table illustrates the performance improvements from the previous release by comparing the time it takes for complex transforms to perform operation on float32 Tensor with two channels and 256 frames.

    CPU
    torchaudio version Spectrogram TimeStretch GriffinLim
    0.9

    0.229

    12.6

    3320

    0.8

    0.283

    126

    5320

    Unit: msec

    CUDA
    torchaudio version Spectrogram TimeStretch GriffinLim
    0.9

    0.195

    0.599

    36

    0.8

    0.219

    0.687

    60.2

    Unit: msec

    Improved Autograd Support

    Along with the work of Complex Tensor Migration and Filtering Improvement mentioned above, more tests were added to ensure the autograd support. Now the following operations are guaranteed to support autograd up to second order.

    Functionals
    • lfilter
    • allpass_biquad
    • biquad
    • band_biquad
    • bandpass_biquad
    • bandrefect_biquad
    • bass_biquad
    • equalizer_biquad
    • treble_biquad
    • highpass_biquad
    • lowpass_biquad
    Transforms
    • AmplitudeToDB
    • ComputeDeltas
    • Fade
    • GriffinLim
    • TimeMasking
    • FrequencyMasking
    • MFCC
    • MelScale
    • MelSpectrogram
    • Resample
    • SpectralCentroid
    • Spectrogram
    • SlidingWindowCmn
    • TimeStretch*
    • Vol

    NOTE:

    1. Autograd test for transforms also covers the following functionals.
      • amplitude_to_DB
      • spectrogram
      • griffinlim
      • resample
      • phase_vocoder*
      • mask_along_axis_iid
      • mask_along_axis
      • gain
      • spectral_centroid
    2. torchaudio.transforms.TimeStretch and torchaudio.functional.phase_vocoder call atan2, which is not differentiable around zero. Therefore these functions are differentiable only when the input spectrogram does not contain values around zero.

    [Beta] Resampling Improvement

    In release 0.8, the resampling operation was vectorized and its performance improved. In this release, the implementation of the resampling algorithm has been further revised.

    • Kaiser window has been added for a wider range of resampling quality.
    • rolloff parameter has been added for anti-aliasing control.
    • torchaudio.transforms.Resample precomputes the kernel using float64 precision and caches it for even faster operation.
    • New entry point, torchaudio.functional.resample has been added and the original entry point, torchaudio.compliance.kaldi.resample_waveform is deprecated.

    The following table illustrates the performance improvements from the previous release by comparing the time it takes for torchaudio.transforms.Resample to complete the operation on float32 tensor with two channels and one-second duration.

    CPU
    torchaudio version 8k → 16k [Hz] 16k → 8k 16k → 44.1k 44.1k → 16k
    0.9

    0.192

    0.559

    0.478

    0.467

    0.8

    0.537

    0.753

    43.9

    17.6

    Unit: msec

    CUDA
    torchaudio version 8k → 16k 16k → 8k 16k → 44.1k 44.1k → 16k
    0.9

    0.203

    0.172

    0.213

    0.212

    0.8

    0.860

    0.559

    116

    46.7

    Unit: msec

    Improved Windows Support

    torchaudio implements some operations in C++ for reasons such as performance and integration with third-party libraries. This C++ module was only available on Linux and macOS. In this release, Windows packages also come with C++ module.

    This C++ module in Windows package includes the efficient filtering implementation mentioned above, however, “sox_io” backend and torchaudio.functional.compute_kaldi_pitch are not included.

    I/O Functions Migration

    Since the 0.6 release, we have continuously improved I/O functionality. Specifically, in 0.8 the default backend has been changed from “sox” to “sox_io”, and the similar API change has been applied to “soundfile” backend. The 0.9 release concludes this migration by removing the deprecated backends. For the detail please refer to #903.

    Backward Incompatible Changes

    I/O

    • Deprecated backends and functions were removed (#1311, #1329, #1362)
      • Please see #903 for the migration.
    • Added validation of the number of channels when saving GSM (#1384)
      • Please make sure that signal has only one channel when saving into GSM.

    Ops

    • Removed deprecated normalized argument from torchaudio.functional.griffinlim (#1369)
      • This argument was never used. Please remove the argument from your call.
    • Renamed torchaudio.functional.sliding_window_cmn arg for correctness (#1347)
      • The first argument is supposed to spectrogram. If you have used keyword argument waveform=..., please change it to specgram=...
    • Changed torchaudio.transforms.Resample to precompute and cache the resampling kernel. (#1499, #1514)
      • To use the transform in devices other than CPU, please move the instantiated object to the target device.
        resampler = torchaudio.transforms.Resample(orig_freq=8000, new_freq=44100)
        resampler.to(torch.device("cuda"))
        

    Dataset

    • Removed deprecated arguments from CommonVoice (#1534)
      • torchaudio no longer supports programmatic download of Common Voice dataset. Please remove the arguments from your code.

    Deprecations

    • Deprecated the use of pseudo complex type (#1445, #1492)
      • torchaudio is adopting native complex type and the use of pseudo complex type and the related utility functions are now deprecated. Please refer to #1337 for the migration process.
    • Deprecated torchaudio.compliance.kaldi.resample_waveform (#1533)
      • Please use torchaudio.functional.resample.
    • torchaudio.transforms.MelScale now expects valid n_stft value (#1515)
      • Please provide a valid value to n_stft.

    New Features

    [Beta] Wav2Vec2.0

    • Added wav2vec2.0 model (#1529)
    • Added wav2vec2.0 HuggingFace importer (#1530)
    • Added wav2vec2.0 fairseq importer (#1531)
    • Added speech recognition C++ example (#1538)

    Filtering

    • Added C++ implementation of torchaudio.functional.lfilter (#1319)
    • Added autograd support to torchaudio.functional.lfilter (#1310, #1441)

    [Beta] Resampling

    • Added torchaudio.functional.resample (#1402)
    • Added rolloff parameter (#1488)
    • Added kaiser window support to resampling (#1509)
    • Added kernel caching mechanism in torchaudio.transforms.Resample (#1499, #1514, #1556)
    • Skip resampling when sampling rate is not changed (#1537)

    Native Complex Tensor

    • Added complex tensor support to torchaudio.functional.phase_vocoder and torchaudio.transforms.TimeStretch (#1410)
    • Added return_complex to torchaudio.functional.spectrogram and torchaudio.transforms.Spectrogram (#1366, #1551)

    Improvements

    I/O

    • Added file path to I/O error messages (#1523)
    • Added __str__ override to AudioMetaData for easy print (#1339)
    • Fixed uninitialized variable in sox/utils.cpp (#1306)
    • Replaced UB sox conversion macros with tensor op (#1370)
    • Removed check_length from validate_input_file (#1312)

    Ops

    • Added warning for non-integer resampling frequencies (#1490)
    • Adopted native complex tensors in torchaudio.functional.griffinlim (#1368)
    • Prohibited scripting torchaudio.transforms.MelScale when n_stft is invalid (#1505)
    • Added input dimension check to VAD (#1513)
    • Added HTK-compatible option to Mel-scale conversion (#593)

    Models

    • Added vanilla DeepSpeech model (#1399)

    Datasets

    • Fixed checksum for the YESNO dataset (#1405)

    Misc

    • Added missing transforms to __all__ (#1458)
    • Removed reference_cast in make_boxed_from_unboxed_functor (#1300)
    • Removed unused normalized constant from torchaudio.transforms.GriffinLim (#1433)
    • Removed unused helper function (#1396)

    Examples

    • Added libtorchaudio C++ example (#1349)
    • Refactored libtorchaudio example (#1486)
    • Replaced librosa's Mel scale conversion with torchaudio’s in WaveRNN example (#1444)

    Build

    • Updated config.guess to support source build in recent architectures (#1484)
    • Explicitly disabled wavpack when building SoX (#1462)
    • Added ROCm support to source build (#1411)
    • Added Windows C++ binary build (#1345, #1371)
    • Made kaldi selective in build (#1342)
    • Made sox selective (#1338)

    Testing

    • Added autograd test for torchaudio.functional.lfilter and biquad variants (#1400, #1438)
    • Added autograd test for transforms (overview: #1414)
      • torchaudio.transforms.FrequencyMasking (#1498)
      • torchaudio.transforms.SlidingWindowCmn (#1482)
      • torchaudio.transforms.MelScale (#1467)
      • torchaudio.transforms.Vol (#1460)
      • torchaudio.transforms.TimeStretch (#1420)
      • torchaudio.transforms.AmplitudeToDB (#1447)
      • torchaudio.transforms.GriffinLim (#1421)
      • torchaudio.transforms.SpectralCentroid (#1425)
      • torchaudio.transforms.ComputeDeltas (#1422)
      • torchaudio.transforms.Fade (#1424)
      • torchaudio.transforms.Resample (#1416)
      • torchaudio.transforms.MFCC (#1415)
      • torchaudio.transforms.Spectrogram / MelSpectrogram (#1340)
    • Added test for a batch of different items in the functional batch consistency test. (#1315)
    • Added test for validating torchaudio.functional.lfilter shape (#1360)
    • Added TorchScript test for torchaudio.functional.resample (#1516)
    • Added TorchScript test for torchaudio.functional.phase_vocoder (#1379)
    • Added steps to save and load the scripted object in TorchScript (#1446)
    • Added GPU support to functional tests (#1475)
    • Added GPU support to transform librosa compatibility test (#1439)
    • Added GPU support to functional librosa compatibility test (#1436)
    • Improved HTTP fetch test reliability (#1512)
    • Refactored functional batch consistency test (#1341)
    • Refactored test classes for complex (#1491)
    • Refactored sox_io load test (#1394)
    • Refactored Kaldi compatibility tests (#1359)
    • Refactored functional test (#1435, #1463)
    • Refactored transform tests (#1356)
    • Refactored librosa compatibility test (#1350)
    • Refactored sox compatibility test (#1344)
    • Refactored librosa compatibility test (#1259)
    • Removed the use I/O functions in batch consistency test (#1521)
    • Removed skipIfNoSoxBackend (#1390)
    • Removed VAD from batch consistency tests (#1451)
    • Replaced deprecated floor_divide with div (#1455)
    • Replaced torch.assert_allclose with assertEqual (#1387)
    • Shortened torchaudio.functional.lfilter autograd tests input size (#1443)
    • Updated torchaudio.transforms.InverseMelScale comparison test (#1437)

    Bug Fixes

    • Updated torchaudio.transforms.TimeMasking and torchaudio.transforms.FrequencyMasking to perform out-of-place masking (#1481)
    • Annotate power of torchaudio.transforms.MelSpectrogram as float only (#1572)

    Performance

    • Adopted torch.nn.functional.conv1d in torchaudio.functional.lfilter (#1318)
    • Added C++ implementation of torchaudio.functional.overdrive (#1299)

    Documentation

    • Update docs (#1550)
    • Reformat resample docs (#1548)
    • Updated resampling documentation (#1519)
    • Added the clarification that sox_effects.apply_effects_tensor is CPU-only (#1459)
    • Removed instructions on using external sox (#1365, #1281)
    • Added navigation with left/right arrow keys (#1336)
    • Fixed docstring of sliding_window_cmn (#1383)
    • Update contributing guide (#1372)
    • Fix broken links in contribution guide (#1361)
    • Added Windows build instructions (#1440)
    • Fixed typo (#1471, #1397, #1293)
    • Added WER to readme in wav2letter pipeline (#1470)
    • Fixed wav2letter usage example (#1060)
    • Added Google Analytics support (#1466)
    Source code(tar.gz)
    Source code(zip)
  • v0.8.1(Mar 25, 2021)

  • v0.8.0(Mar 4, 2021)

    Highlights

    This release supports Python 3.9.

    I/O Improvements

    Continuing from the previous release, torchaudio improves the audio I/O mechanism. In this release, we have four major updates.

    1. Backend migration. We have migrated the default backend for audio I/O. The new default backend is “sox_io” (for Linux/macOS). The interface for “soundfile” backend has been also changed to align that of “sox_io”. Following the change of default backends, the legacy backend/interface have been marked as deprecated. The legacy backend/interface are still accessible, though it is strongly discouraged to use them. For the detail on the migration, please refer to #903.

    2. File-like object support. We have added file-like object support to I/O functions and sox_effects. You can perform the info, load, save and apply_effects_file operation on file-like objects.

      # Query audio metadata over HTTP
      # Will only fetch the first few kB
      with requests.get(URL, stream=True) as response:
        metadata = torchaudio.info(response.raw)
      
      # Load audio from tar file
      # No need to extract TAR file.
      with tarfile.open(TAR_PATH, mode='r') as tarfile_:
        fileobj = tarfile_.extractfile(SAMPLE_TAR_ITEM)
        waveform, sample_rate = torchaudio.load(fileobj)
      
      # Saving to Bytes buffer
      # Using BytesIO, you can perform in-memory encoding/decoding.
      buffer_ = io.BytesIO()
      torchaudio.save(buffer_, waveform, sample_rate, format="wav")
      
      # Apply effects (lowpass filter / resampling) while loading audio from S3
      client = boto3.client('s3')
      response = client.get_object(Bucket=S3_BUCKET, Key=S3_KEY)
      waveform, sample_rate = torchaudio.sox_effects.apply_effect_file(
        response['Body'], [["lowpass", "-1", "300"], ["rate", "8000"]])
      
    3. [Beta] Codec Application. Built upon the file-like object support, we added functional.apply_codec function, which can degrades audio data by applying audio codecs supported by “sox_io” backend, in in-memory fashion.

      # Apply MP3 codec
      degraded = F.apply_codec(
        waveform, sample_rate, format="mp3", compression=-9)
      # Apply GSM codec
      degraded = F.apply_codec(waveform, sample_rate, format="gsm")
      
    4. Encoding options. We have added encoding options to save function of new backends. Now you can change the format and encodings with format, encoding and bits_per_sample options

      # Save without any encoding option.
      # The function will pick the encoding which the provided data fit
      # For Tensor of float32 type, that is 32-bit floating-point PCM.
      torchaudio.save("data.wav", waveform, sample_rate)
      
      # Save as 16-bit signed integer Linear PCM
      # The resulting file occupies half the storage but loses precision
      torchaudio.save(
        "data.wav", waveform, sample_rate, encoding="PCM_S", bits_per_sample=16)
      
    5. More format support to "sox_io"’s save function. We have added support for GSM, HTK, AMB, and AMR-NB formats to "sox_io"’s save function.

    Switch to CMake-based build

    torchaudio was utilizing CMake to build third party dependencies. Now torchaudio uses CMake to build its C++ extension. This will open the door to integrate torchaudio in non-Python environments (such as C++ applications and mobile). We will work on adding example applications and mobile integrations in upcoming releases.

    Backwards Incompatible Changes

    • Removed deprecated transform and target_transform arguments from VCTK and YESNO datasets. (#1120) If you were relying on the previous behavior, we recommend that you apply the transforms in the collate function.
    • Removed torchaudio.datasets.utils.walk_files (#1111) and replaced by Path and glob. (#1069, #1101). If you relied on the function, we recommend that you use glob instead.
    • Removed torchaudio.data.utils.unicode_csv_reader. (#1086) If you relied on the function, we recommend that you replace by csv.reader.
    • Disabled CommonVoice download as users are required to sign user agreement. Please download and extract the dataset manually, and replace the root argument by the subfolder for the version and language of interest, see #1082 for more details. (#1018, #1079, #1080, #1082)
    • Removed legacy sox effects (#977, #1001). Please migrate to apply_effects_file or apply_effects_tensor.
    • Switched the default backend to the ones with new interfaces (#978). If you were relying on the previous behavior, you can return to the previous behavior by following instructions in #975 for one more release.

    New Features

    • Added GSM, HTK, AMB, AMR-NB and AMR-WB format support to “sox_io” backend. (#1276, #1291, #1277, #1275, #1066)
    • Added encoding options (format, bits_per_sample and encoding) to save function. (#1226, #1177, #1129, #1104)
    • Added new attributes (bits_per_sample and encoding) to the info function return type (AudioMetaData) (#1177, #1206, #1324)
    • Added format override to libsox-based file input. (load, info, sox_effects.apply_effects_file) (#1104)
    • Added file-like object support in “sox_io”, and “soundfile” backend and sox_effects.apply_effects_file. (#1115)
    • [Beta] Added the Kaldi Pitch feature. (#1243, #1260)
    • [Beta] Added the SpectralCentroid transform. (#1167, #1216, #1316)
    • [Beta] Added codec transformation apply_codec. (#1200)

    Improvements

    • Exposed normalization method to Mel transforms. (#1212)
    • Exposed additional STFT arguments to Spectrogram (#892) and to MelSpectrogram (#1211).
    • Added support for pathlib.Path to apply_effects_file (#1048) and to CMUARCTIC (#1025), YESNO (#1015), COMMONVOICE (#1027), VCTK and LJSPEECH (#1028), GTZAN (#1032), SPEECHCOMMANDS (#1039), TEDLIUM (#1045), LIBRITTS and LIBRISPEECH (#1046).
    • Added SpeechCommands train/valid/test split. (#966, #1012)

    Internals

    • Replaced if-elseif-else with switch in sox C++ code. (#1270)
    • Refactored C++ interface for sox_io's get_info_file (#1232) and get_encodinginfo (#1233).
    • Add explicit functional import in init. (#1228)
    • Refactored YESNO dataset (#1127), LJSPEECH dataset (#1143).
    • Removed Python 2.7 reference from setup.py. (#1182)
    • Merged flake8 configurations into single .flake8 file. (#1172, #1214)
    • Updated calls to torch.stft to use return_complex=True. (#1096, #1013)
    • Cleaned up handling of optional args in C++ with c10:optional. (#1043)
    • Removed unused imports in sox effects. (#1052)
    • Introduced functional submodule to organize functionals. (#1003)
    • [Testing] Refactored MelSpectrogram librosa compatibility test to decouple from other tests. (#1267)
    • [Testing] Moved batch tests for functionals. (#1254)
    • [Testing] Refactored tests for backend (#1239) and for functionals (#1237).
    • [Testing] Removed dependency on pytest from testing (#1157, #1188)
    • [Testing] Refactored unitests for VCTK (#1134), SPEECHCOMMANDS (#1136), LIBRISPEECH (#1140), TEDLIUM (#1135), LJSPEECH (#1138), LIBRITTS (#1139), CMUARCTIC (#1147), GTZAN(#1148), COMMONVOICE and YESNO (#1133).
    • [Testing] Removed dependency on COMMONVOICE dataset from tests. (#1132)
    • [Build] Fixed Python 3.9 support (#1242)
    • [Build] Switched to cmake for build. (#1187, #1246, #1249)
    • [Build] Restructured C++ code to allow per file registration of custom ops. (#1221)
    • [Build] Added logging to sox/CMakeLists.txt. (#1190)
    • [Build] Disabled C++11 ABI when necessary for libtorch compatibility. (#880)
    • [Build] Reorganized libsox source and build directory to accommodate additional third party code. (#1161, #1176)
    • [Build] Refactored sox source files and moved into dedicated subfolder. (#1106)
    • [Build] Enabled custom clean function for python setup.py clean. (#1142)
    • [CI] Documented undocumented parameters. Added CI check. (#1248)
    • [CI] Fixed sphinx warnings in documentation. Turned warnings into errors. (#1247)
    • [CI] Print CPU info before running unit test. (#1218)
    • [CI] Fixed clang-format job and fixed newly detected formatting issues. (#981, #1198, #1222)
    • [CI] Updated unit test base Docker image. (#1193)
    • [CI] Disabled CCI cache which is now known to be flaky. (#1189)
    • [CI] Disabled torchscript BC test which is known to fail. (#1192)
    • [CI] Stripped version suffix for pytorch. (#1185)
    • [CI] Ran smoke test with CPU package for pytorch due to known issue with CUDA 11. (#1105)
    • [CI] Added missing empty line at the end of config.yml. (#1020)
    • [CI] Added automatic documentation build and push to branch in CI. (#1006, #1034, #1041, #1049, #1091, #1093, #1098, #1100, #1121)
    • [CI] Ran GPU test for all pull requests and fixed current setup. (#998, #1014, #1191)
    • [CI] Skipped tests that is known to fail on macOS Python 3.6/3.7. (#999)
    • [CI] Changed the order of installation and aligned with Windows. (#987)
    • [CI] Fixed documentation rendering by using Sphinx 2.4.4. (#974)
    • [Doc] Added subcategories to functional documentation. (#1325)
    • [Doc] Added a version selector in documentation. (#1273)
    • [Doc] Updated compilation recommendation in README. (#1263)
    • [Doc] Added CONTRIBUTING.md. (#1241)
    • [Doc] Added instructions to install parametrized package. (#1164)
    • [Doc] Fixed the return type for load functions. (#1122)
    • [Doc] Added missing modules and minor fixes. (#1022, #1056, #1117)
    • [Doc] Fixed spelling and links in README. (#1029, #1037, #1062, #1110, #1261)
    • [Doc] Grouped filtering functionals in documentation page. (#1005, #1004)
    • [Doc] Updated the compatibility matrix with torchaudio 0.7 (#979)
    • [Doc] Added description of prototype/beta/stable features. (#968)

    Bug Fixes

    • Fixed amplitude_to_DB clamping behaviour on batches. (#1113)
    • Disabled audio devices in sox builds which could interfere in the build process when detected. (#1153)
    • Fixed COMMONVOICE for French where the audio file extension was missing on load. (#1126)
    • Disabled OpenMP support for libsox which can produce errors when used in DataLoader. (#1026)
    • Fixed noise_down_time argument in VAD by properly propagating it. (#1017)
    • Removed print-freq option to compute validation loss at each epoch in wav2letter pipeline. (#997)
    • Migrated from torch.rfft to torch.fft.rfft and cfloat following change in pytorch. (#941)
    • Fixed interactive ASR demo to aligned with latest version of FAIRSeq. (#996)

    Deprecations

    • The normalized argument is unused and will be removed from griffinlim. (#1036)
    • The previous sox and soundfile backend remain available for one release, see #903 for details. (#975)

    Performance

    • Added C++ lfilter core loop for faster iteration on CPU. (#1244)
    • Leveraged julius resampling implementation to make resampling faster. (#1087)
    Source code(tar.gz)
    Source code(zip)
  • v0.7.2(Dec 10, 2020)

    Highlights

    This release introduces support for python 3.9. There is no 0.7.1 release, and the following changes are compared to 0.7.0.

    Improvements

    • Add python 3.9 support (#1061)

    Bug Fixes

    • Temporarily disable OpenMP support for libsox (#1054)

    Deprecations

    • Disallow download=True in CommonVoice (#1076)
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Oct 27, 2020)

    Highlights

    Example Pipelines

    torchaudio is expanding its support for models and end-to-end applications. Please file an issue on github to provide feedback on them.

    • Speech Recognition: Building on the addition of the Wav2Letter model for speech recognition in the last release, we added a training example pipelines for speech recognition that uses the LibriSpeech dataset.
    • Text-to-Speech: With the goal of supporting text-to-speech applications, we added a vocoder based on the WaveRNN model. WaveRNN model is based on the implementation from this repository. The original implementation was introduced in "Efficient Neural Audio Synthesis". We provide an example training pipeline in the example folder that uses the LibriTTS dataset added to torchaudio in this release.
    • Source Separation: We also support source separation with the addition of the ConvTasNet model, based on the paper "Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation." An example training pipeline is provided with the wsj0-mix dataset.

    I/O Improvements

    As you are likely already aware from the last release we’re currently in the process of making sox_io, which ships with new features such as TorchScript support and performance improvements, the new default. If you want to benefit from these features now, we encourage you to migrate. For more information see issue #903.

    Backwards Incompatible Changes

    • Switched all %-based string formatting to str.format to adopt changes in PyTorch, leading to improved error messages for TorchScript (#850)
    • Split sox_utils.list_formats() for read and write (#811)
    • Made directory traversal order alphabetical and breadth-first, consistent across operating systems (#814)
    • Changed GTZAN so that it only traverses filenames belonging to the dataset (#791)

    New Features

    • Added ConvTasNet model (#920, #933) with pipeline (#894)
    • Added canonical pipeline with wav2letter (#632)
    • The WaveRNN model (#705, #797, #801, #810, #836) is available with a canonical pipeline (#749, #802, #831, #863)
    • Added all 3 releases of tedlium dataset (#882, #934, #945, #895)
    • Added VCTK_092 dataset (#812)
    • Added LibriTTS (#790, #820)
    • Added SPHERE support to sox_io backend (#871)
    • Added torchscript sox effects (#760)
    • Added a flag to change the interface of soundfile backend to the one identical to sox_io backend. (#922)

    Improvements

    • Added soundfile compatibility backend. (#922)
    • Improved the speed of torchaudio.compliance.kaldi.fbank (#947)
    • Improved the speed of phaser (#660)
    • Added warning when a Mel filter is all zero (#914)
    • Added pathlib.Path support to sox_io backend (#907)
    • Simplified C++ registration with TORCH_LIBRARY (#840)
    • Merged sox effect and sox_io C++ implementation (#779)

    Internal

    • CI: Added test to validate torchscript backward compatibility (#838)
    • CI: Used mocked datasets to test CMUArctic (#829), CommonVoice (#827), Speech Commands (#824), LJSpeech (#826), LibriSpeech (#825), YESNO (#792, #832)
    • CI: Made *nix unit test fail if C++ extension is not available (#847, #849)
    • CI: Separated I/O in testing. (#813, #773, #783)
    • CI: Added smoke tests to sox_io and sox_effects (#806)
    • CI: Tested utilities have been refactored (#805, #808, #809, #817, #822, #831)
    • Doc: Added how to run tests (#843)
    • Doc: Added 0.6.0 to version matrix in README (#833)

    Bug Fixes

    • Fixed device in interactive ASR example (#900)
    • Fixed incorrect extension parsing (#885)
    • Fixed dither with noise_shaping = True (#865)
    • Run unit test with non-editable installation (#845), and set zip_safe = False to disable egg installation (#842)
    • Sorted GTZAN dataset and use on-the-fly data in GTZAN test (#819)

    Deprecations

    • Removed istft wrapper in favor of torch.istft. (#841)
    • Deprecated SoxEffect and SoxEffectsChain (#787)
    • I/O: Deprecated sox backend. (#904)
    • I/O: Deprecated the current interface of soundfile. (#922)
    • I/O: Deprecated load_wav functions. (#905)
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Jul 28, 2020)

    Highlights

    torchaudio now includes a new model module (with wav2letter included), new functionals (contrast, cvm, dcshift, overdrive, vad, phaser, flanger, biquad), datasets (GTZAN, CMU), and a new optional sox backend with support for torchscript. torchaudio now also supports Windows, with the soundfile backend.

    torchaudio requires python 3.6 or more recent.

    Backwards Incompatible Changes

    • We reorganized the C++ resources (#630) and replaced C++ bindings for sox_effects init/list/shutdown with torch binding (#748).
    • We removed code specific to python 2 (#691), and we no longer tests against python 2 (#575) and 3.5 (#577)

    New Features

    • We now support Windows. (#604, #637, #642, #655, #743)
    • We now have a model module which includes wav2letter. (#462, #722)
    • We added the GTZAN and CMU datasets. (#668, #710)
    • We now have the contrast functional (#551), cvm (#540), dcshift (#558), overdrive (#569), vad (#578, #599), phaser (#587, #607, #702), flanger (#651, #702), biquad (#661).
    • We added a new sox_io backend (#718, #728, #734, #727, #763, #752, #731, #732, #726, #780) that is compatible with torchscript with a new AudioMetaData class (#761).
    • MelSpectrogram now has power and normalized parameters (#633), and slaney normalization (#589, #641).
    • lfilter now has a clamp option. (#600)
    • Griffin-Lim can now have zero momentum. (#601)
    • sliding_window_cmn now supports batching. (#570)
    • Downloaded datasets now verify checksums. (#499)

    Improvements

    • We added ogg/vorbis/opus support to binary distribution (#750, #755).
    • We replaced the use of torch.norm in spectrogram to improve performance (#747).
    • We now use fused operations in lfilter for faster computation. (#517, #564)
    • STFT is now called directly from torchaudio. (#531)
    • We redesigned the backend mechanism to support torchscript, by restructuring the code (#695, #696, #700, #706, #707, #698), adding dynamic listing (#697)
    • torchaudio can be built along with sox, or can use external sox. (#625, #669, #739)
    • We redesigned the sox_effects module. (#708)
    • We added more details to compilation instructions. (#667)
    • We updated the README with instructions on changing the backend. (#553)
    • We now have a version compatibility matrix in README. (#685)
    • We now use cmake to build third party libraries (#753).
    • We now use CircleCI instead of travis (#576, #584, #598, #603, #636, #738) and we test on GPU (#586, #777).
    • We run the test suite against nightlies. (#538, #678)
    • We redesigned our test suite: with new helper functions (#514, #519, #521, #565, #616, #690, #692, #694), standard pytorch test utilities (#513, #640, #643, #645, #646, #652, #650, #712), separated CPU and GPU tests (#513, #528, #644), more descriptive names (#532), clearer organization (#539, #541, #542, #664, #672, #687, #703, #716, #732), standardized name (#559), and backend aware (#719). This is detailed in a new README for testing (#566, #759).
    • We now support typing, for datasets (#511, #522), for backends (#527), for init (#526), and inline (#530), with mypy configuration (#524, #544, #590).

    Bug Fixes

    • We removed in place operations so that Griffin-Lim can be backpropagated through. (#730)
    • We fixed kaldi MFCC on GPU. (#681)
    • We removed multiple definitions of SoxEffect in C++. (#635)
    • We fixed the docstring of masking. (#612)
    • We replaced views by reshape for batching. (#594)
    • We fixed missing conda environment when testing in python 3.8. (#582)
    • We ensure that sox is not exposed in windows. (#579)
    • We corrected the instructions to install nightlies. (#547, #552)
    • We fix the seed of mask_along_iid. (#529)
    • We correctly report GPU tests as skipped instead of passed. (#516)

    Deprecations

    • Since sox_effects is now automatically initialized and shutdown (#572, #693), we are deprecating these functions (#709).
    • ISTFT is migrating to torch. (#523)
    Source code(tar.gz)
    Source code(zip)
  • v0.5.1(Jun 22, 2020)

  • v0.5.0(Apr 21, 2020)

    Highlights

    torchaudio includes new transforms (e.g. Griffin-Lim and inverse Mel scale), new filters (e.g. all pass, fade, band pass/reject, band, treble, deemph, riaa), and datasets (LJ Speech and SpeechCommands).

    Backwards Incompatible Changes

    • torchaudio no longer supports python 2. We removed future and six imports. We added inline typing. (#413, #478, #479, #482, #486)
    • We fixed CommonVoice dataset download, and updated to the latest version. (#498)
    • We now skip data point with missing data in VCTK dataset. (#484)

    New Features

    • We now have the Vol transforms, and DB_to_amplitude.(#468, #469)
    • We now have the InverseMelScale (#448)
    • We now have the Griffin-Lim functional. (#365)
    • We now support allpass, fade, bandpass, bandreject, band, treble, deemph, riaa. (#444, #449, #464, #470, #508)
    • We now offer LJSpeech and SpeechCommands datasets. (#439, #437)

    Improvements

    • We added inline typing to SoxEffects and Kaldi compliance. (#490, #497)
    • We refactored the tests. (#480, #485, #496, #491, #501, #502, #503, #506, #507, #509)
    • We now run tests with sox only when sox is available. (#419)
    • We extended batch support to MelScale, MelSpectrogram, MFCC, Resample. (#391, #435)
    • The speed of torchaudio.functional.istft was improved. (#471)
    • We now have transform and functional tests for AmplitudeToDB. (#463)
    • We now ignore pycharm and OSX files in git. (#461)
    • TimeStretch now has a batch test. (#459)
    • Docstrings in transforms were polished. (#442)
    • TimeStretch and AmplitudeToDB are now torch.nn.Module. (#456)
    • Resample is now jitable. (#441)
    • We support python 3.8. (#397)
    • Add cuda test for complex norm. (#421)
    • Dither is jitable with the latest version of pytorch. (#417)
    • Batching uses view instead of reshape. (#409)
    • We refactored the jitability test. (#395)
    • In .circleci, we removed a conditional block that wasn't doing anything. (#399)
    • We now have Windows CI for building. (#394 and #398)
    • We corrected the use of standard variable names in code. (#393)
    • We adopted native-Python code generation convention. (#378)
    • torchaudio.istft creates tensors directly on device. (#377)
    • torchaudio.compliance.kaldi.resample_waveform is now jitable. (#362)
    • The runtime of torchaudio.functional.lfilter was decreased. (#374)

    Bug Fixes

    • We fixed flake8 errors. (#504, #505)
    • We fixed Windows test by only testing with cpu-only binaries. (#489)
    • Spelling correction in docstrings for transforms.FrequencyMasking and transforms.TimeMasking. (#474)
    • In .circleci, we switched to use token for conda uploads. (#460)
    • The default value of dither parameter was changed. (#453)
    • TimeStretch moves device correctly. (#457)
    • Adding dev-other option in librispeech. (#433)
    • In build script, we install the correct version of pytorch for pip. (#412)
    • Upgrading dataset DeprecationWarning to UserWarning so that the user gets the warning. (#402)
    • Make power of spectrogram a float to work with complex norm. (#392)
    • Fix random seed for flaky test_griffinlim test. (#388)
    • Apply 'nightly' branch filter to binary uploads. (#385)
    • Fixed build errors: added explicitly utf8 decoration, added explicit utf_8_encoder definition if not available, explicitly cast to int. (#380)

    Deprecations

    • None
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Jan 15, 2020)

    torchaudio 0.4 improves on current transformations, datasets, and backend support.

    • We introduce an interactive speech recognition demo. (#266, #229, #248)
    • SoX is now optional, and a new extensible backend dispatch mechanism exposes SoundFile as an alternative to SoX.
    • The interface for datasets has been unified. This enables the addition of two large datasets: LibriSpeech and Common Voice.
    • New filters such as biquad, data augmentation such as time and frequency masking, and transforms such as gain and dither, and new feature computation such as deltas, are now available.
    • Transformations now support batches and are jitable.

    We would like to thank again our contributors and the wider community for their significant contributions to this release. In particular we'd like to thank @keunwoochoi, @ksanjeevan, and all the other maintainers and contributors of torchaudio-contrib for their significant and valuable additions around augmentations (#285) and batching (#327).

    Breaking Changes

    • torchaudio now requires PyTorch 1.3.0 or newer, see https://pytorch.org/ for installation instructions. (#312)
    • We make jit compilation optional for functions and use nn.Module where possible. (#314, #326, #342, #369)
    • By unifying the interface for datasets, we changed the interface for VCTK and YESNO (#303, #316). In particular, the construction parameters downsample, transform, target_transform, and return_dict are being deprecated.
    • SoxEffectsChain.EFFECTS_AVAILABLE replaced by SoxEffectsChain().EFFECTS_AVAILABLE (#355)
    • This is the last version to support Python 2.

    New Features

    • SoX is now optional, and a new extensible backend dispatch mechanism exposes SoundFile as an alternative to SoX. This makes it possible to use torchaudio even when SoX or SoundFile are not installed or available. (#355)
    • We now have a unified dataset interface that loads in memory only one item at a time enabling new large datasets: LibriSpeech and CommonVoice. (#303, #316, #330)
    • We introduce a pitch detection algorithm: torchaudio.functional.detect_pitch_frequency. (#313, #322)
    • We offer data augmentations in torchaudio.transforms: TimeStretch, FrequencyMasking, TimeMasking. (#285, #333, #348)
    • We introduce a complex norm transform: torchaudio.transform.ComplexNorm. (#285, #333)
    • We now have a new audio feature generation for computing deltas: torchaudio.functional.compute_deltas. (#268, #326)
    • We introduce torchaudio.functional.gain and torchaudio.functional.dither (#319, #360). We welcome work to continue the effort to implement features available in SoX, see #260.
    • We now include equalizer_biquad (#315, #340), lowpass_biquad, highpass_biquad (#275), lfilter, and biquad (#275, #291, #326) in torchaudio.functional.
    • MFCC is available as torchaudio.functional.mfcc. (#228)

    Improvements

    • We now support batching in transforms. (#327, #337, #404)
    • Functions are now jitable, and nn.Module is used where possible. (#314, #326, #342, #362, #369, #395)
    • Downloads of large files are now automatically resumed with new download function. (#320)
    • New tests for ISTFT are added. (#279)
    • We introduce nightly builds. (#301)
    • We now have smoke tests for builds. (#346, #359)

    Bug Fixes

    • Fix mismatch between MelScale and librosa. (#294)
    • Fix torchaudio.compliance.kaldi.resample_waveform where internal variables where not moved to the GPU when used. (#277)
    • Fix a bug that occurred when importing torchaudio built outside of a git repository. (#276)
    • Fix istft where the dtype and device of parameters were not created on the same device as the tensor provided by the user. (#264)
    • Fix size mismatch when saving and loading from state dictionary (load_state_dict). (#246)
    • Clarified internal naming convention within transforms and functionals. (#298)
    • Fix build script to be more tolerant to download drops. (#280, #284, #305)
    • Correct documentation for SoxEffectsChain. (#283)
    • Fix resample error with cuda tensors. (#277)
    • Fix error when importing version outside of git. (#276)
    • Fix missing asound in linux build. (#254)
    • Fix deprecated torch. (#254)
    • Fix link in README. (#253)
    • Fix window device in ISTFT. (#240)
    • Documentation: Fix range in documentation for torchaudio.load to [-1, 1]. (#283)
    Source code(tar.gz)
    Source code(zip)
  • v0.3.2(Jan 14, 2020)

  • v0.3.1(Jan 8, 2020)

  • v0.3.0(Aug 8, 2019)

    Highlights

    torchaudio as an extension of PyTorch

    torchaudio has been redesigned to be an extension of PyTorch and part of the domain APIs (DAPI) ecosystem. Domain specific libraries such as this one are kept separated in order to maintain a coherent environment for each of them. As such, torchaudio is an ML library that provides relevant signal processing functionality, but it is not a general signal processing library. The full rationale of this new standardization can be found in the README.md.

    In light of these changes some transforms have been removed or have different argument names and conventions. See the section on backwards breaking changes for a migration guide.

    We provide binaries via pip and conda. They require PyTorch 1.2.0 and newer. See https://pytorch.org/ for installation instructions.

    Community

    We would like to thank our contributors and the wider community for their significant contributions to this release. We are happy to see an active community around torchaudio and are eager to further grow and support it.

    In particular we'd like to thank @keunwoochoi, @ksanjeevan, and all the other maintainers and contributors of torchaudio-contrib for their significant and valuable additions around standardization and the support of complex numbers (https://github.com/pytorch/audio/pull/131, https://github.com/pytorch/audio/issues/110, https://github.com/keunwoochoi/torchaudio-contrib/issues/61, https://github.com/keunwoochoi/torchaudio-contrib/issues/36).

    Kaldi Compliance Interface

    An implementation of basic transforms with a Kaldi-like interface.

    We added the functions spectrogram, fbank, and resample_waveform (https://github.com/pytorch/audio/pull/119, https://github.com/pytorch/audio/pull/127, and https://github.com/pytorch/audio/pull/134). For more details see the documentation on torchaudio.compliance.kaldi which mirrors the arguments and outputs of Kaldi features.

    As an example we can look at the sinc interpolation resampling similar to Kaldi’s implementation. In the figure below, the blue dots are the original signal and red dots are the downsampled signal with half the original frequency. The red dot elements are approximately every other original element.

    resampling

    specgram = torchaudio.compliance.kaldi.spectrogram(waveform, frame_length=...)
    fbank = torchaudio.compliance.kaldi.fbank(waveform, num_mel_bins=...)
    resampled_waveform = torchaudio.compliance.kaldi.resample_waveform(waveform, orig_freq=...)
    

    Inverse short time Fourier transform

    Constructing a signal from a spectrogram can be used in applications like source separation or to generate audio signals to listen to. More specifically torchaudio.functional.istft is the inverse of torch.stft. It has the same parameters (+ additional optional parameter of length) and returns the least squares estimation of an original signal.

    torch.manual_seed(0)
    n_fft = 5
    waveform = torch.rand(2, 5)
    stft = torch.stft(waveform, n_fft=n_fft)
    approx_waveform = torchaudio.functional.istft(stft, n_fft=n_fft, length=waveform.size(1))
    >>> waveform
    tensor([[0.4963, 0.7682, 0.0885, 0.1320, 0.3074],
            [0.6341, 0.4901, 0.8964, 0.4556, 0.6323]])
    >>> approx_waveform
    tensor([[0.4963, 0.7682, 0.0885, 0.1320, 0.3074],
            [0.6341, 0.4901, 0.8964, 0.4556, 0.6323]])
    

    Breaking Changes

    • Removed Compose: Please use core abstractions such as nn.Sequential() or a for-loop over a list of transforms.
    • SPECTROGRAM, F2M, and MEL have been removed. Please use Spectrogram, MelScale, and MelSpectrogram
    • Removed formatting transforms ( LC2CL and BLC2CBL): While the LC layout might be common in signal processing, support for it is out of scope of this library and transforms such as LC2CL only aid their proliferation. Please use transpose if you need this behavior.
    • Removed Scale, PadTrim, DownmixMono: Please use division in place of Scale torch.nn.functional.pad/trim in place of PadTrim , torch.mean on the channel dimension in place of DownmixMono.
    • torchaudio.legacy has been removed. Please use torchaudio.load and torchaudio.save
    • Spectrogram used to be of dimension (channel, time, freq) and is now (channel, freq, time). Similarly for MelScale, MelSpectrogram, and MFCC, time is the last dimension. Please see our README for an explanation of the rationale behind these changes. Please use transpose to get the previous behavior.
    • MuLawExpanding was renamed to MuLawDecoding as the inverse of MuLawEncoding ( https://github.com/pytorch/audio/pull/159)
    • SpectrogramToDB was renamed to AmplitudeToDB ( https://github.com/pytorch/audio/pull/170). The input does not necessarily have to be a spectrogram and as such can be used in many more cases as the name should reflect.

    New Features

    Performance

    JIT and CUDA

    • JIT support added to Spectrogram, AmplitudeToDB, MelScale, MelSpectrogram, MFCC, MuLawEncoding, and MuLawDecoding. (https://github.com/pytorch/audio/pull/118)
    • CUDA support added to Spectrogram, AmplitudeToDB, MelScale, MelSpectrogram, MFCC, MuLawEncoding, and MuLawDecoding (https://github.com/pytorch/audio/pull/118)

    Bug Fixes

    • Fix test_transforms.py where double tensors were compared with floats (https://github.com/pytorch/audio/pull/132)
    • Fix vctk.read_audio (issue https://github.com/pytorch/audio/issues/143) as there were issues with downsampling using SoxEffectsChain (https://github.com/pytorch/audio/pull/145)
    • Fix segfault passing null to sox_close (https://github.com/pytorch/audio/pull/174)
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Aug 8, 2019)

    Background

    The goal of this release is to fix the current API as there will be future changes that breaking backward compatibility in order to improve the library as more thought is given to design, capabilities, and usability.

    While this release is compatible with all currently known PyTorch versions (<=1.2.0), the available binaries will only require Pytorch 1.1.0. Installation commands:

    # Wheels for Python 2 are NOT supported
    # Python 3.5
    $ pip3 install http://download.pytorch.org/whl/torchaudio-0.2-cp35-cp35m-linux_x86_64.whl
    # Python 3.6
    $ pip3 install http://download.pytorch.org/whl/torchaudio-0.2-cp36-cp36m-linux_x86_64.whl
    # Python 3.7
    $ pip3 install http://download.pytorch.org/whl/torchaudio-0.2-cp37-cp37m-linux_x86_64.whl
    

    What's new?

    • Fixed broken tests and setup automatic testing environment
    • Read in Kaldi files (“.ark”, “.scp”)
    • Separation of state and computation into transforms.py and functional.py
    • Loading and saving to file
    • Datasets VCTK and YESNO
    • SoxEffects and SoxEffectsChain in torchaudio.sox_effects

    CI and Testing

    A continuous integration (Travis CI) has been setup in https://github.com/pytorch/audio/pull/117. This means all the tests have been fixed and their status can be checked in https://travis-ci.org/pytorch/audio. The test files have to be run separately via build_tools/travis/test_script.sh because closing sox after a test file is completed prevents it from being reopened. The testing framework is pytest.

    # Run the whole test suite
    $ build_tools/travis/test_script.sh
    # Run an individual test
    $ python -m pytest test/test_transforms.py
    

    Kaldi IO

    Kaldi IO has been added as an optional dependency in https://github.com/pytorch/audio/pull/111. torchaudio provides a simple wrapper around this by converting the np.ndarray into torch.Tensor. Functions include: read_vec_int_ark, read_vec_flt_scp, read_vec_flt_ark, read_mat_scp, and read_mat_ark.

    >>> # read ark to a 'dictionary'
    >>> d = { u:d for u,d in torchaudio.kaldi_io.read_vec_int_ark(file) }
    

    Separation of State and Computation

    In https://github.com/pytorch/audio/pull/105, the computations have been moved into functional.py. The reasoning behind this is that tracking state is a separate problem by itself and should be separate from computing a function. It also allows us to annotate the functional as weak scriptable, which in turn allows us to utilize the JIT and create efficient code. The functional itself might then also be used by other functionals, which is much easier and more efficient than having another Module create an instance of the class. This also makes it easier to implement performance improvements and create a generic API. If someone implements a function that adheres to the contract of your functional, it can be an immediate drop-in. This is important if we want to support different backends (e.g. move a functional entirely into C++).

    >>> torchaudio.transforms.Spectrogram(n_fft=...)(waveform)
    >>> torchaudio.functional.spectrogram(waveform, …)
    

    Loading and saving to file

    Tensors can be read and written to various file formats (e.g. “mp3”, “wav”, etc.) through torchaudio.

    sound, sample_rate = torchaudio.load(‘input.wav’)
    torchaudio.save(‘output.wav’, sound)
    

    Transforms and functionals

    Transforms

    class Compose(object):
        def __init__(self, transforms):
        def __call__(self, audio):
            
    class Scale(object):
        def __init__(self, factor=2**31):
        def __call__(self, tensor):
            
    class PadTrim(object):
        def __init__(self, max_len, fill_value=0, channels_first=True):
        def __call__(self, tensor):
           
    class DownmixMono(object):
        def __init__(self, channels_first=None):
        def __call__(self, tensor):
    
    class LC2CL(object):
        def __call__(self, tensor):
    
    def SPECTROGRAM(*args, **kwargs):
    
    class Spectrogram(object):
        def __init__(self, n_fft=400, ws=None, hop=None,
                     pad=0, window=torch.hann_window,
                     power=2, normalize=False, wkwargs=None):
        def __call__(self, sig):
            
    def F2M(*args, **kwargs):
    
    class MelScale(object):
        def __init__(self, n_mels=128, sr=16000, f_max=None, f_min=0., n_stft=None):
        def __call__(self, spec_f):
    
    class SpectrogramToDB(object):
        def __init__(self, stype="power", top_db=None):
        def __call__(self, spec):
           
    class MFCC(object):
        def __init__(self, sr=16000, n_mfcc=40, dct_type=2, norm='ortho', log_mels=False,
                     melkwargs=None):
        def __call__(self, sig):
    
    class MelSpectrogram(object):
        def __init__(self, sr=16000, n_fft=400, ws=None, hop=None, f_min=0., f_max=None,
                     pad=0, n_mels=128, window=torch.hann_window, wkwargs=None):
        def __call__(self, sig):
    
    def MEL(*args, **kwargs):
    
    class BLC2CBL(object):
        def __call__(self, tensor):
    
    class MuLawEncoding(object):
        def __init__(self, quantization_channels=256):
        def __call__(self, x):
    
    class MuLawExpanding(object):
        def __init__(self, quantization_channels=256):
        def __call__(self, x_mu):
    

    Functional

    def scale(tensor, factor):
        # type: (Tensor, int) -> Tensor
    
    def pad_trim(tensor, ch_dim, max_len, len_dim, fill_value):
        # type: (Tensor, int, int, int, float) -> Tensor
    
    def downmix_mono(tensor, ch_dim):
        # type: (Tensor, int) -> Tensor
    
    def LC2CL(tensor):
        # type: (Tensor) -> Tensor
    
    def spectrogram(sig, pad, window, n_fft, hop, ws, power, normalize):
        # type: (Tensor, int, Tensor, int, int, int, int, bool) -> Tensor
    
    def create_fb_matrix(n_stft, f_min, f_max, n_mels):
        # type: (int, float, float, int) -> Tensor
    
    def mel_scale(spec_f, f_min, f_max, n_mels, fb=None):
        # type: (Tensor, float, float, int, Optional[Tensor]) -> Tuple[Tensor, Tensor]
    
    def spectrogram_to_DB(spec, multiplier, amin, db_multiplier, top_db=None):
        # type: (Tensor, float, float, float, Optional[float]) -> Tensor
    
    def create_dct(n_mfcc, n_mels, norm):
        # type: (int, int, string) -> Tensor
    
    def MFCC(sig, mel_spect, log_mels, s2db, dct_mat):
        # type: (Tensor, MelSpectrogram, bool, SpectrogramToDB, Tensor) -> Tensor
    
    def BLC2CBL(tensor):
        # type: (Tensor) -> Tensor
    
    def mu_law_encoding(x, qc):
        # type: (Tensor, int) -> Tensor
    
    def mu_law_expanding(x_mu, qc):
        # type: (Tensor, int) -> Tensor
    

    Datasets VCTK and YESNO

    All datasets are subclasses of torch.utils.data.Dataset i.e, they have __getitem__ and __len__ methods implemented. Hence, they can all be passed to a torch.utils.data.DataLoader which can load multiple samples parallelly using torch.multiprocessing workers. For example:

    yesno_data = torchaudio.datasets.YESNO('.', download=True)
    data_loader = torch.utils.data.DataLoader(yesno_data,
                                              batch_size=1,
                                              shuffle=True,
                                              num_workers=args.nThreads)
    
    

    The two datasets available are VCTK and YESNO. They download the datasets and preprocess them so that the loaded data is in convenient format.

    SoxEffects and SoxEffectsChain

    SoxEffects and SoxEffectsChain in torchaudio.sox_effects expose sox operations through a Python interface. Various useful effects like downmixing a multichannel signal or resampling a signal can be done here.

    torchaudio.initialize_sox()
    E = torchaudio.sox_effects.SoxEffectsChain()
    E.append_effect_to_chain("rate", [16000])  # resample to 16000hz
    E.append_effect_to_chain("channels", ["1"])  # mono signal
    E.set_input_file(fn)
    waveform, sample_rate = E.sox_build_flow_effects()
    torchaudio.shutdown_sox()
    
    Source code(tar.gz)
    Source code(zip)
Owner
null
Transformation spoken text to written text

Transformation spoken text to written text This model is used for formatting raw asr text output from spoken text to written text (Eg. date, number, i

Nguyen Binh 16 Dec 28, 2022
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

?? The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Hugging Face 15k Jan 2, 2023
[NeurIPS 2021] Code for Learning Signal-Agnostic Manifolds of Neural Fields

Learning Signal-Agnostic Manifolds of Neural Fields This is the uncleaned code for the paper Learning Signal-Agnostic Manifolds of Neural Fields. The

null 60 Dec 12, 2022
AI-powered literature discovery and review engine for medical/scientific papers

AI-powered literature discovery and review engine for medical/scientific papers paperai is an AI-powered literature discovery and review engine for me

NeuML 819 Dec 30, 2022
Lingtrain Aligner — ML powered library for the accurate texts alignment.

Lingtrain Aligner ML powered library for the accurate texts alignment in different languages. Purpose Main purpose of this alignment tool is to build

Sergei Averkiev 76 Dec 14, 2022
Python powered crossword generator with database with 20k+ polish words

crossword_generator Generate simple crossword puzzle from words and definitions fetched from krzyżowki.edu.pl endpoints -/< string:word > - returns js

null 0 Jan 4, 2022
Journey is a NLP-Powered Developer assistant

Journey Journey is a NLP-Powered Developer assistant Using on the powerful Natural Language Processing library Mindmeld, this projects aims to assist

Christian Eilers 21 Dec 11, 2022
🌐 Translation microservice powered by AI

Dot Translate ?? A microservice for quick and local translation using A.I. This service starts a local webserver used for neural machine translation.

Dot HQ 48 Nov 22, 2022
txtai: Build AI-powered semantic search applications in Go

txtai: Build AI-powered semantic search applications in Go txtai executes machine-learning workflows to transform data and build AI-powered semantic s

NeuML 49 Dec 6, 2022
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 ?? Transformers provides thousands of pretrained models to perform tasks o

Hugging Face 77.3k Jan 3, 2023
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 ?? Transformers provides thousands of pretrained models to perform tasks o

Hugging Face 40.9k Feb 18, 2021
Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

Texar-PyTorch is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar

ASYML 726 Dec 30, 2022
🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow ?? Transformers provides thousands of pretrained mo

Hugging Face 77.2k Jan 3, 2023
Tools, wrappers, etc... for data science with a concentration on text processing

Rosetta Tools for data science with a focus on text processing. Focuses on "medium data", i.e. data too big to fit into memory but too small to necess

null 207 Nov 22, 2022
A number of methods in order to perform Natural Language Processing on live data derived from Twitter

A number of methods in order to perform Natural Language Processing on live data derived from Twitter

null 1 Nov 24, 2021
This is Assignment1 code for the Web Data Processing System.

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata).

null 3 Dec 4, 2022
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 2.1k Jan 1, 2023
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 1.9k Feb 3, 2021