Textlesslib - Library for Textless Spoken Language Processing

Overview

textlesslib

License: MIT Python 3.8 Code style: black

Textless NLP is an active area of research that aims to extend NLP techniques to work directly on spoken language. By using self-supervisedly learnt discrete speech representations, the area promises to unlock interesting NLP applications on languages without written form or on facets of spoken language that are unaccessable for text-based approaches, e.g. prosody. To learn more, please check some of the papers.

textlesslib is a library aimed to facilitate research in Textless NLP. The goal of the library is to speed up the research cycle and lower the learning curve for those who want to start. We provide highly configurable, off-the-shelf available tools to encode speech as sequences of discrete values and tools to decode such streams back into the audio domain.

Table of Contents

Installation

git clone [email protected]:facebookresearch/textlesslib.git
cd textlesslib
pip install -e .
pip install git+git://github.com:pytorch/fairseq.git@dd106d9534b22e7db859a6b87ffd7780c38341f8

Usage examples

We include a set of examples in the examples folder:

There is also a [Jupyter notebook] and a [Google Colab] that combine discrete resynthesis and speech continuation examples in a step-by-step mini-tutorial.

We believe those examples can serve both as illustrations for the provided components and provide a starting point for tinkering in interesting directions.

Encoding speech

Below is an example on loading an audio example and encoding it as a sequence of HuBERT-based discrete tokens (aka pseudo-units). Downloading of the required checkpoints is handled by textlesslib itself (by default they are stored in ~/.textless):

import torchaudio
from textless.data.speech_encoder import SpeechEncoder

dense_model_name = "hubert-base-ls960"
quantizer_name, vocab_size = "kmeans", 100
input_file = "input.wav"

# now let's load an audio example
waveform, sample_rate = torchaudio.load(input_file)

# We can build a speech encoder module using names of pre-trained
# dense and quantizer models.  The call below will download
# appropriate checkpoints as needed behind the scenes. We can
# also construct an encoder by directly passing model instances
encoder = SpeechEncoder.by_name(
    dense_model_name=dense_model_name,
    quantizer_model_name=quantizer_name,
    vocab_size=vocab_size,
    deduplicate=True,
).cuda()


# now convert it in a stream of deduplicated units (as in GSLM)
encoded = encoder(waveform.cuda())
# encoded is a dict with keys ('dense', 'units', 'durations').
# It can also contain 'f0' if SpeechEncoder was initialized
# with need_f0=True flag.
units = encoded["units"]  # tensor([71, 12, 57, ...], ...)

Now it can be casted back into the audio domain:

# as with encoder, we can setup vocoder by passing checkpoints
# directly or by specifying the expected format by the names
# of dense and quantizer models (these models themselves
# won't be loaded)
vocoder = TacotronVocoder.by_name(
    dense_model_name,
    quantizer_name,
    vocab_size,
).cuda()

# now we turn those units back into the audio.
audio = vocoder(units)

# save the audio
torchaudio.save(output_file, audio.cpu().float().unsqueeze(0), vocoder.output_sample_rate)

Dataset helpers

Below is an example on using textless view on the LibriSpeech dataset:

encoder = SpeechEncoder.by_name(
  dense_model_name=dense_model_name,
  quantizer_model_name=quantizer_name,
  vocab_size=vocab_size,
  deduplicate=True,
).cuda()

quantized_dataset = QuantizedLibriSpeech(
  root=existing_root, speech_encoder=encoder, url=url)

datum = quantized_dataset[0]
sample_rate, utterance, speaker_id, chapter_id, utterance_id = datum['rest']
# datum['units'] = tensor([71, 12, 63, ...])

In the probing example we illustrate how such a dataset can be used with a standard Pytorch dataloader in a scalable manner.

Data preprocessing

We also provide a multi-GPU/multi-node preprocessing tool for the cases where on-the-fly processing of audio should be avoided.

Provided models

We provide implementations and pre-trained checkpoints for the following models:

  • Dense representations: HuBERT-base (trained on LibriSpeech 960h) and CPC (trained on 6Kh subset of LibriLight);
  • Quantizers: k-means quantizers with vocabulary sizes of 50, 100, 200 for both the dense models (trained on LibriSpeech 960h);
  • Decoders: Tacotron2 models for all (dense model x quantizer) combinations (trained on LJSpeech).

Finally, the pitch extraction is done via YAAPT.

Testing

We use pytest (pip install pytest pytest-xdist ). Our unit tests are located in the tests directory:

cd tests && pytest -n 8

Licence

textlesslib is licensed under MIT, the text of the license can be found here. Internally, it uses

Comments
  • Code release

    Code release

    Hi author, Thanks for your inspiring work. I read your paper titled Generative Spoken Dialogue Language Modeling recently, and wonder if you could release the corresponding code. Thanks in advance.

    Best

    opened by Rongjiehuang 4
  • There was an error when I installed Fairseq

    There was an error when I installed Fairseq

    When I use pip install git+git://github.com:pytorch/fairseq.git@dd106d9534b22e7db859a6b87ffd7780c38341f8 Command, there is a 400 error, maybe the version of the problem, resulting in unusable

    opened by yyz845935161 4
  • Any possibility to train/finetune HuBert using custom data

    Any possibility to train/finetune HuBert using custom data

    This is really great job! I am wondering if textlesslib could also support training or finetuning HuBert with custom data.

    For example not just in English but other language data.

    opened by KingStorm 2
  • Example

    Example "Speaker_Probing" not working

    Hi, after installing the library I was trying to run the examples. However, the speaker_probing example gives the error FileNotFoundError: [Errno 2] No such file or directory: 'datasets/tmpfn4ib6xd' . I guess it is failing to download some dataset.

    opened by fra1993 1
  • About the vocabulary size

    About the vocabulary size

    I read your papers carefully , About the vocabulary size ,I have a question, what is the vocabulary size?is speech ?How to construct the size of thesaurus?

    opened by yyz845935161 1
  • calling tacotron2 stales with numpy=1.22

    calling tacotron2 stales with numpy=1.22

    Just sharing a problem i've had and solved: calling tacotron2 with textlesslib stales because of the linalg.pinv function. I downgraded numpy to 1.21.5 and the problem is fixed.

    opened by RobinAlgayres 0
  • Fairseq installation

    Fairseq installation

    It might just be an issue with the corporate network I'm using but I found that in https://github.com/facebookresearch/textlesslib#installation I could only install fairseq if I replaced the colon after github.com with a forward slash: pip install git+git://github.com/pytorch/fairseq.git@dd106d9534b22e7db859a6b87ffd7780c38341f8 instead of: pip install git+git://github.com:pytorch/fairseq.git@dd106d9534b22e7db859a6b87ffd7780c38341f8

    Just noting it in case it anyone else finds the same issue.

    opened by eonglints 0
  • Model Release for

    Model Release for "Generative Spoken Dialogue Language Modeling"?

    Hello!

    We are interested in using the HuBERT model trained / fine-tuned on the Fisher corpus as well as the HiFi-GAN Vocoder that generates audio directly from the units for academic research. Is it possible that these models be released soon? Thank you very much!

    opened by siyan-sylvia-li 5
  • Tacotron2 training codes

    Tacotron2 training codes

    Hi, thank you very much for the great work!

    Could you please release the training codes to convert discrete ids to mel-spectrograms?

    I was wondering how to train this tacotron2 model. The ground truth mel-spectrograms is sythesized by a TTS model?

    Thank you!

    opened by WillQuCD 1
  • Update HubertFeatureReader to skip small chunks

    Update HubertFeatureReader to skip small chunks

    Very rarely, if the chunk is smaller than the kernel size and larger than the max_chunk size (e.g. (x.size(1) % max_chunk) < 10), the feature reader will produce a runtime error:

    RuntimeError: Calculated padded input size per channel: (1). Kernel size: (10). Kernel size can't be greater than actual input size

    I ran into this error when transcribing libri-light large with hubert-base-ls960 where there are two files that would produce this error and kill the job: 6454/13348/vicksburgnational_03_everhart_64kb_0014.flac with duration 1600001 and 9221/9912/cloudstudies_08_clayden_64kb_0016.flac with duration 3200001.

    CLA Signed 
    opened by many-hats 3
Owner
Meta Research
Meta Research
The ability of computer software to identify words and phrases in spoken language and convert them to human-readable text

speech-recognition-py Speech recognition is the ability of computer software to identify words and phrases in spoken language and convert them to huma

Deepangshi 1 Apr 3, 2022
Help you discover excellent English projects and get rid of disturbing by other spoken language

GitHub English Top Charts 「Help you discover excellent English projects and get

GrowingGit 544 Jan 9, 2023
Transformation spoken text to written text

Transformation spoken text to written text This model is used for formatting raw asr text output from spoken text to written text (Eg. date, number, i

Nguyen Binh 16 Dec 28, 2022
LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

Felipe Maia Polo 125 Dec 20, 2022
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

Robert Bogan Kang 3 May 25, 2022
🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

Gustavo Rosa 21 Aug 12, 2022
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Jan 2, 2023
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.6k Feb 18, 2021
Natural Language Processing library built with AllenNLP 🌲🌱

Custom Natural Language Processing with big and small models ????

Recognai 65 Sep 13, 2022
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Dec 31, 2022
A high-level Python library for Quantum Natural Language Processing

lambeq About lambeq is a toolkit for quantum natural language processing (QNLP). Documentation: https://cqcl.github.io/lambeq/ Getting started Prerequ

Cambridge Quantum 315 Jan 1, 2023
Python library for Serbian Natural language processing (NLP)

SrbAI - Python biblioteka za procesiranje srpskog jezika SrbAI je projekat prikupljanja algoritama i modela za procesiranje srpskog jezika u jedinstve

Serbian AI Society 3 Nov 22, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Explosion 24.9k Jan 2, 2023
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 2.1k Jan 1, 2023
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

null 652 Jan 6, 2023
PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing pororo performs Natural Language Processing and Speech-related tasks. It is easy to

Kakao Brain 1.2k Dec 21, 2022