UniSpeech - Large Scale Self-Supervised Learning for Speech

Microsoft

Last update: Dec 15, 2022

Related tags

Text Data & NLP speech pytorch speech-recognition speaker-verification speech-processing speech-separation diarization speech-diarization

Overview

UniSpeech

The family of UniSpeech:

WavLM (arXiv): WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing

UniSpeech (ICML 2021): Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

UniSpeech-SAT (ICASSP 2022 Submission): Universal Speech Representation Learning with Speaker Aware Pre-Training

Update

[HuggingFace Integration] Octorber 26, 2021: UniSpeech-SAT models are on HuggingFace .
[Model Release] Octorber 13, 2021: UniSpeech-SAT models are releaseed.
[HuggingFace Integration] Octorber 11, 2021: UniSpeech models are on HuggingFace .
[Model Release] June, 2021: UniSpeech v1 models are released.

Pre-trained models

We strongly suggest using our UniSpeech-SAT model for speaker related tasks, since it shows very powerful performance on various speaker related benchmarks.

Model	Pretraining Dataset	Finetuning Dataset	Model
UniSpeech Large EN	Labeled: 1350 hrs en	-	download
UniSpeech Large Multilingual	Labeled: 1350 hrs en + 353 hrs fr + 168 hrs es + 90 hrs it	-	download
Unispeech Large+	Labeled: 1350 hrs en, Unlabeled: 353 hrs fr	-	download
UniSpeech Large+	Labeld: 1350 hrs en, Unlabeled: 168 hrs es	-	download
UniSpeech Large+	Labeled: 1350 hrs en, Unlabeld: 90 hrs it	-	download
UniSpeech Large Multilingual	Labeled: 1350 hrs en + 353 hrs fr + 168 hrs es + 90 hrs it, Unlabeled: 17 hrs ky	-	download
UniSpeech Large+	Labeled: 1350 hrs en, Unlabeled: 353 hrs fr	1 hr fr	download
UniSpeech Large+	Labeld: 1350 hrs en, Unlabeled: 168 hrs es	1 hr es	download
UniSpeech Large+	Labeled: 1350 hrs en, Unlabeld: 90 hrs it	1 hr it	download
UniSpeech Large Multilingual	Labeled: 1350 hrs en + 353 hrs fr + 168 hrs es + 90 hrs it, Unlabeled: 17 hrs ky	1 hr ky	download
UniSpeech-SAT Base	960 hrs LibriSpeech	-	download
UniSpeech-SAT Base+	60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli	-	download
UniSpeech-SAT Large	60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli	-	download
WavLM Base	960 hrs LibriSpeech	-	Azure Storage Google Drive
WavLM Base+	60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli	-	Azure Storage Google Drive
WavLM Large	60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli	-	Azure Storage Google Drive

Universal Representation Evaluation on SUPERB

Downstream Task Performance

We also evaluate our models on typical speaker related benchmarks.

Speaker Verification

Model	Fix pre-train	Vox1-O	Vox1-E	Vox1-H
ECAPA-TDNN	-	0.87	1.12	2.12
HuBERT large	Yes	0.888	0.912	1.853
Wav2Vec2.0 (XLSR)	Yes	0.915	0.945	1.895
UniSpeech-SAT large	Yes	0.771	0.781	1.669
WavLM large	Yes	0.638	0.687	1.457
HuBERT large	No	0.585	0.654	1.342
Wav2Vec2.0 (XLSR)	No	0.564	0.605	1.23
UniSpeech-SAT large	No	0.564	0.561	1.23
WavLM large	No	0.431	0.538	1.154

Our paper for verification

Speech Separation

Evaluation on LibriCSS

Model	0S	0L	OV10	OV20	OV30	OV40
Conformer (SOTA)	4.5	4.4	6.2	8.5	11	12.6
UniSpeech-SAT base	4.4	4.4	5.4	7.2	9.2	10.5
UniSpeech-SAT large	4.3	4.2	5.0	6.3	8.2	8.8
WavLM base+	4.5	4.4	5.6	7.5	9.4	10.9
WavLM large	4.2	4.1	4.8	5.8	7.4	8.5

Speaker Diarization

Evaluation on CALLHOME

Model	spk_2	spk_3	spk_4	spk_5	spk_6	spk_all
EEND-vector clustering	7.96	11.93	16.38	21.21	23.1	12.49
EEND-EDA clustering (SOTA)	7.11	11.88	14.37	25.95	21.95	11.84
UniSpeech-SAT large	5.93	10.66	12.9	16.48	23.25	10.92
WavLM Base	6.99	11.12	15.20	16.48	21.61	11.75
WavLm large	6.46	10.69	11.84	12.89	20.70	10.35

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the FAIRSEQ project.

Microsoft Open Source Code of Conduct

Reference

If you find our work is useful in your research, please cite the following paper:

@inproceedings{Wang2021UniSpeech,
  author    = {Chengyi Wang and Yu Wu and Yao Qian and Kenichi Kumatani and Shujie Liu and Furu Wei and Michael Zeng and Xuedong Huang},
  editor    = {Marina Meila and Tong Zhang},
  title     = {UniSpeech: Unified Speech Representation Learning with Labeled and
               Unlabeled Data},
  booktitle = {Proceedings of the 38th International Conference on Machine Learning,
               {ICML} 2021, 18-24 July 2021, Virtual Event},
  series    = {Proceedings of Machine Learning Research},
  volume    = {139},
  pages     = {10937--10947},
  publisher = {{PMLR}},
  year      = {2021},
  url       = {http://proceedings.mlr.press/v139/wang21y.html},
  timestamp = {Thu, 21 Oct 2021 16:06:12 +0200},
  biburl    = {https://dblp.org/rec/conf/icml/0002WQK0WZ021.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

@article{Chen2021WavLM,
  title   = {WavLM: Large-Scale Self-Supervised  Pre-training   for Full Stack Speech Processing},
  author  = {Sanyuan Chen and Chengyi Wang and Zhengyang Chen and Yu Wu and Shujie Liu and Zhuo Chen and Jinyu Li and Naoyuki Kanda and Takuya Yoshioka and Xiong Xiao and Jian Wu and Long Zhou and Shuo Ren and Yanmin Qian and Yao Qian and Jian Wu and Michael Zeng and Furu Wei},
  eprint={2110.13900},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  year={2021}
}

@article{Chen2021UniSpeechSAT,
  title   = {UniSpeech-SAT: Universal Speech Representation Learning with  Speaker Aware Pre-Training},
  author  = {Sanyuan Chen and Yu Wu and Chengyi Wang and Zhengyang Chen and Zhuo Chen and Shujie Liu and   Jian Wu and Yao Qian and Furu Wei and Jinyu Li and  Xiangzhan Yu},
  eprint={2110.05752},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  year={2021}
}

Contact Information

For help or issues using UniSpeech models, please submit a GitHub issue.

For other communications related to UniSpeech, please contact Yu Wu ([email protected]).

Comments

Is "unispeech_sat.th" wrong ?

Hello,

I think the "unispeech_sat.th" is wrong. I have just cloned the repository and tried the speaker verification with Unispeech-SAT and when I launch the example :

python verification.py --model_name unispeech_sat --wav1 vox1_data/David_Faustino/hn8GyCJIfLM_0000012.wav --wav2 vox1_data/Josh_Gad/HXUqYaOwrxA_0000015.wav --checkpoint UniSpeech-SAT-Large.pt

I have an error (end of the traceback): File "/data/coros1/ddallon/workspace/UniSpeech/UniSpeech-SAT/fairseq/models/__init__.py", line 88, in build_model assert model is not None, ( AssertionError: Could not infer model type from {'_name': 'bc_m_hubert', 'label_rate': 50, 'extractor_mode': 'layer_norm', 'structure_type': 'transformer', 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': 'gelu', 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'final_dim': 768, 'untie_final_proj': True, 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 1.0, 'boundary_mask': False, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': 'static', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'relative_position_embedding': False, 'num_buckets': 320, 'max_distance': 1280, 'gru_rel_pos': False, 'expand_attention_head_size': -1, 'streaming': False, 'chunk_size': 0, 'left_chunk': 0, 'num_negatives': 0, 'negatives_from_everywhere': False, 'cross_sample_negatives': 100, 'codebook_negatives': 0, 'quantize_targets': True, 'latent_vars': 320, 'latent_groups': 2, 'latent_dim': 0, 'spk_layer': 12, 'mixing_max_len': -1, 'mixing_prob': 0.5, 'mixing_num': 1, 'pretrained_path': ''}. Available models: dict_keys(['wav2vec', 'wav2vec2', 'wav2vec_ctc', 'wav2vec_seq2seq', 'wav2vec_transducer', 'hubert', 'hubert_ctc', 'transformer_lm', 'unispeech_sat']) Requested model type: bc_m_hubert

And I notice that "bc_m_hubert" appears only in "unispeech_sat.th".

Could you check it or help me ? :-)

opened by Damien-Da 15
Formula 6 in paper

Hi there!

Great repo and paper. I had a question that I think maybe a mistake in my understanding of the paper/code. After reading through both I understand:

You are first doing CTC + Contrastive on labeled data "L" and then optional pre-training on "M". However, from your paper I understand that they should be solved as a single task with joint multi-task training (from formula 6 in paper). This does not reflect in the code.

Would be glad if you could please help. Thank You!

opened by Sreyan88 7
Unispeech-SAT fairseq code

Hi!

From UniSpeech/downstreams/speaker_diarization/README.md: For UniSpeech-SAT large, we should install the Unispeech-SAT fairseq code.

Where can I find the Unispeech-SAT fairseq code?

Thanks in advance.

opened by RuslanSel 4

Huggingface sat model missing tokenizer

I tried to use pretrained model from huggingface, it seems no tokenizer uploaded there.

>>> processor = Wav2Vec2Processor.from_pretrained('microsoft/unispeech-sat-base-plus')
OSError: Can't load tokenizer for 'microsoft/unispeech-sat-base-plus'. Make sure that:

- 'microsoft/unispeech-sat-base-plus' is a correct model identifier listed on 'https://huggingface.co/models'
  (make sure 'microsoft/unispeech-sat-base-plus' is not a path to a local directory with something else, in that case)

- or 'microsoft/unispeech-sat-base-plus' is the correct path to a directory containing relevant tokenizer files

(1) Any workaround? (2) Also, since I don't need tokenizer (used for audio classification), is there any option to disable obtaining tokenizer?

cc @patrickvonplaten

opened by bagustris 4

diarization - KeyError: 'embed.weight'

I got the error running python diarization.py --config_path config/infer_est_nspk1.yaml --wav_path 0.wav --model_init WavLM-Large.pt Traceback (most recent call last): File "diarization.py", line 321, in main(args) File "diarization.py", line 272, in main model_all_n_speakers = model_parameter_dict["embed.weight"].shape[0] KeyError: 'embed.weight'

Thanks in advance.

opened by RuslanSel 3
Getting speaker embeddings

UniSpeech-SAT directory in this repo contains an example. The example takes a .wav file as an input and produces a tensor 'f' as an output. Can I get the speaker embeddings from 'f'?

opened by AH289 3
Release vocab.json for Common Voice

Hey UniSpeech team!

Thanks a lot for making the pre-trained checkpoints available for everyone. Would you mind also open-sourcing the dictionaries /datablob/users/v-chengw/data/commonvoice_20200622/common_voices_splits/nl/phonesMatches_reduced.json for UniSpeech base & large so that the model can be used out of the box for inference?

opened by patrickvonplaten 3
WavLM Inference Error

I loaded the WaveLM Large model from the link provided

When trying to follow the code for loading the pretrained model for inference I get the following error: cfg = WavLMConfig(checkpoint['cfg']) KeyError: 'cfg'

It looks like this model does not have any 'cfg' key or 'model' key for that matter.

opened by bryant0918 2
Why is my duplicated wavLM results on vox1-o is 30% worse
model | EER(mine) | EER(official) -- | -- | -- wavlm_large_nofinetune.pth | 0.965 | 0.75 wavlm_large_finetune.pth | 0.631 | 0.431

The above results are the validation results of your shared wav_lm models on the original Vox1-o data without changing any code. What might be the reason for this gap? Wrong settings? Here is more background about my setting:

Create a conda env as:

conda create -n UniSpeech_py3p8 python=3.8

Following your guidance under https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification

pip install --require-hashes -r requirements.txt

The following error will appear:

Collecting numpy<1.23.0,>=1.16.5 ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==. These do not: numpy<1.23.0,>=1.16.5 from https://files.pythonhosted.org/packages/2f/14/abc14a3f3663739e5d3c8fd980201d10788d75fea5b0685734227052c4f0/numpy-1.22.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=64f56fc53a2d18b1924abd15745e30d82a5782b2cab3429aceecc6875bd5add0 (from scipy==1.7.1->-r requirements.txt (line 1))

Then I installed the environment manually (installed around 30~40 tools) just as https://github.com/microsoft/UniSpeech/issues/26

Here is some related details: pip list | grep fairseq fairseq 0.12.1 /home/user1/tools/fairseq pip list | grep s3prl s3prl 0.3.1 torch.version: 1.9.0+cu102 python -V: 3.8.13

Thanks for your wonderful work and looking forward for your help.
opened by AIDman 2
More details about the output

When I try to run the example in UniSpeech-SAT directory in this repo, I get 'f' as a tensor of size torch.Size([1, 512, 31]). What exactly does the variable f represent?

opened by AhmedHashish123 2
Access Required for UniSpeech-SAT models

Would it be possible to give everyone access to the UniSpeech-SAT models. Currently one cannot download the checkpoints for UniSpeech-SAT as the link is a google drive and requires access, e.g.: https://drive.google.com/file/d/1l5etRW6W2aP_8I2Fs_8ailGZqEzdrAPz/view?usp=sharing

opened by patrickvonplaten 2
Change listed source of Flashlight bindings

Flashlight bindings have moved to https://github.com/flashlight/text and https://github.com/flashlight/sequence — point import failures to those repos.

opened by jacobkahn 0
UniSpeech Model Download

Hello, the download link for UniSpeech Large EN is invalid, can you update it? Also, can the UniSpeech Base EN model be shared as well? Thank you very much!

opened by hongfeixue 0
pre-training detail for Unispeech-SAT

Hi there!

Excellent paper in Unispeech-SAT. I have one question regarding pre-training as I see the pre-training code isn't available (I would be happy to know if it is available anywhere). I wanted to know if any kind of normalization was applied to the model embeddings for the utterance-wise contrastive loss (like l2 normalization or instance normalization) etc.

Would be very helpful if you could help me with that!

opened by Sreyan88 0
Hi, how to calculate the EER or DER

Sorry to bother you, but I am the newer, I want to know how to calculate the EER or DER as the author presented in the paper. I seem not to see that code about it

opened by liyunlongaaa 0
How continue pretrain WavLM for ASR?

I see this issue:

the recipe for ASR fine-tuning can be found in fairseq repo.

but there is no config file. Can you give a simple example?

Thanks a lot.

opened by cnlinxi 0

Owner

Microsoft

Open source projects and samples from Microsoft

GitHub

This repository describes our reproducible framework for assessing self-supervised representation learning from speech

LeBenchmark: a reproducible framework for assessing SSL from speech Self-Supervised Learning (SSL) using huge unlabeled data has been successfully exp

49 Aug 24, 2022

PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI

data2vec-pytorch PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI (F

105 Jan 4, 2023

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

309 Oct 19, 2022

CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

- 基于标题的大规模商品实体检索top1 一、任务介绍 CCKS 2020：基于标题的大规模商品实体检索，任务为对于给定的一个商品标题，参赛系统需要匹配到该标题在给定商品库中的对应商品实体。输入：输入文件包括若干行商品标题。输出：输出文本每一行包括此标题对应的商品实体，即给定知识库中商品 ID，

43 Nov 11, 2022

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

XL-Sum This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Lang

189 Jan 2, 2023

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet ?? ???? 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

40 Nov 30, 2022

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

377 Jan 2, 2023

Code for text augmentation method leveraging large-scale language models

HyperMix Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation. Getting Started Installing P

47 Dec 20, 2022

Tools for curating biomedical training data for large-scale language modeling

242 Dec 25, 2022

Large-scale Knowledge Graph Construction with Prompting

Large-scale Knowledge Graph Construction with Prompting across tasks (predictive and generative), and modalities (language, image, vision + language, etc.)

161 Dec 28, 2022

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

MTFAA-Net Unofficial PyTorch implementation of Baidu's MTFAA-Net: "Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speec

87 Dec 19, 2022

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning We propose a SASE mode

1 Nov 20, 2021

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

2.2k Jan 9, 2023

In this repository, I have developed an end to end Automatic speech recognition project. I have developed the neural network model for automatic speech recognition with PyTorch and used MLflow to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.

End to End Automatic Speech Recognition In this repository, I have developed an end to end Automatic speech recognition project. I have developed the

22 Nov 13, 2022

Speech Recognition for Uyghur using Speech transformer

Speech Recognition for Uyghur using Speech transformer Training: this model using CTC loss and Cross Entropy loss for training. Download pretrained mo

11 Nov 17, 2022

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

3.2k Dec 31, 2022

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

1k Dec 30, 2022

Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

5 Dec 28, 2021

A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

1 Dec 20, 2021