Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

Microsoft

Last update: Jan 9, 2023

Related tags

Deep Learning UniSpeech

Overview

UniSpeech

The family of UniSpeech:

UniSpeech (ICML 2021): Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

UniSpeech-SAT (ICASSP 2022 Submission): Universal Speech Representation Learning with Speaker Aware Pre-Training

Pre-trained models

We strongly suggest using our UniSpeech-SAT model for speaker related tasks, since it shows very powerful performance on various speaker related benchmarks.

Model	Dataset	Model
UniSpeech Base	1500 hrs CommonVoice	download
UniSpeech Large	1500 hrs CommonVoice	download
UniSpeech-SAT Base	960 hrs LibriSpeech	download
UniSpeech-SAT Base+	60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli	download
UniSpeech-SAT Large	60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli	download

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the FAIRSEQ project.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using UniSpeech models, please submit a GitHub issue.

For other communications related to UniSpeech, please contact Yu Wu ([email protected]).

Comments

Is "unispeech_sat.th" wrong ?

Hello,

I think the "unispeech_sat.th" is wrong. I have just cloned the repository and tried the speaker verification with Unispeech-SAT and when I launch the example :

python verification.py --model_name unispeech_sat --wav1 vox1_data/David_Faustino/hn8GyCJIfLM_0000012.wav --wav2 vox1_data/Josh_Gad/HXUqYaOwrxA_0000015.wav --checkpoint UniSpeech-SAT-Large.pt

I have an error (end of the traceback): File "/data/coros1/ddallon/workspace/UniSpeech/UniSpeech-SAT/fairseq/models/__init__.py", line 88, in build_model assert model is not None, ( AssertionError: Could not infer model type from {'_name': 'bc_m_hubert', 'label_rate': 50, 'extractor_mode': 'layer_norm', 'structure_type': 'transformer', 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': 'gelu', 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'final_dim': 768, 'untie_final_proj': True, 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 1.0, 'boundary_mask': False, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': 'static', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'relative_position_embedding': False, 'num_buckets': 320, 'max_distance': 1280, 'gru_rel_pos': False, 'expand_attention_head_size': -1, 'streaming': False, 'chunk_size': 0, 'left_chunk': 0, 'num_negatives': 0, 'negatives_from_everywhere': False, 'cross_sample_negatives': 100, 'codebook_negatives': 0, 'quantize_targets': True, 'latent_vars': 320, 'latent_groups': 2, 'latent_dim': 0, 'spk_layer': 12, 'mixing_max_len': -1, 'mixing_prob': 0.5, 'mixing_num': 1, 'pretrained_path': ''}. Available models: dict_keys(['wav2vec', 'wav2vec2', 'wav2vec_ctc', 'wav2vec_seq2seq', 'wav2vec_transducer', 'hubert', 'hubert_ctc', 'transformer_lm', 'unispeech_sat']) Requested model type: bc_m_hubert

And I notice that "bc_m_hubert" appears only in "unispeech_sat.th".

Could you check it or help me ? :-)

opened by Damien-Da 15
Formula 6 in paper

Hi there!

Great repo and paper. I had a question that I think maybe a mistake in my understanding of the paper/code. After reading through both I understand:

You are first doing CTC + Contrastive on labeled data "L" and then optional pre-training on "M". However, from your paper I understand that they should be solved as a single task with joint multi-task training (from formula 6 in paper). This does not reflect in the code.

Would be glad if you could please help. Thank You!

opened by Sreyan88 7
Unispeech-SAT fairseq code

Hi!

From UniSpeech/downstreams/speaker_diarization/README.md: For UniSpeech-SAT large, we should install the Unispeech-SAT fairseq code.

Where can I find the Unispeech-SAT fairseq code?

Thanks in advance.

opened by RuslanSel 4

Huggingface sat model missing tokenizer

I tried to use pretrained model from huggingface, it seems no tokenizer uploaded there.

>>> processor = Wav2Vec2Processor.from_pretrained('microsoft/unispeech-sat-base-plus')
OSError: Can't load tokenizer for 'microsoft/unispeech-sat-base-plus'. Make sure that:

- 'microsoft/unispeech-sat-base-plus' is a correct model identifier listed on 'https://huggingface.co/models'
  (make sure 'microsoft/unispeech-sat-base-plus' is not a path to a local directory with something else, in that case)

- or 'microsoft/unispeech-sat-base-plus' is the correct path to a directory containing relevant tokenizer files

(1) Any workaround? (2) Also, since I don't need tokenizer (used for audio classification), is there any option to disable obtaining tokenizer?

cc @patrickvonplaten

opened by bagustris 4

diarization - KeyError: 'embed.weight'

I got the error running python diarization.py --config_path config/infer_est_nspk1.yaml --wav_path 0.wav --model_init WavLM-Large.pt Traceback (most recent call last): File "diarization.py", line 321, in main(args) File "diarization.py", line 272, in main model_all_n_speakers = model_parameter_dict["embed.weight"].shape[0] KeyError: 'embed.weight'

Thanks in advance.

opened by RuslanSel 3
Getting speaker embeddings

UniSpeech-SAT directory in this repo contains an example. The example takes a .wav file as an input and produces a tensor 'f' as an output. Can I get the speaker embeddings from 'f'?

opened by AH289 3
Release vocab.json for Common Voice

Hey UniSpeech team!

Thanks a lot for making the pre-trained checkpoints available for everyone. Would you mind also open-sourcing the dictionaries /datablob/users/v-chengw/data/commonvoice_20200622/common_voices_splits/nl/phonesMatches_reduced.json for UniSpeech base & large so that the model can be used out of the box for inference?

opened by patrickvonplaten 3
WavLM Inference Error

I loaded the WaveLM Large model from the link provided

When trying to follow the code for loading the pretrained model for inference I get the following error: cfg = WavLMConfig(checkpoint['cfg']) KeyError: 'cfg'

It looks like this model does not have any 'cfg' key or 'model' key for that matter.

opened by bryant0918 2
Why is my duplicated wavLM results on vox1-o is 30% worse
model | EER(mine) | EER(official) -- | -- | -- wavlm_large_nofinetune.pth | 0.965 | 0.75 wavlm_large_finetune.pth | 0.631 | 0.431

The above results are the validation results of your shared wav_lm models on the original Vox1-o data without changing any code. What might be the reason for this gap? Wrong settings? Here is more background about my setting:

Create a conda env as:

conda create -n UniSpeech_py3p8 python=3.8

Following your guidance under https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification

pip install --require-hashes -r requirements.txt

The following error will appear:

Collecting numpy<1.23.0,>=1.16.5 ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==. These do not: numpy<1.23.0,>=1.16.5 from https://files.pythonhosted.org/packages/2f/14/abc14a3f3663739e5d3c8fd980201d10788d75fea5b0685734227052c4f0/numpy-1.22.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=64f56fc53a2d18b1924abd15745e30d82a5782b2cab3429aceecc6875bd5add0 (from scipy==1.7.1->-r requirements.txt (line 1))

Then I installed the environment manually (installed around 30~40 tools) just as https://github.com/microsoft/UniSpeech/issues/26

Here is some related details: pip list | grep fairseq fairseq 0.12.1 /home/user1/tools/fairseq pip list | grep s3prl s3prl 0.3.1 torch.version: 1.9.0+cu102 python -V: 3.8.13

Thanks for your wonderful work and looking forward for your help.
opened by AIDman 2
More details about the output

When I try to run the example in UniSpeech-SAT directory in this repo, I get 'f' as a tensor of size torch.Size([1, 512, 31]). What exactly does the variable f represent?

opened by AhmedHashish123 2
Access Required for UniSpeech-SAT models

Would it be possible to give everyone access to the UniSpeech-SAT models. Currently one cannot download the checkpoints for UniSpeech-SAT as the link is a google drive and requires access, e.g.: https://drive.google.com/file/d/1l5etRW6W2aP_8I2Fs_8ailGZqEzdrAPz/view?usp=sharing

opened by patrickvonplaten 2
Change listed source of Flashlight bindings

Flashlight bindings have moved to https://github.com/flashlight/text and https://github.com/flashlight/sequence — point import failures to those repos.

opened by jacobkahn 0
UniSpeech Model Download

Hello, the download link for UniSpeech Large EN is invalid, can you update it? Also, can the UniSpeech Base EN model be shared as well? Thank you very much!

opened by hongfeixue 0
pre-training detail for Unispeech-SAT

Hi there!

Excellent paper in Unispeech-SAT. I have one question regarding pre-training as I see the pre-training code isn't available (I would be happy to know if it is available anywhere). I wanted to know if any kind of normalization was applied to the model embeddings for the utterance-wise contrastive loss (like l2 normalization or instance normalization) etc.

Would be very helpful if you could help me with that!

opened by Sreyan88 0
Hi, how to calculate the EER or DER

Sorry to bother you, but I am the newer, I want to know how to calculate the EER or DER as the author presented in the paper. I seem not to see that code about it

opened by liyunlongaaa 0
How continue pretrain WavLM for ASR?

I see this issue:

the recipe for ASR fine-tuning can be found in fairseq repo.

but there is no config file. Can you give a simple example?

Thanks a lot.

opened by cnlinxi 0

Owner

Microsoft

Open source projects and samples from Microsoft

GitHub

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

138 Dec 30, 2022

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

1.3k Dec 31, 2022

PyTorch implementation of "Contrast to Divide: self-supervised pre-training for learning with noisy labels"

Contrast to Divide: self-supervised pre-training for learning with noisy labels This is an official implementation of "Contrast to Divide: self-superv

55 Nov 23, 2022

UniLM AI - Large-scale Self-supervised Pre-training across Tasks, Languages, and Modalities

Pre-trained (foundation) models across tasks (understanding, generation and translation), languages (100+ languages), and modalities (language, image, audio, vision + language, audio + language, etc.)

7.6k Jan 1, 2023

Self-Supervised Pre-Training for Transformer-Based Person Re-Identification

Self-Supervised Pre-Training for Transformer-Based Person Re-Identification [pdf] The official repository for Self-Supervised Pre-Training for Transfo

45 Dec 3, 2021

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [Arxiv] VideoMAE: Masked Autoencoders are Data-Efficient Learne

Multimedia Computing Group, Nanjing University

697 Jan 7, 2023

SSD: A Unified Framework for Self-Supervised Outlier Detection [ICLR 2021]

SSD: A Unified Framework for Self-Supervised Outlier Detection [ICLR 2021] Pdf: https://openreview.net/forum?id=v5gjXpmR8J Code for our ICLR 2021 pape

113 Nov 27, 2022

The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

Published by SpaceML • About SpaceML • Quick Colab Example Self-Supervised Learner The Self-Supervised Learner can be used to train a classifier with

92 Nov 30, 2022

Code release for SLIP Self-supervision meets Language-Image Pre-training

SLIP: Self-supervision meets Language-Image Pre-training What you can find in this repo: Pre-trained models (with ViT-Small, Base, Large) and code to

621 Dec 31, 2022

SAS: Self-Augmentation Strategy for Language Model Pre-training

SAS: Self-Augmentation Strategy for Language Model Pre-training This repository

5 Nov 2, 2022

Colossal-AI: A Unified Deep Learning System for Large-Scale Parallel Training

ColossalAI An integrated large-scale model training system with efficient parallelization techniques Installation PyPI pip install colossalai Install

7.1k Jan 3, 2023

[EMNLP 2021] Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training

RoSTER The source code used for Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training, p

60 Dec 30, 2022

PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO

Self-Supervised Vision Transformers with DINO PyTorch implementation and pretrained models for DINO. For details, see Emerging Properties in Self-Supe

4.2k Jan 3, 2023

Unified unsupervised and semi-supervised domain adaptation network for cross-scenario face anti-spoofing, Pattern Recognition

USDAN The implementation of Unified unsupervised and semi-supervised domain adaptation network for cross-scenario face anti-spoofing, which is accepte

11 Nov 3, 2022

Training code and evaluation benchmarks for the "Self-Supervised Policy Adaptation during Deployment" paper.

Self-Supervised Policy Adaptation during Deployment PyTorch implementation of PAD and evaluation benchmarks from Self-Supervised Policy Adaptation dur

101 Nov 1, 2022

E2e music remastering system - End-to-end Music Remastering System Using Self-supervised and Adversarial Training

End-to-end Music Remastering System This repository includes source code and pre

37 Dec 15, 2022

PyTorch implementation of the Transformer in Post-LN (Post-LayerNorm) and Pre-LN (Pre-LayerNorm).

Transformer-PyTorch A PyTorch implementation of the Transformer from the paper Attention is All You Need in both Post-LN (Post-LayerNorm) and Pre-LN (

22 Feb 27, 2022

Code for the paper One Thing One Click: A Self-Training Approach for Weakly Supervised 3D Semantic Segmentation, CVPR 2021.

One Thing One Click One Thing One Click: A Self-Training Approach for Weakly Supervised 3D Semantic Segmentation (CVPR2021) Code for the paper One Thi

44 Dec 12, 2022

ST++: Make Self-training Work Better for Semi-supervised Semantic Segmentation

ST++ This is the official PyTorch implementation of our paper: ST++: Make Self-training Work Better for Semi-supervised Semantic Segmentation. Lihe Ya

147 Jan 3, 2023