Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

Overview

UniSpeech

The family of UniSpeech:

UniSpeech (ICML 2021): Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

UniSpeech-SAT (ICASSP 2022 Submission): Universal Speech Representation Learning with Speaker Aware Pre-Training

Pre-trained models

We strongly suggest using our UniSpeech-SAT model for speaker related tasks, since it shows very powerful performance on various speaker related benchmarks.

Model Dataset Model
UniSpeech Base 1500 hrs CommonVoice download
UniSpeech Large 1500 hrs CommonVoice download
UniSpeech-SAT Base 960 hrs LibriSpeech download
UniSpeech-SAT Base+ 60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli download
UniSpeech-SAT Large 60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli download

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the FAIRSEQ project.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using UniSpeech models, please submit a GitHub issue.

For other communications related to UniSpeech, please contact Yu Wu ([email protected]).

Comments
  • Is

    Is "unispeech_sat.th" wrong ?

    Hello,

    I think the "unispeech_sat.th" is wrong. I have just cloned the repository and tried the speaker verification with Unispeech-SAT and when I launch the example :

    python verification.py --model_name unispeech_sat --wav1 vox1_data/David_Faustino/hn8GyCJIfLM_0000012.wav --wav2 vox1_data/Josh_Gad/HXUqYaOwrxA_0000015.wav --checkpoint UniSpeech-SAT-Large.pt

    I have an error (end of the traceback): File "/data/coros1/ddallon/workspace/UniSpeech/UniSpeech-SAT/fairseq/models/__init__.py", line 88, in build_model assert model is not None, ( AssertionError: Could not infer model type from {'_name': 'bc_m_hubert', 'label_rate': 50, 'extractor_mode': 'layer_norm', 'structure_type': 'transformer', 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': 'gelu', 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'final_dim': 768, 'untie_final_proj': True, 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 1.0, 'boundary_mask': False, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': 'static', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'relative_position_embedding': False, 'num_buckets': 320, 'max_distance': 1280, 'gru_rel_pos': False, 'expand_attention_head_size': -1, 'streaming': False, 'chunk_size': 0, 'left_chunk': 0, 'num_negatives': 0, 'negatives_from_everywhere': False, 'cross_sample_negatives': 100, 'codebook_negatives': 0, 'quantize_targets': True, 'latent_vars': 320, 'latent_groups': 2, 'latent_dim': 0, 'spk_layer': 12, 'mixing_max_len': -1, 'mixing_prob': 0.5, 'mixing_num': 1, 'pretrained_path': ''}. Available models: dict_keys(['wav2vec', 'wav2vec2', 'wav2vec_ctc', 'wav2vec_seq2seq', 'wav2vec_transducer', 'hubert', 'hubert_ctc', 'transformer_lm', 'unispeech_sat']) Requested model type: bc_m_hubert

    And I notice that "bc_m_hubert" appears only in "unispeech_sat.th".

    Could you check it or help me ? :-)

    opened by Damien-Da 15
  • Formula 6 in paper

    Formula 6 in paper

    Hi there!

    Great repo and paper. I had a question that I think maybe a mistake in my understanding of the paper/code. After reading through both I understand:

    You are first doing CTC + Contrastive on labeled data "L" and then optional pre-training on "M". However, from your paper I understand that they should be solved as a single task with joint multi-task training (from formula 6 in paper). This does not reflect in the code.

    Would be glad if you could please help. Thank You!

    opened by Sreyan88 7
  • Unispeech-SAT fairseq code

    Unispeech-SAT fairseq code

    Hi!

    From UniSpeech/downstreams/speaker_diarization/README.md: For UniSpeech-SAT large, we should install the Unispeech-SAT fairseq code.

    Where can I find the Unispeech-SAT fairseq code?

    Thanks in advance.

    opened by RuslanSel 4
  • Huggingface sat model missing tokenizer

    Huggingface sat model missing tokenizer

    I tried to use pretrained model from huggingface, it seems no tokenizer uploaded there.

    >>> processor = Wav2Vec2Processor.from_pretrained('microsoft/unispeech-sat-base-plus')
    OSError: Can't load tokenizer for 'microsoft/unispeech-sat-base-plus'. Make sure that:
    
    - 'microsoft/unispeech-sat-base-plus' is a correct model identifier listed on 'https://huggingface.co/models'
      (make sure 'microsoft/unispeech-sat-base-plus' is not a path to a local directory with something else, in that case)
    
    - or 'microsoft/unispeech-sat-base-plus' is the correct path to a directory containing relevant tokenizer files
    
    

    (1) Any workaround? (2) Also, since I don't need tokenizer (used for audio classification), is there any option to disable obtaining tokenizer?

    cc @patrickvonplaten

    opened by bagustris 4
  • diarization - KeyError: 'embed.weight'

    diarization - KeyError: 'embed.weight'

    I got the error running python diarization.py --config_path config/infer_est_nspk1.yaml --wav_path 0.wav --model_init WavLM-Large.pt Traceback (most recent call last): File "diarization.py", line 321, in main(args) File "diarization.py", line 272, in main model_all_n_speakers = model_parameter_dict["embed.weight"].shape[0] KeyError: 'embed.weight'

    Thanks in advance.

    opened by RuslanSel 3
  • Getting speaker embeddings

    Getting speaker embeddings

    UniSpeech-SAT directory in this repo contains an example. The example takes a .wav file as an input and produces a tensor 'f' as an output. Can I get the speaker embeddings from 'f'?

    opened by AH289 3
  • Release vocab.json for Common Voice

    Release vocab.json for Common Voice

    Hey UniSpeech team!

    Thanks a lot for making the pre-trained checkpoints available for everyone. Would you mind also open-sourcing the dictionaries /datablob/users/v-chengw/data/commonvoice_20200622/common_voices_splits/nl/phonesMatches_reduced.json for UniSpeech base & large so that the model can be used out of the box for inference?

    opened by patrickvonplaten 3
  • WavLM Inference Error

    WavLM Inference Error

    I loaded the WaveLM Large model from the link provided

    When trying to follow the code for loading the pretrained model for inference I get the following error: cfg = WavLMConfig(checkpoint['cfg']) KeyError: 'cfg'

    It looks like this model does not have any 'cfg' key or 'model' key for that matter.

    opened by bryant0918 2
  • Why is my duplicated wavLM results on vox1-o is 30% worse

    Why is my duplicated wavLM results on vox1-o is 30% worse

    model | EER(mine) | EER(official) -- | -- | -- wavlm_large_nofinetune.pth | 0.965 | 0.75 wavlm_large_finetune.pth | 0.631 | 0.431

    The above results are the validation results of your shared wav_lm models on the original Vox1-o data without changing any code. What might be the reason for this gap? Wrong settings? Here is more background about my setting:

    1. Create a conda env as:
    conda create -n UniSpeech_py3p8 python=3.8
    
    1. Following your guidance under https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification
    pip install --require-hashes -r requirements.txt 
    

    The following error will appear:

    Collecting numpy<1.23.0,>=1.16.5
    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==. These do not:
        numpy<1.23.0,>=1.16.5 from https://files.pythonhosted.org/packages/2f/14/abc14a3f3663739e5d3c8fd980201d10788d75fea5b0685734227052c4f0/numpy-1.22.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=64f56fc53a2d18b1924abd15745e30d82a5782b2cab3429aceecc6875bd5add0 (from scipy==1.7.1->-r requirements.txt (line 1))
    

    Then I installed the environment manually (installed around 30~40 tools) just as https://github.com/microsoft/UniSpeech/issues/26

    1. Here is some related details: pip list | grep fairseq fairseq 0.12.1 /home/user1/tools/fairseq pip list | grep s3prl s3prl 0.3.1 torch.version: 1.9.0+cu102 python -V: 3.8.13

    Thanks for your wonderful work and looking forward for your help.

    opened by AIDman 2
  • More details about the output

    More details about the output

    When I try to run the example in UniSpeech-SAT directory in this repo, I get 'f' as a tensor of size torch.Size([1, 512, 31]). What exactly does the variable f represent?

    opened by AhmedHashish123 2
  • Access Required for UniSpeech-SAT models

    Access Required for UniSpeech-SAT models

    Would it be possible to give everyone access to the UniSpeech-SAT models. Currently one cannot download the checkpoints for UniSpeech-SAT as the link is a google drive and requires access, e.g.: https://drive.google.com/file/d/1l5etRW6W2aP_8I2Fs_8ailGZqEzdrAPz/view?usp=sharing

    opened by patrickvonplaten 2
  • Change listed source of Flashlight bindings

    Change listed source of Flashlight bindings

    Flashlight bindings have moved to https://github.com/flashlight/text and https://github.com/flashlight/sequence — point import failures to those repos.

    opened by jacobkahn 0
  • UniSpeech Model Download

    UniSpeech Model Download

    Hello, the download link for UniSpeech Large EN is invalid, can you update it? Also, can the UniSpeech Base EN model be shared as well? Thank you very much!

    opened by hongfeixue 0
  • pre-training detail for Unispeech-SAT

    pre-training detail for Unispeech-SAT

    Hi there!

    Excellent paper in Unispeech-SAT. I have one question regarding pre-training as I see the pre-training code isn't available (I would be happy to know if it is available anywhere). I wanted to know if any kind of normalization was applied to the model embeddings for the utterance-wise contrastive loss (like l2 normalization or instance normalization) etc.

    Would be very helpful if you could help me with that!

    opened by Sreyan88 0
  • Hi, how to calculate the EER or DER

    Hi, how to calculate the EER or DER

    Sorry to bother you, but I am the newer, I want to know how to calculate the EER or DER as the author presented in the paper. I seem not to see that code about it

    opened by liyunlongaaa 0
  • How continue pretrain WavLM for ASR?

    How continue pretrain WavLM for ASR?

    I see this issue:

    the recipe for ASR fine-tuning can be found in fairseq repo.

    but there is no config file. Can you give a simple example?

    Thanks a lot.

    opened by cnlinxi 0
Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

Wasi Ahmad 138 Dec 30, 2022
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Salesforce 1.3k Dec 31, 2022
PyTorch implementation of "Contrast to Divide: self-supervised pre-training for learning with noisy labels"

Contrast to Divide: self-supervised pre-training for learning with noisy labels This is an official implementation of "Contrast to Divide: self-superv

null 55 Nov 23, 2022
UniLM AI - Large-scale Self-supervised Pre-training across Tasks, Languages, and Modalities

Pre-trained (foundation) models across tasks (understanding, generation and translation), languages (100+ languages), and modalities (language, image, audio, vision + language, audio + language, etc.)

Microsoft 7.6k Jan 1, 2023
Self-Supervised Pre-Training for Transformer-Based Person Re-Identification

Self-Supervised Pre-Training for Transformer-Based Person Re-Identification [pdf] The official repository for Self-Supervised Pre-Training for Transfo

Hao Luo 45 Dec 3, 2021
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [Arxiv] VideoMAE: Masked Autoencoders are Data-Efficient Learne

Multimedia Computing Group, Nanjing University 697 Jan 7, 2023
SSD: A Unified Framework for Self-Supervised Outlier Detection [ICLR 2021]

SSD: A Unified Framework for Self-Supervised Outlier Detection [ICLR 2021] Pdf: https://openreview.net/forum?id=v5gjXpmR8J Code for our ICLR 2021 pape

Princeton INSPIRE Research Group 113 Nov 27, 2022
The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

Published by SpaceML • About SpaceML • Quick Colab Example Self-Supervised Learner The Self-Supervised Learner can be used to train a classifier with

SpaceML 92 Nov 30, 2022
Code release for SLIP Self-supervision meets Language-Image Pre-training

SLIP: Self-supervision meets Language-Image Pre-training What you can find in this repo: Pre-trained models (with ViT-Small, Base, Large) and code to

Meta Research 621 Dec 31, 2022
SAS: Self-Augmentation Strategy for Language Model Pre-training

SAS: Self-Augmentation Strategy for Language Model Pre-training This repository

Alibaba 5 Nov 2, 2022
Colossal-AI: A Unified Deep Learning System for Large-Scale Parallel Training

ColossalAI An integrated large-scale model training system with efficient parallelization techniques Installation PyPI pip install colossalai Install

HPC-AI Tech 7.1k Jan 3, 2023
[EMNLP 2021] Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training

RoSTER The source code used for Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training, p

Yu Meng 60 Dec 30, 2022
PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO

Self-Supervised Vision Transformers with DINO PyTorch implementation and pretrained models for DINO. For details, see Emerging Properties in Self-Supe

Facebook Research 4.2k Jan 3, 2023
Unified unsupervised and semi-supervised domain adaptation network for cross-scenario face anti-spoofing, Pattern Recognition

USDAN The implementation of Unified unsupervised and semi-supervised domain adaptation network for cross-scenario face anti-spoofing, which is accepte

null 11 Nov 3, 2022
Training code and evaluation benchmarks for the "Self-Supervised Policy Adaptation during Deployment" paper.

Self-Supervised Policy Adaptation during Deployment PyTorch implementation of PAD and evaluation benchmarks from Self-Supervised Policy Adaptation dur

Nicklas Hansen 101 Nov 1, 2022
E2e music remastering system - End-to-end Music Remastering System Using Self-supervised and Adversarial Training

End-to-end Music Remastering System This repository includes source code and pre

Junghyun (Tony) Koo 37 Dec 15, 2022
PyTorch implementation of the Transformer in Post-LN (Post-LayerNorm) and Pre-LN (Pre-LayerNorm).

Transformer-PyTorch A PyTorch implementation of the Transformer from the paper Attention is All You Need in both Post-LN (Post-LayerNorm) and Pre-LN (

Jared Wang 22 Feb 27, 2022
Code for the paper One Thing One Click: A Self-Training Approach for Weakly Supervised 3D Semantic Segmentation, CVPR 2021.

One Thing One Click One Thing One Click: A Self-Training Approach for Weakly Supervised 3D Semantic Segmentation (CVPR2021) Code for the paper One Thi

null 44 Dec 12, 2022
ST++: Make Self-training Work Better for Semi-supervised Semantic Segmentation

ST++ This is the official PyTorch implementation of our paper: ST++: Make Self-training Work Better for Semi-supervised Semantic Segmentation. Lihe Ya

Lihe Yang 147 Jan 3, 2023