Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR



The family of UniSpeech:

UniSpeech (ICML 2021): Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

UniSpeech-SAT (ICASSP 2022 Submission): Universal Speech Representation Learning with Speaker Aware Pre-Training

Pre-trained models

We strongly suggest using our UniSpeech-SAT model for speaker related tasks, since it shows very powerful performance on various speaker related benchmarks.

Model Dataset Model
UniSpeech Base 1500 hrs CommonVoice download
UniSpeech Large 1500 hrs CommonVoice download
UniSpeech-SAT Base 960 hrs LibriSpeech download
UniSpeech-SAT Base+ 60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli download
UniSpeech-SAT Large 60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli download


This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the FAIRSEQ project.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using UniSpeech models, please submit a GitHub issue.

For other communications related to UniSpeech, please contact Yu Wu ([email protected]).

  • Is

    Is "unispeech_sat.th" wrong ?


    I think the "unispeech_sat.th" is wrong. I have just cloned the repository and tried the speaker verification with Unispeech-SAT and when I launch the example :

    python verification.py --model_name unispeech_sat --wav1 vox1_data/David_Faustino/hn8GyCJIfLM_0000012.wav --wav2 vox1_data/Josh_Gad/HXUqYaOwrxA_0000015.wav --checkpoint UniSpeech-SAT-Large.pt

    I have an error (end of the traceback): File "/data/coros1/ddallon/workspace/UniSpeech/UniSpeech-SAT/fairseq/models/__init__.py", line 88, in build_model assert model is not None, ( AssertionError: Could not infer model type from {'_name': 'bc_m_hubert', 'label_rate': 50, 'extractor_mode': 'layer_norm', 'structure_type': 'transformer', 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': 'gelu', 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'final_dim': 768, 'untie_final_proj': True, 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 1.0, 'boundary_mask': False, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': 'static', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'relative_position_embedding': False, 'num_buckets': 320, 'max_distance': 1280, 'gru_rel_pos': False, 'expand_attention_head_size': -1, 'streaming': False, 'chunk_size': 0, 'left_chunk': 0, 'num_negatives': 0, 'negatives_from_everywhere': False, 'cross_sample_negatives': 100, 'codebook_negatives': 0, 'quantize_targets': True, 'latent_vars': 320, 'latent_groups': 2, 'latent_dim': 0, 'spk_layer': 12, 'mixing_max_len': -1, 'mixing_prob': 0.5, 'mixing_num': 1, 'pretrained_path': ''}. Available models: dict_keys(['wav2vec', 'wav2vec2', 'wav2vec_ctc', 'wav2vec_seq2seq', 'wav2vec_transducer', 'hubert', 'hubert_ctc', 'transformer_lm', 'unispeech_sat']) Requested model type: bc_m_hubert

    And I notice that "bc_m_hubert" appears only in "unispeech_sat.th".

    Could you check it or help me ? :-)

    opened by Damien-Da 15
  • Formula 6 in paper

    Formula 6 in paper

    Hi there!

    Great repo and paper. I had a question that I think maybe a mistake in my understanding of the paper/code. After reading through both I understand:

    You are first doing CTC + Contrastive on labeled data "L" and then optional pre-training on "M". However, from your paper I understand that they should be solved as a single task with joint multi-task training (from formula 6 in paper). This does not reflect in the code.

    Would be glad if you could please help. Thank You!

    opened by Sreyan88 7
  • Unispeech-SAT fairseq code

    Unispeech-SAT fairseq code


    From UniSpeech/downstreams/speaker_diarization/README.md: For UniSpeech-SAT large, we should install the Unispeech-SAT fairseq code.

    Where can I find the Unispeech-SAT fairseq code?

    Thanks in advance.

    opened by RuslanSel 4
  • Huggingface sat model missing tokenizer

    Huggingface sat model missing tokenizer

    I tried to use pretrained model from huggingface, it seems no tokenizer uploaded there.

    >>> processor = Wav2Vec2Processor.from_pretrained('microsoft/unispeech-sat-base-plus')
    OSError: Can't load tokenizer for 'microsoft/unispeech-sat-base-plus'. Make sure that:
    - 'microsoft/unispeech-sat-base-plus' is a correct model identifier listed on 'https://huggingface.co/models'
      (make sure 'microsoft/unispeech-sat-base-plus' is not a path to a local directory with something else, in that case)
    - or 'microsoft/unispeech-sat-base-plus' is the correct path to a directory containing relevant tokenizer files

    (1) Any workaround? (2) Also, since I don't need tokenizer (used for audio classification), is there any option to disable obtaining tokenizer?

    cc @patrickvonplaten

    opened by bagustris 4
  • diarization - KeyError: 'embed.weight'

    diarization - KeyError: 'embed.weight'

    I got the error running python diarization.py --config_path config/infer_est_nspk1.yaml --wav_path 0.wav --model_init WavLM-Large.pt Traceback (most recent call last): File "diarization.py", line 321, in main(args) File "diarization.py", line 272, in main model_all_n_speakers = model_parameter_dict["embed.weight"].shape[0] KeyError: 'embed.weight'

    Thanks in advance.

    opened by RuslanSel 3
  • Getting speaker embeddings

    Getting speaker embeddings

    UniSpeech-SAT directory in this repo contains an example. The example takes a .wav file as an input and produces a tensor 'f' as an output. Can I get the speaker embeddings from 'f'?

    opened by AH289 3
  • Release vocab.json for Common Voice

    Release vocab.json for Common Voice

    Hey UniSpeech team!

    Thanks a lot for making the pre-trained checkpoints available for everyone. Would you mind also open-sourcing the dictionaries /datablob/users/v-chengw/data/commonvoice_20200622/common_voices_splits/nl/phonesMatches_reduced.json for UniSpeech base & large so that the model can be used out of the box for inference?

    opened by patrickvonplaten 3
  • WavLM Inference Error

    WavLM Inference Error

    I loaded the WaveLM Large model from the link provided

    When trying to follow the code for loading the pretrained model for inference I get the following error: cfg = WavLMConfig(checkpoint['cfg']) KeyError: 'cfg'

    It looks like this model does not have any 'cfg' key or 'model' key for that matter.

    opened by bryant0918 2
  • Why is my duplicated wavLM results on vox1-o is 30% worse

    Why is my duplicated wavLM results on vox1-o is 30% worse

    model | EER(mine) | EER(official) -- | -- | -- wavlm_large_nofinetune.pth | 0.965 | 0.75 wavlm_large_finetune.pth | 0.631 | 0.431

    The above results are the validation results of your shared wav_lm models on the original Vox1-o data without changing any code. What might be the reason for this gap? Wrong settings? Here is more background about my setting:

    1. Create a conda env as:
    conda create -n UniSpeech_py3p8 python=3.8
    1. Following your guidance under https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification
    pip install --require-hashes -r requirements.txt 

    The following error will appear:

    Collecting numpy<1.23.0,>=1.16.5
    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==. These do not:
        numpy<1.23.0,>=1.16.5 from https://files.pythonhosted.org/packages/2f/14/abc14a3f3663739e5d3c8fd980201d10788d75fea5b0685734227052c4f0/numpy-1.22.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=64f56fc53a2d18b1924abd15745e30d82a5782b2cab3429aceecc6875bd5add0 (from scipy==1.7.1->-r requirements.txt (line 1))

    Then I installed the environment manually (installed around 30~40 tools) just as https://github.com/microsoft/UniSpeech/issues/26

    1. Here is some related details: pip list | grep fairseq fairseq 0.12.1 /home/user1/tools/fairseq pip list | grep s3prl s3prl 0.3.1 torch.version: 1.9.0+cu102 python -V: 3.8.13

    Thanks for your wonderful work and looking forward for your help.

    opened by AIDman 2
  • More details about the output

    More details about the output

    When I try to run the example in UniSpeech-SAT directory in this repo, I get 'f' as a tensor of size torch.Size([1, 512, 31]). What exactly does the variable f represent?

    opened by AhmedHashish123 2
  • Access Required for UniSpeech-SAT models

    Access Required for UniSpeech-SAT models

    Would it be possible to give everyone access to the UniSpeech-SAT models. Currently one cannot download the checkpoints for UniSpeech-SAT as the link is a google drive and requires access, e.g.: https://drive.google.com/file/d/1l5etRW6W2aP_8I2Fs_8ailGZqEzdrAPz/view?usp=sharing

    opened by patrickvonplaten 2
  • Change listed source of Flashlight bindings

    Change listed source of Flashlight bindings

    Flashlight bindings have moved to https://github.com/flashlight/text and https://github.com/flashlight/sequence — point import failures to those repos.

    opened by jacobkahn 0
  • UniSpeech Model Download

    UniSpeech Model Download

    Hello, the download link for UniSpeech Large EN is invalid, can you update it? Also, can the UniSpeech Base EN model be shared as well? Thank you very much!

    opened by hongfeixue 0
  • pre-training detail for Unispeech-SAT

    pre-training detail for Unispeech-SAT

    Hi there!

    Excellent paper in Unispeech-SAT. I have one question regarding pre-training as I see the pre-training code isn't available (I would be happy to know if it is available anywhere). I wanted to know if any kind of normalization was applied to the model embeddings for the utterance-wise contrastive loss (like l2 normalization or instance normalization) etc.

    Would be very helpful if you could help me with that!

    opened by Sreyan88 0
  • Hi, how to calculate the EER or DER

    Hi, how to calculate the EER or DER

    Sorry to bother you, but I am the newer, I want to know how to calculate the EER or DER as the author presented in the paper. I seem not to see that code about it

    opened by liyunlongaaa 0
  • How continue pretrain WavLM for ASR?

    How continue pretrain WavLM for ASR?

    I see this issue:

    the recipe for ASR fine-tuning can be found in fairseq repo.

    but there is no config file. Can you give a simple example?

    Thanks a lot.

    opened by cnlinxi 0
