A self-supervised learning framework for audio-visual speech

Meta Research

Last update: Jan 7, 2023

Related tags

Deep Learning av_hubert

Overview

AV-HuBERT (Audio-Visual Hidden Unit BERT)

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

Robust Self-Supervised Audio-Visual Speech Recognition

Introduction

AV-HuBERT is a self-supervised representation learning framework for audio-visual speech. It achieves state-of-the-art results in lip reading, ASR and audio-visual speech recognition on the LRS3 audio-visual speech benchmark.

If you find AV-HuBERT useful in your research, please use the following BibTeX entry for citation.

@inproceedings{shi2022avhubert,
    author  = {Bowen Shi and Wei-Ning Hsu and Kushal Lakhotia and Abdelrahman Mohamed},
    title = {Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction},
    year = {2022}
}

@article{shi2022avsr,
    author  = {Bowen Shi and Wei-Ning Hsu and Abdelrahman Mohamed},
    title = {Robust Self-Supervised Audio-Visual Speech Recognition},
    journal = {arXiv preprint arXiv:2201.01763}
    year = {2022}
}

License

AV-HuBERT LICENSE AGREEMENT

This License Agreement (as may be amended in accordance with this License Agreement, “License”), between you (“Licensee” or “you”) and Meta Platforms, Inc. (“Meta” or “we”) applies to your use of any computer program, algorithm, source code, object code, or software that is made available by Meta under this License (“Software”) and any specifications, manuals, documentation, and other written information provided by Meta related to the Software (“Documentation”).

By using the Software, you agree to the terms of this License. If you do not agree to this License, then you do not have any rights to use the Software or Documentation (collectively, the “Software Products”), and you must immediately cease using the Software Products.

Pre-trained and fine-tuned models

Please find the checkpoints here

Installation

First, create a conda virtual environment and activate it:

conda create -n avhubert python=3.8 -y
conda activate avhubert

Then, clone this directory:

git clone https://github.com/facebookresearch/av_hubert.git
cd avhubert
git submodule init
git submodule update

Lastly, install Fairseq and the other packages:

pip install -r requirements.txt
cd fairseq
pip install --editable ./

Load a pretrained model

$ cd avhubert
$ python
>>> import fairseq
>>> import hubert_pretraining, hubert
>>> ckpt_path = "/path/to/the/checkpoint.pt"
>>> models, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([ckpt_path])
>>> model = models[0]

Train a new model

Data preparation

Follow the steps in preparation to pre-process:

LRS3 and VoxCeleb2 datasets

Follow the steps in clustering (pre-train only) to create:

{train,valid}.km frame-aligned pseudo label files. The label_rate is the same as the feature frame rate used for clustering, which is 100Hz for MFCC features and 25Hz for AV-HuBERT features by default.

Pre-train an AV-HuBERT model

Suppose {train,valid}.tsv are saved at /path/to/data, {train,valid}.km are saved at /path/to/labels, the configuration file is saved at /path/to/conf/conf-name, and the label rate is 100Hz.

To train a model, run:

$ cd avhubert
$ fairseq-hydra-train --config-dir /path/to/conf/ --config-name conf-name \
  task.data=/path/to/data task.label_dir=/path/to/label \
  model.label_rate=100 hydra.run.dir=/path/to/experiment/pretrain/ \
  common.user_dir=`pwd`

Finetune an AV-HuBERT model with Seq2Seq

Suppose {train,valid}.tsv are saved at /path/to/data, {train,valid}.wrd are saved at /path/to/labels, the configuration file is saved at /path/to/conf/conf-name.

To fine-tune a pre-trained HuBERT model at /path/to/checkpoint, run:

$ cd avhubert
$ fairseq-hydra-train --config-dir /path/to/conf/ --config-name conf-name \
  task.data=/path/to/data task.label_dir=/path/to/label \
  task.tokenizer_bpe_model=/path/to/tokenizer model.w2v_path=/path/to/checkpoint \
  hydra.run.dir=/path/to/experiment/finetune/ common.user_dir=`pwd`

Decode an AV-HuBERT model

Suppose the test.tsv and test.wrd are the video list and transcripts of the split to be decoded, saved at /path/to/data, and the fine-tuned model is saved at /path/to/checkpoint.

Seq2Seq decoding

task.normalize needs to be consistent with the value used during fine-tuning. Decoding results will be saved at /path/to/experiment/decode/s2s/test.

$ cd avhubert
$ python -B infer_s2s.py --config-dir ./conf/ --config-name conf-name \
  dataset.gen_subset=test common_eval.path=/path/to/checkpoint \
  common_eval.results_path=/path/to/experiment/decode/s2s/test \
  override.modalities=['video'] common.user_dir=`pwd`

The command above uses the default decoding hyperparameter, which can be found in conf/s2s_decode.yaml. override.modalities can be set to ['video'] (for lip reading), or ['audio'] (for ASR) or ['audio','video'] (for audio-visual speech recognition).These parameters can be configured from the command line. For example, to search with a beam size of 20, we can append the command above with generation.beam=20. Important parameters include:

generation.beam
generation.lenpen

If you want to test your model under noisy environment, append the following to the above command.

+override.noise_wav=/path/to/noise override.noise_prob=1 override.noise_snr={snr}

{snr} is the signal-to-noise ratio (SNR) and /path/to/noise is a folder containing noise manifest files (/path/to/noise/{valid,test}.tsv). See preparation for setting up this folder.

Comments

How to fine-tune with my own dataset
Hello,

Congratulation on this project, it's amazing!!!

I want to test the Single-modal Visual HuBERT because my project is focused on lip reading without audio and I want to use checkpoints of Finetuned Models for Visual Speech Recognition with my own dataset that isn't in English. . For this reason, I would like to ask a question about the section of Finetune an AV-HuBERT model with Seq2Seq.

what is the configuration file and how to generate it ?.

how is tokenizer_bpe_model ? Could you give me an example to generate it myself with my dataset ?, because I can't use step 4 in LRS3 preprocessing to generate it, as my dataset is made up of frames and not videos.

And there are two .tsv files in /path/to/data, one for train and another for validation or only one .tsv file ?

On the other hand, the .tsv file stores different paths that contain the frames of videos and .wrd file stores different paths that contain the sentences, right?

For example, if I have this organization for training data.

The .tsv file should look like this, train_dataset/data_speaker/speaker0/speaker0_01 train_dataset/data_speaker/speaker0/speaker0_02 train_dataset/data_speaker/speaker0/.... train_dataset/data_speaker/speaker1/speaker1_01 train_dataset/data_speaker/speaker1/speaker1_02 train_dataset/data_speaker/speaker1/... . . . train_dataset/data_speaker/speakerN/...

and wrd file, train_dataset/transcriptions_speaker/speaker0/transcription_speaker0_01.txt train_dataset/transcriptions_speaker/speaker0/transcription_speaker0_02.txt train_dataset/transcriptions_speaker/speaker0/.... train_dataset/transcriptions_speaker/speaker1/transcription_speaker1_01.txt train_dataset/transcriptions_speaker/speaker1/transcription_speaker1_01.txt train_dataset/transcriptions_speaker/speaker1/...

right?

Thanks a lot in advance!
opened by YadiraRoCa 21
Count number of frames per clip Problem

for rank in $(seq 0 $((nshard - 1)));do cat ${lrs3}/nframes.audio.${rank}; done > ${lrs3}/nframes.audio for rank in $(seq 0 $((nshard - 1)));do cat ${lrs3}/nframes.video.${rank}; done > ${lrs3}/nframes.video

Hello Will these two lines of code actually work in the terminal?

opened by dinghongzhe1024 11
configuration files for the "no pre-training" setups

Hi authors,

Thank you for sharing such nice research work! Thanks to your previous help, I have successfully completed the LRS3 and MUSAN data preparation.

I am now intersted in directly finetune the AVSR system (without pretraining, because of computing resource limits), and hope to reproduce the following highlighted systems in the paper "Robust Self-Supervised Audio-Visual Speech Recognition":

I wonder if you can share the config files for these four systems?

If inconvenient, can you guide me some details on how to modify the existed config files in directory conf/av-finetune/? I am new in this field and don't know how to do this, hope you can help me:)

Thank you very much!!

opened by YUCHEN005 10
`trim_video_frame` generates only empty folders

Hi, when I run step-2 in lrs3_prepare.py, while the progress bar proceeds normally (no error or warnings) and the speed being slow, the outputs seem nowhere to find. There are only empty folders generated under short-pretrain.

So far the progress bar looks like this 44%|████▍ | 52029/118516 [5:20:19<8:22:04, 2.21it/s] and what I got were 2778 empty folders.

Hope you can give me some help, thanks!

opened by 18445864529 6
Error in step 2 of preprocessing, what values to put in ${rank} and ${nshard}

Hi authors, Great work! Could you please help with the values of {rank} and {nshard}? When putting {rank} as 5 and {nshard} as 6, I am getting the following error, would be really grateful if you could help me out with it!

python lrs3_prepare.py --lrs3 data --ffmpeg /path/to/ffmpeg --rank 5 --nshard 6 --step 2 Trim original videos in pretrain Step 2. Trim video and audio Total videos in current shard: 6363/38198 0%| | 0/6363 [00:00<?, ?it/s]The system cannot find the path specified. 0%| | 0/6363 [00:00<?, ?it/s] Traceback (most recent call last): File "lrs3_prepare.py", line 235, in trim_pretrain(args.lrs3, args.ffmpeg, args.rank, args.nshard, step=args.step) File "lrs3_prepare.py", line 179, in trim_pretrain trim_video_frame(csv_fn, pretrain_dir, output_video_dir, ffmpeg, rank, nshard) File "lrs3_prepare.py", line 127, in trim_video_frame pipe = subprocess.call(cmd, stdout = subprocess.PIPE, stderr = subprocess.STDOUT) # subprocess.PIPE File "D:\Users\jeet\Anaconda3\envs\avhubert\lib\subprocess.py", line 340, in call with Popen(*popenargs, **kwargs) as p: File "D:\Users\jeet\Anaconda3\envs\avhubert\lib\subprocess.py", line 858, in init self._execute_child(args, executable, preexec_fn, close_fds, File "D:\Users\jeet\Anaconda3\envs\avhubert\lib\subprocess.py", line 1311, in _execute_child hp, ht, pid, tid = _winapi.CreateProcess(executable, args, FileNotFoundError: [WinError 2] The system cannot find the file specified

opened by JeetShah25 5

fairseq-hydra-train with single-node multiple-gpu training

Hi, there,

When started training on my machine with eight gpus with the command provided on README as follows:

fairseq-hydra-train --config-dir /path/to/conf/ --config-name conf-name \
  task.data=/path/to/data task.label_dir=/path/to/label \
  model.label_rate=100 hydra.run.dir=/path/to/experiment/pretrain/ \
  common.user_dir=`pwd`

it did started -- but only use device 0 with the following message:

[2022-02-15 19:25:02,218][fairseq.utils][INFO] - ***********************CUDA enviroments for all 1 workers***********************
[2022-02-15 19:25:02,218][fairseq.utils][INFO] - rank   0: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-PCIE-16GB
[2022-02-15 19:25:02,218][fairseq.utils][INFO] - ***********************CUDA enviroments for all 1 workers***********************
[2022-02-15 19:25:02,219][fairseq_cli.train][INFO] - training on 1 devices (GPUs/TPUs)
[2022-02-15 19:25:02,219][fairseq_cli.train][INFO] - max tokens per device = 1000 and max sentences per device = None
[2022-02-15 19:25:02,220][fairseq.trainer][INFO] - Preparing to load checkpoint checkpoints/checkpoint_last.pt
[2022-02-15 19:25:02,221][fairseq.trainer][INFO] - No existing checkpoint found checkpoints/checkpoint_last.pt
[2022-02-15 19:25:02,221][fairseq.trainer][INFO] - loading train data for epoch 1

Even if I added CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 before the command, the rest gpus are still in idle.

Any suggestions to solve this?

Thanks in advance.

opened by stoneyang 5

Questions about data loading

Hi, excellent work!

I'm trying to understand a couple of details regarding the data loading for the self-supervised pre-training stage. From my understanding (please correct me if I'm wrong), in the default setting, each mini-batch contains sequences of variable length, potentially ranging from max_sample_size=5 (corresponding to 0.2 secs at 25fps) to max_sample_size=500 (20 secs). Is any "uniform batching" performed such that each batch has sequences of approximately equal length and thus the number of padded zeros are minimised, to speed up training? I couldn't spot it in the code.

Also, in lrs3_prepare.py, it seems that utterances longer than 15 secs are trimmed, whereas in vox_prepare, no utterances are trimmed. Does this mean that when both LRS3 and VoxCeleb2 are used to pre-train the model, the longest possible LRS3 sequence is 15 secs whereas the longest possible VoxCeleb2 sequence is 20 secs (determined by max_sample_size)?

opened by ahaliassos 5
non-deterministic results when decoding with noises

Dear authors,

thanks a lot for your excellent work. I would ask if your results are deterministic when decoding with noises, e.g. babble noise. In my experiments, the noise probability is set as 1 during decoding. I find that the decoding results are not deterministic when running multiple times. The difference can be larger than 1%.

I find the reason could be the random seed when selecting noises: https://github.com/facebookresearch/av_hubert/blob/5ab235b3d9dac548055670d534b283b5b70212cc/avhubert/hubert_dataset.py#L305

I would ask, do you have the same observation? If so, how did you solve the non-deterministic problem?

I'm looking forward to your reply!

Merry Christmas and best regards, Zhengyang

opened by joyolee 4
Checkpoints of finetuned AVSR models without VoxCeleb2 data?

Hi, thanks for your awesome work! I'm wondering if you have any plan to release the checkpoints of base/large_noise_lrs3_pt_noise_433h/30h_ft, i.e. pretrained and finetuned only on LRS3? It will be useful since we're planning to take your work as a strong model for comparison in the AVSR task on LRS3.

opened by ALIVE321 4
model register problem with multiple pre-trained models

Hi,

Thanks for your wonderful work.

I want load multiple pre-trained model in the same task. But when I try to load another pre-trained model, a model registration bug is shown as following:

Traceback (most recent call last): File "hubert_asr.py", line 489, in load_vision_model model = task.build_model(cfg.model) File "/home3/chenchen/research/hubert/fairseq/fairseq/tasks/fairseq_task.py", line 324, in build_model model = models.build_model(cfg, self) File "/home3/chenchen/research/hubert/fairseq/fairseq/models/init.py", line 89, in build_model assert model is not None, ( AssertionError: Could not infer model type from {'_name': 'av_hubert_seq2seq',.......,Available models: dict_keys(['wav2vec', 'wav2vec2', 'wav2vec_ctc', 'wav2vec_seq2seq', 'hubert', 'hubert_ctc', 'transformer_lm', 'av_hubert', 'av_hubert_ctc']) Requested model type: av_hubert_seq2seq.

I think it because the 'av_hubert_seq2seq' is not registered into the model list, and how can I add it?

Thanks!

opened by chrisole 4
Problem in finetuning

Thanks for sharing this research work! I tried to finetune the lrs3 checkpoint but I got this error: File "~/av_hubert/avhubert/hubert_asr.py", line 474, in build_model del state['model']['mask_emb'] KeyError: 'mask_emb'

and after comment this line, this error occurred : File "~/av_hubert/avhubert/hubert_asr.py", line 386, in forward AttributeError: 'AVHubertSeq2Seq' object has no attribute 'extract_finetune'

opened by javadpeyman 4
Fixing Colab

Hei there,

1.) In line 1, change % cd /content/ to %cd /content/ --> remove the space between % & cd

2.) <play_video(mouth_roi_path)> add MIME-Type 'video/mp4' to play it within browser/colab. It's easily fixable by supplying the 'magic' file via colab (~/) itself, since it won't automatically look it up at ~/mime.types.

3.) The inference part is broken, since it produces "Prediction - {ns} {ns} {ns} {ns} {ns} {ns} {ns} {ns} {ns} {ns} ". Not quite sure how to fix it. :(

Would it be possible to publish a colab, like the one that produced the guitar player with a computer voice? Or, alternatively, could you point me in the right direction how to achieve this effect? This would be spot-on for an art installation I'm currently working on.

Thank you very much, merry christmas and a happy new year! <3

opened by hideosnes 0
issue during 1st iteration of pretraining

Hi.

I conduct pretraining of 1st iteration. And I found the following issue during training.

UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [64, 1, 5, 7, 7], strides() = [245, 1, 49, 7, 1] bucket_view.sizes() = [64, 1, 5, 7, 7], strides() = [245, 245, 49, 7, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass

What is the problem?

And, how much time did you take to pretrain AV hubert model in the first iteration? I think it takes too long in my system... I'm using 6 GPU (32 GB, A100) for 1st iteration of pretraining. Can you tell us the spec of GPU system used for pretraining? and it's training time?

Thank you so much.

opened by sungheedong 1
preparation-cnn_face_detector

Hi.

I'm doing preparation using dlib and avhubert/preparation First I installed dlib via "yum install" and then downloaded python_examples in dlib github page.

I run python3 detect_landmark.py --root ../LRS3 --landmark ../LRS3/landmark --manifest ../LRS3/file.list.org --cnn_detector ../dlib/python_examples/cnn_face_detector.py --face_predictor ../dlib/python_examples/face_detector.py --ffmpeg /bin/ffmpeg --rank 1 --nshard 2

Then, I got the following error message,

Traceback (most recent call last): File "detect_landmark.py", line 72, in detect_face_landmarks(args.face_predictor, args.cnn_detector, args.root, args.landmark, args.manifest, args.rank, args.nshard) File "detect_landmark.py", line 34, in detect_face_landmarks cnn_detector = dlib.cnn_face_detection_model_v1(cnn_detector_path) RuntimeError: An error occurred while trying to read the first object from the file '/home/dsh/dlib/python_examples/cnn_face_detector.py'. ERROR: Unexpected version found while deserializing dlib::add_loss_layer.

How can I solve the problem? Please help.

Thank you.

opened by sungheedong 1
infer_s2s.py: Load dataset (possibly sharded) ???

I realized that only part of the test dataset is evaluated when running the "infer_s2s.py". After inspecting the code, I found this comment "Load dataset (possibly sharded)" here. Specifically, the test set of my database has around 400 samples but only 150 are decoded. Why? How can I solve this? I was trying to set different parameters of the dataset/task but with no succes. I would like to get a %WER performance on the whole test set in order to be able for comparing and benchmarking purposes.

opened by david-gimeno 2
Question on result of pretrain 433h and finetune 30h on LRS3

Hi, Recently I finished pretrain and finetuning on the LRS3 dataset which is pretraining on 433h and fine-tuning on 30h. The pretrain config files I used is avhubert/conf/pretrain/base_lrs3_iter[1-5].yaml The fine-tuning config file I used is avhubert/conf/av-finetune/base_noise_pt_noise_ft_30h.yaml which deleted the parameters for noise. The decode result for 5 iterations is here

The result of your paper is here

So the audio-HuBERT on the same dataset is 4.9 in your paper. The result of AV-HuBERT which I trained is 6.73. I think there is something wrong with the training process, otherwise, the AV-HuBERT's result should be better than audio-HuBERT's result which is 4.9. Here is my question about pretrain config file, 1. modality_dropout is 0 in the config file(iter 1-4) but in the paper it should be 0.5 2. mask_prob_image is 0.8 but in the paper it should be 0.08

In a word, what should I do to reproduce your experimental results? Looking forward to your reply.

opened by li563042811 5

A self-supervised learning framework for audio-visual speech

Related tags

Overview

AV-HuBERT (Audio-Visual Hidden Unit BERT)

Introduction

License

Pre-trained and fine-tuned models

Installation

Load a pretrained model

Train a new model

Data preparation

Pre-train an AV-HuBERT model

Finetune an AV-HuBERT model with Seq2Seq

Decode an AV-HuBERT model

Seq2Seq decoding

Comments

Owner

Meta Research

Python codes for Lite Audio-Visual Speech Enhancement.

Facestar dataset. High quality audio-visual recordings of human conversational speech.

Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

This is the implementation of "SELF SUPERVISED REPRESENTATION LEARNING WITH DEEP CLUSTERING FOR ACOUSTIC UNIT DISCOVERY FROM RAW SPEECH" submitted to ICASSP 2022

UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning

A Low Complexity Speech Enhancement Framework for Full-Band Audio (48kHz) based on Deep Filtering.

An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Implementation of the method described in the Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

This repository is the official implementation of Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regularized Fine-Tuning (NeurIPS21).

ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

Patch Rotation: A Self-Supervised Auxiliary Task for Robustness and Accuracy of Supervised Models

Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation

Code for One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning (AAAI 2022)

A self-supervised 3D representation learning framework named viewpoint bottleneck.

A self-supervised 3D representation learning framework named viewpoint bottleneck.

A PyTorch implementation of Mugs proposed by our paper "Mugs: A Multi-Granular Self-Supervised Learning Framework".

Code for Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021)

SAAVN - Sound Adversarial Audio-Visual Navigation,ICLR2022 (In PyTorch)