TalkNet: Audio-visual active speaker detection Model

Overview

Is someone talking? TalkNet: Audio-visual active speaker detection Model

This repository contains the code for our ACM MM 2021 paper, TalkNet, an active speaker detection model to detect 'whether the face in the screen is speaking or not?'. [Paper] [Video_English] [Video_Chinese].

overall.png

  • Awesome ASD: Papers about active speaker detection in last years.

  • TalkNet in AVA-Activespeaker dataset: The code to preprocess the AVA-ActiveSpeaker dataset, train TalkNet in AVA train set and evaluate it in AVA val/test set.

  • TalkNet in TalkSet and Columbia ASD dataset: The code to generate TalkSet, an ASD dataset in the wild, based on VoxCeleb2 and LRS3, train TalkNet in TalkSet and evaluate it in Columnbia ASD dataset.

  • An ASD Demo with pretrained TalkNet model: An end-to-end script to detect and mark the speaking face by the pretrained TalkNet model.


Dependencies

Start from building the environment

conda create -n TalkNet python=3.7.9 anaconda
conda activate TalkNet
pip install -r requirement.txt

Start from the existing environment

pip install -r requirement.txt

TalkNet in AVA-Activespeaker dataset

Data preparation

The following script can be used to download and prepare the AVA dataset for training.

python trainTalkNet.py --dataPathAVA AVADataPath --download 

AVADataPath is the folder you want to save the AVA dataset and its preprocessing outputs, the details can be found in here . Please read them carefully.

Training

Then you can train TalkNet in AVA end-to-end by using:

python trainTalkNet.py --dataPathAVA AVADataPath

exps/exps1/score.txt: output score file, exps/exp1/model/model_00xx.model: trained model, exps/exps1/val_res.csv: prediction for val set.

Pretrained model

Our pretrained model performs mAP: 92.3 in validation set, you can check it by using:

python trainTalkNet.py --dataPathAVA AVADataPath --evaluation

The pretrained model will automaticly be downloaded into TalkNet_ASD/pretrain_AVA.model. It performs mAP: 90.8 in the testing set.


TalkNet in TalkSet and Columbia ASD dataset

Data preparation

We find that it is challenge to apply the model we trained in AVA for the videos not in AVA (Reason is here, Q1). So we build TalkSet, an active speaker detection dataset in the wild, based on VoxCeleb2 and LRS3.

We do not plan to upload this dataset since we just modify it, instead of building it. In TalkSet folder we provide these .txt files to describe which files we used to generate the TalkSet and their ASD labels. You can generate this TalkSet if you are interested to train an ASD model in the wild.

Also, we have provided our pretrained TalkNet model in TalkSet. You can evaluate it in Columbia ASD dataset or other raw videos in the wild.

Usage

A pretrain model in TalkSet will be download into TalkNet_ASD/pretrain_TalkSet.model when using the following script:

python demoTalkNet.py --evalCol --colSavePath colDataPath

Also, Columnbia ASD dataset and the labels will be downloaded into colDataPath. Finally you can get the following F1 result.

Name Bell Boll Lieb Long Sick Avg.
F1 98.1 88.8 98.7 98.0 97.7 96.3

(This result is different from that in our paper because we train the model again, while the avg. F1 is very similar)


An ASD Demo with pretrained TalkNet model

Data preparation

We build an end-to-end script to detect and extract the active speaker from the raw video by our pretrain model in TalkSet.

You can put the raw video (.mp4 and .avi are both fine) into the demo folder, such as 001.mp4.

Usage

python demoTalkNet.py --videoName 001

A pretrain model in TalkSet will be downloaded into TalkNet_ASD/pretrain_TalkSet.model. The structure of the output reults can be found in here.

You can get the output video demo/001/pyavi/video_out.avi, which has marked the active speaker by green box and non-active speaker by red box.


Citation

Please cite the following if our paper or code is helpful to your research.

@article{tao2021TalkNet,
  title={Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection},
  author={Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, Haizhou Li},
  journal={ACM Multimedia (MM)},
  year={2021}
}

I have summaried some potential FAQs. This is my first open-source work, please let me know if I can future improve in this repositories. Thanks for your support!

Comments
  • how to modify face detection and tracking algorithm for cartoon faces?

    how to modify face detection and tracking algorithm for cartoon faces?

    I want to apply this repo to a cartoon video but it is not able to detect cartoon faces. What can be the solution? Should I retrain again with the data (in AVA format) or just modify the face detection algorithm?

    opened by saumyaborwankar 17
  • audio input size

    audio input size

    Thanks for the contribution. From your paper the input image size is 128x128 and the dimension of MFCC is 13. what is the size of the MFCC is the temporal direction? I am asking this cause your code inputs different MFCC size at each interaction. How the network managed to input different audio input at each iteration I was expect uniform input size for the audio network.

    opened by Falmi 16
  • 关于预训练模型的下载

    关于预训练模型的下载

    你好,按照您说的命令操作后,

    代码报错如下: (pytorch) D:\project_code_home\V-A\TalkNet_ASD>python trainTalkNet.py --dataPathAVA AVA --evaluation Traceback (most recent call last): File "trainTalkNet.py", line 93, in main() File "trainTalkNet.py", line 46, in main visualPath=os.path.join(args.visualPathAVA, 'train'), **vars(args)) File "D:\project_code_home\V-A\TalkNet_ASD\dataLoader.py", line 95, in init mixLst = open(trialFileName).read().splitlines() FileNotFoundError: [Errno 2] No such file or directory: 'AVA\csv\train_loader.csv'

    dataPath 路径修改为相对和绝对路径,都试过,还是不行,无法下载。我是在windows平台运行的,下载预训练模型的话和我电脑有没有gpu关系不大吧?还有我的网络是可以使用google浏览器的,从github下载文件是可以的。

    期待您的回应,谢谢!

    opened by Ma0110 9
  • How cause the gap about labels of TalkSet?

    How cause the gap about labels of TalkSet?

    Thank you for your job and very detailed explanation! I have download the voxceleb2 & lrs3, and used your 'generate_TalkSet.py' code and successfully got the TalkSet! But it looks like a difference in labels between 'TAudio.txt' generated by myself with the same file in 'lists_out'. Does this Gap have a big impact on training a 'in the wild' model? My labels: TAudio id06358/1Dy3Ro1Qqbo/00003 id06358/1Dy3Ro1Qqbo/00003 5.05 0 5.05 0 0 TAudio id00903/k1UwIOqNxwc/00408 id00903/k1UwIOqNxwc/00408 5.63 0 5.63 0 0 TAudio id02728/0cy01XwS_WA/00004 id02728/0cy01XwS_WA/00004 5.37 0 5.37 0 0 TAudio id01071/m7V7Og5SU1Q/00020 id01071/m7V7Og5SU1Q/00020 5.95 0 5.95 0 0 Your labels: TAudio id06358/1Dy3Ro1Qqbo/00003 id06358/1Dy3Ro1Qqbo/00003 5.12 0 5.12 0 0 TAudio id00903/k1UwIOqNxwc/00408 id00903/k1UwIOqNxwc/00408 5.69 0 5.69 0 0 TAudio id02728/0cy01XwS_WA/00004 id02728/0cy01XwS_WA/00004 5.37 0 5.37 0 0 TAudio id01071/m7V7Og5SU1Q/00020 id01071/m7V7Og5SU1Q/00020 6.01 0 6.01 0 0

    opened by gd2016229035 7
  • Getting NaN values for prediction

    Getting NaN values for prediction

    Hello everyone,

    Thank you so much for sharing the code. It was so helpful. I was using the pre-trained model on a video, and I can tell that it can smoothly detect all faces, but it highlights them all in red color (not-speaking), and it doesn't show a number for prediction. It simply shows a NaN error. Do you know what could be wrong?

    Image attached: https://i.imgur.com/ssgjjP0.png

    opened by AdhamKhalifa 6
  • question

    question

    File "trainTalkNet.py", line 86, in main() File "trainTalkNet.py", line 35, in main loader = train_loader(trialFileName = args.trainTrialAVA,
    File "D:\python project\TalkNet_ASD-main\dataLoader.py", line 97, in init sortedMixLst = sorted(mixLst, key=lambda data: (int(float(data.split('\t')[1])), int(float(data.split('\t')[-1]))),reverse=True) File "D:\python project\TalkNet_ASD-main\dataLoader.py", line 97, in sortedMixLst = sorted(mixLst, key=lambda data: (int(float(data.split('\t')[1])), int(float(data.split('\t')[-1]))),reverse=True) ValueError: could not convert string to float: '20467", 请问这是什么原因,data格式是正确的

    opened by songhaozhen 6
  • How can I train the network with TalkSet?

    How can I train the network with TalkSet?

    Hi Thanks a lot for sharing your code :) I must have missed something - how can I train the network using TalkSet? Is it the same procedure ?

    Thanks !

    opened by bugdary 5
  • ASD confidence/score

    ASD confidence/score

    Hi there,

    Thank you for releasing your code and models, very impressive results! :)

    I am evaluating your TalkNet demo model on a new dataset. I now need to draw a precision-recall curve, so I was expecting to get something like a per-frame ASD confidence value in the range [0; 1] so that I can gradually vary the confidence threshold from 0% to 100% and compute a pair of precision and recall rates at each step. Instead, TalkNet provides per-frame scores that seem to range from about -2 or -3 to +2 or +3, more or less, without clear max or min boundaries, where positive values are active and negative silent frames.

    From my understanding, the last FC layer is followed by a softmax operation so shouldn't the output be expressed in terms of [0; 1] confidence? Is there a way to convert the output score into a confidence? I was thinking of simply applying a Sigmoid function to the output score, but perhaps I am missing something.

    Thank you again for your work, looking forward to hearing back from you! Davide

    opened by dberghi 4
  • request for clarification on evaluation code

    request for clarification on evaluation code

    Hi Tao, Could you explain why you added only " Speaking_Audible" to the prediction CSV file after prediction https://github.com/TaoRuijie/TalkNet-ASD/blob/85b37afc32bedcadd8b5dfbc7f69298a76de279b/talkNet.py#L66 I would be glad if you could explain the evaluation logic you used in your code. Thanks!

    opened by Falmi 4
  • Could you plz explain details meaning of csv file.

    Could you plz explain details meaning of csv file.

    I saw there are a lot of array looks like [1,1,1,1,1,1,1,1,1] and a big number like 21231 after with that, within test_loader.csv

    And there are a lot of id, such as video id, instance id, entity id, and label id.

    I guess video id is the clip_figure accompanied id with each related video, and entity_box_x,y is the face clip marks, but what are the meanings of the others' id? Could you please give more details ?

    Really Thanks!

    opened by xiejiachen 4
  • 关于测试集test

    关于测试集test

    我进行评估test集的时候,发现结果一直都是mAP=100%,评估val集就没有问题,不知道作者有没有遇到这个问题?而且我观察到test_orig.csv文件中,所有的标签都是SPEAKING_AUDIBLE, 不知道是否和这个有关

    下面是运行时的输出

    `WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid. WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid. WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid. 46%|███████████████████████████████████████████▏ | 9814/21361 [32:11<09:59, 19.25it/s]WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid. WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid. WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid. WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid. WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid. WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid. WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid. WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.

    46%|███████████████████████████████████████████▏ | 9819/21361 [32:12<17:17, 11.12it/s]WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.

    46%|███████████████████████████████████████████▎ | 9834/21361 [32:14<30:59, 6.20it/s]WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.

    100%|███████████████████████████████████████████████████████████████████████████████████████████| 21361/21361 [1:07:15<00:00, 5.29it/s]

    mAP 100.00%`

    opened by Overcautious 4
Owner
NUS ECE PhD student
null
Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization

Line as a Visual Sentence with LineTR This repository contains the inference code, pretrained model, and demo scripts of the following paper. It suppo

SungHo Yoon 158 Dec 27, 2022
This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.

UIS-RNN Overview This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm. UIS-RNN solves the problem of s

Google 1.4k Dec 28, 2022
PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

StyleSpeech - PyTorch Implementation PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Status (2021.06.09

Keon Lee 142 Jan 6, 2023
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS)

This repository is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) with a vocoder that works in real-time. Feel free to check my thesis if you're curious or if you're looking for info I haven't documented. Mostly I would recommend giving a quick look to the figures beyond the introduction.

Corentin Jemine 38.5k Jan 3, 2023
neural network based speaker embedder

Content What is deepaudio-speaker? Installation Get Started Model Architecture How to contribute to deepaudio-speaker? Acknowledge What is deepaudio-s

null 20 Dec 29, 2022
Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Speaker-Embeddings-Correlation-Pooling This is the original implementation of the pooling method introduced in "Speaker embeddings by modeling channel

Themos Stafylakis 10 Apr 30, 2022
This repo contains simple to use, pretrained/training-less models for speaker diarization.

PyDiar This repo contains simple to use, pretrained/training-less models for speaker diarization. Supported Models Binary Key Speaker Modeling Based o

null 12 Jan 20, 2022
Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning English | 中文 ❗ Now we provide inferencing code and pre-training models

null 164 Jan 2, 2023
A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk.

Simple-Vosk A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk. Check out the official Vosk G

null 2 Jun 19, 2022
An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

Ultra_Fast_Lane_Detection_TensorRT An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI to accelerate. our model support for in

steven.yan 121 Dec 27, 2022
Active learning for text classification in Python

Active Learning allows you to efficiently label training data in a small-data scenario.

Webis 375 Dec 28, 2022
An attempt to map the areas with active conflict in Ukraine using open source twitter data.

Live Action Map (LAM) An attempt to use open source data on Twitter to map areas with active conflict. Right now it is used for the Ukraine-Russia con

Kinshuk Dua 171 Nov 21, 2022
Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

null 41 Jan 3, 2023
Data manipulation and transformation for audio signal processing, powered by PyTorch

torchaudio: an audio library for PyTorch The aim of torchaudio is to apply PyTorch to the audio domain. By supporting PyTorch, torchaudio follows the

null 1.9k Jan 8, 2023
Word2Wave: a framework for generating short audio samples from a text prompt using WaveGAN and COALA.

Word2Wave is a simple method for text-controlled GAN audio generation. You can either follow the setup instructions below and use the source code and CLI provided in this repo or you can have a play around in the Colab notebook provided. Note that, in both cases, you will need to train a WaveGAN model first

Ilaria Manco 91 Dec 23, 2022
VMD Audio/Text control with natural language

This repository is a proof of principle for performing Molecular Dynamics analysis, in this case with the program VMD, via natural language commands.

Andrew White 13 Jun 9, 2022