TalkNet: Audio-visual active speaker detection Model

Last update: Dec 14, 2022

Related tags

Text Data & NLP TalkNet_ASD

Overview

Is someone talking? TalkNet: Audio-visual active speaker detection Model

This repository contains the code for our ACM MM 2021 paper, TalkNet, an active speaker detection model to detect 'whether the face in the screen is speaking or not?'. [Paper] [Video_English] [Video_Chinese].

Awesome ASD: Papers about active speaker detection in last years.
TalkNet in AVA-Activespeaker dataset: The code to preprocess the AVA-ActiveSpeaker dataset, train TalkNet in AVA train set and evaluate it in AVA val/test set.
TalkNet in TalkSet and Columbia ASD dataset: The code to generate TalkSet, an ASD dataset in the wild, based on VoxCeleb2 and LRS3, train TalkNet in TalkSet and evaluate it in Columnbia ASD dataset.
An ASD Demo with pretrained TalkNet model: An end-to-end script to detect and mark the speaking face by the pretrained TalkNet model.

Dependencies

Start from building the environment

conda create -n TalkNet python=3.7.9 anaconda
conda activate TalkNet
pip install -r requirement.txt

Start from the existing environment

pip install -r requirement.txt

TalkNet in AVA-Activespeaker dataset

Data preparation

The following script can be used to download and prepare the AVA dataset for training.

python trainTalkNet.py --dataPathAVA AVADataPath --download

AVADataPath is the folder you want to save the AVA dataset and its preprocessing outputs, the details can be found in here . Please read them carefully.

Training

Then you can train TalkNet in AVA end-to-end by using:

python trainTalkNet.py --dataPathAVA AVADataPath

exps/exps1/score.txt: output score file, exps/exp1/model/model_00xx.model: trained model, exps/exps1/val_res.csv: prediction for val set.

Pretrained model

Our pretrained model performs mAP: 92.3 in validation set, you can check it by using:

python trainTalkNet.py --dataPathAVA AVADataPath --evaluation

The pretrained model will automaticly be downloaded into TalkNet_ASD/pretrain_AVA.model. It performs mAP: 90.8 in the testing set.

TalkNet in TalkSet and Columbia ASD dataset

Data preparation

We find that it is challenge to apply the model we trained in AVA for the videos not in AVA (Reason is here, Q1). So we build TalkSet, an active speaker detection dataset in the wild, based on VoxCeleb2 and LRS3.

We do not plan to upload this dataset since we just modify it, instead of building it. In TalkSet folder we provide these .txt files to describe which files we used to generate the TalkSet and their ASD labels. You can generate this TalkSet if you are interested to train an ASD model in the wild.

Also, we have provided our pretrained TalkNet model in TalkSet. You can evaluate it in Columbia ASD dataset or other raw videos in the wild.

Usage

A pretrain model in TalkSet will be download into TalkNet_ASD/pretrain_TalkSet.model when using the following script:

python demoTalkNet.py --evalCol --colSavePath colDataPath

Also, Columnbia ASD dataset and the labels will be downloaded into colDataPath. Finally you can get the following F1 result.

Name	Bell	Boll	Lieb	Long	Sick	Avg.
F1	98.1	88.8	98.7	98.0	97.7	96.3

(This result is different from that in our paper because we train the model again, while the avg. F1 is very similar)

An ASD Demo with pretrained TalkNet model

Data preparation

We build an end-to-end script to detect and extract the active speaker from the raw video by our pretrain model in TalkSet.

You can put the raw video (.mp4 and .avi are both fine) into the demo folder, such as 001.mp4.

Usage

python demoTalkNet.py --videoName 001

A pretrain model in TalkSet will be downloaded into TalkNet_ASD/pretrain_TalkSet.model. The structure of the output reults can be found in here.

You can get the output video demo/001/pyavi/video_out.avi, which has marked the active speaker by green box and non-active speaker by red box.

Citation

Please cite the following if our paper or code is helpful to your research.

@article{tao2021TalkNet,
  title={Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection},
  author={Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, Haizhou Li},
  journal={ACM Multimedia (MM)},
  year={2021}
}

I have summaried some potential FAQs. This is my first open-source work, please let me know if I can future improve in this repositories. Thanks for your support!

Comments

how to modify face detection and tracking algorithm for cartoon faces?

I want to apply this repo to a cartoon video but it is not able to detect cartoon faces. What can be the solution? Should I retrain again with the data (in AVA format) or just modify the face detection algorithm?

opened by saumyaborwankar 17
audio input size

Thanks for the contribution. From your paper the input image size is 128x128 and the dimension of MFCC is 13. what is the size of the MFCC is the temporal direction? I am asking this cause your code inputs different MFCC size at each interaction. How the network managed to input different audio input at each iteration I was expect uniform input size for the audio network.

opened by Falmi 16
关于预训练模型的下载

你好，按照您说的命令操作后，

代码报错如下： (pytorch) D:\project_code_home\V-A\TalkNet_ASD>python trainTalkNet.py --dataPathAVA AVA --evaluation Traceback (most recent call last): File "trainTalkNet.py", line 93, in main() File "trainTalkNet.py", line 46, in main visualPath=os.path.join(args.visualPathAVA, 'train'), **vars(args)) File "D:\project_code_home\V-A\TalkNet_ASD\dataLoader.py", line 95, in init mixLst = open(trialFileName).read().splitlines() FileNotFoundError: [Errno 2] No such file or directory: 'AVA\csv\train_loader.csv'

dataPath 路径修改为相对和绝对路径，都试过，还是不行，无法下载。我是在windows平台运行的，下载预训练模型的话和我电脑有没有gpu关系不大吧？还有我的网络是可以使用google浏览器的，从github下载文件是可以的。

期待您的回应，谢谢！

opened by Ma0110 9
How cause the gap about labels of TalkSet?

Thank you for your job and very detailed explanation! I have download the voxceleb2 & lrs3, and used your 'generate_TalkSet.py' code and successfully got the TalkSet! But it looks like a difference in labels between 'TAudio.txt' generated by myself with the same file in 'lists_out'. Does this Gap have a big impact on training a 'in the wild' model? My labels: TAudio id06358/1Dy3Ro1Qqbo/00003 id06358/1Dy3Ro1Qqbo/00003 5.05 0 5.05 0 0 TAudio id00903/k1UwIOqNxwc/00408 id00903/k1UwIOqNxwc/00408 5.63 0 5.63 0 0 TAudio id02728/0cy01XwS_WA/00004 id02728/0cy01XwS_WA/00004 5.37 0 5.37 0 0 TAudio id01071/m7V7Og5SU1Q/00020 id01071/m7V7Og5SU1Q/00020 5.95 0 5.95 0 0 Your labels: TAudio id06358/1Dy3Ro1Qqbo/00003 id06358/1Dy3Ro1Qqbo/00003 5.12 0 5.12 0 0 TAudio id00903/k1UwIOqNxwc/00408 id00903/k1UwIOqNxwc/00408 5.69 0 5.69 0 0 TAudio id02728/0cy01XwS_WA/00004 id02728/0cy01XwS_WA/00004 5.37 0 5.37 0 0 TAudio id01071/m7V7Og5SU1Q/00020 id01071/m7V7Og5SU1Q/00020 6.01 0 6.01 0 0

opened by gd2016229035 7
Getting NaN values for prediction

Hello everyone,

Thank you so much for sharing the code. It was so helpful. I was using the pre-trained model on a video, and I can tell that it can smoothly detect all faces, but it highlights them all in red color (not-speaking), and it doesn't show a number for prediction. It simply shows a NaN error. Do you know what could be wrong?

Image attached: https://i.imgur.com/ssgjjP0.png

opened by AdhamKhalifa 6
question

File "trainTalkNet.py", line 86, in main() File "trainTalkNet.py", line 35, in main loader = train_loader(trialFileName = args.trainTrialAVA,
File "D:\python project\TalkNet_ASD-main\dataLoader.py", line 97, in init sortedMixLst = sorted(mixLst, key=lambda data: (int(float(data.split('\t')[1])), int(float(data.split('\t')[-1]))),reverse=True) File "D:\python project\TalkNet_ASD-main\dataLoader.py", line 97, in sortedMixLst = sorted(mixLst, key=lambda data: (int(float(data.split('\t')[1])), int(float(data.split('\t')[-1]))),reverse=True) ValueError: could not convert string to float: '20467", 请问这是什么原因，data格式是正确的

opened by songhaozhen 6
How can I train the network with TalkSet?

Hi Thanks a lot for sharing your code :) I must have missed something - how can I train the network using TalkSet? Is it the same procedure ?

Thanks !

opened by bugdary 5
ASD confidence/score

Hi there,

Thank you for releasing your code and models, very impressive results! :)

I am evaluating your TalkNet demo model on a new dataset. I now need to draw a precision-recall curve, so I was expecting to get something like a per-frame ASD confidence value in the range [0; 1] so that I can gradually vary the confidence threshold from 0% to 100% and compute a pair of precision and recall rates at each step. Instead, TalkNet provides per-frame scores that seem to range from about -2 or -3 to +2 or +3, more or less, without clear max or min boundaries, where positive values are active and negative silent frames.

From my understanding, the last FC layer is followed by a softmax operation so shouldn't the output be expressed in terms of [0; 1] confidence? Is there a way to convert the output score into a confidence? I was thinking of simply applying a Sigmoid function to the output score, but perhaps I am missing something.

Thank you again for your work, looking forward to hearing back from you! Davide

opened by dberghi 4
request for clarification on evaluation code

Hi Tao, Could you explain why you added only " Speaking_Audible" to the prediction CSV file after prediction https://github.com/TaoRuijie/TalkNet-ASD/blob/85b37afc32bedcadd8b5dfbc7f69298a76de279b/talkNet.py#L66 I would be glad if you could explain the evaluation logic you used in your code. Thanks!

opened by Falmi 4
Could you plz explain details meaning of csv file.

I saw there are a lot of array looks like [1,1,1,1,1,1,1,1,1] and a big number like 21231 after with that, within test_loader.csv

And there are a lot of id, such as video id, instance id, entity id, and label id.

I guess video id is the clip_figure accompanied id with each related video, and entity_box_x,y is the face clip marks, but what are the meanings of the others' id? Could you please give more details ?

Really Thanks!

opened by xiejiachen 4
关于测试集test

我进行评估test集的时候，发现结果一直都是mAP=100%，评估val集就没有问题，不知道作者有没有遇到这个问题？而且我观察到test_orig.csv文件中，所有的标签都是SPEAKING_AUDIBLE，不知道是否和这个有关

下面是运行时的输出

`WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid. WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid. WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid. 46%|███████████████████████████████████████████▏ | 9814/21361 [32:11<09:59, 19.25it/s]WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid. WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid. WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid. WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid. WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid. WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid. WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid. WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.

46%|███████████████████████████████████████████▏ | 9819/21361 [32:12<17:17, 11.12it/s]WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.

46%|███████████████████████████████████████████▎ | 9834/21361 [32:14<30:59, 6.20it/s]WARNING:root:frame length (556) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.

100%|███████████████████████████████████████████████████████████████████████████████████████████| 21361/21361 [1:07:15<00:00, 5.29it/s]

mAP 100.00%`

opened by Overcautious 4

TalkNet: Audio-visual active speaker detection Model

Related tags

Overview

Is someone talking? TalkNet: Audio-visual active speaker detection Model

Dependencies

TalkNet in AVA-Activespeaker dataset

Data preparation

Training

Pretrained model

TalkNet in TalkSet and Columbia ASD dataset

Data preparation

Usage

An ASD Demo with pretrained TalkNet model

Data preparation

Usage

Citation

Comments

Owner

Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS)

neural network based speaker embedder

Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

This repo contains simple to use, pretrained/training-less models for speaker diarization.

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk.

An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Active learning for text classification in Python

An attempt to map the areas with active conflict in Ukraine using open source twitter data.

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

Data manipulation and transformation for audio signal processing, powered by PyTorch

Word2Wave: a framework for generating short audio samples from a text prompt using WaveGAN and COALA.

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

VMD Audio/Text control with natural language