Code for the Active Speakers in Context Paper (CVPR2020)

Last update: Oct 14, 2022

Related tags

Deep Learning active-speakers-context

Overview

Active Speakers in Context

This repo contains the official code and models for the "Active Speakers in Context" CVPR 2020 paper.

Before Training

The code relies on multiple external libraries go to ./scripts/dev_env.sh.an recreate the suggested envirroment.

This code works over face crops and their corresponding audio track, before you start training you need to preprocess the videos in the AVA dataset. We have 3 utility files that contain the basic data to support this process, download them using ./scripts/dowloads.sh.

Extract the audio tracks from every video in the dataset. Go to ./data/extract_audio_tracks.py in main adapt the ava_video_dir (directory with the original ava videos) and target_audios (empty directory where the audio tracks will be stored) to your local file system. The code relies on 16k .wav files and will fail with other formats and bit rates.
Slice the audio tracks by timestamp. Go to ./data/slice_audio_tracks.py in main adapt the ava_audio_dir (the directory with the audio tracks you extracted on step 1), output_dir (empty directory where you will store the sliced audio files) and csv (the utility file you download previously, use the set accordingly) to your local file system.
Extract the face crops by timestamp. Go to ./data/extract_face_crops_time.py in main adapt the ava_video_dir (directory with the original ava videos), csv_file (the utility file you download previously, use the train/val/test set accordingly) and output_dir (empty directory where you will store the face crops) to your local file system. This process will result in about 124GB extra data.

The full audio tracks obtained on step 1. will not be used anymore.

Training

Training the ASC is divided in two major stages: the optimization of the Short-Term Encoder (similar to google baseline) and the optimization of the Context Ensemble Network. The second step includes the pair-wise refinement and the temporal refinement, and relies on a full forward pass of the Short-Term Encoder on the training and validation sets.

Training the Short-Term Encoder

Got to ./core/config.py and modify the STE_inputs dictionary so that the keys audio_dir, video_dir and models_out point to the audio clips, face crops (those extracted on ‘Before Training’) and an empty directory where the STE models will be saved.

Execute the script STE_train.py clip_lenght cuda_device_number, we used clip_lenght=11 on the paper, but it can be set to any uneven value greater than 0 (performance will vary!).

Forward Short Term Encoder

The Active Speaker Context relies on the features extracted from the STE for its optimization, execute the script python STE_forward.py clip_lenght cuda_device_number, use the same clip_lenght as the training. Check lines 44 and 45 to switch between a list of training and val videos, you will need both subsets for the next step.

If you want to evaluate on the AVA Active Speaker Datasets, use ./STE_postprocessing.py, check lines 44 to 50 and adjust the files to your local file system.

Training the ASC Module

Once all the STE features have been calculated, go to ./core/config.py and change the dictionary ASC_inputs modify the value of keys, features_train_full, features_val_full, and models_out so that they point to the local directories where the features extracted with the STE in the train and val set have been stored, and an empty directory where the ASC models will 'be stored. Execute ./ASC_train.py clip_lenght skip_frames speakers cuda_device_number clip_lenght must be the same clip size used to train the STE, skip_frames determines the amount of frames in between sampled clips, we used 4 for the results presented in the paper, speakers is the number of candidates speakers in the contex.

Forward ASC

use ./ASC_forward.py clips time_stride speakers cuda_device_number to forward the models produced by the last step. Use the same clip and stride configurations. You will get one csv file for every video, for evaluation purposes use the script ASC_predcition_postprocessing.py to generate a single CSV file which is compatible with the evaluation tool, check lines 54 to 59 and adapt the paths to your local configuration.

If you want to evaluate on the AVA Active Speaker Datasets, use ./ASC_predcition_postprocessing.py, check lines 54 to 59 and adjust the files to your local file system.

Pre-Trained Models

Short Term Encoder

Active Speaker Context

Prediction Postprocessing and Evaluation

The prediction format follows the very same format of the AVA-Active speaker dataset, but contains an extra value for the active speaker class in the final column. The script ./STE_postprocessing.py handles this step. Check lines 44, 45 and 46 and set the directory where you saved the output of the forward pass (44), the directory with the original ava csv (45) and and empty temporary directory (46). Additionally set on lines 48 and 49 the outputs of the script, one of them is the final prediction formated to use the official evaluation tool and the other one is a utility file to use along the same tool. Notice you can do some temporal smoothing on the function 'softmax_feats', is a simple median filter and you can choose the window size on lines 35 and 36.

Comments

For mAp evaluation issue

Thanks for your sharing!

I search the ActivityNet official code and find there is the get_ava_active_speaker_performance.py file in https://github.com/activitynet/ActivityNet/blob/master/Evaluation/get_ava_active_speaker_performance.py, which looks like to do evaluation for ava dataset. In your code you are using sklearn metric to get the mAP results right? So may I know did you use this code to do evaluation in the process of contest?

If so, could you share the format of the result csv file? My format is looks like that:

video_id,frame_timestamp,entity_box_x1,entity_box_y1,entity_box_x2,entity_box_y2,label,entity_id,score HV0H6oc4Kvs,960.0,0.6042350000000001,0.150289,0.701954,0.349711,NOT_SPEAKING,HV0H6oc4Kvs_0960_1020:1,0.11589885

and the ground truth csv file is "ava_activespeaker_val_augmented.csv" file, which you have shared.

Thanks for your time !

opened by TaoRuijie 7
ASC_forward.py is missing

I am training a model from scratch and it is nearly done, however I checked and ASC_forward.py appears to be a copy of the README, do you mind updating it? I am very excited about using your work in a HRI research context but need some help with reproducing your results. Thanks for your great work!

opened by chrismbirmingham 5
Model Inputs

Hi, I am attempting to set up the model for real world use and struggling to reverse engineer your dataset loader. From what I understand, the model should take 11 rgb face crops (144x144) as video_data (which also needs to be normalized), but how do you properly generate the audio input? I understand you are working with Mel Frequency Cepstral Coefficients, but can you give me some insight on why 11 frames worth of audio would match with an input shape of 13x40?

opened by chrismbirmingham 3
Unable to slice audio files

Hi, When I try to slice the audio files following step 2 before training, I am finding the directory is empty, i.e, the audio slices are not extracted at all. Can you please check once from your side?

opened by godatta 2
What does instance_id mean?

Thank you for sharing your code.

What does instance_id mean? in csv file? (ex: vL7N_xRJKJU_1440_1500:118:0:0) I need this information to train my own data.

Thanks and looking forward for your reply.

opened by ArialChan 2
test_data is empyt

I download the file according the ./scripts/download.sh , but 'wget https://filedn.com/l0kNCNuXuEq70c3iUHsXxJ7/active-speakers-context/ava_activespeaker_test_augment' is not exist. Can you share the test data?

opened by HHHHWB 2
Training Time

Hi, Thanks for the open source code, I was wondering what time did it take to train the ASC model? Also, Can you please upload the model weight if possible . Thanks!

opened by rahulranjan29 2
STE_postprocessing.py files

Hi, Thanks for this very clean and easy-to-use repo!

I was wondering, in STE_postprocessing.py on lines 49 and 50. where do "STE.csv" and "gt.csv" come from?

I have done the forward pass of the STE, and the predictions are stored in "forward_dir" (line 44)

Thanks for your help!

opened by Andrew-Brown1 1
Capturing audio segment from clip

Hi, to capture the audio part from the clip, you are subtracting the audio_offset at the calculation of audio_start and audio_end at the L103. Is it correct way of capturing audio part? I believe, we should not subtract audio_offset.

opened by okankop 1
Postprocessing of the labels

Hi, Thanks for this fantastic work! Currently, I'm trying to replicate your results and build my own model. When I looked into the way you're dealing with the data, I find two functions in core/dataset.py called: _postprocess_speech_label and _post_process, which seems to transform SPEAKING_NOT_AUDIBLE to NOT_SPEAKING. As far as I can understand, this will change the original 3-category classification task to a 2-category classification during training. Will that influence the results and does it conform to the official guide? Maybe I'm misunderstanding something, please correct me if so. Thanks!

opened by victorywys 1
When will publish the code?

I find this link in the paper "Active Speakers in Context", but this project is empty. So I wonder to know when will publish the code? Thanks!

opened by Claireikx 1
Code for generating face tracking csv file

Hi,

Thanks a lot for sharing your model and your code. Your paper is great!

I intend to perform active speaker detection on a bunch of videos. I just need to apply your pre-trained models , not to train the model.

My understanding is that, in order to do active speaker detection on a video, I first have to perform the face tracking on this video and generate a csv file with similar tracking data as the ones provided with the AVA dataset. I did not find any face tracking script in the code. Could you provide this script (or tell me where it is in the code if I missed it) ? More generally could you provide all the missing scripts or give guidance and pointers to perform active speaker detection on any video ?

Thanks in advance.

opened by itanghiu 0
Noticed bug in STE data augmentation (core/dataset.py#L158-L162)
In core/dataset.py#L158-L162

# random crop width, height = video_data[0].size f = random.uniform(0.5, 1) i, j, h, w = RandomCrop.get_params(video_data[0], output_size=(int(height*f), int(width*f))) video_data = [s.crop(box=(j, i, w, h)) for s in video_data]

[Source]

you pass the arguments (left, upper, width, height) into Image.crop() when it should be (left, upper, right, lower). The result is that the training boxes are smaller than intended.

PyTorch's implementation is the following

def crop(img: Image.Image, top: int, left: int, height: int, width: int) -> Image.Image: if not _is_pil_image(img): raise TypeError('img should be PIL Image. Got {}'.format(type(img))) return img.crop((left, top, left + width, top + height))

[Source]

with (top, left, height, width) the output of RandomCrop.get_params(). I would recommend using the following to avoid argument-conversion mistakes.

# random crop width, height = video_data[0].size f = random.uniform(0.5, 1) crop_module = RandomCrop(size=(int(height*f), int(width*f))) video_data = [crop_module.forward(img) for img in images]

Visualization (first existing code, then fixed code)

Note: the exact position of the crop should be ignored. Only the image size is relevant. Also notice how the old pictures are generally not square crops.

f == 0.98 : negligible

f == 0.52 : significant

I haven't run your code with this fix, so I don't know how much the results would improve (if at all).
opened by btamm12 1
ASCPredictions.csv for 87.1 mAP

Hey,

Thanks again for great and helpful repo - would you be able share the ASCPredictions.csv file that you used to get the mAP scores of 87.1? I would like to analyse some of the results please.

opened by Andrew-Brown1 0
torch dataloader issue

i am getting the following error while runing STE_train.py File "STE_train.py", line 73, in shuffle=True, num_workers=1) File "/home/zain/anaconda3/envs/asc_env/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 213, in init sampler = RandomSampler(dataset) File "/home/zain/anaconda3/envs/asc_env/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 94, in init "value, but got num_samples={}".format(self.num_samples)) ValueError: num_samples should be a positive integer value, but got num_samples=0

is it a pytorch version issue? if yes then, which pytorch version you have used in your experiment

opened by javaria-qadeer 1
Inference on my own data

Hi, Thank you for sharing your code and models!

I need to use you code and models on my own video data for other tasks. My own videos are the movie data, I want know how to prepare the utility csv file ava_activespeaker_val_augmented.csv as model's input?

Thanks and looking forward for your reply!

opened by JinmingZhao 1

Owner

GitHub

Pytorch implementation of CVPR2020 paper “VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation”

VectorNet Re-implementation This is the unofficial pytorch implementation of CVPR2020 paper "VectorNet: Encoding HD Maps and Agent Dynamics from Vecto

120 Jan 6, 2023

PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Daft-Exprt - PyTorch Implementation PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis The

47 Dec 18, 2022

Make your AirPlay devices as TTS speakers

Apple AirPlayer Home Assistant integration component, make your AirPlay devices as TTS speakers. Before Use 2021.6.X or earlier Apple Airplayer compon

117 Dec 15, 2022

Identify the emotion of multiple speakers in an Audio Segment

MevonAI - Speech Emotion Recognition Identify the emotion of multiple speakers in a Audio Segment Report Bug · Request Feature Try the Demo Here Table

110 Dec 3, 2022

Code for Referring Image Segmentation via Cross-Modal Progressive Comprehension, CVPR2020.

CMPC-Refseg Code of our CVPR 2020 paper Referring Image Segmentation via Cross-Modal Progressive Comprehension. Shaofei Huang*, Tianrui Hui*, Si Liu,

55 Dec 1, 2022

PyTorch reimplementation of minimal-hand (CVPR2020)

Minimal Hand Pytorch Unofficial PyTorch reimplementation of minimal-hand (CVPR2020). you can also find in youtube or bilibili bare hand youtube or bil

228 Dec 29, 2022

Densely Connected Search Space for More Flexible Neural Architecture Search (CVPR2020)

DenseNAS The code of the CVPR2020 paper Densely Connected Search Space for More Flexible Neural Architecture Search. Neural architecture search (NAS)

291 Nov 18, 2022

An unofficial implementation of "Unpaired Image Super-Resolution using Pseudo-Supervision." CVPR2020

UnpairedSR An unofficial implementation of "Unpaired Image Super-Resolution using Pseudo-Supervision." CVPR2020 turn RCAN(modified) --> xmodel(xilinx

10 Oct 28, 2022

RCDNet: A Model-driven Deep Neural Network for Single Image Rain Removal (CVPR2020)

RCDNet: A Model-driven Deep Neural Network for Single Image Rain Removal (CVPR2020) Hong Wang, Qi Xie, Qian Zhao, and Deyu Meng [PDF] [Supplementary M

6 Sep 27, 2022

Super Pix Adv - Offical implemention of Robust Superpixel-Guided Attentional Adversarial Attack (CVPR2020)

Super_Pix_Adv Offical implemention of Robust Superpixel-Guided Attentional Adver

8 Oct 26, 2022

Official PyTorch code for CVPR 2020 paper "Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision"

Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision https://arxiv.org/abs/2003.00393 Abstract Active learning (AL) aims to min

29 Nov 21, 2022

Implementation of the paper "Language-agnostic representation learning of source code from structure and context".

Code Transformer This is an official PyTorch implementation of the CodeTransformer model proposed in: D. Zügner, T. Kirschstein, M. Catasta, J. Leskov

131 Dec 13, 2022

Source code for paper "Document-Level Relation Extraction with Adaptive Thresholding and Localized Context Pooling", AAAI 2021

ATLOP Code for AAAI 2021 paper Document-Level Relation Extraction with Adaptive Thresholding and Localized Context Pooling. If you make use of this co

146 Nov 29, 2022

Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

Context Matters: Graph-based Self-supervised Representation Learning for Medical Images Official PyTorch implementation for paper Context Matters: Gra

49 Nov 23, 2022

codes for paper Combining Dynamic Local Context Focus and Dependency Cluster Attention for Aspect-level sentiment classification

DLCF-DCA codes for paper Combining Dynamic Local Context Focus and Dependency Cluster Attention for Aspect-level sentiment classification. submitted t

15 Aug 30, 2022

Official PyTorch code of DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization (ICCV 2021 Oral).

DeepPanoContext (DPC) [Project Page (with interactive results)][Paper] DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context G

66 Nov 16, 2022