Code for the Active Speakers in Context Paper (CVPR2020)

Overview

Active Speakers in Context

This repo contains the official code and models for the "Active Speakers in Context" CVPR 2020 paper.

Before Training

The code relies on multiple external libraries go to ./scripts/dev_env.sh.an recreate the suggested envirroment.

This code works over face crops and their corresponding audio track, before you start training you need to preprocess the videos in the AVA dataset. We have 3 utility files that contain the basic data to support this process, download them using ./scripts/dowloads.sh.

  1. Extract the audio tracks from every video in the dataset. Go to ./data/extract_audio_tracks.py in main adapt the ava_video_dir (directory with the original ava videos) and target_audios (empty directory where the audio tracks will be stored) to your local file system. The code relies on 16k .wav files and will fail with other formats and bit rates.
  2. Slice the audio tracks by timestamp. Go to ./data/slice_audio_tracks.py in main adapt the ava_audio_dir (the directory with the audio tracks you extracted on step 1), output_dir (empty directory where you will store the sliced audio files) and csv (the utility file you download previously, use the set accordingly) to your local file system.
  3. Extract the face crops by timestamp. Go to ./data/extract_face_crops_time.py in main adapt the ava_video_dir (directory with the original ava videos), csv_file (the utility file you download previously, use the train/val/test set accordingly) and output_dir (empty directory where you will store the face crops) to your local file system. This process will result in about 124GB extra data.

The full audio tracks obtained on step 1. will not be used anymore.

Training

Training the ASC is divided in two major stages: the optimization of the Short-Term Encoder (similar to google baseline) and the optimization of the Context Ensemble Network. The second step includes the pair-wise refinement and the temporal refinement, and relies on a full forward pass of the Short-Term Encoder on the training and validation sets.

Training the Short-Term Encoder

Got to ./core/config.py and modify the STE_inputs dictionary so that the keys audio_dir, video_dir and models_out point to the audio clips, face crops (those extracted on ‘Before Training’) and an empty directory where the STE models will be saved.

Execute the script STE_train.py clip_lenght cuda_device_number, we used clip_lenght=11 on the paper, but it can be set to any uneven value greater than 0 (performance will vary!).

Forward Short Term Encoder

The Active Speaker Context relies on the features extracted from the STE for its optimization, execute the script python STE_forward.py clip_lenght cuda_device_number, use the same clip_lenght as the training. Check lines 44 and 45 to switch between a list of training and val videos, you will need both subsets for the next step.

If you want to evaluate on the AVA Active Speaker Datasets, use ./STE_postprocessing.py, check lines 44 to 50 and adjust the files to your local file system.

Training the ASC Module

Once all the STE features have been calculated, go to ./core/config.py and change the dictionary ASC_inputs modify the value of keys, features_train_full, features_val_full, and models_out so that they point to the local directories where the features extracted with the STE in the train and val set have been stored, and an empty directory where the ASC models will 'be stored. Execute ./ASC_train.py clip_lenght skip_frames speakers cuda_device_number clip_lenght must be the same clip size used to train the STE, skip_frames determines the amount of frames in between sampled clips, we used 4 for the results presented in the paper, speakers is the number of candidates speakers in the contex.

Forward ASC

use ./ASC_forward.py clips time_stride speakers cuda_device_number to forward the models produced by the last step. Use the same clip and stride configurations. You will get one csv file for every video, for evaluation purposes use the script ASC_predcition_postprocessing.py to generate a single CSV file which is compatible with the evaluation tool, check lines 54 to 59 and adapt the paths to your local configuration.

If you want to evaluate on the AVA Active Speaker Datasets, use ./ASC_predcition_postprocessing.py, check lines 54 to 59 and adjust the files to your local file system.

Pre-Trained Models

Short Term Encoder

Active Speaker Context

Prediction Postprocessing and Evaluation

The prediction format follows the very same format of the AVA-Active speaker dataset, but contains an extra value for the active speaker class in the final column. The script ./STE_postprocessing.py handles this step. Check lines 44, 45 and 46 and set the directory where you saved the output of the forward pass (44), the directory with the original ava csv (45) and and empty temporary directory (46). Additionally set on lines 48 and 49 the outputs of the script, one of them is the final prediction formated to use the official evaluation tool and the other one is a utility file to use along the same tool. Notice you can do some temporal smoothing on the function 'softmax_feats', is a simple median filter and you can choose the window size on lines 35 and 36.

Comments
  • For mAp evaluation issue

    For mAp evaluation issue

    Thanks for your sharing!

    I search the ActivityNet official code and find there is the get_ava_active_speaker_performance.py file in https://github.com/activitynet/ActivityNet/blob/master/Evaluation/get_ava_active_speaker_performance.py, which looks like to do evaluation for ava dataset. In your code you are using sklearn metric to get the mAP results right? So may I know did you use this code to do evaluation in the process of contest?

    If so, could you share the format of the result csv file? My format is looks like that:

    video_id,frame_timestamp,entity_box_x1,entity_box_y1,entity_box_x2,entity_box_y2,label,entity_id,score HV0H6oc4Kvs,960.0,0.6042350000000001,0.150289,0.701954,0.349711,NOT_SPEAKING,HV0H6oc4Kvs_0960_1020:1,0.11589885

    and the ground truth csv file is "ava_activespeaker_val_augmented.csv" file, which you have shared.

    Thanks for your time !

    opened by TaoRuijie 7
  • ASC_forward.py is missing

    ASC_forward.py is missing

    I am training a model from scratch and it is nearly done, however I checked and ASC_forward.py appears to be a copy of the README, do you mind updating it? I am very excited about using your work in a HRI research context but need some help with reproducing your results. Thanks for your great work!

    opened by chrismbirmingham 5
  • Model Inputs

    Model Inputs

    Hi, I am attempting to set up the model for real world use and struggling to reverse engineer your dataset loader. From what I understand, the model should take 11 rgb face crops (144x144) as video_data (which also needs to be normalized), but how do you properly generate the audio input? I understand you are working with Mel Frequency Cepstral Coefficients, but can you give me some insight on why 11 frames worth of audio would match with an input shape of 13x40?

    opened by chrismbirmingham 3
  • Unable to slice audio files

    Unable to slice audio files

    Hi, When I try to slice the audio files following step 2 before training, I am finding the directory is empty, i.e, the audio slices are not extracted at all. Can you please check once from your side?

    opened by godatta 2
  • What does instance_id mean?

    What does instance_id mean?

    Thank you for sharing your code.

    What does instance_id mean? in csv file? (ex: vL7N_xRJKJU_1440_1500:118:0:0) I need this information to train my own data.

    Thanks and looking forward for your reply.

    opened by ArialChan 2
  • test_data is empyt

    test_data is empyt

    I download the file according the ./scripts/download.sh , but 'wget https://filedn.com/l0kNCNuXuEq70c3iUHsXxJ7/active-speakers-context/ava_activespeaker_test_augment' is not exist. Can you share the test data?

    opened by HHHHWB 2
  • Training Time

    Training Time

    Hi, Thanks for the open source code, I was wondering what time did it take to train the ASC model? Also, Can you please upload the model weight if possible . Thanks!

    opened by rahulranjan29 2
  • STE_postprocessing.py files

    STE_postprocessing.py files

    Hi, Thanks for this very clean and easy-to-use repo!

    I was wondering, in STE_postprocessing.py on lines 49 and 50. where do "STE.csv" and "gt.csv" come from?

    I have done the forward pass of the STE, and the predictions are stored in "forward_dir" (line 44)

    Thanks for your help!

    opened by Andrew-Brown1 1
  • Capturing audio segment from clip

    Capturing audio segment from clip

    Hi, to capture the audio part from the clip, you are subtracting the audio_offset at the calculation of audio_start and audio_end at the L103. Is it correct way of capturing audio part? I believe, we should not subtract audio_offset.

    opened by okankop 1
  • Postprocessing of the labels

    Postprocessing of the labels

    Hi, Thanks for this fantastic work! Currently, I'm trying to replicate your results and build my own model. When I looked into the way you're dealing with the data, I find two functions in core/dataset.py called: _postprocess_speech_label and _post_process, which seems to transform SPEAKING_NOT_AUDIBLE to NOT_SPEAKING. As far as I can understand, this will change the original 3-category classification task to a 2-category classification during training. Will that influence the results and does it conform to the official guide? Maybe I'm misunderstanding something, please correct me if so. Thanks!

    opened by victorywys 1
  • When will publish the code?

    When will publish the code?

    I find this link in the paper "Active Speakers in Context", but this project is empty. So I wonder to know when will publish the code? Thanks!

    opened by Claireikx 1
  • Code for generating face tracking csv file

    Code for generating face tracking csv file

    Hi,

    Thanks a lot for sharing your model and your code. Your paper is great!

    I intend to perform active speaker detection on a bunch of videos. I just need to apply your pre-trained models , not to train the model.

    My understanding is that, in order to do active speaker detection on a video, I first have to perform the face tracking on this video and generate a csv file with similar tracking data as the ones provided with the AVA dataset. I did not find any face tracking script in the code. Could you provide this script (or tell me where it is in the code if I missed it) ? More generally could you provide all the missing scripts or give guidance and pointers to perform active speaker detection on any video ?

    Thanks in advance.

    opened by itanghiu 0
  • Noticed bug in STE data augmentation (core/dataset.py#L158-L162)

    Noticed bug in STE data augmentation (core/dataset.py#L158-L162)

    In core/dataset.py#L158-L162

    # random crop
    width, height = video_data[0].size
    f = random.uniform(0.5, 1)
    i, j, h, w = RandomCrop.get_params(video_data[0], output_size=(int(height*f), int(width*f)))
    video_data = [s.crop(box=(j, i, w, h)) for s in video_data]
    

    [Source]

    you pass the arguments (left, upper, width, height) into Image.crop() when it should be (left, upper, right, lower). The result is that the training boxes are smaller than intended.

    PyTorch's implementation is the following

    def crop(img: Image.Image, top: int, left: int, height: int, width: int) -> Image.Image:
        if not _is_pil_image(img):
            raise TypeError('img should be PIL Image. Got {}'.format(type(img)))
    
        return img.crop((left, top, left + width, top + height))
    

    [Source]

    with (top, left, height, width) the output of RandomCrop.get_params(). I would recommend using the following to avoid argument-conversion mistakes.

    # random crop
    width, height = video_data[0].size
    f = random.uniform(0.5, 1)
    crop_module = RandomCrop(size=(int(height*f), int(width*f)))
    video_data = [crop_module.forward(img) for img in images]
    

    Visualization (first existing code, then fixed code)

    Note: the exact position of the crop should be ignored. Only the image size is relevant. Also notice how the old pictures are generally not square crops.

    f == 0.98 : negligible old-f98 new-f98

    f == 0.52 : significant old-f52 new-f52

    I haven't run your code with this fix, so I don't know how much the results would improve (if at all).

    opened by btamm12 1
  • ASCPredictions.csv for 87.1 mAP

    ASCPredictions.csv for 87.1 mAP

    Hey,

    Thanks again for great and helpful repo - would you be able share the ASCPredictions.csv file that you used to get the mAP scores of 87.1? I would like to analyse some of the results please.

    opened by Andrew-Brown1 0
  • torch dataloader issue

    torch dataloader issue

    i am getting the following error while runing STE_train.py File "STE_train.py", line 73, in shuffle=True, num_workers=1) File "/home/zain/anaconda3/envs/asc_env/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 213, in init sampler = RandomSampler(dataset) File "/home/zain/anaconda3/envs/asc_env/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 94, in init "value, but got num_samples={}".format(self.num_samples)) ValueError: num_samples should be a positive integer value, but got num_samples=0

    issue

    is it a pytorch version issue? if yes then, which pytorch version you have used in your experiment

    opened by javaria-qadeer 1
  • Inference on my own data

    Inference on my own data

    Hi, Thank you for sharing your code and models!

    I need to use you code and models on my own video data for other tasks. My own videos are the movie data, I want know how to prepare the utility csv file ava_activespeaker_val_augmented.csv as model's input?

    Thanks and looking forward for your reply!

    opened by JinmingZhao 1
Owner
null
Pytorch implementation of CVPR2020 paper “VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation”

VectorNet Re-implementation This is the unofficial pytorch implementation of CVPR2020 paper "VectorNet: Encoding HD Maps and Agent Dynamics from Vecto

null 120 Jan 6, 2023
PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Daft-Exprt - PyTorch Implementation PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis The

Keon Lee 47 Dec 18, 2022
Make your AirPlay devices as TTS speakers

Apple AirPlayer Home Assistant integration component, make your AirPlay devices as TTS speakers. Before Use 2021.6.X or earlier Apple Airplayer compon

George Zhao 117 Dec 15, 2022
Identify the emotion of multiple speakers in an Audio Segment

MevonAI - Speech Emotion Recognition Identify the emotion of multiple speakers in a Audio Segment Report Bug · Request Feature Try the Demo Here Table

Suyash More 110 Dec 3, 2022
Code for Referring Image Segmentation via Cross-Modal Progressive Comprehension, CVPR2020.

CMPC-Refseg Code of our CVPR 2020 paper Referring Image Segmentation via Cross-Modal Progressive Comprehension. Shaofei Huang*, Tianrui Hui*, Si Liu,

spyflying 55 Dec 1, 2022
PyTorch reimplementation of minimal-hand (CVPR2020)

Minimal Hand Pytorch Unofficial PyTorch reimplementation of minimal-hand (CVPR2020). you can also find in youtube or bilibili bare hand youtube or bil

Hao Meng 228 Dec 29, 2022
Densely Connected Search Space for More Flexible Neural Architecture Search (CVPR2020)

DenseNAS The code of the CVPR2020 paper Densely Connected Search Space for More Flexible Neural Architecture Search. Neural architecture search (NAS)

Jamin Fong 291 Nov 18, 2022
An unofficial implementation of "Unpaired Image Super-Resolution using Pseudo-Supervision." CVPR2020

UnpairedSR An unofficial implementation of "Unpaired Image Super-Resolution using Pseudo-Supervision." CVPR2020 turn RCAN(modified) --> xmodel(xilinx

JiaKui Hu 10 Oct 28, 2022
RCDNet: A Model-driven Deep Neural Network for Single Image Rain Removal (CVPR2020)

RCDNet: A Model-driven Deep Neural Network for Single Image Rain Removal (CVPR2020) Hong Wang, Qi Xie, Qian Zhao, and Deyu Meng [PDF] [Supplementary M

Hong Wang 6 Sep 27, 2022
Super Pix Adv - Offical implemention of Robust Superpixel-Guided Attentional Adversarial Attack (CVPR2020)

Super_Pix_Adv Offical implemention of Robust Superpixel-Guided Attentional Adver

DLight 8 Oct 26, 2022
Official PyTorch code for CVPR 2020 paper "Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision"

Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision https://arxiv.org/abs/2003.00393 Abstract Active learning (AL) aims to min

Denis 29 Nov 21, 2022
Implementation of the paper "Language-agnostic representation learning of source code from structure and context".

Code Transformer This is an official PyTorch implementation of the CodeTransformer model proposed in: D. Zügner, T. Kirschstein, M. Catasta, J. Leskov

Daniel Zügner 131 Dec 13, 2022
Source code for paper "Document-Level Relation Extraction with Adaptive Thresholding and Localized Context Pooling", AAAI 2021

ATLOP Code for AAAI 2021 paper Document-Level Relation Extraction with Adaptive Thresholding and Localized Context Pooling. If you make use of this co

Wenxuan Zhou 146 Nov 29, 2022
Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

Context Matters: Graph-based Self-supervised Representation Learning for Medical Images Official PyTorch implementation for paper Context Matters: Gra

null 49 Nov 23, 2022
codes for paper Combining Dynamic Local Context Focus and Dependency Cluster Attention for Aspect-level sentiment classification

DLCF-DCA codes for paper Combining Dynamic Local Context Focus and Dependency Cluster Attention for Aspect-level sentiment classification. submitted t

null 15 Aug 30, 2022
Code for Multiple Instance Active Learning for Object Detection, CVPR 2021

Language: 简体中文 | English Introduction This is the code for Multiple Instance Active Learning for Object Detection, CVPR 2021. Installation A Linux pla

Tianning Yuan 269 Dec 21, 2022
Code for Multiple Instance Active Learning for Object Detection, CVPR 2021

MI-AOD Language: 简体中文 | English Introduction This is the code for Multiple Instance Active Learning for Object Detection (The PDF is not available tem

Tianning Yuan 269 Dec 21, 2022
Code for Active Learning at The ImageNet Scale.

Code for Active Learning at The ImageNet Scale. This repository implements many popular active learning algorithms and allows training with torch's DDP.

Zeyad Emam 47 Dec 12, 2022
Official PyTorch code of DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization (ICCV 2021 Oral).

DeepPanoContext (DPC) [Project Page (with interactive results)][Paper] DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context G

Cheng Zhang 66 Nov 16, 2022