Localizing Visual Sounds the Hard Way

Honglie Chen

Last update: Dec 7, 2022

Related tags

Deep Learning Localizing-Visual-Sounds-the-Hard-Way

Overview

Localizing-Visual-Sounds-the-Hard-Way

Code and Dataset for "Localizing Visual Sounds the Hard Way".

The repo contains code and our pre-trained model.

Environment

Python 3.6.8
Pytorch 1.3.0

Flickr-SoundNet

We provide the pretrained model here.

To test the model, testing data and ground truth should be downloaded from learning to localize sound source.

Then run

python test.py --data_path "path to downloaded data with structure below/" --summaries_dir "path to pretrained models" --gt_path "path to ground truth" --testset "flickr"

VGG-Sound Source

We provide the pretrained model here.

To test the model, run

python test.py --data_path "path to downloaded data with structure below/" --summaries_dir "path to pretrained models" --testset "vggss"

(Note, some gt bounding boxes are updated recently, all results on VGG-SS cause a 2~3% difference on IoU.)

Both test data should be placed in the following structure.

data path
│
└───frames
│   │   image001.jpg
│   │   image002.jpg
│   │
└───audio
    │   audio011.wav
    │   audio012.wav

Citation

@InProceedings{Chen21,
              title        = "Localizing Visual Sounds the Hard Way",
              author       = "Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman",
              booktitle    = "CVPR",
              year         = "2021"}

Comments

Some questions about loss implementation
Hi, thank you for sharing your awesome code.

I'm facing some issues while understanding your code.

https://github.com/hche11/Localizing-Visual-Sounds-the-Hard-Way/blob/509acf8e673332e7db52ca1a7f29d65f19ef7c86/model.py#L54-L79

Referring to the code and your paper, I understood that sim1 in L#69 is representing P_i, and sim2 in L#72 is implementing the left term of N_i.

It seems sim in L#71 is referring to the right term of N_i. However, I cannot understand why Pos_all variable is a thresholded value of A0. I thought it should be all-one matrices, according to the paper.

Could you clarify where sim belongs to in the loss objective?

One more question, please.

Regarding L#74-77, should I scale the logits with the temperature value (0.07) ? I am little confused as scaling logits with temperature value is not directly stated in the paper.

It would be very helpful if you can release the code of your loss function. Thank you very much. Have a nice day!
opened by kyuyeonpooh 2
The fisrt test result isn't match to the one in your paper, is that OK?

Hey @hche11 ,I tested the results of the pre-trained model you provided on the test.py, but there was some difference from the tabular data in the paper. Specifically, I tested on the SoundNet-Flickr test set, and all the steps were completed, but when there were some bugs at runtime, I successfully ran test.py after modifying two code places, but the results were as shown in the figure, which was only close to the results of the setup model with the training set VGG-Sound Full in the paper. I wonder if this gap is normal?

by the way，the two codes I modified are mainly: 1.line 56 in model.py, because the Tensor object does not have a T method, so I replaced it with aud.t() 2.line 110 in dataloader.py, in the call The axis doesn't match error occurred during the aid_spectrogram method, so I expanded the dimension of the spectrogram object, that is, spectrogram = np.expand_dims(spectrogram,axis=2)

opened by WindBlowMyAss 0
problem in downloading the VGGSS dataset

Thanks for sharing the codes.

Have you annotated the raw videos from youtube? or you have annotated the processed videos of vggs dataset? I am asking this because the names of the videos in the vggss.json file are not identical to the .csv files of the vggs dataset. Otherwise, if you have processed the raw videos from youtube, did you select the middle frame (as said in the paper) regardless of the vggs dataset?

I am confused about how to find the middle frame. I appreciate it if you could share the code that helps to download the dataset.

opened by Ehsan-Yaghoubi 0
Random threshold

i think there is a problem with the random threshold in model it is supposed to be under arg 'i think' but it has no set value please clarify this for me

opened by lowkey001 0
Missing data from VGG-SS

Hi, I am trying to download all videos from the test portion of VGG-SS, however many samples are missing. Do you have all of the videos? And if so, how can I get access to them. Thanks.

opened by denfed 1

Localizing Visual Sounds the Hard Way

Related tags

Overview

Localizing-Visual-Sounds-the-Hard-Way

Environment

Flickr-SoundNet

VGG-Sound Source

Citation

Comments

Some questions about loss implementation

The fisrt test result isn't match to the one in your paper, is that OK?

problem in downloading the VGGSS dataset

Random threshold

Missing data from VGG-SS

Owner

Honglie Chen

Audio Source Separation is the process of separating a mixture into isolated sounds from individual sources

DC3: A Learning Method for Optimization with Hard Constraints

Source code for the Paper: CombOptNet: Fit the Right NP-Hard Problem by Learning Integer Programming Constraints}

Official repository for the paper "Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks"

Optimizing DR with hard negatives and achieving SOTA first-stage retrieval performance on TREC DL Track (SIGIR 2021 Full Paper).

The official implementation of NeurIPS 2021 paper: Finding Optimal Tangent Points for Reducing Distortions of Hard-label Attacks

A python software that can help blind people find things like laptops, phones, etc the same way a guide dog guides a blind person in finding his way.

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

The fastest way to visualize GradCAM with your Keras models.

text_recognition_toolbox: The reimplementation of a series of classical scene text recognition papers with Pytorch in a uniform way.

Capture all information throughout your model's development in a reproducible way and tie results directly to the model code!

Ready-to-use code and tutorial notebooks to boost your way into few-shot image classification.

An easier way to build neural search on the cloud

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

This is a simple backtesting framework to help you test your crypto currency trading. It includes a way to download and store historical crypto data and to execute a trading strategy.

Implementation of the Triangle Multiplicative module, used in Alphafold2 as an efficient way to mix rows or columns of a 2d feature map, as a standalone package for Pytorch

Official implementation of paper "Query2Label: A Simple Transformer Way to Multi-Label Classification".

An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results

Demonstrates how to divide a DL model into multiple IR model files (division) and introduce a simplest way to implement a custom layer works with OpenVINO IR models.