Localizing Visual Sounds the Hard Way

Overview

Localizing-Visual-Sounds-the-Hard-Way

Code and Dataset for "Localizing Visual Sounds the Hard Way".

The repo contains code and our pre-trained model.

Environment

  • Python 3.6.8
  • Pytorch 1.3.0

Flickr-SoundNet

We provide the pretrained model here.

To test the model, testing data and ground truth should be downloaded from learning to localize sound source.

Then run

python test.py --data_path "path to downloaded data with structure below/" --summaries_dir "path to pretrained models" --gt_path "path to ground truth" --testset "flickr"

VGG-Sound Source

We provide the pretrained model here.

To test the model, run

python test.py --data_path "path to downloaded data with structure below/" --summaries_dir "path to pretrained models" --testset "vggss"

(Note, some gt bounding boxes are updated recently, all results on VGG-SS cause a 2~3% difference on IoU.)

Both test data should be placed in the following structure.

data path
│
└───frames
│   │   image001.jpg
│   │   image002.jpg
│   │
└───audio
    │   audio011.wav
    │   audio012.wav

Citation

@InProceedings{Chen21,
              title        = "Localizing Visual Sounds the Hard Way",
              author       = "Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman",
              booktitle    = "CVPR",
              year         = "2021"}
Comments
  • Some questions about loss implementation

    Some questions about loss implementation

    Hi, thank you for sharing your awesome code.

    I'm facing some issues while understanding your code.

    111 https://github.com/hche11/Localizing-Visual-Sounds-the-Hard-Way/blob/509acf8e673332e7db52ca1a7f29d65f19ef7c86/model.py#L54-L79

    Referring to the code and your paper, I understood that sim1 in L#69 is representing P_i, and sim2 in L#72 is implementing the left term of N_i.

    It seems sim in L#71 is referring to the right term of N_i. However, I cannot understand why Pos_all variable is a thresholded value of A0. I thought it should be all-one matrices, according to the paper.

    • Could you clarify where sim belongs to in the loss objective?

    One more question, please.

    • Regarding L#74-77, should I scale the logits with the temperature value (0.07) ? I am little confused as scaling logits with temperature value is not directly stated in the paper.

    It would be very helpful if you can release the code of your loss function. Thank you very much. Have a nice day!

    opened by kyuyeonpooh 2
  • The fisrt test result isn't match to the one in your paper, is that OK?

    The fisrt test result isn't match to the one in your paper, is that OK?

    Hey @hche11 ,I tested the results of the pre-trained model you provided on the test.py, but there was some difference from the tabular data in the paper. Specifically, I tested on the SoundNet-Flickr test set, and all the steps were completed, but when there were some bugs at runtime, I successfully ran test.py after modifying two code places, but the results were as shown in the figure, which was only close to the results of the setup model with the training set VGG-Sound Full in the paper. I wonder if this gap is normal?

    by the way,the two codes I modified are mainly: 1.line 56 in model.py, because the Tensor object does not have a T method, so I replaced it with aud.t() 2.line 110 in dataloader.py, in the call The axis doesn't match error occurred during the aid_spectrogram method, so I expanded the dimension of the spectrogram object, that is, spectrogram = np.expand_dims(spectrogram,axis=2) cbcb4b0f9e082a05b9d69408a384bb5


    opened by WindBlowMyAss 0
  • problem in downloading the VGGSS dataset

    problem in downloading the VGGSS dataset

    Thanks for sharing the codes.

    Have you annotated the raw videos from youtube? or you have annotated the processed videos of vggs dataset? I am asking this because the names of the videos in the vggss.json file are not identical to the .csv files of the vggs dataset. Otherwise, if you have processed the raw videos from youtube, did you select the middle frame (as said in the paper) regardless of the vggs dataset?

    I am confused about how to find the middle frame. I appreciate it if you could share the code that helps to download the dataset.

    opened by Ehsan-Yaghoubi 0
  • Random threshold

    Random threshold

    i think there is a problem with the random threshold in model it is supposed to be under arg 'i think' but it has no set value please clarify this for me

    opened by lowkey001 0
  • Missing data from VGG-SS

    Missing data from VGG-SS

    Hi, I am trying to download all videos from the test portion of VGG-SS, however many samples are missing. Do you have all of the videos? And if so, how can I get access to them. Thanks.

    opened by denfed 1
Owner
Honglie Chen
Honglie Chen
Audio Source Separation is the process of separating a mixture into isolated sounds from individual sources

Audio Source Separation is the process of separating a mixture into isolated sounds from individual sources (e.g. just the lead vocals).

Victor Basu 14 Nov 7, 2022
DC3: A Learning Method for Optimization with Hard Constraints

DC3: A learning method for optimization with hard constraints This repository is by Priya L. Donti, David Rolnick, and J. Zico Kolter and contains the

CMU Locus Lab 57 Dec 26, 2022
Source code for the Paper: CombOptNet: Fit the Right NP-Hard Problem by Learning Integer Programming Constraints}

CombOptNet: Fit the Right NP-Hard Problem by Learning Integer Programming Constraints Installation Run pipenv install (at your own risk with --skip-lo

Autonomous Learning Group 65 Dec 27, 2022
Official repository for the paper "Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks"

Easy-To-Hard The official repository for the paper "Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks". Gett

Avi Schwarzschild 52 Sep 8, 2022
Optimizing DR with hard negatives and achieving SOTA first-stage retrieval performance on TREC DL Track (SIGIR 2021 Full Paper).

Optimizing Dense Retrieval Model Training with Hard Negatives Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, Shaoping Ma This repo provi

Jingtao Zhan 99 Dec 27, 2022
The official implementation of NeurIPS 2021 paper: Finding Optimal Tangent Points for Reducing Distortions of Hard-label Attacks

The official implementation of NeurIPS 2021 paper: Finding Optimal Tangent Points for Reducing Distortions of Hard-label Attacks

machen 11 Nov 27, 2022
A python software that can help blind people find things like laptops, phones, etc the same way a guide dog guides a blind person in finding his way.

GuidEye A python software that can help blind people find things like laptops, phones, etc the same way a guide dog guides a blind person in finding h

Munal Jain 0 Aug 9, 2022
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 310 Dec 28, 2022
The fastest way to visualize GradCAM with your Keras models.

VizGradCAM VizGradCam is the fastest way to visualize GradCAM in Keras models. GradCAM helps with providing visual explainability of trained models an

null 58 Nov 19, 2022
text_recognition_toolbox: The reimplementation of a series of classical scene text recognition papers with Pytorch in a uniform way.

text recognition toolbox 1. 项目介绍 该项目是基于pytorch深度学习框架,以统一的改写方式实现了以下6篇经典的文字识别论文,论文的详情如下。该项目会持续进行更新,欢迎大家提出问题以及对代码进行贡献。 模型 论文标题 发表年份 模型方法划分 CRNN 《An End-t

null 168 Dec 24, 2022
Capture all information throughout your model's development in a reproducible way and tie results directly to the model code!

Rubicon Purpose Rubicon is a data science tool that captures and stores model training and execution information, like parameters and outcomes, in a r

Capital One 97 Jan 3, 2023
Ready-to-use code and tutorial notebooks to boost your way into few-shot image classification.

Easy Few-Shot Learning Ready-to-use code and tutorial notebooks to boost your way into few-shot image classification. This repository is made for you

Sicara 399 Jan 8, 2023
An easier way to build neural search on the cloud

An easier way to build neural search on the cloud Jina is a deep learning-powered search framework for building cross-/multi-modal search systems (e.g

Jina AI 17k Jan 2, 2023
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Phil Wang 12.6k Jan 9, 2023
This is a simple backtesting framework to help you test your crypto currency trading. It includes a way to download and store historical crypto data and to execute a trading strategy.

You can use this simple crypto backtesting script to ensure your trading strategy is successful Minimal setup required and works well with static TP a

Andrei 154 Sep 12, 2022
Implementation of the Triangle Multiplicative module, used in Alphafold2 as an efficient way to mix rows or columns of a 2d feature map, as a standalone package for Pytorch

Triangle Multiplicative Module - Pytorch Implementation of the Triangle Multiplicative module, used in Alphafold2 as an efficient way to mix rows or c

Phil Wang 22 Oct 28, 2022
Official implementation of paper "Query2Label: A Simple Transformer Way to Multi-Label Classification".

Introdunction This is the official implementation of the paper "Query2Label: A Simple Transformer Way to Multi-Label Classification". Abstract This pa

Shilong Liu 274 Dec 28, 2022
An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results

EasyDatas An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results Installation pip install git+https

Ximing Yang 4 Dec 14, 2021
Demonstrates how to divide a DL model into multiple IR model files (division) and introduce a simplest way to implement a custom layer works with OpenVINO IR models.

Demonstration of OpenVINO techniques - Model-division and a simplest-way to support custom layers Description: Model Optimizer in Intel(r) OpenVINO(tm

Yasunori Shimura 12 Nov 9, 2022