《Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis》(2021)

Nikhil Singh

Last update: Nov 27, 2022

Related tags

Deep Learning image2reverb

Overview

Image2Reverb

Image2Reverb is an end-to-end neural network that generates plausible audio impulse responses from single images of acoustic environments. Code for the paper Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis. The architecture is a conditional GAN with a ResNet50 (pre-trained on Places365 and fine-tuned) image encoder. It generates monoaural audio impulse responses (directly applicable to convolution applications) as magnitude spectrograms.

Dependencies

Model/Data:

PyTorch>=1.7.0
PyTorch Lightning
torchvision
torchaudio
librosa
PyRoomAcoustics
PIL

Eval/Preprocessing:

PySoundfile
SciPy
Scikit-Learn
python-acoustics
google-images-download
matplotlib

Usage

We will make a pre-trained model available soon!

Acknowledgments

We borrow and adapt code snippets from GANSynth (and this PyTorch re-implementation), additional snippets from this PGGAN implementation, and more.

Comments

about T60 metric
Dear authors,'

Thanks for this awesome work!

Could you please let me know how you compute the T60 mean, and std in Table 3? Does it correspond to the 'test.py' script and take 1957 errors in %=(estimate-real)/real), and directly compute the mean and std?

I guess the pretrained model corresponds to the checkpoint under `resources', however, when I compute the T60 metric in point 1, it replies mean=60， std=253, which is different from the paper?

Thanks for your help!
opened by catherine-qian 4
repeated IRs

Dear authors,

For the dataset description, you said "We curated a dataset of 265 different spaces totalling 1169 images and 738 IRs. From these, we produced a total of 11234 paired examples with a train- validation-test split of 9743-154-1957."

So, for the 11234 paired examples, did you duplicate the IRs i.e., some IRs in the provided dataset are exactly the same? And how do you choose which IR to duplicate?

opened by catherine-qian 3
the best model selection

Dear authors,

Could you please let me know how you select the best model? From the validation set? or report the best test results in the paper?

BTW, have you tried to remove the adversarial scheme to see the results with only a generator?

opened by catherine-qian 2
microphone-sound distance

Dear authors,

Could you please let me know in the paper, how could you deal with the sound location? You assume it co-located with the microphone? Because RIR is related with the microphone-sound path, i didn;t find the information in the paper. Do you randomly locate a sound?

opened by catherine-qian 1
New Spectrogram with new PyTorch (1.7.0)

Newest PyTorch seems to have FFT routines that support complex tensors. Also added the finite-difference instantaneous angular frequency representation from GANSynth, etc.

opened by nikhilsinghmus 0

Cannot reproduce the results from the project page

Hi, I was trying to run your very interesting model on the input images from the project page, but the generated IRs are always kind of the same and do not resemble your output examples.

This is the code I used:

import numpy as np
import torch
from PIL import Image
import torchvision.transforms as transforms

from image2reverb.model import Image2Reverb
from image2reverb.stft import STFT

checkpoint_path = "./checkpoints/model.ckpt"
encoder_path = "./checkpoints/resnet50_places365.pth.tar"
depthmodel_path = "./checkpoints/mono_640x192"
constant_depth = None
latent_dimension = 512

model = Image2Reverb(encoder_path, depthmodel_path, constant_depth=constant_depth, spec="stft")
m = torch.load(checkpoint_path, map_location=model.device)
model.load_state_dict(m["state_dict"])

image_transforms = transforms.Compose([
    transforms.Resize([224, 224], transforms.functional.InterpolationMode.BICUBIC),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

image_path = "examples/input.5416407f.png"

image = Image.open(image_path).convert("RGB")
label = image_transforms(image).unsqueeze(0)

with torch.no_grad():
    f, img = model.enc.forward(label)

    shape = (
        f.shape[0],
        (latent_dimension - f.shape[1]) if f.shape[1] < latent_dimension else f.shape[1],
        f.shape[2],
        f.shape[3]
    )
    z = torch.cat((f, torch.randn(shape, device=model.device)), 1)

    spec = model.g(z)

stft = STFT()
ir = stft.inverse(spec.squeeze())

import soundfile as sf
sf.write("ir.wav", ir, 22050)

It's possible I'm doing something wrong or am missing a step. Any ideas? The generated IR always seems to be the same kind of exponentially decaying noise burst, regardless of what the input image looks like.

opened by hollance 9

dataset details

Dear authors,

Thanks for the open sourced code.

Do you select the anechonic signals and AIRs from the OpenAIR dataset? Or some samples are recorded by yourself?

I am investigating other possible acoustic dataset with AIR from different environments, do you have something in mind?

Thanks in advance for your help! Best, Xinyuan

opened by catherine-qian 2
Differentiability of proxy T_60 error

Hi Nikhil,

Thanks for open-sourcing the code!

I have a question regarding the differentiability of the proxy T_60 error. Firstly, in the in compare_t60 in image2reverb.util, detach is being called on both predictions and ground truths, which would make any further computation non-differentiable. Secondly, inside estimate_t60, which is being called from compare_t60, there exists a bunch of torch.where operations, which are again non-differentiable. So it looks like the proxy T_60 error won't be differentiable and upon adding it to the other losses, shouldn't really change the model training as the model won't have any control over it.

However, in Sec. 4.2 (Ablation Study) of the paper, it looks like not having the proxy T_60 error term in the objective is creating a difference. Am I missing something here?

Best, Sagnik

opened by SAGNIKMJR 3
Training data

Hi Nikhil,

Thank you for sharing the code! However, I can't find any information about downloading the data. Could you share more details of that?

Thanks, Changan

opened by ChanganVR 30

Owner

Nikhil Singh

GitHub

Python script that takes an Impulse response .wav and a input .wav to demonstrate audio convolution.

convolver Python script that takes an Impulse response .wav and a input .wav to demonstrate audio convolution. Created by Sean Higley [email protected]

1 Feb 23, 2022

PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval (M2HSE) PyTorch code fo

6 Dec 23, 2022

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

28 Dec 30, 2022

PyTorch Implementation for AAAI'21 "Do Response Selection Models Really Know What's Next? Utterance Manipulation Strategies for Multi-turn Response Selection"

UMS for Multi-turn Response Selection Implements the model described in the following paper Do Response Selection Models Really Know What's Next? Utte

47 Nov 22, 2022

Example-custom-ml-block-keras - Custom Keras ML block example for Edge Impulse

Custom Keras ML block example for Edge Impulse This repository is an example on

8 Nov 2, 2022

Probabilistic Cross-Modal Embedding (PCME) CVPR 2021

Probabilistic Cross-Modal Embedding (PCME) CVPR 2021 Official Pytorch implementation of PCME | Paper Sanghyuk Chun1 Seong Joon Oh1 Rafael Sampaio de R

87 Dec 21, 2022

Official Implement of CVPR 2021 paper “Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting”

RGBT Crowd Counting Lingbo Liu, Jiaqi Chen, Hefeng Wu, Guanbin Li, Chenglong Li, Liang Lin. "Cross-Modal Collaborative Representation Learning and a L

37 Dec 8, 2022

Code for 'Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning', ICCV 2021

CMIC-Retrieval Code for Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning. ICCV 2021. Introduction In this wo

42 Nov 17, 2022

Code for Referring Image Segmentation via Cross-Modal Progressive Comprehension, CVPR2020.

CMPC-Refseg Code of our CVPR 2020 paper Referring Image Segmentation via Cross-Modal Progressive Comprehension. Shaofei Huang*, Tianrui Hui*, Si Liu,

55 Dec 1, 2022

Cross-modal Deep Face Normals with Deactivable Skip Connections

Cross-modal Deep Face Normals with Deactivable Skip Connections Victoria Fernández Abrevaya*, Adnane Boukhayma*, Philip H. S. Torr, Edmond Boyer (*Equ

72 Nov 27, 2022

Cross-Modal Contrastive Learning for Text-to-Image Generation

Cross-Modal Contrastive Learning for Text-to-Image Generation This repository hosts the open source JAX implementation of XMC-GAN. Setup instructions

94 Nov 12, 2022

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

43 Nov 21, 2022

X-modaler is a versatile and high-performance codebase for cross-modal analytics.

X-modaler X-modaler is a versatile and high-performance codebase for cross-modal analytics. This codebase unifies comprehensive high-quality modules i

910 Dec 28, 2022

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

ROSITA News & Updates (24/08/2021) Release the demo to perform fine-grained semantic alignments using the pretrained ROSITA model. (15/08/2021) Releas

48 Dec 23, 2022

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

CLIP4CMR A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval The original data and pre-calculate

9 Jan 12, 2022

Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral)

DSA^2 F: Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral) This repo is the official imp

46 Dec 21, 2022

Cross Quality LFW: A database for Analyzing Cross-Resolution Image Face Recognition in Unconstrained Environments

Cross-Quality Labeled Faces in the Wild (XQLFW) Here, we release the database, evaluation protocol and code for the following paper: Cross Quality LFW

10 Dec 12, 2022

Source Code for DialogBERT: Discourse-Aware Response Generation via Learning to Recover and Rank Utterances (https://arxiv.org/pdf/2012.01775.pdf)

DialogBERT This is a PyTorch implementation of the DialogBERT model described in DialogBERT: Neural Response Generation via Hierarchical BERT with Dis

67 Jan 6, 2023

Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network

DeepCDR Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network This work has been accepted to ECCB2020 and was also published in the

50 Dec 18, 2022