《Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis》(2021)

Overview

Image2Reverb

Image2Reverb is an end-to-end neural network that generates plausible audio impulse responses from single images of acoustic environments. Code for the paper Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis. The architecture is a conditional GAN with a ResNet50 (pre-trained on Places365 and fine-tuned) image encoder. It generates monoaural audio impulse responses (directly applicable to convolution applications) as magnitude spectrograms.

Dependencies

Model/Data:

  • PyTorch>=1.7.0
  • PyTorch Lightning
  • torchvision
  • torchaudio
  • librosa
  • PyRoomAcoustics
  • PIL

Eval/Preprocessing:

  • PySoundfile
  • SciPy
  • Scikit-Learn
  • python-acoustics
  • google-images-download
  • matplotlib

Usage

We will make a pre-trained model available soon!

Acknowledgments

We borrow and adapt code snippets from GANSynth (and this PyTorch re-implementation), additional snippets from this PGGAN implementation, and more.

Comments
  • about T60 metric

    about T60 metric

    Dear authors,'

    Thanks for this awesome work!

    1. Could you please let me know how you compute the T60 mean, and std in Table 3? Does it correspond to the 'test.py' script and take 1957 errors in %=(estimate-real)/real), and directly compute the mean and std?
    2. I guess the pretrained model corresponds to the checkpoint under `resources', however, when I compute the T60 metric in point 1, it replies mean=60, std=253, which is different from the paper?

    Thanks for your help!

    opened by catherine-qian 4
  • repeated IRs

    repeated IRs

    Dear authors,

    For the dataset description, you said "We curated a dataset of 265 different spaces totalling 1169 images and 738 IRs. From these, we produced a total of 11234 paired examples with a train- validation-test split of 9743-154-1957."

    So, for the 11234 paired examples, did you duplicate the IRs i.e., some IRs in the provided dataset are exactly the same? And how do you choose which IR to duplicate?

    opened by catherine-qian 3
  • the best model selection

    the best model selection

    Dear authors,

    Could you please let me know how you select the best model? From the validation set? or report the best test results in the paper?

    BTW, have you tried to remove the adversarial scheme to see the results with only a generator?

    opened by catherine-qian 2
  • microphone-sound distance

    microphone-sound distance

    Dear authors,

    Could you please let me know in the paper, how could you deal with the sound location? You assume it co-located with the microphone? Because RIR is related with the microphone-sound path, i didn;t find the information in the paper. Do you randomly locate a sound?

    opened by catherine-qian 1
  • New Spectrogram with new PyTorch (1.7.0)

    New Spectrogram with new PyTorch (1.7.0)

    Newest PyTorch seems to have FFT routines that support complex tensors. Also added the finite-difference instantaneous angular frequency representation from GANSynth, etc.

    opened by nikhilsinghmus 0
  • Cannot reproduce the results from the project page

    Cannot reproduce the results from the project page

    Hi, I was trying to run your very interesting model on the input images from the project page, but the generated IRs are always kind of the same and do not resemble your output examples.

    This is the code I used:

    import numpy as np
    import torch
    from PIL import Image
    import torchvision.transforms as transforms
    
    from image2reverb.model import Image2Reverb
    from image2reverb.stft import STFT
    
    checkpoint_path = "./checkpoints/model.ckpt"
    encoder_path = "./checkpoints/resnet50_places365.pth.tar"
    depthmodel_path = "./checkpoints/mono_640x192"
    constant_depth = None
    latent_dimension = 512
    
    model = Image2Reverb(encoder_path, depthmodel_path, constant_depth=constant_depth, spec="stft")
    m = torch.load(checkpoint_path, map_location=model.device)
    model.load_state_dict(m["state_dict"])
    
    image_transforms = transforms.Compose([
        transforms.Resize([224, 224], transforms.functional.InterpolationMode.BICUBIC),
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])
    
    image_path = "examples/input.5416407f.png"
    
    image = Image.open(image_path).convert("RGB")
    label = image_transforms(image).unsqueeze(0)
    
    with torch.no_grad():
        f, img = model.enc.forward(label)
    
        shape = (
            f.shape[0],
            (latent_dimension - f.shape[1]) if f.shape[1] < latent_dimension else f.shape[1],
            f.shape[2],
            f.shape[3]
        )
        z = torch.cat((f, torch.randn(shape, device=model.device)), 1)
    
        spec = model.g(z)
    
    stft = STFT()
    ir = stft.inverse(spec.squeeze())
    
    import soundfile as sf
    sf.write("ir.wav", ir, 22050)
    

    It's possible I'm doing something wrong or am missing a step. Any ideas? The generated IR always seems to be the same kind of exponentially decaying noise burst, regardless of what the input image looks like.

    opened by hollance 9
  • dataset details

    dataset details

    Dear authors,

    Thanks for the open sourced code.

    Do you select the anechonic signals and AIRs from the OpenAIR dataset? Or some samples are recorded by yourself?

    I am investigating other possible acoustic dataset with AIR from different environments, do you have something in mind?

    Thanks in advance for your help! Best, Xinyuan

    opened by catherine-qian 2
  • Differentiability of proxy T_60 error

    Differentiability of proxy T_60 error

    Hi Nikhil,

    Thanks for open-sourcing the code!

    I have a question regarding the differentiability of the proxy T_60 error. Firstly, in the in compare_t60 in image2reverb.util, detach is being called on both predictions and ground truths, which would make any further computation non-differentiable. Secondly, inside estimate_t60, which is being called from compare_t60, there exists a bunch of torch.where operations, which are again non-differentiable. So it looks like the proxy T_60 error won't be differentiable and upon adding it to the other losses, shouldn't really change the model training as the model won't have any control over it.

    However, in Sec. 4.2 (Ablation Study) of the paper, it looks like not having the proxy T_60 error term in the objective is creating a difference. Am I missing something here?

    Best, Sagnik

    opened by SAGNIKMJR 3
  • Training data

    Training data

    Hi Nikhil,

    Thank you for sharing the code! However, I can't find any information about downloading the data. Could you share more details of that?

    Thanks, Changan

    opened by ChanganVR 30
Owner
Nikhil Singh
Nikhil Singh
Python script that takes an Impulse response .wav and a input .wav to demonstrate audio convolution.

convolver Python script that takes an Impulse response .wav and a input .wav to demonstrate audio convolution. Created by Sean Higley [email protected]

Sean Higley 1 Feb 23, 2022
PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval (M2HSE) PyTorch code fo

Xinlei-Pei 6 Dec 23, 2022
CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

Mingyang Zhou 28 Dec 30, 2022
PyTorch Implementation for AAAI'21 "Do Response Selection Models Really Know What's Next? Utterance Manipulation Strategies for Multi-turn Response Selection"

UMS for Multi-turn Response Selection Implements the model described in the following paper Do Response Selection Models Really Know What's Next? Utte

Taesun Whang 47 Nov 22, 2022
Example-custom-ml-block-keras - Custom Keras ML block example for Edge Impulse

Custom Keras ML block example for Edge Impulse This repository is an example on

Edge Impulse 8 Nov 2, 2022
Probabilistic Cross-Modal Embedding (PCME) CVPR 2021

Probabilistic Cross-Modal Embedding (PCME) CVPR 2021 Official Pytorch implementation of PCME | Paper Sanghyuk Chun1 Seong Joon Oh1 Rafael Sampaio de R

NAVER AI 87 Dec 21, 2022
Official Implement of CVPR 2021 paper “Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting”

RGBT Crowd Counting Lingbo Liu, Jiaqi Chen, Hefeng Wu, Guanbin Li, Chenglong Li, Liang Lin. "Cross-Modal Collaborative Representation Learning and a L

null 37 Dec 8, 2022
Code for 'Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning', ICCV 2021

CMIC-Retrieval Code for Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning. ICCV 2021. Introduction In this wo

null 42 Nov 17, 2022
Code for Referring Image Segmentation via Cross-Modal Progressive Comprehension, CVPR2020.

CMPC-Refseg Code of our CVPR 2020 paper Referring Image Segmentation via Cross-Modal Progressive Comprehension. Shaofei Huang*, Tianrui Hui*, Si Liu,

spyflying 55 Dec 1, 2022
Cross-modal Deep Face Normals with Deactivable Skip Connections

Cross-modal Deep Face Normals with Deactivable Skip Connections Victoria Fernández Abrevaya*, Adnane Boukhayma*, Philip H. S. Torr, Edmond Boyer (*Equ

null 72 Nov 27, 2022
Cross-Modal Contrastive Learning for Text-to-Image Generation

Cross-Modal Contrastive Learning for Text-to-Image Generation This repository hosts the open source JAX implementation of XMC-GAN. Setup instructions

Google Research 94 Nov 12, 2022
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

null 43 Nov 21, 2022
X-modaler is a versatile and high-performance codebase for cross-modal analytics.

X-modaler X-modaler is a versatile and high-performance codebase for cross-modal analytics. This codebase unifies comprehensive high-quality modules i

null 910 Dec 28, 2022
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

ROSITA News & Updates (24/08/2021) Release the demo to perform fine-grained semantic alignments using the pretrained ROSITA model. (15/08/2021) Releas

Vision and Language Group@ MIL 48 Dec 23, 2022
A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

CLIP4CMR A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval The original data and pre-calculate

null 9 Jan 12, 2022
Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral)

DSA^2 F: Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral) This repo is the official imp

如今我已剑指天涯 46 Dec 21, 2022
Cross Quality LFW: A database for Analyzing Cross-Resolution Image Face Recognition in Unconstrained Environments

Cross-Quality Labeled Faces in the Wild (XQLFW) Here, we release the database, evaluation protocol and code for the following paper: Cross Quality LFW

Martin Knoche 10 Dec 12, 2022
Source Code for DialogBERT: Discourse-Aware Response Generation via Learning to Recover and Rank Utterances (https://arxiv.org/pdf/2012.01775.pdf)

DialogBERT This is a PyTorch implementation of the DialogBERT model described in DialogBERT: Neural Response Generation via Hierarchical BERT with Dis

Xiaodong Gu 67 Jan 6, 2023
Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network

DeepCDR Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network This work has been accepted to ECCB2020 and was also published in the

Qiao Liu 50 Dec 18, 2022