Codebase for ECCV18 "The Sound of Pixels"

Hang Zhao

Last update: Dec 20, 2022

Related tags

Overview

Sound-of-Pixels

Codebase for ECCV18 "The Sound of Pixels".

*This repository is under construction, but the core parts are already there.

Environment

The code is developed under the following configurations.

Hardware: 1-4 GPUs (change [--num_gpus NUM_GPUS] accordingly)
Software: Ubuntu 16.04.3 LTS, CUDA>=8.0, Python>=3.5, PyTorch>=0.4.0

Training

Prepare video dataset.

a. Download MUSIC dataset from: https://github.com/roudimit/MUSIC_dataset

b. Download videos.

Preprocess videos. You can do it in your own way as long as the index files are similar.

a. Extract frames at 8fps and waveforms at 11025Hz from videos. We have following directory structure:

data
├── audio
|   ├── acoustic_guitar
│   |   ├── M3dekVSwNjY.mp3
│   |   ├── ...
│   ├── trumpet
│   |   ├── STKXyBGSGyE.mp3
│   |   ├── ...
│   ├── ...
|
└── frames
|   ├── acoustic_guitar
│   |   ├── M3dekVSwNjY.mp4
│   |   |   ├── 000001.jpg
│   |   |   ├── ...
│   |   ├── ...
│   ├── trumpet
│   |   ├── STKXyBGSGyE.mp4
│   |   |   ├── 000001.jpg
│   |   |   ├── ...
│   |   ├── ...
│   ├── ...

b. Make training/validation index files by running:

python scripts/create_index_files.py

It will create index files train.csv/val.csv with the following format:

./data/audio/acoustic_guitar/M3dekVSwNjY.mp3,./data/frames/acoustic_guitar/M3dekVSwNjY.mp4,1580
./data/audio/trumpet/STKXyBGSGyE.mp3,./data/frames/trumpet/STKXyBGSGyE.mp4,493

For each row, it stores the information: AUDIO_PATH,FRAMES_PATH,NUMBER_FRAMES

Train the default model.

./scripts/train_MUSIC.sh

During training, visualizations are saved in HTML format under ckpt/MODEL_ID/visualization/.

Evaluation

(Optional) Download our trained model weights for evaluation.

./scripts/download_trained_model.sh

Evaluate the trained model performance.

./scripts/eval_MUSIC.sh

Reference

If you use the code or dataset from the project, please cite:

    @InProceedings{Zhao_2018_ECCV,
        author = {Zhao, Hang and Gan, Chuang and Rouditchenko, Andrew and Vondrick, Carl and McDermott, Josh and Torralba, Antonio},
        title = {The Sound of Pixels},
        booktitle = {The European Conference on Computer Vision (ECCV)},
        month = {September},
        year = {2018}
    }

Comments

Poor visualizations, getting zero SDR, SIR, etc. on evaluation

I was trying to evaluate on 16 videos using downloaded trained model but I am unable to see the results in visualization. Video1 and video2 have only 3 frames each with no audio and predicted audio are also silent.

I'm getting the following output after evaluation:

Loading weights for net_frame Loading weights for net_synthesizer samples: 6300 samples: 16 1 Epoch = 196 iters Evaluating at 0 epochs... [Eval] iter 0, loss: 0.0115 [Eval Summary] Epoch: 0, Loss: 0.0115, SDR_mixture: 0.0000, SDR: 0.0000, SIR: 0.0000, SAR: 0.0000 Plotting html for visualization... Evaluation Done!

Hope I would get some help Thanks

opened by deepakee13 10
A Question on Evaluation

Hello, I am a Chinese student. I have downloaded two solo videos（2P83WJXifEs and 3d1b4UH43-E）from 'val.csv' to evaluate the performance of the model. Finally, loss is 0.5479. The effect of each speech separation is very unsatisfactory. Why is that? hope to get your reply.

P.s. I have download the trained model weights for evaluation by: > ./ scripts / download_trained_model.sh and I Evaluate the trained model performance by: > ./ scripts / eval_MUSIC.sh

opened by GFENGG 3
Calculate the evaluation index as zero

When I first calculated the evaluation index using an ideal binary mask, all the indices were zero. Through debugging, it is found that the predicted masks are all less than 0.5. I don't know how to solve this problem, or is this the first evaluation has not been trained, so the result is not good?

opened by JusperLee 0
where is the pixelwise sound

Hi, I saw the func: forward_pixelwise in the code synthesizer, this is the one version of forward function that produce pixel-wise mask. However, throughout the code, and I found only the foward func is invoked but it is not the one of pixel-wise sound. Is there any demo that can produce pixel-wise sound?

opened by TaoZheng9 0
About duet and mixtures video

I evaluate the trained model performance by the trained model weights u provided. I find that the trained model use the Mix-and-Seperate process and finally restruct the two audios by inputing two solo videos,. This is a validation part. And how about the Test part about duet video?
I am interested in research on sound source localization and separation of natural duo videos. Should I train the model from scratch？ Or could I still use the trained model u provided？ Could u give me some suggestions please? Thank u~ I'm looking forward to your reply.

opened by fanglixuezi 0
Why the model does not go training?

Hello, I am a Chinese student. I have pre-processed the dataset, and use the train_MUSIC.sh to train the default model. But the result is not what I supposed. The metrics is all 0. Even I directly use the eval_MUSIC.sh (I have downloaded the trained model), I also get the 0 metics(SDR ,SIR, .etc). I don't change the code that you submit in github. So how can I find what the problem is?

opened by avis-ma 4
Failed to loading frames/audio

Sir, first i created .csv files, in the csv files it is showing what inputs are there and it's paths also. but during training it is showing failed to load frames/audio.

opened by krishnareedy 9
Cannot download the trained model

Hello. I have tried to download the trained model, but I failed to download the model by running the file 'download_trained_model.sh'. And I have also tried to access the website of the model "http://sound-of-pixels.csail.mit.edu/release/", but I got the reply "You don't have permission to access /release/ on this server.". So, I cannot get the trained model. How can I solve that problem? Thanks a lot.

opened by liuxinzhu0353150307 0

Owner

Hang Zhao

Assistant Professor at Tsinghua University, MIT PhD in Computer Vision

GitHub http://sound-of-pixels.csail.mit.edu

Official codebase for Pretrained Transformers as Universal Computation Engines.

universal-computation Overview Official codebase for Pretrained Transformers as Universal Computation Engines. Contains demo notebook and scripts to r

210 Dec 28, 2022

AOT-GAN for High-Resolution Image Inpainting (codebase for image inpainting)

AOT-GAN for High-Resolution Image Inpainting Arxiv Paper | AOT-GAN: Aggregated Contextual Transformations for High-Resolution Image Inpainting Yanhong

214 Jan 3, 2023

This is the codebase for Diffusion Models Beat GANS on Image Synthesis.

3k Dec 26, 2022

Official codebase for Decision Transformer: Reinforcement Learning via Sequence Modeling.

Decision Transformer Lili Chen*, Kevin Lu*, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas†, and Igor M

1.4k Jan 7, 2023

Codebase for the Summary Loop paper at ACL2020

Summary Loop This repository contains the code for ACL2020 paper: The Summary Loop: Learning to Write Abstractive Summaries Without Examples. Training

Canny Lab @ The University of California, Berkeley

44 Nov 4, 2022

This is the codebase for the ICLR 2021 paper Trajectory Prediction using Equivariant Continuous Convolution

Trajectory Prediction using Equivariant Continuous Convolution (ECCO) This is the codebase for the ICLR 2021 paper Trajectory Prediction using Equivar

45 Jul 22, 2022

A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

README.md shall be finished soon. WSSGG 0 Overview 1 Installation 1.1 Faster-RCNN 1.2 Language Parser 1.3 GloVe Embeddings 2 Settings 2.1 VG-GT-Graph

35 Nov 20, 2022

X-modaler is a versatile and high-performance codebase for cross-modal analytics.

X-modaler X-modaler is a versatile and high-performance codebase for cross-modal analytics. This codebase unifies comprehensive high-quality modules i

910 Dec 28, 2022

Codebase for Diffusion Models Beat GANS on Image Synthesis.

128 Dec 2, 2022

Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World

Legged Robots that Keep on Learning Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World, whic

70 Dec 7, 2022

An Image Captioning codebase

An Image Captioning codebase This is a codebase for image captioning research. It supports: Self critical training from Self-critical Sequence Trainin

1.1k Oct 18, 2021

Ranger - a synergistic optimizer using RAdam (Rectified Adam), Gradient Centralization and LookAhead in one codebase

Ranger-Deep-Learning-Optimizer Ranger - a synergistic optimizer combining RAdam (Rectified Adam) and LookAhead, and now GC (gradient centralization) i

1.1k Dec 21, 2022

Codebase for the self-supervised goal reaching benchmark introduced in the LEXA paper

LEXA Benchmark Codebase for the self-supervised goal reaching benchmark introduced in the LEXA paper (Discovering and Achieving Goals via World Models

36 Dec 22, 2022

This codebase is the official implementation of Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization (NeurIPS2021, Spotlight)

Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization This codebase is the official implementation of Test-Time Classifier A

47 Dec 28, 2022

Official codebase for "B-Pref: Benchmarking Preference-BasedReinforcement Learning" contains scripts to reproduce experiments.

B-Pref Official codebase for B-Pref: Benchmarking Preference-BasedReinforcement Learning contains scripts to reproduce experiments. Install conda env

48 Dec 20, 2022

Codebase for ECCV18 "The Sound of Pixels"

Related tags

Overview

Sound-of-Pixels

Environment

Training

Evaluation

Reference

Comments

Poor visualizations, getting zero SDR, SIR, etc. on evaluation

A Question on Evaluation

Calculate the evaluation index as zero

where is the pixelwise sound

About duet and mixtures video

Why the model does not go training?

Failed to loading frames/audio

Cannot download the trained model

Owner

Hang Zhao

Official codebase for Pretrained Transformers as Universal Computation Engines.

AOT-GAN for High-Resolution Image Inpainting (codebase for image inpainting)

This is the codebase for Diffusion Models Beat GANS on Image Synthesis.

Official codebase for Decision Transformer: Reinforcement Learning via Sequence Modeling.

Codebase for the Summary Loop paper at ACL2020

This is the codebase for the ICLR 2021 paper Trajectory Prediction using Equivariant Continuous Convolution

A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

X-modaler is a versatile and high-performance codebase for cross-modal analytics.

Codebase for Diffusion Models Beat GANS on Image Synthesis.

Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World

An Image Captioning codebase

Ranger - a synergistic optimizer using RAdam (Rectified Adam), Gradient Centralization and LookAhead in one codebase

Codebase for the self-supervised goal reaching benchmark introduced in the LEXA paper

This codebase is the official implementation of Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization (NeurIPS2021, Spotlight)

Official codebase for "B-Pref: Benchmarking Preference-BasedReinforcement Learning" contains scripts to reproduce experiments.

Codebase for "ProtoAttend: Attention-Based Prototypical Learning."

Using this codebase as a tool for my own research. Making some modifications to the original repo for my own purposes.

Codebase for Amodal Segmentation through Out-of-Task andOut-of-Distribution Generalization with a Bayesian Model

PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.