Codebase for ECCV18 "The Sound of Pixels"

Overview

Sound-of-Pixels

Codebase for ECCV18 "The Sound of Pixels".

*This repository is under construction, but the core parts are already there.

Environment

The code is developed under the following configurations.

  • Hardware: 1-4 GPUs (change [--num_gpus NUM_GPUS] accordingly)
  • Software: Ubuntu 16.04.3 LTS, CUDA>=8.0, Python>=3.5, PyTorch>=0.4.0

Training

  1. Prepare video dataset.

    a. Download MUSIC dataset from: https://github.com/roudimit/MUSIC_dataset

    b. Download videos.

  2. Preprocess videos. You can do it in your own way as long as the index files are similar.

    a. Extract frames at 8fps and waveforms at 11025Hz from videos. We have following directory structure:

    data
    ├── audio
    |   ├── acoustic_guitar
    │   |   ├── M3dekVSwNjY.mp3
    │   |   ├── ...
    │   ├── trumpet
    │   |   ├── STKXyBGSGyE.mp3
    │   |   ├── ...
    │   ├── ...
    |
    └── frames
    |   ├── acoustic_guitar
    │   |   ├── M3dekVSwNjY.mp4
    │   |   |   ├── 000001.jpg
    │   |   |   ├── ...
    │   |   ├── ...
    │   ├── trumpet
    │   |   ├── STKXyBGSGyE.mp4
    │   |   |   ├── 000001.jpg
    │   |   |   ├── ...
    │   |   ├── ...
    │   ├── ...
    

    b. Make training/validation index files by running:

    python scripts/create_index_files.py
    

    It will create index files train.csv/val.csv with the following format:

    ./data/audio/acoustic_guitar/M3dekVSwNjY.mp3,./data/frames/acoustic_guitar/M3dekVSwNjY.mp4,1580
    ./data/audio/trumpet/STKXyBGSGyE.mp3,./data/frames/trumpet/STKXyBGSGyE.mp4,493
    

    For each row, it stores the information: AUDIO_PATH,FRAMES_PATH,NUMBER_FRAMES

  3. Train the default model.

./scripts/train_MUSIC.sh
  1. During training, visualizations are saved in HTML format under ckpt/MODEL_ID/visualization/.

Evaluation

  1. (Optional) Download our trained model weights for evaluation.
./scripts/download_trained_model.sh
  1. Evaluate the trained model performance.
./scripts/eval_MUSIC.sh

Reference

If you use the code or dataset from the project, please cite:

    @InProceedings{Zhao_2018_ECCV,
        author = {Zhao, Hang and Gan, Chuang and Rouditchenko, Andrew and Vondrick, Carl and McDermott, Josh and Torralba, Antonio},
        title = {The Sound of Pixels},
        booktitle = {The European Conference on Computer Vision (ECCV)},
        month = {September},
        year = {2018}
    }
Comments
  • Poor visualizations, getting zero SDR, SIR, etc. on evaluation

    Poor visualizations, getting zero SDR, SIR, etc. on evaluation

    I was trying to evaluate on 16 videos using downloaded trained model but I am unable to see the results in visualization. Video1 and video2 have only 3 frames each with no audio and predicted audio are also silent.

    I'm getting the following output after evaluation:

    Loading weights for net_frame Loading weights for net_synthesizer samples: 6300 samples: 16 1 Epoch = 196 iters Evaluating at 0 epochs... [Eval] iter 0, loss: 0.0115 [Eval Summary] Epoch: 0, Loss: 0.0115, SDR_mixture: 0.0000, SDR: 0.0000, SIR: 0.0000, SAR: 0.0000 Plotting html for visualization... Evaluation Done!

    Hope I would get some help Thanks

    opened by deepakee13 10
  • A Question on Evaluation

    A Question on Evaluation

    Hello, I am a Chinese student. I have downloaded two solo videos(2P83WJXifEs and 3d1b4UH43-E)from 'val.csv' to evaluate the performance of the model. Finally, loss is 0.5479. The effect of each speech separation is very unsatisfactory. Why is that? hope to get your reply.

    P.s. I have download the trained model weights for evaluation by: > ./ scripts / download_trained_model.sh and I Evaluate the trained model performance by: > ./ scripts / eval_MUSIC.sh

    opened by GFENGG 3
  • Calculate the evaluation index as zero

    Calculate the evaluation index as zero

    When I first calculated the evaluation index using an ideal binary mask, all the indices were zero. Through debugging, it is found that the predicted masks are all less than 0.5. I don't know how to solve this problem, or is this the first evaluation has not been trained, so the result is not good? image

    opened by JusperLee 0
  • where is the pixelwise sound

    where is the pixelwise sound

    Hi, I saw the func: forward_pixelwise in the code synthesizer, this is the one version of forward function that produce pixel-wise mask. However, throughout the code, and I found only the foward func is invoked but it is not the one of pixel-wise sound. Is there any demo that can produce pixel-wise sound?

    opened by TaoZheng9 0
  • About duet and mixtures video

    About duet and mixtures video

    I evaluate the trained model performance by the trained model weights u provided. I find that the trained model use the Mix-and-Seperate process and finally restruct the two audios by inputing two solo videos,. This is a validation part. And how about the Test part about duet video?
    I am interested in research on sound source localization and separation of natural duo videos. Should I train the model from scratch? Or could I still use the trained model u provided? Could u give me some suggestions please? Thank u~ I'm looking forward to your reply.

    opened by fanglixuezi 0
  • Why the model does not go training?

    Why the model does not go training?

    Hello, I am a Chinese student. I have pre-processed the dataset, and use the train_MUSIC.sh to train the default model. But the result is not what I supposed. The metrics is all 0. Even I directly use the eval_MUSIC.sh (I have downloaded the trained model), I also get the 0 metics(SDR ,SIR, .etc). I don't change the code that you submit in github. So how can I find what the problem is?

    opened by avis-ma 4
  • Failed to loading frames/audio

    Failed to loading frames/audio

    Sir, first i created .csv files, in the csv files it is showing what inputs are there and it's paths also. but during training it is showing failed to load frames/audio.

    opened by krishnareedy 9
  • Cannot download the trained model

    Cannot download the trained model

    Hello. I have tried to download the trained model, but I failed to download the model by running the file 'download_trained_model.sh'. And I have also tried to access the website of the model "http://sound-of-pixels.csail.mit.edu/release/", but I got the reply "You don't have permission to access /release/ on this server.". So, I cannot get the trained model. How can I solve that problem? Thanks a lot.

    opened by liuxinzhu0353150307 0
Owner
Hang Zhao
Assistant Professor at Tsinghua University, MIT PhD in Computer Vision
Hang Zhao
Official codebase for Pretrained Transformers as Universal Computation Engines.

universal-computation Overview Official codebase for Pretrained Transformers as Universal Computation Engines. Contains demo notebook and scripts to r

Kevin Lu 210 Dec 28, 2022
AOT-GAN for High-Resolution Image Inpainting (codebase for image inpainting)

AOT-GAN for High-Resolution Image Inpainting Arxiv Paper | AOT-GAN: Aggregated Contextual Transformations for High-Resolution Image Inpainting Yanhong

Multimedia Research 214 Jan 3, 2023
This is the codebase for Diffusion Models Beat GANS on Image Synthesis.

This is the codebase for Diffusion Models Beat GANS on Image Synthesis.

OpenAI 3k Dec 26, 2022
Official codebase for Decision Transformer: Reinforcement Learning via Sequence Modeling.

Decision Transformer Lili Chen*, Kevin Lu*, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas†, and Igor M

Kevin Lu 1.4k Jan 7, 2023
Codebase for the Summary Loop paper at ACL2020

Summary Loop This repository contains the code for ACL2020 paper: The Summary Loop: Learning to Write Abstractive Summaries Without Examples. Training

Canny Lab @ The University of California, Berkeley 44 Nov 4, 2022
This is the codebase for the ICLR 2021 paper Trajectory Prediction using Equivariant Continuous Convolution

Trajectory Prediction using Equivariant Continuous Convolution (ECCO) This is the codebase for the ICLR 2021 paper Trajectory Prediction using Equivar

Spatiotemporal Machine Learning 45 Jul 22, 2022
A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

README.md shall be finished soon. WSSGG 0 Overview 1 Installation 1.1 Faster-RCNN 1.2 Language Parser 1.3 GloVe Embeddings 2 Settings 2.1 VG-GT-Graph

Keren Ye 35 Nov 20, 2022
X-modaler is a versatile and high-performance codebase for cross-modal analytics.

X-modaler X-modaler is a versatile and high-performance codebase for cross-modal analytics. This codebase unifies comprehensive high-quality modules i

null 910 Dec 28, 2022
Codebase for Diffusion Models Beat GANS on Image Synthesis.

Codebase for Diffusion Models Beat GANS on Image Synthesis.

Katherine Crowson 128 Dec 2, 2022
Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World

Legged Robots that Keep on Learning Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World, whic

Laura Smith 70 Dec 7, 2022
An Image Captioning codebase

An Image Captioning codebase This is a codebase for image captioning research. It supports: Self critical training from Self-critical Sequence Trainin

Ruotian(RT) Luo 1.1k Oct 18, 2021
Ranger - a synergistic optimizer using RAdam (Rectified Adam), Gradient Centralization and LookAhead in one codebase

Ranger-Deep-Learning-Optimizer Ranger - a synergistic optimizer combining RAdam (Rectified Adam) and LookAhead, and now GC (gradient centralization) i

Less Wright 1.1k Dec 21, 2022
Codebase for the self-supervised goal reaching benchmark introduced in the LEXA paper

LEXA Benchmark Codebase for the self-supervised goal reaching benchmark introduced in the LEXA paper (Discovering and Achieving Goals via World Models

Oleg Rybkin 36 Dec 22, 2022
This codebase is the official implementation of Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization (NeurIPS2021, Spotlight)

Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization This codebase is the official implementation of Test-Time Classifier A

null 47 Dec 28, 2022
Official codebase for "B-Pref: Benchmarking Preference-BasedReinforcement Learning" contains scripts to reproduce experiments.

B-Pref Official codebase for B-Pref: Benchmarking Preference-BasedReinforcement Learning contains scripts to reproduce experiments. Install conda env

null 48 Dec 20, 2022
Codebase for "ProtoAttend: Attention-Based Prototypical Learning."

Codebase for "ProtoAttend: Attention-Based Prototypical Learning." Authors: Sercan O. Arik and Tomas Pfister Paper: Sercan O. Arik and Tomas Pfister,

47 2 May 17, 2022
Using this codebase as a tool for my own research. Making some modifications to the original repo for my own purposes.

For SwapNet Create a list.txt file containing all the images to process. This can be done with the GNU find command: find path/to/input/folder -name '

Andrew Jong 2 Nov 10, 2021
Codebase for Amodal Segmentation through Out-of-Task andOut-of-Distribution Generalization with a Bayesian Model

Codebase for Amodal Segmentation through Out-of-Task andOut-of-Distribution Generalization with a Bayesian Model

Yihong Sun 12 Nov 15, 2022
PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.

PySlowFast PySlowFast is an open source video understanding codebase from FAIR that provides state-of-the-art video classification models with efficie

Meta Research 5.3k Jan 3, 2023