Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)

Overview

Taming Visually Guided Sound Generation

• [Project Page] • [ArXiv] • [Poster] • Open In Colab

Generated Samples Using our Model

Listen for the samples on our project page.

Overview

We propose to tame the visually guided sound generation by shrinking a training dataset to a set of representative vectors aka. a codebook. These codebook vectors can, then, be controllably sampled to form a novel sound given a set of visual cues as a prime.

The codebook is trained on spectrograms similarly to VQGAN (an upgraded VQVAE). We refer to it as Spectrogram VQGAN

Spectrogram VQGAN

Once the spectrogram codebook is trained, we can train a transformer (a variant of GPT-2) to autoregressively sample the codebook entries as tokens conditioned on a set of visual features

Vision-based Conditional Cross-modal Autoregressive Sampler

This approach allows training a spectrogram generation model which produces long, relevant, and high-fidelity sounds while supporting tens of data classes.

Environment Preparation

During experimentation, we used Linux machines with conda virtual environments, PyTorch 1.8 and CUDA 11.

Start by cloning this repo

git clone https://github.com/v-iashin/SpecVQGAN.git

Next, install the environment. For your convenience, we provide both conda and docker environments.

Conda

conda env create -f conda_env.yml

Test your environment

conda activate specvqgan
python -c "import torch; print(torch.cuda.is_available())"
# True

Docker

Download the image from Docker Hub and test if CUDA is available:

docker run \
    --mount type=bind,source=/absolute/path/to/SpecVQGAN/,destination=/home/ubuntu/SpecVQGAN/ \
    --mount type=bind,source=/absolute/path/to/logs/,destination=/home/ubuntu/SpecVQGAN/logs/ \
    --mount type=bind,source=/absolute/path/to/vggsound/features/,destination=/home/ubuntu/SpecVQGAN/data/vggsound/ \
    --shm-size 8G \
    -it --gpus '"device=0"' \
    iashin/specvqgan:latest \
    python
>>> import torch; print(torch.cuda.is_available())
# True

or build it yourself

docker build - < Dockerfile --tag specvqgan

Data

In this project, we used VAS and VGGSound datasets. VAS can be downloaded directly using the link provided in the RegNet repository. For VGGSound, however, one might need to retrieve videos directly from YouTube.

Download

The scripts will download features, check the md5 sum, unpack, and do a clean-up for each part of the dataset:

cd ./data
# 24GB
bash ./download_vas_features.sh
# 420GB (+ 420GB if you also need ResNet50 Features)
bash ./download_vggsound_features.sh

The unpacked features are going to be saved in ./data/downloaded_features/*. Move them to ./data/vas and ./data/vggsound such that the folder structure would match the structure of the demo files. By default, it will download BN Inception features, to download ResNet50 features uncomment the lines in scripts ./download_*_features.sh

If you wish to download the parts manually, use the following URL templates:

  • https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/specvqgan_public/vas/*.tar
  • https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/specvqgan_public/vggsound/*.tar

Also, make sure to check the md5 sums provided in ./data/md5sum_vas.md5 and ./data/md5sum_vggsound.md5 along with file names.

Note, we distribute features for the VGGSound dataset in 64 parts. Each part holds ~3k clips and can be used independently as a subset of the whole dataset (the parts are not class-stratified though).

Extract Features Manually

For BN Inception features, we employ the same procedure as RegNet.

For ResNet50 features, we rely on video_features repository and used these commands:

# VAS (few hours on three 2080Ti)
strings=("dog" "fireworks" "drum" "baby" "gun" "sneeze" "cough" "hammer")
for class in "${strings[@]}"; do
    python main.py \
        --feature_type resnet50 \
        --device_ids 0 1 2 \
        --batch_size 86 \
        --extraction_fps 21.5 \
        --file_with_video_paths ./paths_to_mp4_${class}.txt \
        --output_path ./data/vas/features/${class}/feature_resnet50_dim2048_21.5fps \
        --on_extraction save_pickle
done

# VGGSound (6 days on three 2080Ti)
python main.py \
    --feature_type resnet50 \
    --device_ids 0 1 2 \
    --batch_size 86 \
    --extraction_fps 21.5 \
    --file_with_video_paths ./paths_to_mp4s.txt \
    --output_path ./data/vggsound/feature_resnet50_dim2048_21.5fps \
    --on_extraction save_pickle

Similar to BN Inception, we need to "tile" (cycle) a video if it is shorter than 10s. For ResNet50 we achieve this by tiling the resulting frame-level features up to 215 on temporal dimension, e.g. as follows:

feats = pickle.load(open(path, 'rb')).astype(np.float32)
reps = 1 + (215 // feats.shape[0])
feats = np.tile(feats, (reps, 1))[:215, :]
with open(new_path, 'wb') as file:
    pickle.dump(feats, file)

Pretrained Models

Unpack the pre-trained models to ./logs/ directory.

Codebooks

Trained on Evaluated on FID ↓ Avg. MKL ↓ Link / MD5SUM
VGGSound VGGSound 1.0 0.8 7ea229427297b5d220fb1c80db32dbc5
VAS VAS 6.0 1.0 0024ad3705c5e58a11779d3d9e97cc8a

Run Sampling Tool to see the reconstruction results for available data.

Transformers

The setting (a): the transformer is trained on VGGSound to sample from the VGGSound codebook:

Condition Features FID ↓ Avg. MKL ↓ Sample Time️ ↓ Link / MD5SUM
No Feats 13.5 9.7 7.7 b1f9bb63d831611479249031a1203371
1 Feat BN Inception 8.6 7.7 7.7 f2fe41dab17e232bd94c6d119a807fee
1 Feat ResNet50 11.5* 7.3* 7.7 27a61d4b74a72578d13579333ed056f6
5 Feats BN Inception 9.4 7.0 7.9 b082d894b741f0d7a1af9c2732bad70f
5 Feats ResNet50 11.3* 7.0* 7.9 f4d7105811589d441b69f00d7d0b8dc8
212 Feats BN Inception 9.6 6.8 11.8 79895ac08303b1536809cad1ec9a7502
212 Feats ResNet50 10.5* 6.9* 11.8 b222cc0e7aeb419f533d5806a08669fe

* – calculated on 1 sampler per video the test set instead of 10 samples per video as the rest. Evaluating a model on a larger number of samples per video is an expensive procedure. When evaluative on 10 samples per video, one might expect that the values might improve a bit (~+0.1).

The setting (b): the transformer is trained on VAS to sample from the VGGSound codebook

Condition Features FID ↓ Avg. MKL ↓ Sample Time️ ↓ Link / MD5SUM
No Feats 33.7 9.6 7.7 e6b0b5be1f8ac551700f49d29cda50d7
1 Feat BN Inception 38.6 7.3 7.7 a98a124d6b3613923f28adfacba3890c
1 Feat ResNet50 26.5* 6.7* 7.7 37cd48f06d74176fa8d0f27303841d94
5 Feats BN Inception 29.1 6.9 7.9 38da002f900fb81275b73e158e919e16
5 Feats ResNet50 22.3* 6.5* 7.9 7b6951a33771ef527f1c1b1f99b7595e
212 Feats BN Inception 20.5 6.0 11.8 1c4e56077d737677eac524383e6d98d3
212 Feats ResNet50 20.8* 6.2* 11.8 6e553ea44c8bc7a3310961f74e7974ea

* – calculated on 10 sampler per video the validation set instead of 100 samples per video as the rest. Evaluating a model on a larger number of samples per video is an expensive procedure. When evaluative on 10 samples per video, one might expect that the values might improve a bit (~+0.1).

The setting (c): the transformer is trained on VAS to sample from the VAS codebook

Condition Features FID ↓ Avg. MKL ↓ Sample Time ↓ Link / MD5SUM
No Feats 28.7 9.2 7.6 ea4945802094f826061483e7b9892839
1 Feat BN Inception 25.1 6.6 7.6 8a3adf60baa049a79ae62e2e95014ff7
1 Feat ResNet50 25.1* 6.3* 7.6 a7a1342030653945e97f68a8112ed54a
5 Feats BN Inception 24.8 6.2 7.8 4e1b24207780eff26a387dd9317d054d
5 Feats ResNet50 20.9* 6.1* 7.8 78b8d42be19dd1b0a346b1f512967302
212 Feats BN Inception 25.4 5.9 11.6 4542632b3c5bfbf827ea7868cedd4634
212 Feats ResNet50 22.6* 5.8* 11.6 dc2b5cbd28ad98d2f9ca4329e8aa0f64

* – calculated on 10 sampler per video the validation set instead of 100 samples per video as the rest. Evaluating a model on a larger number of samples per video is an expensive procedure. When evaluative on 10 samples per video, one might expect that the values might improve a bit (~+0.1).

A transformer can also be trained to generate a spectrogram given a specific class. We also provide pre-trained models for all three settings: The setting (c): the transformer is trained on VAS to sample from the VAS codebook

Setting Codebook Sampling for FID ↓ Avg. MKL ↓ Sample Time ↓ Link / MD5SUM
(a) VGGSound VGGSound 7.8 5.0 7.7 98a3788ab973f1c3cc02e2e41ad253bc
(b) VGGSound VAS 39.6 6.7 7.7 16a816a270f09a76bfd97fe0006c704b
(c) VAS VAS 23.9 5.5 7.6 412b01be179c2b8b02dfa0c0b49b9a0f

VGGish-ish, Melception, and MelGAN

These will be downloaded automatically during the first run. However, if you need them separately, here are the checkpoints

  • VGGish-ish (1.54GB, 197040c524a07ccacf7715d7080a80bd) + Normalization Parameters (in /specvqgan/modules/losses/vggishish/data/)
  • Melception (0.27GB, a71a41041e945b457c7d3d814bbcf72d) + Normalization Parameters (in /specvqgan/modules/losses/vggishish/data/)
  • MelGAN

The reference performance of VGGish-ish and Melception:

Model Top-1 Acc Top-5 Acc mAP mAUC
VGGish-ish 34.70 63.71 36.63 95.70
Melception 44.49 73.79 47.58 96.66

Run Sampling Tool to see Melception and MelGAN in action.

Training

The training is done in two stages. First, a spectrogram codebook should be trained. Second, a transformer is trained to sample from the codebook The first and the second stages can be trained on the same or separate datasets as long as the process of spectrogram extraction is the same.

Training a Spectrogram Codebook

To train a spectrogram codebook, we tried two datasets: VAS and VGGSound. We run our experiments on a relatively expensive hardware setup with four 40GB NVidia A100 but the models can also be trained on one 12GB NVidia 2080Ti with smaller batch size. When training on four 40GB NVidia A100, change arguments to --gpus 0,1,2,3 and data.params.batch_size=8 for the codebook and =16 for the transformer. The training will hang a bit at 0, 2, 4, 8, ... steps because of the logging. If folders with features and spectrograms are located elsewhere, the paths can be specified in data.params.spec_dir_path, data.params.rgb_feats_dir_path, and data.params.flow_feats_dir_path arguments but use the same format as in the config file e.g. notice the * in the path which globs class folders.

# VAS Codebook
# mind the comma after `0,`
python train.py --base configs/vas_codebook.yaml -t True --gpus 0,
# or
# VGGSound codebook
python train.py --base configs/vggsound_codebook.yaml -t True --gpus 0,

Training a Transformer

A transformer (GPT-2) is trained to sample from the spectrogram codebook given a set of frame-level visual features.

VAS Transformer

# with the VAS codebook
python train.py --base configs/vas_transformer.yaml -t True --gpus 0, \
    model.params.first_stage_config.params.ckpt_path=./logs/2021-06-06T19-42-53_vas_codebook/checkpoints/epoch_259.ckpt
# or with the VGGSound codebook which has 1024 codes
python train.py --base configs/vas_transformer.yaml -t True --gpus 0, \
    model.params.transformer_config.params.GPT_config.vocab_size=1024 \
    model.params.first_stage_config.params.n_embed=1024 \
    model.params.first_stage_config.params.ckpt_path=./logs/2021-05-19T22-16-54_vggsound_codebook/checkpoints/epoch_39.ckpt

VGGSound Transformer

python train.py --base configs/vggsound_transformer.yaml -t True --gpus 0, \
    model.params.first_stage_config.params.ckpt_path=./logs/2021-05-19T22-16-54_vggsound_codebook/checkpoints/epoch_39.ckpt

Controlling the Condition Size

The size of the visual condition is controlled by two arguments in the config file. The feat_sample_size is the size of the visual features resampled equidistantly from all available features (212) and block_size is the attention span. Make sure to use block_size = 53 * 5 + feat_sample_size. For instance, for feat_sample_size=212 the block_size=477. However, the longer the condition, the more memory and more timely the sampling. By default, the configs are using feat_sample_size=212 for VAS and 5 for VGGSound. Feel free to tweak it to your liking/application for example:

python train.py --base configs/vas_transformer.yaml -t True --gpus 0, \
    model.params.transformer_config.params.GPT_config.block_size=318 \
    data.params.feat_sampler_cfg.params.feat_sample_size=53 \
    model.params.first_stage_config.params.ckpt_path=./logs/2021-06-06T19-42-53_vas_codebook/checkpoints/epoch_259.ckpt

The No Feats settings (without visual condition) are trained similarly to the settings with visual conditioning where the condition is replaced with random vectors. The optimal approach here is to use replace_feats_with_random=true along with feat_sample_size=1 for example (VAS):

python train.py --base configs/vas_transformer.yaml -t True --gpus 0, \
    data.params.replace_feats_with_random=true \
    model.params.transformer_config.params.GPT_config.block_size=266 \
    data.params.feat_sampler_cfg.params.feat_sample_size=1 \
    model.params.first_stage_config.params.ckpt_path=./logs/2021-06-06T19-42-53_vas_codebook/checkpoints/epoch_259.ckpt

Training VGGish-ish and Melception

We include all necessary files for training both vggishish and melception in ./specvqgan/modules/losses/vggishish. Run it on a 12GB GPU as

cd ./specvqgan/modules/losses/vggishish
# vggish-ish
python train_vggishish.py config=./configs/vggish.yaml device='cuda:0'
# melception
python train_melception.py config=./configs/melception.yaml device='cuda:1'

Training MelGAN

To train the vocoder, use this command:

cd ./vocoder
python scripts/train.py \
    --save_path ./logs/`date +"%Y-%m-%dT%H-%M-%S"` \
    --data_path /path/to/melspec_10s_22050hz \
    --batch_size 64

Evaluation

The evaluation is done in two steps. First, the samples are generated for each video. Second, evaluation script is run. The sampling procedure supports multi-gpu multi-node parallization. We provide a multi-gpu command which can easily be applied on a multi-node setup by replacing --master_addr to your main machine and --node_rank for every worker's id (also see an sbatch script in ./evaluation/sbatch_sample.sh if you have a SLURM cluster at your disposal):

# Sample
python -m torch.distributed.launch \
    --nproc_per_node=3 \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=62374 \
    --use_env \
        evaluation/generate_samples.py \
        sampler.config_sampler=evaluation/configs/sampler.yaml \
        sampler.model_logdir=$EXPERIMENT_PATH \
        sampler.splits=$SPLITS \
        sampler.samples_per_video=$SAMPLES_PER_VIDEO \
        sampler.batch_size=$SAMPLER_BATCHSIZE \
        sampler.top_k=$TOP_K \
        data.params.spec_dir_path=$SPEC_DIR_PATH \
        data.params.rgb_feats_dir_path=$RGB_FEATS_DIR_PATH \
        data.params.flow_feats_dir_path=$FLOW_FEATS_DIR_PATH \
        sampler.now=$NOW
# Evaluate
python -m torch.distributed.launch \
    --nproc_per_node=3 \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=62374 \
    --use_env \
    evaluate.py \
        config=./evaluation/configs/eval_melception_${DATASET,,}.yaml \
        input2.path_to_exp=$EXPERIMENT_PATH \
        patch.specs_dir=$SPEC_DIR_PATH \
        patch.spec_dir_path=$SPEC_DIR_PATH \
        patch.rgb_feats_dir_path=$RGB_FEATS_DIR_PATH \
        patch.flow_feats_dir_path=$FLOW_FEATS_DIR_PATH \
        input1.params.root=$EXPERIMENT_PATH/samples_$NOW/$SAMPLES_FOLDER

The variables for the VAS dataset:

EXPERIMENT_PATH="./logs/<folder-name-of-vas-transformer-or-codebook>"
SPEC_DIR_PATH="./data/vas/features/*/melspec_10s_22050hz/"
RGB_FEATS_DIR_PATH="./data/vas/features/*/feature_rgb_bninception_dim1024_21.5fps/"
FLOW_FEATS_DIR_PATH="./data/vas/features/*/feature_flow_bninception_dim1024_21.5fps/"
SAMPLES_FOLDER="VAS_validation"
SPLITS="\"[validation, ]\""
SAMPLER_BATCHSIZE=4
SAMPLES_PER_VIDEO=10
TOP_K=64 # use TOP_K=512 when evaluating a VAS transformer trained with a VGGSound codebook
NOW=`date +"%Y-%m-%dT%H-%M-%S"`

The variables for the VGGSound dataset:

EXPERIMENT_PATH="./logs/<folder-name-of-vggsound-transformer-or-codebook>"
SPEC_DIR_PATH="./data/vggsound/melspec_10s_22050hz/"
RGB_FEATS_DIR_PATH="./data/vggsound/feature_rgb_bninception_dim1024_21.5fps/"
FLOW_FEATS_DIR_PATH="./data/vggsound/feature_flow_bninception_dim1024_21.5fps/"
SAMPLES_FOLDER="VGGSound_test"
SPLITS="\"[test, ]\""
SAMPLER_BATCHSIZE=32
SAMPLES_PER_VIDEO=1
TOP_K=512
NOW=`date +"%Y-%m-%dT%H-%M-%S" the`

Sampling Tool

For interactive sampling, we rely on the Streamlit library. To start the streamlit server locally, run

# mind the trailing `--`
streamlit run --server.port 5555 ./sample_visualization.py --
# go to `localhost:5555` in your browser

or Open In Colab.

We also alternatively provide a similar notebook in ./generation_demo.ipynb to play with the demo on a local machine.

The Neural Audio Codec Demo

While the Spectrogram VQGAN was never designed to be a neural audio codec but it happened to be highly effective for this task. We can employ our Spectrogram VQGAN pre-trained on an open-domain dataset as a neural audio codec without a change

If you wish to apply the SpecVQGAN for audio compression for arbitrary audio, please see our Google Colab demo: Open In Colab.

Integrated to Huggingface Spaces with Gradio. See demo: Hugging Face Spaces

We also alternatively provide a similar notebook in ./neural_audio_codec_demo.ipynb to play with the demo on a local machine.

Citation

Our paper was accepted as an oral presentation for the BMVC 2021. Please, use this bibtex if you would like to cite our work

@InProceedings{SpecVQGAN_Iashin_2021,
  title={Taming Visually Guided Sound Generation},
  author={Iashin, Vladimir and Rahtu, Esa},
  booktitle={British Machine Vision Conference (BMVC)},
  year={2021}
}

Acknowledgments

Funding for this research was provided by the Academy of Finland projects 327910 & 324346. The authors acknowledge CSC — IT Center for Science, Finland, for computational resources for our experimentation.

We also acknowledge the following codebases:

Comments
  • Issue running example with load_model()

    Issue running example with load_model()

    Hello!

    First of all thank you for this amazing work. I'll definitely use it.

    The thing is I'm running into an issue when trying to initialize the algorithm. On the fourth cell of your notebook https://colab.research.google.com/drive/1pxTIMweAKApJZ3ZFqyBee3HtMqFpnwQ0?usp=sharing#scrollTo=FeJarWQuFQOT when calling the load_model() I have an error. I'll paste a bit of code here to illustrate:

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model_name = '2021-07-30T21-34-25_vggsound_transformer'
    log_dir = 'SpecVQGAN/logs'
    config, sampler, melgan, melception = load_model(model_name, log_dir, device)
    

    The model downloads and I get the warning:

    device  cuda                                                                                                                    
    Using: 2021-07-30T21-34-25_vggsound_transformer (5 ResNet50 Features)
    3.68GB [01:55, 31.7MB/s]                                                                                                        
    Unpacking SpecVQGAN/logs/2021-07-30T21-34-25_vggsound_transformer.tar.gz to SpecVQGAN/logs
    

    But seconds after the unpack I get this error:

    
    Traceback (most recent call last):
      File "/home/luis/Desktop/clip/scripts/test_both_libs.py", line 73, in <module>
        config, sampler, melgan, melception = load_model(model_name, log_dir, device)
      File "SpecVQGAN/feature_extraction/demo_utils.py", line 191, in load_model
        config = load_config(model_dir)
      File "SpecVQGAN/feature_extraction/demo_utils.py", line 178, in load_config
        if config.data.params[a] is not None:
      File "/home/luis/anaconda3/envs/pytorch/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 373, in __getitem__
        self._format_and_raise(key=key, value=None, cause=e)
      File "/home/luis/anaconda3/envs/pytorch/lib/python3.9/site-packages/omegaconf/base.py", line 190, in _format_and_raise
        format_and_raise(
      File "/home/luis/anaconda3/envs/pytorch/lib/python3.9/site-packages/omegaconf/_utils.py", line 821, in format_and_raise
        _raise(ex, cause)
      File "/home/luis/anaconda3/envs/pytorch/lib/python3.9/site-packages/omegaconf/_utils.py", line 719, in _raise
        raise ex.with_traceback(sys.exc_info()[2])  # set end OC_CAUSE=1 for full backtrace
      File "/home/luis/anaconda3/envs/pytorch/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 367, in __getitem__
        return self._get_impl(key=key, default_value=_DEFAULT_MARKER_)
      File "/home/luis/anaconda3/envs/pytorch/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 438, in _get_impl
        node = self._get_node(key=key, throw_on_missing_key=True)
      File "/home/luis/anaconda3/envs/pytorch/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 470, in _get_node
        raise ConfigKeyError(f"Missing key {key}")
    omegaconf.errors.ConfigKeyError: Missing key flow_feats_dir_path
        full_key: data.params.flow_feats_dir_path
        object_type=dict
    
    

    I can post my conda list if needed, pytorch is running CUDA enabled and works fine with other projects (I guess). Pytorch 1.9.1.post3 NVIDIA RTX3070 CUDA 11.2

    Any idea what it might be? Thanks in advance

    opened by luisArandas 6
  • about training vocoder

    about training vocoder

    Hi, I have a problem about training mel-gan. I find that when you train mel-gan, you normalize the audio data before transfer it to mel spectrum. e.g. In the file vocoder/mel2wav/dataset.py. def load_wav_to_torch(self, full_path): data = np.load(full_path) data = 0.95 * normalize(data)

    I just want to know why you try to nomalize it and the mutiply 0.95? After the nomalization operation, the extracted mel-spectrum is same as the orginal spectrum? I mean such operation whether influence the results when we use it to transfer the predicted specrum into wave?

    Furthermore, when I use your script vocoder/scripts/generate_from_folder.py to generate sample, I find it fails (It means that the reverse audio is far from the orginal audio). After that I modify it as followwing: It works `def main(): args = parse_args() vocoder = MelVocoder(args.load_path)

    args.save_path.mkdir(exist_ok=True, parents=True)
    
    for i, fname in tqdm(enumerate(args.folder.glob("*.wav"))):
        wavname = fname.name
        wav, sr = librosa.core.load(fname)
        data = 0.95 * normalize(wav) # 
        #wav = torch.from_numpy(wav).unsqueeze(0)
        #mel = vocoder(torch.from_numpy(wav)[None])
        mel = wav2mel(wav)
        # print('mel ',mel.shape)
        # assert 1==2
        recons = vocoder.inverse(mel).squeeze().cpu().numpy()
    
        librosa.output.write_wav(args.save_path / wavname, recons, sr=sr)`
    
    opened by yangdongchao 3
  • Reconstruct mel spectrogram from librosa

    Reconstruct mel spectrogram from librosa

    Hello! first of all, thanks for this wonderful repo. I would just like to ask as how to reconstruct the mel spectrogram i generated from librosa? I can do this via VQGAN using this code:

    def reconstruct_with_vqgan(x, model):
      z, _, [_, _, indices] = model.encode(x)
      xrec = model.decode(z)
      return xrec
    

    the xrec is the reconstructed image (from VQGAN)

    I also add a preprocessing step before reconstructing using this code (same one from DALL-E's VQVAE):

    def preprocess(img): 
        s = min(img.size)
        
         if s < target_image_size:
            raise ValueError(f'min dim for image {s} < {target_image_size}')
            
        r = target_image_size / s
        s = (round(r * img.size[1]), round(r * img.size[0]))
        img = TF.resize(img, s, interpolation=PIL.Image.LANCZOS)
        #img = TF.center_crop(img, output_size=2 * [target_image_size])
        img = torch.unsqueeze(T.ToTensor()(img), 0)
        return img
    

    in the end i just call these 2 functions to reconstruct the image

    img = PIL.Image.open(image).convert("RGB") # input is the mel spectogram in image form
    x_vqgan = preprocess(img)
    x_vqgan = x_vqgan.to(DEVICE)
      
    x2 = reconstruct_with_vqgan(x_vqgan, model32x32) # model32x32 is the VQGAN model
    x2 = custom_to_pil(x2[0]) # final reconstructed image 
    

    I was wondering how I could use your model instead to reconstruct in a way that it is similar to this. I just checked the demo and saw that it extracts audio from the video. I'm thinking as to how I can directly reconstruct the mel spectrogram generated on librosa.

    Thank you very much in advance :D

    opened by clairerity 3
  • bending the re/de-constructed melspectrogram to create new sounds.

    bending the re/de-constructed melspectrogram to create new sounds.

    https://github.com/ciaua/unagan/issues/8

    image

    Is it possible? I want to take above visual and mash it around (change the shapes) to create new vocals....

    UPDATE basically - i think I want to condition the SpecVQGAN on these images - (not video a video frame per se')

    opened by johndpope 3
  • Loss becoming

    Loss becoming "nan" during codebook training?

    Hello! I was running codebook training on VAS, but for some reason I see the loss turning into nan after the first epoch. I was wondering if I may be doing something incorrectly? I used this command:

    python train.py --base configs/vas_codebook.yaml -t True --gpus 0,
    

    Here are the nans I see:

    Epoch 0: 51%|██████████████████████████████▉ | 78/154 [01:55<01:52, 1.48s/it, loss=nan, v_num=0, val/rec_loss_epoch=1.100, val/aeloss_epoch=1.140, train/aeloss_step=nan.0] Previous Epoch counts: [530, 0, 1, 0, 0, 0, 11, 45, 212, 1, 0, 49, 5, 0, 1, 0, 0, 0, 0, 4, 1, 48, 0, 17, 5, 201, 13, 5, 38, 0, 1, 287, 1370, 6, 3, 0, 0, 1, 0, 1, 58, 1, 3, 4, 228, 123, 0, 0, 15, 0, 0, 6 , 0, 0, 36, 39, 36, 1, 7, 0, 0, 4, 38, 3, 0, 1, 62, 147, 5, 0, 3, 9, 8, 0, 13, 80, 33, 40, 0, 20, 0, 104, 26, 0, 4, 14, 1, 0, 0, 129, 0, 0, 2, 4, 7, 0, 1, 1, 0, 0, 28, 33, 2, 83, 0, 0, 43, 4, 4, 0, 59, 11, 22, 17, 6, 0, 30, 219, 0, 6, 15, 4, 2, 0, 0, 2, 0, 8] Epoch 1: 51%|▌| 78/154 [01:08<01:06, 1.15it/s, loss=nan, v_num=0, val/rec_loss_epoch=nan.0, val/aeloss_epoch=nan.0, train/aeloss_step=nan.0, train/aeloss_epoch=nan.0, val/rec_loss_step=nan.0, val/aelo Previous Epoch counts: [41870, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0]

    Thank you very much!

    The loss is going to 'nan' when i load the correct ckpt, do you have this problem? I trained on VAS dataset.

    Originally posted by @jwliu-cc in https://github.com/v-iashin/SpecVQGAN/issues/13#issuecomment-1049503111

    opened by jhyau 2
  • Cannot evaluation

    Cannot evaluation

    Hi, thanks for your code, I have generate audios according to your code. But I get a mistake on the last steps, when I try to get the Quality Assessment. When I run evaluate.py, the following error is occurred.

    Extracting features from input_1 Traceback (most recent call last): File "/apdcephfs/share_1316500/donchaoyang/code3/SpecVQGAN/evaluate.py", line 231, in main() File "/apdcephfs/share_1316500/donchaoyang/code3/SpecVQGAN/evaluate.py", line 194, in main featuresdict_1 = get_featuresdict(feat_extractor, device, cfg.input1, is_ddp, **cfg.extraction_cfg) File "/apdcephfs/share_1316500/donchaoyang/code3/SpecVQGAN/evaluate.py", line 60, in get_featuresdict input = get_dataset_class(dataset_cfg) File "/apdcephfs/share_1316500/donchaoyang/code3/SpecVQGAN/evaluate.py", line 48, in get_dataset_class dataset_class = instantiate_from_config(dataset_cfg) File "/apdcephfs/share_1316500/donchaoyang/code3/SpecVQGAN/train.py", line 130, in instantiate_from_config return get_obj_from_str(config['target'])(**config.get('params', dict())) File "/apdcephfs/share_1316500/donchaoyang/code3/SpecVQGAN/evaluation/datasets/fakes.py", line 41, in init super().init(root, loader, extensions=extensions, transform=transform, File "/root/anaconda3/envs/specvqgan/lib/python3.8/site-packages/torchvision/datasets/folder.py", line 126, in init classes, class_to_idx = self._find_classes(self.root) File "/root/anaconda3/envs/specvqgan/lib/python3.8/site-packages/torchvision/datasets/folder.py", line 164, in _find_classes classes = [d.name for d in os.scandir(dir) if d.is_dir()] FileNotFoundError: [Errno 2] No such file or directory: '/apdcephfs/share_1316500/donchaoyang/code3/SpecVQGAN/logs/2022-01-30T16-28-24_vas_transformer/samples_2022-01-31T16-04-14/VAS_validation'

    I am very sure the path (/apdcephfs/share_1316500/donchaoyang/code3/SpecVQGAN/logs/2022-01-30T16-28-24_vas_transformer/samples_2022-01-31T16-04-14/VAS_validation) exists , and the directory save the sample audio mel in previous stage. Becuase I want to evaluate VAS dataset, so this direcory includes 8 different classes sub-direcory.

    opened by yangdongchao 2
  • Is the generated sound visually aligned?

    Is the generated sound visually aligned?

    Hi,

    First of all, congrats and really great work! While there are lots of audio examples, I haven't found any examples with videos so it is hard to tell. Since you have compared with RegNet which claimed to generate Visually Aligned Sound from Videos, I am just curious whether this work can also achieve that. Thank you.

    opened by sukun1045 2
  • Question about generating audio (longer than 10s)

    Question about generating audio (longer than 10s)

    Hi, thanks for your sharing! I want use the pre-trained model to generate audio longer than 10s, but i don't know how to achieve it? could you please give me some suggestions? Thanks a lots!

    opened by albertwy 1
  • Training conditional transformer

    Training conditional transformer

    Hello,

    I am trying to understand these lines could you further elaborate what is the procedure of training the transformer here?

    `# target includes all sequence elements (no need to handle first one # differently because we are conditioning) target = z_indices

        # in the case we do not want to encode condition anyhow (e.g. inputs are features)
        if isinstance(self.transformer, (GPTFeats, GPTClass, GPTFeatsClass)):
            # make the prediction
            logits, _, _ = self.transformer(z_indices[:, :-1], c)
            # cut off conditioning outputs - output i corresponds to p(z_i | z_{<i}, c)
            if isinstance(self.transformer, GPTFeatsClass):
                cond_size = c['feature'].size(-1) + c['target'].size(-1)
            else:
                cond_size = c.size(-1)
            logits = logits[:, cond_size-1:]`
    

    Using the features and all of the indices what exactly are we trying to predict? Isn't the target all the z_indices that we are already giving to the transformer? Or are we just predicting the last z_index given the features and the previous z_indices?

    opened by radiradev 1
  • new environment.yml if it is possible?

    new environment.yml if it is possible?

    Dear author, Current yml have serious conflict problem when using conda to install. A numbers of package seems unnecessary for running the project. Is it possible to upload a new clean environment.yml? Thank you!

    opened by Allencheng97 1
  • Issues with the sampling script

    Issues with the sampling script

    Hi Vladimir, thanks for the great project / repo!

    I’m having issues with the sampling script. First, there seems to be an issue parsing the split, ie: SPLITS="\"[test, ]\"" The script crashes with the following error: https://gist.github.com/roudimit/5c76893a380b999508401820c3fb00e4 (put in a gist since it’s so long). I guess OmegaConf can't parse this environment variable: https://github.com/v-iashin/SpecVQGAN/blob/f209a5aa3a090552bca3d30a8f46dce01c40667a/evaluation/generate_samples.py#L32 I made a temporary workaround by setting SPLITS= and hardcoding the test set here https://github.com/v-iashin/SpecVQGAN/blob/f209a5aa3a090552bca3d30a8f46dce01c40667a/evaluation/generate_samples.py#L70 and here https://github.com/v-iashin/SpecVQGAN/blob/f209a5aa3a090552bca3d30a8f46dce01c40667a/evaluation/generate_samples.py#L90

    The next problem seems to be with the GPU assignment. Here is the error I get after hard coding the test set.

    Traceback (most recent call last):
      File "evaluation/generate_samples.py", line 303, in <module>
    Traceback (most recent call last):
      File "evaluation/generate_samples.py", line 303, in <module>
        main()
      File "evaluation/generate_samples.py", line 299, in main
        sample(local_rank, cfg, samples_split_dirs, is_ddp)
      File "evaluation/generate_samples.py", line 262, in sample
        main()
      File "evaluation/generate_samples.py", line 299, in main
        torch.cuda.set_device(device)
      File "/home/gridsan/roudi/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 313, in set_device
        sample(local_rank, cfg, samples_split_dirs, is_ddp)
      File "evaluation/generate_samples.py", line 262, in sample
        torch._C._cuda_setDevice(device)
    RuntimeError: CUDA error: invalid device ordinal
    CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
        torch.cuda.set_device(device)
      File "/home/gridsan/roudi/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 313, in set_device
        torch._C._cuda_setDevice(device)
    RuntimeError: CUDA error: invalid device ordinal
    CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    WARNING:torch.distributed.elastic.multiprocessing.api:Sending process `9479` closing signal SIGTERM
    ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 9480) of binary: /home/gridsan/roudi/.conda/envs/specvqgan/bin/python
    

    I tried both the model that I trained and your model, ie EXPERIMENT_PATH="./logs/2021-07-30T21-34-25_vggsound_transformer". I also tried with 1 and 2 GPUs, and on different machines, and the same error comes up. I am using the specvqgan conda environment. Do you have any idea about this? Thanks!

    good first issue 
    opened by roudimit 3
  • Obout train Mel-GAN

    Obout train Mel-GAN

    Hi, I want to ask whether your code about training Mel-GAN vocoder is support multiple GPUs? In you paper, you use one single GPU training about 14 days. So I want to ask you whether we can use multiple GPUs to decrease the training time.

    opened by yangdongchao 2
  • Issue with vggish checkpoint

    Issue with vggish checkpoint

    Hello.

    the vggishish_lpaps checkpoint is used here:

    • https://github.com/v-iashin/SpecVQGAN/blob/eee222d8351df9b6314db69185d5ce8ca55b50c8/specvqgan/modules/losses/lpaps.py#L35
    • https://github.com/v-iashin/SpecVQGAN/blob/eee222d8351df9b6314db69185d5ce8ca55b50c8/specvqgan/modules/losses/lpaps.py#L135

    Errors are ignored in the code, but neither lpaps, nor vggishish manage to load it.

    The checkpoint URL is here: https://github.com/v-iashin/SpecVQGAN/blob/eee222d8351df9b6314db69185d5ce8ca55b50c8/specvqgan/util.py#L8

    The vggish weights can be found under the 'model' key, but I cannot find the lpaps weights anywhere in here. Are they not required ?

    Best regards,

    opened by luc-leonard 9
  • report error when I use multiple GPUs

    report error when I use multiple GPUs

    python3 train.py --base vas_codebook.yaml -t True --gpus 0,1,

    when I try to run the code with two GPUs, it report error pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] But your machine only has: []

    But if I only uses gpus 0, there is no error happen. So I want to ask how to using multiple GPUs to train this code?

    opened by yangdongchao 4
Owner
Vladimir Iashin
Deep learning researcher, tech enthusiast, Ph.D. student
Vladimir Iashin
[BMVC 2021] Official PyTorch Implementation of Self-supervised learning of Image Scale and Orientation Estimation

Self-Supervised Learning of Image Scale and Orientation Estimation (BMVC 2021) This is the official implementation of the paper "Self-Supervised Learn

Jongmin Lee 17 Nov 10, 2022
Official and maintained implementation of the paper "OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data" [BMVC 2021].

OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data Christoph Reich, Tim Prangemeier, Özdemir Cetin & Heinz Koeppl | Pr

Christoph Reich 23 Sep 21, 2022
Cascading Feature Extraction for Fast Point Cloud Registration (BMVC 2021)

Cascading Feature Extraction for Fast Point Cloud Registration This repository contains the source code for the paper [Arxive link comming soon]. Meth

null 7 May 26, 2022
Pytorch implementation of the paper Progressive Growing of Points with Tree-structured Generators (BMVC 2021)

PGpoints Pytorch implementation of the paper Progressive Growing of Points with Tree-structured Generators (BMVC 2021) Hyeontae Son, Young Min Kim Pre

Hyeontae Son 9 Jun 6, 2022
Official PyTorch implementation of "Improving Face Recognition with Large AgeGaps by Learning to Distinguish Children" (BMVC 2021)

Inter-Prototype (BMVC 2021): Official Project Webpage This repository provides the official PyTorch implementation of the following paper: Improving F

Jungsoo Lee 16 Jun 30, 2022
This repository provides the official implementation of 'Learning to ignore: rethinking attention in CNNs' accepted in BMVC 2021.

inverse_attention This repository provides the official implementation of 'Learning to ignore: rethinking attention in CNNs' accepted in BMVC 2021. Le

Firas Laakom 5 Jul 8, 2022
Code for "NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video", CVPR 2021 oral

NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video Project Page | Paper NeuralRecon: Real-Time Coherent 3D Reconstruction from Mon

ZJU3DV 1.4k Dec 30, 2022
Code for "Single-view robot pose and joint angle estimation via render & compare", CVPR 2021 (Oral).

Single-view robot pose and joint angle estimation via render & compare Yann Labbé, Justin Carpentier, Mathieu Aubry, Josef Sivic CVPR: Conference on C

Yann Labbé 51 Oct 14, 2022
Code for CVPR 2021 oral paper "Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts"

Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts The rapid progress in 3D scene understanding has come with growing dem

Facebook Research 182 Dec 30, 2022
Code for "Reconstructing 3D Human Pose by Watching Humans in the Mirror", CVPR 2021 oral

Reconstructing 3D Human Pose by Watching Humans in the Mirror Qi Fang*, Qing Shuai*, Junting Dong, Hujun Bao, Xiaowei Zhou CVPR 2021 Oral The videos a

ZJU3DV 178 Dec 13, 2022
Code for "Human Pose Regression with Residual Log-likelihood Estimation", ICCV 2021 Oral

Human Pose Regression with Residual Log-likelihood Estimation [Paper] [arXiv] [Project Page] Human Pose Regression with Residual Log-likelihood Estima

JeffLi 347 Dec 24, 2022
This is the code for CVPR 2021 oral paper: Jigsaw Clustering for Unsupervised Visual Representation Learning

JigsawClustering Jigsaw Clustering for Unsupervised Visual Representation Learning Pengguang Chen, Shu Liu, Jiaya Jia Introduction This project provid

DV Lab 73 Sep 18, 2022
Code for the ICCV 2021 paper "Pixel Difference Networks for Efficient Edge Detection" (Oral).

Pixel Difference Convolution This repository contains the PyTorch implementation for "Pixel Difference Networks for Efficient Edge Detection" by Zhuo

Alex 236 Dec 21, 2022
Code Release for ICCV 2021 (oral), "AdaFit: Rethinking Learning-based Normal Estimation on Point Clouds"

AdaFit: Rethinking Learning-based Normal Estimation on Point Clouds (ICCV 2021 oral) **Project Page | Arxiv ** Runsong Zhu¹, Yuan Liu², Zhen Dong¹, Te

null 40 Dec 30, 2022
Official PyTorch code of DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization (ICCV 2021 Oral).

DeepPanoContext (DPC) [Project Page (with interactive results)][Paper] DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context G

Cheng Zhang 66 Nov 16, 2022
Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

IC-Conv This repository is an official implementation of the paper Inception Convolution with Efficient Dilation Search. Getting Started Download Imag

Jie Liu 111 Dec 31, 2022
Official PyTorch implementation of RobustNet (CVPR 2021 Oral)

RobustNet (CVPR 2021 Oral): Official Project Webpage Codes and pretrained models will be released soon. This repository provides the official PyTorch

Sungha Choi 173 Dec 21, 2022
[ICLR 2021 Spotlight Oral] "Undistillable: Making A Nasty Teacher That CANNOT teach students", Haoyu Ma, Tianlong Chen, Ting-Kuei Hu, Chenyu You, Xiaohui Xie, Zhangyang Wang

Undistillable: Making A Nasty Teacher That CANNOT teach students "Undistillable: Making A Nasty Teacher That CANNOT teach students" Haoyu Ma, Tianlong

VITA 71 Dec 28, 2022
Dynamic Slimmable Network (CVPR 2021, Oral)

Dynamic Slimmable Network (DS-Net) This repository contains PyTorch code of our paper: Dynamic Slimmable Network (CVPR 2021 Oral). Architecture of DS-

Changlin Li 197 Dec 9, 2022