Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Overview

[Paper] [Project page]

This repository contains code for the paper:

Andrew Owens, Alexei A. Efros. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. arXiv, 2018

Contents

This release includes code and models for:

  • On/off-screen source separation: separating the speech of an on-screen speaker from background sounds.
  • Blind source separation: audio-only source separation using u-net and PIT.
  • Sound source localization: visualizing the parts of a video that correspond to sound-making actions.
  • Self-supervised audio-visual features: a pretrained 3D CNN that can be used for downstream tasks (e.g. action recognition, source separation).

Setup

pip install tensorflow     # for CPU evaluation only
pip install tensorflow-gpu # for GPU support

We used TensorFlow version 1.8, which can be installed with:

pip install tensorflow-gpu==1.8
  • Install other python dependencies
pip install numpy matplotlib pillow scipy
  • Download the pretrained models and sample data
./download_models.sh
./download_sample_data.sh

Pretrained audio-visual features

We have provided the features for our fused audio-visual network. These features were learned through self-supervised learning. Please see shift_example.py for a simple example that uses these pretrained features.

Audio-visual source separation

To try the on/off-screen source separation model, run:

python sep_video.py ../data/translator.mp4 --model full --duration_mult 4 --out ../results/

This will separate a speaker's voice from that of an off-screen speaker. It will write the separated video files to ../results/, and will also display them in a local webpage, for easier viewing. This produces the following videos (click to watch):

Input On-screen Off-screen

We can visually mask out one of the two on-screen speakers, thereby removing their voice:

python sep_video.py ../data/crossfire.mp4 --model full --mask l --out ../results/
python sep_video.py ../data/crossfire.mp4 --model full --mask r --out ../results/

This produces the following videos (click to watch):

Source Left Right

Blind (audio-only) source separation

This baseline trains a u-net model to minimize a permutation invariant loss.

python sep_video.py ../data/translator.mp4 --model unet_pit --duration_mult 4 --out ../results/

The model will write the two separated streams in an arbitrary order.

Visualizing the locations of sound sources

To view the self-supervised network's class activation map (CAM), use the --cam flag:

python sep_video.py ../data/translator.mp4 --model full --cam --out ../results/

This produces a video in which the CAM is overlaid as a heat map:

Action recognition and fine-tuning

We have provided example code for training an action recognition model (e.g. on the UCF-101 dataset) in videocls.py). This involves fine-tuning our pretrained, audio-visual network. It is also possible to train this network with only visual data (no audio).

Citation

If you use this code in your research, please consider citing our paper:

@article{multisensory2018,
  title={Audio-Visual Scene Analysis with Self-Supervised Multisensory Features},
  author={Owens, Andrew and Efros, Alexei A},
  journal={arXiv preprint arXiv:1804.03641},
  year={2018}
}

Updates

  • 11/08/18: Fixed a bug in the class activation map example code. Added Tensorflow 1.9 compatibility.

Acknowledgements

Our u-net code draws from this implementation of pix2pix.

Comments
  • TypeError: convolution() got multiple values for argument 'weights_regularizer'

    TypeError: convolution() got multiple values for argument 'weights_regularizer'

    I got error like this, what happened, please help me to fix. Traceback (most recent call last): File "D:/Workspace/PythonProjects/studyProjects/multisensory/src/sep_video.py", line 398, in ret = run(arg.vid_file, t, arg.clip_dur, pr, gpus[0], mask = arg.mask, arg = arg, net = net) File "D:/Workspace/PythonProjects/studyProjects/multisensory/src/sep_video.py", line 294, in run net.init() File "D:/Workspace/PythonProjects/studyProjects/multisensory/src/sep_video.py", line 42, in init pr, reuse = False, train = False) File "D:\Workspace\PythonProjects\studyProjects\multisensory\src\sourcesep.py", line 953, in make_net vid_net_full = shift_net.make_net(ims, sfs, pr, None, reuse, train) File "D:\Workspace\PythonProjects\studyProjects\multisensory\src\shift_net.py", line 419, in make_net sf_net = conv2d(sf_net,num_outputs= 64, kernel_size= [65, 1], scope = 'sf/conv1_1', stride = [4, 1], padding='SAME', reuse = reuse) # by lg 8.20 File "C:\Anaconda3\envs\tensorflow-gpu\lib\site-packages\tensorflow\contrib\framework\python\ops\arg_scope.py", line 183, in func_with_args return func(*args, **current_args) File "C:\Anaconda3\envs\tensorflow-gpu\lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py", line 1154, in convolution2d conv_dims=2) File "C:\Anaconda3\envs\tensorflow-gpu\lib\site-packages\tensorflow\contrib\framework\python\ops\arg_scope.py", line 183, in func_with_args return func(*args, **current_args) TypeError: convolution() got multiple values for argument 'weights_regularizer'

    opened by chouqin3 5
  • I RuntimeError: Command failed! ffmpeg -i

    I RuntimeError: Command failed! ffmpeg -i "/tmp/ao_wmjz0ezg.wav" -r 29.970000 -loglevel warning -safe 0 -f concat -i "/tmp/ao_i2pwi0b8.txt" -pix_fmt yuv420p -vcodec h264 -strict -2 -y -acodec aac "results/fg_translator.mp4"

    python sep_video.py data/translator.mp4 --model unet_pit --duration_mult 4 --out results/ Start time: 0.0 GPU = 0 Spectrogram samples: 512 (8.298, 8.288) 100.0% complete, total time: 0:00:00. 0:00:00 per iteration. (11:29 AM Tue) Struct(alg=sourcesep, augment_audio=False, augment_ims=True, augment_rms=False, base_lr=0.0001, batch_size=24, bn_last=True, bn_scale=True, both_videos_in_batch=False, cam=False, check_iters=1000, crop_im_dim=224, dilate=False, do_shift=False, dset_seed=None, fix_frame=False, fps=29.97, frame_length_ms=64, frame_sample_delta=74.5, frame_step_ms=16, freq_len=1024, full_im_dim=256, full_model=False, full_samples_len=105000, gamma=0.1, gan_weight=0.0, grad_clip=10.0, im_split=False, im_type=jpeg, init_path=None, init_type=shift, input_rms=0.14142135623730953, l1_weight=1.0, log_spec=True, loss_types=['pit'], model_path=results/nets/sep/unet-pit/net.tf-160000, mono=False, multi_shift=False, net_style=no-im, normalize_rms=True, num_dbs=None, num_samples=173774, opt_method=adam, pad_stft=False, phase_type=pred, phase_weight=0.01, pit_weight=1.0, predict_bg=True, print_iters=10, profile_iters=None, resdir=/home/study/PycharmProjects/results/nets/sep/unet-pit, samp_sr=21000.0, sample_len=None, sampled_frames=248, samples_per_frame=700.7007007007007, show_iters=None, show_videos=False, slow_check_iters=10000, spec_len=512, spec_max=80.0, spec_min=-100.0, step_size=120000, subsample_frames=None, summary_iters=10, test_batch=10, test_list=../data/celeb-tf-v6-full/test/tf, total_frames=149, train_iters=160000, train_list=../data/celeb-tf-v6-full/train/tf, use_3d=True, use_sound=True, use_wav_gan=False, val_list=../data/celeb-tf-v6-full/val/tf, variable_frame_count=False, vid_dur=8.288, weight_decay=1e-05) ffmpeg -loglevel error -ss 0.0 -i "data/translator.mp4" -safe 0 -t 8.338000000000001 -r 29.97 -vf scale=256:256 "/tmp/tmpw4889ppn/small_%04d.png" ffmpeg -loglevel error -ss 0.0 -i "data/translator.mp4" -safe 0 -t 8.338000000000001 -r 29.97 -vf "scale=-2:'min(600,ih)'" "/tmp/tmpw4889ppn/full_%04d.png" ffmpeg -loglevel error -ss 0.0 -i "data/translator.mp4" -safe 0 -t 8.338000000000001 -ar 21000.0 -ac 2 "/tmp/tmpw4889ppn/sound.wav" Running on: 2019-05-14 11:29:30.212532: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2019-05-14 11:29:30.329825: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-05-14 11:29:30.330229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62 pciBusID: 0000:01:00.0 totalMemory: 7.77GiB freeMemory: 7.19GiB 2019-05-14 11:29:30.330244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0 2019-05-14 11:29:30.547596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-05-14 11:29:30.547627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 2019-05-14 11:29:30.547632: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N 2019-05-14 11:29:30.547797: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6920 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5) Raw spec length: [1, 514, 1025] Truncated spec length: [1, 512, 1025] ('gen/conv1', [1, 512, 1024, 2], '->', [1, 512, 512, 64]) ('gen/conv2', [1, 512, 512, 64], '->', [1, 512, 256, 128]) ('gen/conv3', [1, 512, 256, 128], '->', [1, 256, 128, 256]) ('gen/conv4', [1, 256, 128, 256], '->', [1, 128, 64, 512]) ('gen/conv5', [1, 128, 64, 512], '->', [1, 64, 32, 512]) ('gen/conv6', [1, 64, 32, 512], '->', [1, 32, 16, 512]) ('gen/conv7', [1, 32, 16, 512], '->', [1, 16, 8, 512]) ('gen/conv8', [1, 16, 8, 512], '->', [1, 8, 4, 512]) ('gen/conv9', [1, 8, 4, 512], '->', [1, 4, 2, 512]) ('gen/deconv1', [1, 4, 2, 512], '->', [1, 8, 4, 512]) ('gen/deconv2', [1, 8, 4, 1024], '->', [1, 16, 8, 512]) ('gen/deconv3', [1, 16, 8, 1024], '->', [1, 32, 16, 512]) ('gen/deconv4', [1, 32, 16, 1024], '->', [1, 64, 32, 512]) ('gen/deconv5', [1, 64, 32, 1024], '->', [1, 128, 64, 512]) ('gen/deconv6', [1, 128, 64, 1024], '->', [1, 256, 128, 256]) ('gen/deconv7', [1, 256, 128, 512], '->', [1, 512, 256, 128]) ('gen/deconv8', [1, 512, 256, 256], '->', [1, 512, 512, 64]) ('gen/fg', [1, 512, 512, 128], '->', [1, 512, 1024, 2]) ('gen/bg', [1, 512, 512, 128], '->', [1, 512, 1024, 2]) Restoring from: results/nets/sep/unet-pit/net.tf-160000 predict samples shape: (1, 173774, 2) samples pred shape: (1, 173774, 2) (512, 1025) Writing to: results/ ffmpeg -i "/tmp/ao_wmjz0ezg.wav" -r 29.970000 -loglevel warning -safe 0 -f concat -i "/tmp/ao_i2pwi0b8.txt" -pix_fmt yuv420p -vcodec h264 -strict -2 -y -acodec aac "results/fg_translator.mp4" [wav @ 0x558b3f868b40] Estimating duration from bitrate, this may be inaccurate [wav @ 0x558b3f868b40] Could not find codec parameters for stream 0 (Audio: none, 1065353216 Hz, 16256 channels, 9481256 kb/s): unknown codec Consider increasing the value for the 'analyzeduration' and 'probesize' options Unknown encoder 'h264' Traceback (most recent call last): File "sep_video.py", line 455, in ut.make_video(full_ims, pr.fps, pj(arg.out, 'fg%s.mp4' % name), snd(full_samples_fg)) File "/home/study/PycharmProjects/untitled/util.py", line 3176, in make_video % (sound_flags_in, fps, input_file, sound_flags_out, flags, out_fname)) File "/home/study/PycharmProjects/untitled/util.py", line 917, in sys_check fail('Command failed! %s' % cmd) File "/home/study/PycharmProjects/untitled/util.py", line 14, in fail def fail(s = ''): raise RuntimeError(s) RuntimeError: Command failed! ffmpeg -i "/tmp/ao_wmjz0ezg.wav" -r 29.970000 -loglevel warning -safe 0 -f concat -i "/tmp/ao_i2pwi0b8.txt" -pix_fmt yuv420p -vcodec h264 -strict -2 -y -acodec aac "results/fg_translator.mp4"

    I meet the problem like this... I run the code as you say... but what happend to this code? I run the code on python3,thank you for your prompt reply!!!

    opened by ghost 4
  • RuntimeError: Command failed! ffmpeg -i

    RuntimeError: Command failed! ffmpeg -i "/tmp/ao_M0QAze.wav" -r 29.970000 -loglevel warning -safe 0 -f concat -i "/tmp/ao_cnpblR.txt" -pix_fmt yuv420p -vcodec h264 -strict -2 -y -acodec aac "../results/fg_cam_translator.mp4"

    Hello, thanks for the script. When I do the following command to visualize the locations of sound sources python sep_video.py ../data/translator.mp4 --model full --cam --out ../results/ I got a error:

    Start time: 0.0 GPU = 0 Spectrogram samples: 128 2.145 2.135 100.0% complete, total time: 0:00:00. 0:00:00 per iteration. (01:57 PM Fri) Struct(alg=sourcesep, augment_audio=False, augment_ims=True, augment_rms=False, base_lr=0.0001, batch_size=6, bn_last=True, bn_scale=True, both_videos_in_batch=True, cam=False, check_iters=1000, crop_im_dim=224, dilate=False, do_shift=False, dset_seed=None, fix_frame=False, fps=29.97, frame_length_ms=64, frame_sample_delta=74, frame_step_ms=16, freq_len=1024, full_im_dim=256, full_model=False, full_samples_len=105000, gamma=0.1, gan_weight=0.0, grad_clip=10.0, im_split=False, im_type=jpeg, init_path=../results/nets/shift/net.tf-650000, init_type=shift, input_rms=0.141421356237, l1_weight=1.0, log_spec=True, loss_types=['fg-bg'], model_path=../results/nets/sep/full/net.tf-160000, mono=False, multi_shift=False, net_style=full, normalize_rms=True, num_dbs=None, num_samples=44144, opt_method=adam, pad_stft=False, phase_type=pred, phase_weight=0.01, pit_weight=0.0, predict_bg=True, print_iters=10, profile_iters=None, resdir=/multisensory-master/results/nets/sep/full, samp_sr=21000.0, sample_len=None, sampled_frames=63, samples_per_frame=700.700700701, show_iters=None, show_videos=False, slow_check_iters=10000, spec_len=128, spec_max=80.0, spec_min=-100.0, step_size=120000, subsample_frames=None, summary_iters=10, test_batch=10, test_list=../data/celeb-tf-v6-full/test/tf, total_frames=149, train_iters=160000, train_list=../data/celeb-tf-v6-full/train/tf, use_3d=True, use_sound=True, use_wav_gan=False, val_list=../data/celeb-tf-v6-full/val/tf, variable_frame_count=False, vid_dur=2.135, weight_decay=1e-05) ffmpeg -loglevel error -ss 0.0 -i "../data/translator.mp4" -safe 0 -t 2.185 -r 29.97 -vf scale=256:256 "/tmp/tmpVEitNC/small_%04d.png" ffmpeg -loglevel error -ss 0.0 -i "../data/translator.mp4" -safe 0 -t 2.185 -r 29.97 -vf "scale=-2:'min(600,ih)'" "/tmp/tmpVEitNC/full_%04d.png" ffmpeg -loglevel error -ss 0.0 -i "../data/translator.mp4" -safe 0 -t 2.185 -ar 21000.0 -ac 2 "/tmp/tmpVEitNC/sound.wav" Running on: /gpu:0 2018-06-15 13:57:11.657961: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2018-06-15 13:57:12.523259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: name: Tesla K40m major: 3 minor: 5 memoryClockRate(GHz): 0.745 pciBusID: 0000:02:00.0 totalMemory: 11.92GiB freeMemory: 11.84GiB 2018-06-15 13:57:12.523316: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:02:00.0, compute capability: 3.5) Raw spec length: [1, 128, 1025] Truncated spec length: [1, 128, 1025] bn scale: True arg_scope train = False sf/conv1_1 -> [1, 11036, 1, 64] sf/conv2_1_short -> [1, 690, 1, 128] sf/conv2_1_1 -> [1, 690, 1, 128] sf/conv2_1_2 -> [1, 690, 1, 128] sf/conv3_1_1 -> [1, 173, 1, 128] sf/conv3_1_2 -> [1, 173, 1, 128] sf/conv4_1_short -> [1, 44, 1, 256] sf/conv4_1_1 -> [1, 44, 1, 256] sf/conv4_1_2 -> [1, 44, 1, 256] im/conv1 -> [1, 32, 112, 112, 64] before: [1, 63, 224, 224, 3] pool -> [1, 32, 56, 56, 64] im/conv2_1_1 -> [1, 32, 56, 56, 64] before: [1, 32, 56, 56, 64] im/conv2_1_2 -> [1, 32, 56, 56, 64] before: [1, 32, 56, 56, 64] pool -> [1, 16, 28, 28, 64] im/conv2_2_1 -> [1, 16, 28, 28, 64] before: [1, 32, 56, 56, 64] im/conv2_2_2 -> [1, 16, 28, 28, 64] before: [1, 16, 28, 28, 64] frac: 2.6875 sf/conv5_1 -> [1, 16, 1, 128] sf_net shape before merge: [1, 44, 1, 256], and after merge: [1, 16, 1, 256] im/merge1 -> [1, 16, 28, 28, 512] before: [1, 16, 28, 28, 192] im/merge2 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 512] im/conv3_1_1 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128] im/conv3_1_2 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128] im/conv3_2_1 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128] im/conv3_2_2 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128] im/conv4_1_short -> [1, 8, 14, 14, 256] before: [1, 16, 28, 28, 128] im/conv4_1_1 -> [1, 8, 14, 14, 256] before: [1, 16, 28, 28, 128] im/conv4_1_2 -> [1, 8, 14, 14, 256] before: [1, 8, 14, 14, 256] im/conv4_2_1 -> [1, 8, 14, 14, 256] before: [1, 8, 14, 14, 256] im/conv4_2_2 -> [1, 8, 14, 14, 256] before: [1, 8, 14, 14, 256] time_stride = 1 im/conv5_1_short -> [1, 8, 7, 7, 512] before: [1, 8, 14, 14, 256] im/conv5_1_1 -> [1, 8, 7, 7, 512] before: [1, 8, 14, 14, 256] im/conv5_1_2 -> [1, 8, 7, 7, 512] before: [1, 8, 7, 7, 512] im/conv5_2_1 -> [1, 8, 7, 7, 512] before: [1, 8, 7, 7, 512] im/conv5_2_2 -> [1, 8, 7, 7, 512] before: [1, 8, 7, 7, 512] joint/logits -> [1, 1, 1, 1, 1] before: [1, 1, 1, 1, 512] joint/logits -> [1, 8, 7, 7, 1] before: [1, 8, 7, 7, 512] gen/conv1 [1, 128, 1024, 2] -> [1, 128, 512, 64] gen/conv2 [1, 128, 512, 64] -> [1, 128, 256, 128] gen/conv3 [1, 128, 256, 128] -> [1, 64, 128, 256] Video net before merge: [1, 16, 1, 64] After: [1, 64, 1, 64] gen/conv4 [1, 64, 128, 320] -> [1, 32, 64, 512] Video net before merge: [1, 16, 1, 128] After: [1, 32, 1, 128] gen/conv5 [1, 32, 64, 640] -> [1, 16, 32, 512] Video net before merge: [1, 8, 1, 512] After: [1, 16, 1, 512] gen/conv6 [1, 16, 32, 1024] -> [1, 8, 16, 512] gen/conv7 [1, 8, 16, 512] -> [1, 4, 8, 512] gen/conv8 [1, 4, 8, 512] -> [1, 2, 4, 512] gen/conv9 [1, 2, 4, 512] -> [1, 1, 2, 512] gen/deconv1 [1, 1, 2, 512] -> [1, 2, 4, 512] gen/deconv2 [1, 2, 4, 1024] -> [1, 4, 8, 512] gen/deconv3 [1, 4, 8, 1024] -> [1, 8, 16, 512] gen/deconv4 [1, 8, 16, 1024] -> [1, 16, 32, 512] gen/deconv5 [1, 16, 32, 1536] -> [1, 32, 64, 512] gen/deconv6 [1, 32, 64, 1152] -> [1, 64, 128, 256] gen/deconv7 [1, 64, 128, 576] -> [1, 128, 256, 128] gen/deconv8 [1, 128, 256, 256] -> [1, 128, 512, 64] gen/fg [1, 128, 512, 128] -> [1, 128, 1024, 2] gen/bg [1, 128, 512, 128] -> [1, 128, 1024, 2] Restoring from: ../results/nets/sep/full/net.tf-160000 predict samples shape: (1, 44144, 2) samples pred shape: (1, 44144, 2) (128, 1025) Running on: 0 2018-06-15 13:57:18.753499: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:02:00.0, compute capability: 3.5) bn scale: False arg_scope train = True sf/conv1_1 -> [1, 11036, 1, 64] sf/conv2_1_short -> [1, 690, 1, 128] sf/conv2_1_1 -> [1, 690, 1, 128] sf/conv2_1_2 -> [1, 690, 1, 128] sf/conv3_1_1 -> [1, 173, 1, 128] sf/conv3_1_2 -> [1, 173, 1, 128] sf/conv4_1_short -> [1, 44, 1, 256] sf/conv4_1_1 -> [1, 44, 1, 256] sf/conv4_1_2 -> [1, 44, 1, 256] im/conv1 -> [1, 32, 112, 112, 64] before: [1, 63, 224, 224, 3] pool -> [1, 32, 56, 56, 64] im/conv2_1_1 -> [1, 32, 56, 56, 64] before: [1, 32, 56, 56, 64] im/conv2_1_2 -> [1, 32, 56, 56, 64] before: [1, 32, 56, 56, 64] pool -> [1, 16, 28, 28, 64] im/conv2_2_1 -> [1, 16, 28, 28, 64] before: [1, 32, 56, 56, 64] im/conv2_2_2 -> [1, 16, 28, 28, 64] before: [1, 16, 28, 28, 64] frac: 2.6875 sf/conv5_1 -> [1, 16, 1, 128] sf_net shape before merge: [1, 44, 1, 256], and after merge: [1, 16, 1, 256] im/merge1 -> [1, 16, 28, 28, 512] before: [1, 16, 28, 28, 192] im/merge2 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 512] im/conv3_1_1 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128] im/conv3_1_2 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128] im/conv3_2_1 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128] im/conv3_2_2 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128] im/conv4_1_short -> [1, 8, 14, 14, 256] before: [1, 16, 28, 28, 128] im/conv4_1_1 -> [1, 8, 14, 14, 256] before: [1, 16, 28, 28, 128] im/conv4_1_2 -> [1, 8, 14, 14, 256] before: [1, 8, 14, 14, 256] im/conv4_2_1 -> [1, 8, 14, 14, 256] before: [1, 8, 14, 14, 256] im/conv4_2_2 -> [1, 8, 14, 14, 256] before: [1, 8, 14, 14, 256] time_stride = 1 im/conv5_1_short -> [1, 8, 14, 14, 512] before: [1, 8, 14, 14, 256] im/conv5_1_1 -> [1, 8, 14, 14, 512] before: [1, 8, 14, 14, 256] im/conv5_1_2 -> [1, 8, 14, 14, 512] before: [1, 8, 14, 14, 512] im/conv5_2_1 -> [1, 8, 14, 14, 512] before: [1, 8, 14, 14, 512] im/conv5_2_2 -> [1, 8, 14, 14, 512] before: [1, 8, 14, 14, 512] joint/logits -> [1, 1, 1, 1, 1] before: [1, 1, 1, 1, 512] joint/logits -> [1, 8, 14, 14, 1] before: [1, 8, 14, 14, 512] bn scale: False arg_scope train = True sf/conv1_1 -> [1, 11036, 1, 64] sf/conv2_1_short -> [1, 690, 1, 128] sf/conv2_1_1 -> [1, 690, 1, 128] sf/conv2_1_2 -> [1, 690, 1, 128] sf/conv3_1_1 -> [1, 173, 1, 128] sf/conv3_1_2 -> [1, 173, 1, 128] sf/conv4_1_short -> [1, 44, 1, 256] sf/conv4_1_1 -> [1, 44, 1, 256] sf/conv4_1_2 -> [1, 44, 1, 256] im/conv1 -> [1, 32, 112, 112, 64] before: [1, 63, 224, 224, 3] pool -> [1, 32, 56, 56, 64] im/conv2_1_1 -> [1, 32, 56, 56, 64] before: [1, 32, 56, 56, 64] im/conv2_1_2 -> [1, 32, 56, 56, 64] before: [1, 32, 56, 56, 64] pool -> [1, 16, 28, 28, 64] im/conv2_2_1 -> [1, 16, 28, 28, 64] before: [1, 32, 56, 56, 64] im/conv2_2_2 -> [1, 16, 28, 28, 64] before: [1, 16, 28, 28, 64] frac: 2.6875 sf/conv5_1 -> [1, 16, 1, 128] sf_net shape before merge: [1, 44, 1, 256], and after merge: [1, 16, 1, 256] im/merge1 -> [1, 16, 28, 28, 512] before: [1, 16, 28, 28, 192] im/merge2 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 512] im/conv3_1_1 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128] im/conv3_1_2 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128] im/conv3_2_1 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128] im/conv3_2_2 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128] im/conv4_1_short -> [1, 8, 14, 14, 256] before: [1, 16, 28, 28, 128] im/conv4_1_1 -> [1, 8, 14, 14, 256] before: [1, 16, 28, 28, 128] im/conv4_1_2 -> [1, 8, 14, 14, 256] before: [1, 8, 14, 14, 256] im/conv4_2_1 -> [1, 8, 14, 14, 256] before: [1, 8, 14, 14, 256] im/conv4_2_2 -> [1, 8, 14, 14, 256] before: [1, 8, 14, 14, 256] time_stride = 1 im/conv5_1_short -> [1, 8, 14, 14, 512] before: [1, 8, 14, 14, 256] im/conv5_1_1 -> [1, 8, 14, 14, 512] before: [1, 8, 14, 14, 256] im/conv5_1_2 -> [1, 8, 14, 14, 512] before: [1, 8, 14, 14, 512] im/conv5_2_1 -> [1, 8, 14, 14, 512] before: [1, 8, 14, 14, 512] im/conv5_2_2 -> [1, 8, 14, 14, 512] before: [1, 8, 14, 14, 512] joint/logits -> [1, 1, 1, 1, 1] before: [1, 1, 1, 1, 512] joint/logits -> [1, 8, 14, 14, 1] before: [1, 8, 14, 14, 512] Writing to: ../results/ ffmpeg -i "/tmp/ao_M0QAze.wav" -r 29.970000 -loglevel warning -safe 0 -f concat -i "/tmp/ao_cnpblR.txt" -pix_fmt yuv420p -vcodec h264 -strict -2 -y -acodec aac "../results/fg_cam_translator.mp4" Guessed Channel Layout for Input Stream #0.0 : mono [concat @ 0x382d700] DTS -230584300921369 < 0 out of order [h264_v4l2m2m @ 0x385f500] Could not find a valid device [h264_v4l2m2m @ 0x385f500] can't configure encoder Error initializing output stream 0:0 -- Error while opening encoder for output stream #0:0 - maybe incorrect parameters such as bit_rate, rate, width or height Traceback (most recent call last): File "sep_video.py", line 442, in ut.make_video(full_ims, pr.fps, pj(arg.out, 'fg%s.mp4' % name), snd(full_samples_fg)) File "/multisensory-master/src/aolib/util.py", line 3169, in make_video % (sound_flags_in, fps, input_file, sound_flags_out, flags, out_fname)) File "/multisensory-master/src/aolib/util.py", line 915, in sys_check fail('Command failed! %s' % cmd) File "/multisensory-master/src/aolib/util.py", line 12, in fail def fail(s = ''): raise RuntimeError(s) RuntimeError: Command failed! ffmpeg -i "/tmp/ao_M0QAze.wav" -r 29.970000 -loglevel warning -safe 0 -f concat -i "/tmp/ao_cnpblR.txt" -pix_fmt yuv420p -vcodec h264 -strict -2 -y -acodec aac "../results/fg_cam_translator.mp4"

    I want to know what went wrong and what should i do... Any suggestion will be appreciated! Thanks.

    opened by xsingit 4
  • Issue on Large Videos

    Issue on Large Videos

    Thanks for great paper. When working with large videos the maximum duration of separation is 4min. Can this applied to whole video for separation at once

    opened by ChaitanyaBoggavarapu 2
  • What are feats['im_0'] and feats['im_1'] of example for shift model?

    What are feats['im_0'] and feats['im_1'] of example for shift model?

    Hello, In read_example() of shift_dset.py, I saw

    feats['im_0'] = tf.FixedLenFeature([], dtype=tf.string) feats['im_1'] = tf.FixedLenFeature([], dtype=tf.string) What are im_0 and im_1?

    Thank you.

    opened by ruizewang 2
  • How do I run source separation on a different video?

    How do I run source separation on a different video?

    I get this when I run it on my video:

    Writing to: ../results/
    Traceback (most recent call last):
      File "sep_video.py", line 442, in <module>
        ut.make_video(full_ims, pr.fps, pj(arg.out, 'fg%s.mp4' % name), snd(full_samples_fg))
      File "~/multisensory/src/aolib/util.py", line 3156, in make_video
        write_ims = (type(im_fnames[0]) != type(''))
    IndexError: list index out of range
    

    Do I have to run something else before sep_video.py?

    opened by jayavanth 2
  • Question about the original audio waveform  input

    Question about the original audio waveform input

    Hi owen, Thanks for your contributions! In your paper,you said you applied a series of strided 1D convolutions to the input waveform. So the input waveform you refered here (before fusion) is the original audio signal waveform without STFT,right? Why and how you process the 1D signal ? Could you kindly explain this point for me?

    opened by luhuijun666 0
  • Could you provide the dataset?

    Could you provide the dataset?

    Hello, thanks for your great work! I want to reproduce your work, but I don't see where the dataset provided. Could you please share your dataset? Thanks again.

    opened by ruizewang 0
  • What is the format of the tensor in the code?

    What is the format of the tensor in the code?

    Hello, Is it (batch_size, channel, depth, height, width) or (b, d, h, w, c) or something else? I'm new to tensorflow and it confuses me a lot. Thanks.

    opened by tuffr5 0
  • difference between

    difference between "large" and "full" sep models

    Hi Andrew,

    Thanks for publicly releasing your code and models.

    • Could you please tell me the difference between "large" and "full" models for separation?

    • Have you released a model corresponding to "Large-scale training" (Sec. 6.3 in the paper)? Does the large model refer to this?

    Thanks, Sanjeel

    opened by sanjeelparekh 0
  • Download sample-data.zip NOT FOUND

    Download sample-data.zip NOT FOUND

    Hi,

    I am having trouble downloading the sample-data.zip since it's seems that the link is broken, I get a Not found error when running the sh file. Any chance you could provide the correct link?

    Thank you!

    opened by elinaoikonomaki 1
  • Download pretrain models

    Download pretrain models

    Hello Andrew,

    When I am running the bash 'download_models.sh', it fails and returns a Not Found error. Is the link to the pretrained models changed? If so, could you give me the correct link? Thanks a lot!

    opened by WikiChao 3
  • Question about the test in Table 2 GRID transfer

    Question about the test in Table 2 GRID transfer

    Hello Andrew, I have one small question about how to run your model on GRID dataset. Because the audios in GRID dataset are shorter than 2s, and I find that the model in "/results/nets/sep/full/" can't run on video shorter than 2.135 s. So how did you conduct GRID transfer experiments here?

    opened by ruizewang 0
  • Improvement on using pretrained model

    Improvement on using pretrained model

    Thanks for the great paper. I am trying to use the pre-trained model but my results are not great. Can you please suggest on the prerequisite(like video quality, audio quality, sampling rate). I am working on recorded videos with only two speakers in it.

    opened by ChaitanyaBoggavarapu 0
BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation This is a demo implementation of BYOL for Audio (BYOL-A), a self-sup

NTT Communication Science Laboratories 160 Jan 4, 2023
This is the implementation of the paper "Self-supervised Outdoor Scene Relighting"

Self-supervised Outdoor Scene Relighting This is the implementation of the paper "Self-supervised Outdoor Scene Relighting". The model is implemented

Ye Yu 24 Dec 17, 2022
The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

Published by SpaceML • About SpaceML • Quick Colab Example Self-Supervised Learner The Self-Supervised Learner can be used to train a classifier with

SpaceML 92 Nov 30, 2022
Code for CVPR 2021 oral paper "Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts"

Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts The rapid progress in 3D scene understanding has come with growing dem

Facebook Research 182 Dec 30, 2022
The official implementation of the paper, "SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning"

SubTab: Author: Talip Ucar ([email protected]) The official implementation of the paper, SubTab: Subsetting Features of Tabular Data for Self-Supervis

AstraZeneca 98 Dec 29, 2022
Just Go with the Flow: Self-Supervised Scene Flow Estimation

Just Go with the Flow: Self-Supervised Scene Flow Estimation Code release for the paper Just Go with the Flow: Self-Supervised Scene Flow Estimation,

Himangi Mittal 50 Nov 22, 2022
Self-Supervised Multi-Frame Monocular Scene Flow (CVPR 2021)

Self-Supervised Multi-Frame Monocular Scene Flow 3D visualization of estimated depth and scene flow (overlayed with input image) from temporally conse

Visual Inference Lab @TU Darmstadt 85 Dec 22, 2022
An official PyTorch Implementation of Boundary-aware Self-supervised Learning for Video Scene Segmentation (BaSSL)

An official PyTorch Implementation of Boundary-aware Self-supervised Learning for Video Scene Segmentation (BaSSL)

Kakao Brain 72 Dec 28, 2022
Official PyTorch code of DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization (ICCV 2021 Oral).

DeepPanoContext (DPC) [Project Page (with interactive results)][Paper] DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context G

Cheng Zhang 66 Nov 16, 2022
This repository is the official implementation of Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regularized Fine-Tuning (NeurIPS21).

Core-tuning This repository is the official implementation of ``Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regular

vanint 18 Dec 17, 2022
UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning

UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning This is the official PyTorch implementation for UniMoCo pape

dddzg 49 Jan 2, 2023
[TIP 2020] Multi-Temporal Scene Classification and Scene Change Detection with Correlation based Fusion

Multi-Temporal Scene Classification and Scene Change Detection with Correlation based Fusion Code for Multi-Temporal Scene Classification and Scene Ch

Lixiang Ru 33 Dec 12, 2022
Neural Scene Graphs for Dynamic Scene (CVPR 2021)

Implementation of Neural Scene Graphs, that optimizes multiple radiance fields to represent different objects and a static scene background. Learned representations can be rendered with novel object compositions and views.

null 151 Dec 26, 2022
Automatic number plate recognition using tech: Yolo, OCR, Scene text detection, scene text recognation, flask, torch

Automatic Number Plate Recognition Automatic Number Plate Recognition (ANPR) is the process of reading the characters on the plate with various optica

Meftun AKARSU 52 Dec 22, 2022
Pytorch implementation of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Make-A-Scene - PyTorch Pytorch implementation (inofficial) of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors (https://arxiv.org/

Casual GAN Papers 259 Dec 28, 2022
Code for Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021)

Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021) Hang Zhou, Yasheng Sun, Wayne Wu, Chen Cha

Hang_Zhou 628 Dec 28, 2022
Code for One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning (AAAI 2022)

One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning (AAAI 2022) Paper | Demo Requirements Python >= 3.6 , Pytorch >

FuxiVirtualHuman 84 Jan 3, 2023
[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models Codes for this paper The Lottery Tickets Hypo

VITA 59 Dec 28, 2022
Patch Rotation: A Self-Supervised Auxiliary Task for Robustness and Accuracy of Supervised Models

Patch-Rotation(PatchRot) Patch Rotation: A Self-Supervised Auxiliary Task for Robustness and Accuracy of Supervised Models Submitted to Neurips2021 To

null 4 Jul 12, 2021