Official implementation of the paper WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP

Descript

Last update: Dec 13, 2022

Related tags

Deep Learning lyrebird-wav2clip

Overview

Wav2CLIP

🚧 WIP 🚧

Official implementation of the paper WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP 📄 🔗

Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, Juan Pablo Bello

We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on a variety of audio tasks including classification, retrieval, and generation, and show that Wav2CLIP can outperform several publicly available pre-trained audio representation algorithms. Wav2CLIP projects audio into a shared embedding space with images and text, which enables multimodal applications such as zero-shot classification, and cross-modal retrieval. Furthermore, Wav2CLIP needs just ~10% of the data to achieve competitive performance on downstream tasks compared with fully supervised models, and is more efficient to pre-train than competing methods as it does not require learning a visual model in concert with an auditory model. Finally, we demonstrate image generation from Wav2CLIP as qualitative assessment of the shared embedding space. Our code and model weights are open sourced and made available for further applications.

Installation

pip install wav2clip

Usage

Clip-Level Embeddings

import wav2clip

model = wav2clip.get_model()
embeddings = wav2clip.embed_audio(audio, model)

Frame-Level Embeddings

import wav2clip

model = wav2clip.get_model(frame_length=16000, hop_length=16000)
embeddings = wav2clip.embed_audio(audio, model)

Comments

request of projection layer weight

Hi @hohsiangwu , Thanks for great work! Request pre-trained weights of image_transform (MLP layer) for audio-image-language joint embedding space.

Currently, only audio encoders seem to exist in the get_model function. Is there any big problem if I use CLIP embedding (text or image) without projection layer?

opened by SeungHeonDoh 2
Initial checkin for accessing pre-trained model via pip install

I am considering using the release feature of GitHub to host model weights, once the url is added to MODEL_WEIGHTS_URL, and the repository is made public, we should be able to model = torch.hub.load('descriptinc/lyrebird-wav2clip', 'wav2clip', pretrained=True)

opened by hohsiangwu 1
Adding VQGAN-CLIP with modification to generate audio
Adding a working snapshot of original generate.py from https://github.com/nerdyrodent/VQGAN-CLIP/

Modify to add audio related params and functions

Add scripts to generate image and video with options for conditioning and interpolation
opened by hohsiangwu 0
Supervised scenario no transform

In the supervise scenario in the __init__.py the transform flag is not set to True, so the model doesn't contain the MLP layer after training. I'm wondering how you train the MLP layer when using as pretrained.

opened by alirezadir 0
Integrated into VQGAN+CLIP 3D Zooming notebook

Dear researchers,

I integrated Wav2CLIP into a VQGAN+CLIP animation notebook.

It is available on colab here: https://colab.research.google.com/github/pollinations/hive/blob/main/notebooks/2%20Text-To-Video/1%20CLIP-Guided%20VQGAN%203D%20Turbo%20Zoom.ipynb

I'm part of a team creating an open-source generative art platform called Pollinations.AI. It's also possible to use through our frontend if you are interested. https://pollinations.ai/p/QmT7yt67DF3GF4wd2vyw6bAgN3QZx7Xpnoyx98YWEsEuV7/create

Here is an example output: https://user-images.githubusercontent.com/5099901/168467451-f633468d-e596-48f5-8c2c-2dc54648ead3.mp4

opened by voodoohop 0
The details concerning loading raw audio files

Hi !

I haved imported the wave2clip as a package, however when testing, the inputs for the model to extract features are not original audio files. Thus can you provided the details to load the audio files to processed data for the model?

opened by jinx2018 0
torch version

Hi, thanks for sharing the wonderful work! I encountered some issues during pip installing it, so may I ask what is the torch version you used? I cannot find the requirement of this project. Thanks!

opened by annahung31 0
Error when importing after fresh installation on colab

What CUDA and Python versions have you tested the pip package in? After installation on a fresh collab I receive the following error:

OSError Traceback (most recent call last) in () ----> 1 import wav2clip

7 frames /usr/local/lib/python3.7/dist-packages/wav2clip/init.py in () 2 import torch 3 ----> 4 from .model.encoder import ResNetExtractor 5 6

/usr/local/lib/python3.7/dist-packages/wav2clip/model/encoder.py in () 4 from torch import nn 5 ----> 6 from .resnet import BasicBlock 7 from .resnet import ResNet 8

/usr/local/lib/python3.7/dist-packages/wav2clip/model/resnet.py in () 3 import torch.nn as nn 4 import torch.nn.functional as F ----> 5 import torchaudio 6 7

/usr/local/lib/python3.7/dist-packages/torchaudio/init.py in () ----> 1 from torchaudio import _extension # noqa: F401 2 from torchaudio import ( 3 compliance, 4 datasets, 5 functional,

/usr/local/lib/python3.7/dist-packages/torchaudio/_extension.py in () 25 26 ---> 27 _init_extension()

/usr/local/lib/python3.7/dist-packages/torchaudio/_extension.py in _init_extension() 19 # which depends on libtorchaudio and dynamic loader will handle it for us. 20 if path.exists(): ---> 21 torch.ops.load_library(path) 22 torch.classes.load_library(path) 23 # This import is for initializing the methods registered via PyBind11

/usr/local/lib/python3.7/dist-packages/torch/_ops.py in load_library(self, path) 108 # static (global) initialization code in order to register custom 109 # operators with the JIT. --> 110 ctypes.CDLL(path) 111 self.loaded_libraries.add(path) 112

/usr/lib/python3.7/ctypes/init.py in init(self, name, mode, handle, use_errno, use_last_error) 362 363 if handle is None: --> 364 self._handle = _dlopen(self._name, mode) 365 else: 366 self._handle = handle

OSError: libcudart.so.10.2: cannot open shared object file: No such file or directory

opened by janzuiderveld 0

Releases(v0.1.0-alpha)

v0.1.0-alpha(Oct 5, 2021)

pre-release v0.1.0-alpha
Source code(tar.gz)
Source code(zip)
Wav2CLIP.pt(46.69 MB)

Official implementation of the paper WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP

Related tags

Overview

Wav2CLIP

Installation

Usage

Clip-Level Embeddings

Frame-Level Embeddings

Comments

request of projection layer weight

Initial checkin for accessing pre-trained model via pip install

Adding VQGAN-CLIP with modification to generate audio

Supervised scenario no transform

Integrated into VQGAN+CLIP 3D Zooming notebook

The details concerning loading raw audio files

torch version

Error when importing after fresh installation on colab

Releases(v0.1.0-alpha)

v0.1.0-alpha(Oct 5, 2021)

Owner

Descript

FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

Code for the paper: Learning Adversarially Robust Representations via Worst-Case Mutual Information Maximization (https://arxiv.org/abs/2002.11798)

Official Pytorch implementation of the paper "MotionCLIP: Exposing Human Motion Generation to CLIP Space"

[ICCV'21] Official implementation for the paper Social NCE: Contrastive Learning of Socially-aware Motion Representations

Official PyTorch implemention of our paper "Learning to Rectify for Robust Learning with Noisy Labels".

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

Adversarial-Information-Bottleneck - Distilling Robust and Non-Robust Features in Adversarial Examples by Information Bottleneck (NeurIPS21)

[PyTorch] Official implementation of CVPR2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency". https://arxiv.org/abs/2103.05465

The official implementation of our CVPR 2021 paper - Hybrid Rotation Averaging: A Fast and Robust Rotation Averaging Approach

The official implementation of the IEEE S&P`22 paper "SoK: How Robust is Deep Neural Network Image Classification Watermarking".

The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution.

Official Pytorch implementation of "Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes", CVPR 2022

An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Tensorflow 2 implementation of the paper: Learning and Evaluating Representations for Deep One-class Classification published at ICLR 2021

PyTorch implementation of the NIPS-17 paper "Poincaré Embeddings for Learning Hierarchical Representations"

Official PyTorch implementation of Synergies Between Affordance and Geometry: 6-DoF Grasp Detection via Implicit Representations

Official PyTorch implementation of BlobGAN: Spatially Disentangled Scene Representations