VIsually-Pivoted Audio and(N) Text

Yän.PnG

Last update: Nov 4, 2022

Related tags

Deep Learning vipant

Overview

VIP-ANT: VIsually-Pivoted Audio and(N) Text

Code for the paper Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer.

Data

AudioSet can be downloaded and preprocessed via this tool.

Vision-Audio (VA) Pre-training

Check out the running script bash/run_bimodal_va.sh.

Audio-Text (AT) Fine-tuning

Check out the running script bash/run_bimodal_at.sh.

Dependencies

Dockerfile defines the minimum dependencies of the repo.

Citing VIP-ANT

@misc{vip-ant,
      title={Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer},
      author={Yanpeng Zhao and Jack Hessel and Youngjae Yu and Ximing Lu and Rowan Zellers and Yejin Choi},
      url={https://arxiv.org/abs/2112.08995},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      eprint={2112.08995},
      year={2021},
}

License

MIT

You might also like...

[CVPR 2021] Rethinking Text Segmentation: A Novel Dataset and A Text-Specific Refinement Approach

Rethinking Text Segmentation: A Novel Dataset and A Text-Specific Refinement Approach This is the repo to host the dataset TextSeg and code for TexRNe

174 Dec 19, 2022

a reccurrent neural netowrk that when trained on a peice of text and fed a starting prompt will write its on 250 character text using LSTM layers

RNN-Playwrite a reccurrent neural netowrk that when trained on a peice of text and fed a starting prompt will write its on 250 character text using LS

1 Oct 29, 2021

Data manipulation and transformation for audio signal processing, powered by PyTorch

torchaudio: an audio library for PyTorch The aim of torchaudio is to apply PyTorch to the audio domain. By supporting PyTorch, torchaudio follows the

1.9k Dec 28, 2022

We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.

Multi-Modal Self-Supervision using GDT and StiCa This is an official pytorch implementation of papers: Multi-modal Self-Supervision from Generalized D

42 Dec 9, 2022

Classifying audio using Wavelet transform and deep learning

Audio Classification using Wavelet Transform and Deep Learning A step-by-step tutorial to classify audio signals using continuous wavelet transform (C

17 Nov 29, 2022

The description of FMFCC-A (audio track of FMFCC) dataset and Challenge resluts.

FMFCC-A This project is the description of FMFCC-A (audio track of FMFCC) dataset and Challenge resluts. The FMFCC-A dataset is shared through BaiduCl

2 Oct 20, 2021

Code for the TASLP paper "PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation".

PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation Introduction Getting Started FSD50K Recipe AudioSet Recipe Label E

84 Dec 27, 2022

OpenL3: Open-source deep audio and image embeddings

OpenL3 OpenL3 is an open-source Python library for computing deep audio and image embeddings. Please refer to the documentation for detailed instructi

Music and Audio Research Laboratory - NYU

326 Jan 2, 2023

Visualizer using audio and semantic analysis to explore BigGAN (Brock et al., 2018) latent space.

BigGAN Audio Visualizer Description This visualizer explores BigGAN (Brock et al., 2018) latent space by using pitch/tempo of an audio file to generat

2 Nov 21, 2022

Comments

Getting text/audio embeddings (and their gradients) from the pretrained models.

Hi!

First of all, awesome work! I'd like to be able to load a pretrained model so I can obtain text and audio embeddings (and their gradients). Could you help me figure out how to do that?

Thanks! :)

opened by hugofloresgarcia 3

Owner

Yän.PnG

GitHub https://arxiv.org/abs/2112.08995

Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

ResDAVEnet-VQ Official PyTorch implementation of Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech What is in this repo? M

21 Aug 23, 2022

GuideDog is an AI/ML-based mobile app designed to assist the lives of the visually impaired, 100% voice-controlled

Guidedog Authors: Kyuhee Jo, Steven Gunarso, Jacky Wang, Raghav Sharma GuideDog is an AI/ML-based mobile app designed to assist the lives of the visua

5 Nov 24, 2021

PyTorch implementation for the visual prior component (i.e. perception module) of the Visually Grounded Physics Learner [Li et al., 2020].

VGPL-Visual-Prior PyTorch implementation for the visual prior component (i.e. perception module) of the Visually Grounded Physics Learner (VGPL). Give

8 Dec 29, 2022

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation This is a demo implementation of BYOL for Audio (BYOL-A), a self-sup

160 Jan 4, 2023

Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

Deep Text Search - AI Based Text Search & Recommendation System Deep Text Search is an AI-powered multilingual text search and recommendation engine w

19 Sep 29, 2022

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

TAP: Text-Aware Pre-training TAP: Text-Aware Pre-training for Text-VQA and Text-Caption by Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Flo

61 Nov 14, 2022

Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

SwinTextSpotter This is the pytorch implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text R

183 Jan 3, 2023

AugLy is a data augmentations library that currently supports four modalities (audio, image, text & video) and over 100 augmentations

AugLy is a data augmentations library that currently supports four modalities (audio, image, text & video) and over 100 augmentations. Each modality’s augmentations are contained within its own sub-library. These sub-libraries include both function-based and class-based transforms, composition operators, and have the option to provide metadata about the transform applied, including its intensity.

4.6k Jan 9, 2023

Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

458 Jan 2, 2023

A repository for the updated version of CoinRun used to collect MUGEN, a multimodal video-audio-text dataset.

A repository for the updated version of CoinRun used to collect MUGEN, a multimodal video-audio-text dataset. This repo contains scripts to train RL agents to navigate the closed world and collect video data.

11 Oct 22, 2022

VIsually-Pivoted Audio and(N) Text

Related tags

Overview

VIP-ANT: VIsually-Pivoted Audio and(N) Text

Data

Vision-Audio (VA) Pre-training

Audio-Text (AT) Fine-tuning

Dependencies

Citing VIP-ANT

License

You might also like...

[CVPR 2021] Rethinking Text Segmentation: A Novel Dataset and A Text-Specific Refinement Approach

a reccurrent neural netowrk that when trained on a peice of text and fed a starting prompt will write its on 250 character text using LSTM layers

Data manipulation and transformation for audio signal processing, powered by PyTorch

We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.

Classifying audio using Wavelet transform and deep learning

The description of FMFCC-A (audio track of FMFCC) dataset and Challenge resluts.

Code for the TASLP paper "PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation".

OpenL3: Open-source deep audio and image embeddings

Visualizer using audio and semantic analysis to explore BigGAN (Brock et al., 2018) latent space.

Comments

Getting text/audio embeddings (and their gradients) from the pretrained models.

Owner

Yän.PnG

Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

GuideDog is an AI/ML-based mobile app designed to assist the lives of the visually impaired, 100% voice-controlled

PyTorch implementation for the visual prior component (i.e. perception module) of the Visually Grounded Physics Learner [Li et al., 2020].

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

AugLy is a data augmentations library that currently supports four modalities (audio, image, text & video) and over 100 augmentations

Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

A repository for the updated version of CoinRun used to collect MUGEN, a multimodal video-audio-text dataset.