MUSIC-AVQA, CVPR2022 (ORAL)

Related tags

Audio MUSIC-AVQA
Overview

Audio-Visual Question Answering (AVQA)

PyTorch code accompanies our CVPR 2022 paper:

Learning to Answer Questions in Dynamic Audio-Visual Scenarios (Oral Presentation)

Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen and Di Hu

Resources: [Paper], [Supplementary], [Poster], [Video]

Project Homepage: https://gewu-lab.github.io/MUSIC-AVQA/


What's Audio-Visual Question Answering Task?

We focus on audio-visual question answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes.

MUSIC-AVQA Dataset

The large-scale MUSIC-AVQA dataset of musical performance, which contains 45,867 question-answer pairs, distributed in 9,288 videos for over 150 hours. All QA pairs types are divided into 3 modal scenarios, which contain 9 question types and 33 question templates. Finally, as an open-ended problem of our AVQA tasks, all 42 kinds of answers constitute a set for selection.

  • QA examples

Model Overview

To solve the AVQA problem, we propose a spatio-temporal grounding model to achieve scene understanding and reasoning over audio and visual modalities. An overview of the proposed framework is illustrated in below figure.

Requirements

python3.6 +
pytorch1.6.0
tensorboardX
ffmpeg
numpy

Usage

  1. Clone this repo

    https://github.com/GeWu-Lab/MUSIC-AVQA_CVPR2022.git
  2. Download data

    Annotations (QA pairs, etc.)

    • Available for download at here
    • The annotation files are stored in JSON format. Each annotation file contains seven different keyword. And more detail see in Project Homepage

    Features

    • We use VGGish, ResNet18, and ResNet (2+1)D to extract audio, 2D frame-level, and 3D snippet-level features, respectively.

    • The audio and visual features of videos in the MUSIC-AVQA dataset can be download from Baidu Drive (password: cvpr):

      • VGGish feature shape: [T, 128]  Download (112.7M)
      • ResNet18 feature shape: [T, 512]  Download (972.6M)
      • R(2+1)D feature shape: [T, 512]  Download (973.9M)
    • The features are in the ./data/feats folder.

    • 14x14 features, too large to share ... but we can extract from raw video frames.

    Download videos frames

    • Raw videos: Availabel at Baidu Drive (password: cvpr):.

      Note: Please move all downloaded videos to a folder, for example, create a new folder named MUSIC-AVQA-Videos, which contains 9,288 real videos and synthetic videos.

    • Raw video frames (1fps): Available at Baidu Drive (14.84GB) (password: cvpr).

    • Download raw videos in the MUSIC-AVQA dataset. The downloaded videos will be in the /data/video folder.

    • Pandas and ffmpeg libraries are required.

  3. Data pre-processing

    Extract audio waveforms from videos. The extracted audios will be in the ./data/audio folder. moviepy library is used to read videos and extract audios.

    python feat_script/extract_audio_cues/extract_audio.py	

    Extract video frames from videos. The extracted frames will be in the data/frames folder.

    python feat_script/extract_visual_frames/extract_frames_adaptive_script.py
  4. Feature extraction

    Audio feature. TensorFlow1.4 and VGGish pretrained on AudioSet is required. Feature file also can be found from here (password: cvpr).

    python feat_script/extract_audio_feat/audio_feature_extractor.py

    2D visual feature. Pretrained models library is required.

    python feat_script/eatract_visual_feat/extract_rgb_feat.py

    3D visual feature.

    python feat_script/eatract_visual_feat/extract_3d_feat.py

    14x14 visual feature.

    python feat_script/extract_visual_feat_14x14/extract_14x14_feat.py
  5. Baseline Model

    Training

    python net_grd_baseline/main_qa_grd_baseline.py --mode train

    Testing

    python net_grd_baseline/main_qa_grd_baseline.py --mode test
  6. Our Audio-Visual Spatial-Temporal Model

    We provide trained models and you can quickly test the results. Test results may vary slightly on different machines.

    python net_grd_avst/main_avst.py --mode train \
    	--audio_dir = "path to your audio features"
    	--video_res14x14_dir = "path to your visual res14x14 features"

    Audio-Visual grounding generation

    python grounding_gen/main_grd_gen.py

    Training

    python net_grd_avst/main_avst.py --mode train \
    	--audio_dir = "path to your audio features"
    	--video_res14x14_dir = "path to your visual res14x14 features"

    Testing

    python net_grd_avst/main_avst.py --mode test \
    	--audio_dir = "path to your audio features"
    	--video_res14x14_dir = "path to your visual res14x14 features"

Results

  1. Audio-visual video question answering results of different methods on the test set of MUSIC-AVQA. The top-2 results are highlighted. Please see the citations in the [Paper] for comparison methods.

  2. Visualized spatio-temporal grounding results

    We provide several visualized spatial grounding results. The heatmap indicates the location of sounding source. Through the spatial grounding results, the sounding objects are visually captured, which can facilitate the spatial reasoning.

    Firstly, ./grounding_gen/models_grd_vis/ should be created.

    python grounding_gen/main_grd_gen_vis.py

Citation

If you find this work useful, please consider citing it.


@ARTICLE{Li2022Learning,
  title	= {Learning to Answer Questions in Dynamic Audio-Visual Scenarios},
  author	= {Guangyao li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu},
  journal	= {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year	= {2022},
}

Acknowledgement

This research was supported by Public Computing Cloud, Renmin University of China.

License

This project is released under the GNU General Public License v3.0.

You might also like...
This is an OverPowered Vc Music Player! Will work for you and play music in Voice Chatz
This is an OverPowered Vc Music Player! Will work for you and play music in Voice Chatz

VcPlayer This is an OverPowered Vc Music Player! Will work for you and play music in Voice Chatz Telegram Voice-Chat Bot [PyTGCalls] ⇝ Requirements ⇜

This Is Telegram Music UserBot To Play Music Without Being Admin

This Is Telegram Music UserBot To Play Music Without Being Admin

Just-Music - Spotify API Driven Music Web app, that allows to listen and control and share songs

Just Music... Just Music Is A Web APP That Allows Users To Play Song Using Spoti

Xbot-Music - Bot Play Music and Video in Voice Chat Group Telegram
Xbot-Music - Bot Play Music and Video in Voice Chat Group Telegram

XBOT-MUSIC A Telegram Music+video Bot written in Python using Pyrogram and Py-Tg

Noinoi music is smoothly playing music on voice chat of telegram.
Noinoi music is smoothly playing music on voice chat of telegram.

NOINOI MUSIC BOT ✨ Features Music & Video stream support MultiChat support Playlist & Queue support Skip, Pause, Resume, Stop feature Music & Video do

XA Music Player - Telegram Music Bot
XA Music Player - Telegram Music Bot

XA Music Player Requirements 📝 FFmpeg (Latest) NodeJS nodesource.com (NodeJS 17+) Python (3.10+) PyTgCalls (Lastest) MongoDB (3.12.1) 2nd Telegram Ac

Python library for audio and music analysis

librosa A python package for music and audio analysis. Documentation See https://librosa.org/doc/ for a complete reference manual and introductory tut

music library manager and MusicBrainz tagger

beets Beets is the media library management system for obsessive music geeks. The purpose of beets is to get your music collection right once and for

Read music meta data and length of MP3, OGG, OPUS, MP4, M4A, FLAC, WMA and Wave files with python 2 or 3

tinytag tinytag is a library for reading music meta data of MP3, OGG, OPUS, MP4, M4A, FLAC, WMA and Wave files with python Install pip install tinytag

Comments
  • What's the function of the

    What's the function of the "pure" variable?

    Thanks for your great work! I'm reading the code, but have some issues about it. In net_avst.py, line 125, you assigned the audio_feat to audio_feat_pure, but the audio_feat haven't been changed before line 207. It seems that the "pure" variable doesn't work in fact. So it's just for indicating that the audio_feat used in Temporal Grounding Module is "pure"?

    In a word, will the rename operation influence the gradient flow? I noticed that the shape of audio_feat_pure is [B, T, C], and the shape of audio_feat is [B*T, C]. But the pointers of them are the same (in line 206 of net_avst.py). Maybe I can use audio_feat in line 206 directly.

    opened by Rainlt 0
  • Misalignment between audio and video frames

    Misalignment between audio and video frames

    Thank you so much for your fantastic work! I found a few videos shorter than 60s in your dataset. When using your frame extraction script to extract frames from a video in the 1fps manner, I could not get 60 frames, however, the shape of the corresponding audio feature was [60, 128] in vggish folder. It would be so grateful if you let me know how to align the audio and frames from the same video.

    opened by zailongchen 1
  • The size of extracted 14*14 features is not

    The size of extracted 14*14 features is not "[320, 512, 14, 14]"

    Hello, thank you for your great work, but I met some problems when I ran the code. I used your "extract_14x14_feat.py" to extract 14x14 visual feature, but the size of extracted "0000xxxx.npy" is [4, 512, 14, 14]. Therefore, when I ran the "main_avst.py" file, we met the problem "IndexError: index 5 is out of bounds for dimension 0 with size 4". I found the size of 'selected_image' in 'extract_14x14_feat.py' is [4, 3, 224, 224], how can I solve the problem and run the code successfully?

    opened by Huntersxsx 1
Owner
null
Okaeri-Music is a telegram music bot project, allow you to play music on voice chat group telegram.

Okaeri-Music is a telegram bot project that's allow you to play music on telegram voice chat group

Wahyusaputra 1 Dec 22, 2021
Okaeri-Music is a telegram music bot project, allow you to play music on voice chat group telegram.

??️ PROJECT MUSIC,THIS IS MAINTAINED Okaeri-Music is a telegram bot project that's allow you to play music on telegram voice chat group Features ?? Th

Okaeri-Project 2 Dec 23, 2021
NovaMusic is a music sharing robot. Users can get music and music lyrics using inline queries.

A music sharing telegram robot using Redis database and Telebot python library using Redis database.

Hesam Norin 7 Oct 21, 2022
Music player - endlessly plays your music

Music player First, if you wonder about what is supposed to be a music player or what makes a music player different from a simple media player, read

Albert Zeyer 482 Dec 19, 2022
Music player and music library manager for Linux, Windows, and macOS

Ex Falso / Quod Libet - A Music Library / Editor / Player Quod Libet is a music management program. It provides several different ways to view your au

Quod Libet 1.2k Jan 7, 2023
SU Music Player — The first open-source PyTgCalls based Pyrogram bot to play music in voice chats

SU Music Player — The first open-source PyTgCalls based Pyrogram bot to play music in voice chats Note Neither this, or PyTgCalls are fully

SU Projects 58 Jan 2, 2023
Oliva music bot help to play vc music

OLIVA V2 ?? Requirements ?? FFmpeg NodeJS nodesource.com Python 3.7+ PyTgCalls Commands ?? For all in group /play - reply to youtube url or song file

SOUL々H҉A҉C҉K҉E҉R҉ 2 Oct 22, 2021
A small project where I identify notes and key harmonies in a piece of music and use them further to recreate and generate the same piece of music through Python

A small project where I identify notes and key harmonies in a piece of music and use them further to recreate and generate the same piece of music through Python

null 5 Oct 7, 2022
We built this fully functioning Music player in Python. The music player allows you to play/pause and switch to different songs easily.

We built this fully functioning Music player in Python. The music player allows you to play/pause and switch to different songs easily.

null 1 Nov 19, 2021
Royal Music You can play music and video at a time in vc

Royals-Music Royal Music You can play music and video at a time in vc Commands SOON String STRING_SESSION Deployment ?? Credits • ??ᴏᴍʏᴀ⃝??ᴇᴇᴛ • ??ғғɪ

null 2 Nov 23, 2021