Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".

Yuan Gong

Last update: Jan 7, 2023

Related tags

Deep Learning ast

Overview

AST: Audio Spectrogram Transformer

Introduction
Citing
Getting Started
ESC-50 Recipe
Speechcommands Recipe
AudioSet Recipe
Pretrained Models
Contact

Introduction

This repository contains the official implementation (in PyTorch) of the Audio Spectrogram Transformer (AST) proposed in the Interspeech 2021 paper AST: Audio Spectrogram Transformer (Yuan Gong, Yu-An Chung, James Glass).

AST is the first convolution-free, purely attention-based model for audio classification which supports variable length input and can be applied to various tasks. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2. For details, please refer to the paper and the ISCA SIGML talk.

Please have a try! AST can be used with a few lines of code, and we also provide recipes to reproduce the SOTA results on AudioSet, ESC-50, and Speechcommands with almost one click.

The AST model file is in src/models/ast_models.py, the recipes are in egs/[audioset,esc50,speechcommands]/run.sh, when you run run.sh, it will call /src/run.py, which will then call /src/dataloader.py and /src/traintest.py, which will then call /src/models/ast_models.py.

Citing

Please cite our paper(s) if you find this repository useful. The first paper proposes the Audio Spectrogram Transformer while the second paper describes the training pipeline that we applied on AST to achieve the new state-of-the-art on AudioSet.

@article{gong2021ast,  
 title={Ast: Audio spectrogram transformer}, 
 author={Gong, Yuan and Chung, Yu-An and Glass, James}, 
 journal={arXiv preprint arXiv:2104.01778}, 
 year={2021}}

@article{gong2021psla,  
 title={PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation}, 
 author={Gong, Yuan and Chung, Yu-An and Glass, James}, 
 journal={arXiv preprint arXiv:2102.01243}, 
 year={2021}}

Getting Started

Step 1. Clone or download this repository and set it as the working directory, create a virtual environment and install the dependencies.

cd ast/ 
python3 -m venv venvast
source venvast/bin/activate
pip install -r requirements.txt

Step 2. Test the AST model.

ASTModel(label_dim=527, \
         fstride=10, tstride=10, \
         input_fdim=128, input_tdim=1024, \
         imagenet_pretrain=True, audioset_pretrain=False, \
         model_size='base384')

Parameters:
label_dim : The number of classes (default:527).
fstride: The stride of patch spliting on the frequency dimension, for 16*16 patchs, fstride=16 means no overlap, fstride=10 means overlap of 6 (used in the paper). (default:10)
tstride: The stride of patch spliting on the time dimension, for 16*16 patchs, tstride=16 means no overlap, tstride=10 means overlap of 6 (used in the paper). (default:10)
input_fdim: The number of frequency bins of the input spectrogram. (default:128)
input_tdim: The number of time frames of the input spectrogram. (default:1024, i.e., 10.24s)
imagenet_pretrain: If True, use ImageNet pretrained model. (default: True, we recommend to set it as True for all tasks.)
audioset_pretrain: IfTrue, use full AudioSet And ImageNet pretrained model. Currently only support base384 model with fstride=tstride=10. (default: False, we recommend to set it as True for all tasks except AudioSet.)
model_size: The model size of AST, should be in [tiny224, small224, base224, base384] (default: base384).

cd ast/src
python

import os 
import torch
from models import ASTModel 
# download pretrained model in this directory
os.environ['TORCH_HOME'] = '../pretrained_models'  
# assume each input spectrogram has 100 time frames
input_tdim = 100
# assume the task has 527 classes
label_dim = 527
# create a pseudo input: a batch of 10 spectrogram, each with 100 time frames and 128 frequency bins 
test_input = torch.rand([10, input_tdim, 128]) 
# create an AST model
ast_mdl = ASTModel(label_dim=label_dim, input_tdim=input_tdim, imagenet_pretrain=True)
test_output = ast_mdl(test_input) 
# output should be in shape [10, 527], i.e., 10 samples, each with prediction of 527 classes. 
print(test_output.shape)

ESC-50 Recipe

The ESC-50 recipe is in ast/egs/esc50/run_esc.sh, the script will automatically download the ESC-50 dataset and resample it to 16kHz, then run standard 5-cross validation and report the result. The recipe was tested on 4 GTX TITAN GPUs with 12GB memory. The result is saved in ast/egs/esc50/exp/yourexpname/acc_fold.csv (the accuracy of fold 1-5 and the averaged accuracy), you can also check details in result.csv and best_result.csv (accuracy, AUC, loss, etc of each epoch / best epoch). We attached our log file in ast/egs/esc50/test-esc50-f10-t10-p-b48-lr1e-5, the model achieves 95.75% accuracy.

To run the recipe, simply comment out . /data/sls/scratch/share-201907/slstoolchainrc in ast/egs/esc50/run_esc.sh, adjust the path if needed, and run:

cd ast/egs/esc50
(slurm user) sbatch run_esc50.sh
(local user) ./run_esc50.sh

Speechcommands V2 Recipe

The Speechcommands recipe is in ast/egs/speechcommands/run_sc.sh, the script will automatically download the Speechcommands V2 dataset, train an AST model on the training set, validate it on the validation set, and evaluate it on the test set. The recipe was tested on 4 GTX TITAN GPUs with 12GB memory. The result is saved in ast/egs/speechcommands/exp/yourexpname/eval_result.csv in format [val_acc, val_AUC, eval_acc, eval_AUC], you can also check details in result.csv (accuracy, AUC, loss, etc of each epoch). We attached our log file in ast/egs/speechcommends/test-speechcommands-f10-t10-p-b128-lr2.5e-4-0.5-false, the model achieves 98.12% accuracy.

To run the recipe, simply comment out . /data/sls/scratch/share-201907/slstoolchainrc in ast/egs/esc50/run_sc.sh, adjust the path if needed, and run:

cd ast/egs/speechcommands
(slurm user) sbatch run_sc.sh
(local user) ./run_sc.sh

Audioset Recipe

Audioset is a little bit more complex, you will need to prepare your data json files (i.e., train_data.json and eval_data.json) by your self. The reason is that the raw wavefiles of Audioset is not released and you need to download them by yourself. We have put a sample json file in ast/egs/audioset/data/datafiles, please generate files in the same format (You can also refer to ast/egs/esc50/prep_esc50.py and ast/egs/speechcommands/prep_sc.py.). Please keep the label code consistent with ast/egs/audioset/data/class_labels_indices.csv.

Once you have the json files, you will need to generate the sampling weight file of your training data (please check our PSLA paper to see why it is needed).

cd ast/egs/audioset
python gen_weight_file.py ./data/datafiles/train_data.json

Then you just need to change the tr_data and te_data in /ast/egs/audioset/run.sh and then

cd ast/egs/audioset
(slurm user) sbatch run.sh
(local user) ./run.sh

You should get a model achieves 0.448 mAP (without weight averaging) and 0.459 (with weight averaging). This is the best single model reported in the paper. The result of each epoch is saved in ast/egs/audioset/exp/yourexpname/result.csv in format [mAP, mAUC, precision, recall, d_prime, train_loss, valid_loss, cum_mAP, cum_mAUC, lr] , where cum_ results are the checkpoint ensemble results (i.e., averaging the prediction of checkpoint models of each epoch, please check our PSLA paper for details). The result of weighted averaged model is saved in wa_result.csv in format [mAP, AUC, precision, recall, d-prime]. We attached our log file in ast/egs/audioset/test-full-f10-t10-pTrue-b12-lr1e-5/, the model achieves 0.459 mAP.

In order to reproduce ensembe results of 0.475 mAP and 0.485 mAP, please train 3 models use the same setting (i.e., repeat above three times) and train 6 models with different tstride and fstride, and average the output of the models. Please refer to ast/egs/audioset/ensemble.py. We attached our ensemble log in /ast/egs/audioset/exp/ensemble-s.log and ensemble-m.log. You can use our pretrained models (see below) to test ensemble result.

Pretrained Models

We provide full AudioSet pretrained models.

Ensemble model 2-4 achieves 0.475 mAP, Ensemble model 2-7 achieves and 0.485 mAP. You can download these models at one click using ast/egs/audioset/download_models.sh. Once you download the model, you can try ast/egs/audioset/ensemble.py, you need to change the eval_data_path and mdl_list to run it. We attached our ensemble log in /ast/egs/audioset/exp/ensemble-s.log and ensemble-m.log.

If you want to finetune AudioSet-pretrained AST model on your task, you can simply set the audioset_pretrain=True when you create the AST model, it will automatically download model 1 (0.459 mAP). In our ESC-50 recipe, AudioSet pretraining is used.

Contact

If you have a question, please bring up an issue (preferred) or send me an email [email protected].

Comments

Error reshaping positional embedding for AudioSet pretrained model

This error only occurs when using the AudioSet pretrained model - does not occur using only ImageNet pretrained. Audio is resampled to 16k Hz. Error occurs in src/models/ast_models.py - since t_dim > 101, else block on line 139 is triggered.

Traceback (most recent call last):
  File "train.py", line 73, in <module>
    model = VTN(**vars(cfg))
[REDACTED - model call internally]
  File "/[REDACTED]/ast_models.py", line 141, in __init__
    new_pos_embed = new_pos_embed.reshape(1, 768, num_patches).transpose(1, 2)
RuntimeError: shape '[1, 768, 120]' is invalid for input of size 221184

Parameters to "AstModel" instantiation:

label_dim: 400
input_tdim: 251
input_fdim: 64
audioset_pretrain: True

bug

opened by devksingh4 12

About random noise which author put on the speech command.

Hello mister GONG, Thank you for your wonderful work, it really helps me a lot. I have a few question about the random noise which you had injected in the dataset of speech command. And we can see the random noise is useful for the dataset of speech command, May I know the deep reason? Why it is very useful for short time audio?

And also the code fbank = torch.roll(fbank, np.random.randint(-10, 10), 0) from 209th line in dataloader.py. why Mister GONG use this code to do the random noise. Where did Mister GONG got the inspiration for that code?

Looking forward to your reply.
question

opened by poult-lab 11
How to use the model for a downstream task ?
Hi Yuan, Thanks so much for open-sourcing the code and sharing the recipies. I am trying to use the model on my own training pipeline and per your suggestions in the "read-me" using the following :

imagenet = True

audionet = False

label_dim=1<----binary classification

fstride=10

tstride=10

input_fdim=128 for input_tdim , I am using the max()*100 of the audio duration. My input tensor is of the form [1, 901, 128] , but I am running into errors. the model fails during the forward pass with the following error :

RuntimeError: The size of tensor a (1070) must match the size of tensor b (7070) at non-singleton dimension 1 . The full stack trace is as below : `--------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) /tmp/ipykernel_318/1697194335.py in 1 model.cuda() ----> 2 y = model(spec)

/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1101 or _global_forward_hooks or _global_forward_pre_hooks): -> 1102 return forward_call(*input, **kwargs) 1103 # Do not call functions when jit is used 1104 full_backward_hooks, non_full_backward_hooks = [], []

/opt/conda/lib/python3.8/site-packages/torch/autocast_mode.py in decorate_autocast(*args, **kwargs) 196 def decorate_autocast(*args, **kwargs): 197 with self: --> 198 return func(*args, **kwargs) 199 return decorate_autocast

/tmp/ipykernel_318/3758517193.py in forward(self, x) 213 dist_token = self.v.dist_token.expand(B, -1, -1) 214 x = torch.cat((cls_tokens, dist_token, x), dim=1) --> 215 x = x + self.v.pos_embed 216 x = self.v.pos_drop(x) 217 for blk in self.v.blocks:

RuntimeError: The size of tensor a (1070) must match the size of tensor b (7070) at non-singleton dimension 1 `

Do you have any suggestions for me? I'd appreciate your inputs. Thanks, Devesh
bug
opened by devesh-k 9
Inference time mismatch errors ?

Hello, I conducted training of base384 sized ast model on my own data set. While training the was no errors but when I tried to do inference and load from checkpoint error arose.

RuntimeError: Error(s) in loading state_dict for DataParallel: size mismatch for module.v.pos_embed: copying a param with shape torch.Size([1, 602, 768]) from checkpoint, the shape in current model is torch.Size([1, 1214, 768]).

What could be wrong with this error?
bug

opened by Enescigdem 9

Training ESC50 on constrained GPU resources

Hi Yuan, thanks for the amazing research and well documented work.

I am trying to train the ESC50 AST model on a 4GB GPU. Unfortunately, I run out of memory with the default setup from ./run_esc.sh. I saw previous comments on other issues and I learned that you recommend reducing batch_size, fstride and tstride. I have applied only the following changes to run_esc.sh:

batch_size=8 
fstride=16
tstride=16

However, when executing the script, I see the following error, which seems to be related to the change to the strides, not to batch_size:

---------------AST Model Summary---------------
ImageNet pretraining: True, AudioSet pretraining: True
frequncey stride=16, time stride=16
number of patches=256
Traceback (most recent call last):
  File "/user/i/iran/asroman/ast/egs/esc50/../../src/run.py", line 89, in <module>
    audio_model = models.ASTModel(label_dim=args.n_class, fstride=args.fstride, tstride=args.tstride, input_fdim=128,
  File "/user/i/iran/asroman/ast/src/models/ast_models.py", line 148, in __init__
    new_pos_embed = new_pos_embed.reshape(1, 768, num_patches).transpose(1, 2)
RuntimeError: shape '[1, 768, 256]' is invalid for input of size 294912

I would greatly appreciate your help (:

bug

opened by adrianSRoman 8

Question about json file and label index

Good day!

It is my pleasure to read your paper AST:Audio Spectrogram Transformer. You have done such an excellent job.

While I try to use your model for my own dataset, I meet a problem which is how to create the json file and label index, I don't quite understand the specific format and parameter configuration. My dataset consists of five categories with 1300 audios in each category. From your example in /egs/audioset/data/class_labels_indices.csv, I don't understand what 'mid' means. I'm just a newbie so I asked a such stupid question.

I would appreciate it if you could answer my questions patiently.

Yours sincerely.
question

opened by TungyuYoung 8
OSError: ./exp/test-esc50-f10-t10-impTrue-aspTrue-b48-lr1e-5/fold1/result.csv not found.(ESC-50 Recipe)

Gongyuan先生您好，首先多谢谢你的工作。但是在我运行 bash run_esc.sh 文件的时候有一点儿问题所以想向您请教。 run_esc.sh 的最后一行是 python ./get_esc_result.py --exp_path ${base_exp_dir} 当程序运行到最后一行的时候, 程序21行告诉我没有发现fold1/result.csv这个文件。所以我想请教一下这个文件是需要自己创建一下么还是用您上传的文件(ast-master/egs/esc50/exp/test-esc50-f10-t10-pTrue-b48-lr1e-5/result.csv). 我上传的图片是错误的blog描述。
bug

opened by poult-lab 8
Real-time microphone testing

Hi, i've been using your model for classification and audio analysis and it works great. I have trained my own model and was wondering if there's a way to test it in real-time with microphone rather than audio file, if you could provide a way forward it would be greatt.
enhancement

opened by ridasaleem0 8
Wonderful work! questions about feature size

Hi, there: Thank you for open sourcing this piece of implementation! It is very inspiring to see timm works in the audio settings.

Q: I tried the pipeline with a smaller feature size e.g. 64x400, and end up with 39x5 patches, and AST would be stuck at 0.01 mAP. Tried upsampling to your feature size 128x1024, and brought it up to 0.10 mAP. I guess your intuition is to "take advantage of" the 384x384 position (originally 576 n_patches), so 1212 patches would be roughly 2x the 576 patches. Still curious is there a way to do this with a smaller feature dimension.
question

opened by lijuncheng16 7
Running AST on a downstream task.

Dear Yuan,

Thank you for creating this SOTA model for audio processing.

I want to run AST on an Audio dataset. I have prepared the data in a similar manner as the data prepared for ESC50 dataset. I wanted to run the model but then I noticed that you took dataset specific mean and std to normalize the dataset. Can you please share the method you used to find these two metrics.

Regards Saif
enhancement

opened by saifkhan-m 7
How to change the interpolation method?

Hi Yuan, In the AST, for the part of the ablation experiment comparing different interpolation methods, one of the items is called "Reinitialize", how is this reflected in the code? Best Regards.
question

opened by ooobsidian 6
Can AST be used for audio representation towards solving the frame-level classification tasks?

Hi Yuan,

I am currently reading your wonderful papers about the AST and SSAST. I wonder if the AST can be used to extract frame-level representation of audio (like music) to solve the frame-level classification tasks? Thanks.
question

opened by SylviaZiyaZhou 1
About code"100-106" from dataloader.py

Dear Minster. Gong Thanks a lot for your pioneering work in the field of audio processing, and warmhearted comments every time. I have a question about using MixUp method in AST. Since I saw the code 102 from dataloader.py waveform = waveform - waveform.mean(). My question is why the waveform() need to be subtracted the mean of waveform(). That operation of subtracting from either the original MixUp or there is reason behind it?
question

opened by poult-lab 4
The problem of reproducing the AST result in full dataset

Hi! Yaun Gong, Nice job! My research is also based on your pipeline. But I found that I can't reproduce the results of the AST of the paper under the full dataset. My dataset is downloaded from qiuqiangkong/audioset_tagging_cnn (github.com) (the size of unbalanced, balanced and eval set are 1912024, 20550, 18886 respectively), and pre-processed the the dataset following your instruction, including resampling to 16kHZ, generating weights files, and re-calculating mean and std (our mean and std are -3.539583 and 3.4221482, which are not the same as -4.2677393, 4.5689974 in your code)

The results of the 5 epochs are: 0.405, 0.421, 0.434, 0.433, 0.433 Compare the results given in your source code: 4.153, 0.439, 0,448, 0.449, 0.449

The final ensemble result is 0.445, which is quiet difference to your result 0.459 in your paper. We checked that the hyperparameters of the experiment are the same as your code, can you give some advice on the occurrence of such a problem? Looking forward to your reply.

Thanks

你好，龚元，很棒的工作！我的研究也是基于你的代码框架。但是我发现我没有办法在full dataset下复现论文的AST的结果。我的数据集来自qiuqiangkong/audioset_tagging_cnn (github.com) （unbalanced, balanced和eval的样本量分别为1912024，20550，18886），并且按照你的指示对数据集进行预处理，包括重采样到到16kHZ、生成权重文件和重新计算mean和std（我们计算的mean和std分别为-3.539583和3.4221482，与你代码中的-4.2677393，4.5689974不一样）等。

5个epoch的结果分别为： 0.405, 0.421, 0.434, 0.433, 0.433 对比你源码中给出的结果： 4.153, 0.439, 0,448, 0.449, 0.449

最终集成学习的结果是0.445，对比论文的结果0.459有较大的差异。我们检查过实验的超参和你代码是一样的，对于出现这样的问题能给出一些意见吗？期待收到你的回复。

非常感谢
reproduction

opened by MichaelLynn1996 4
Input type (torch.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor

Dear Yuan,

I met this issue when running the demo.py, it occurred in line 29, ast_models.py, self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size) with error msg as followed: Input type (torch.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor. Would you like to have a look at it? I use 👍 timm=0.4.5 torch = 1.10.1+cu102
torchaudio = 0.10.1+cu102
torchvision = 0.11.2+cu102

Thank you Best Regards, Nanjun
bug

opened by michelle-chou25 15
Some questions about the details of AST.

I would like to know how to explain the classification of audio that can be achieved using ImageNet pretrained models based on spectrograms? As we all know, most of the pictures included in Imagenet are common photos of daily life, such as cats, dogs, cars, etc. Are the features of these pictures/objects correlated with the audio spectrogram? Why can the knowledge learned from traditional pictures be distilled into the classification of spectrograms?

I would appreciate it if you could answer my questions.
question

opened by TungyuYoung 1

Owner

Yuan Gong

Ph.D in CS

GitHub

An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations Implementation of the method described in the Speech Resynthesis from Di

253 Jan 6, 2023

NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling @ INTERSPEECH 2021 Accepted

NU-Wave — Official PyTorch Implementation NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling Junhyeok Lee, Seungu Han @ MINDsLab Inc

242 Dec 23, 2022

PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

PyTorch implementation of Conformer: Convolution-augmented Transformer for Speech Recognition. Transformer models are good at capturing content-based

565 Jan 4, 2023

PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

ContextNet ContextNet has CNN-RNN-transducer architecture and features a fully convolutional encoder that incorporates global context information into

24 Nov 24, 2022

The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

This repository contains the software implementation of most algorithms used or developed in my research. The LaTeX and Python code for generating the

3 Jan 3, 2023

Inference code for "StylePeople: A Generative Model of Fullbody Human Avatars" paper. This code is for the part of the paper describing video-based avatars.

NeuralTextures This is repository with inference code for paper "StylePeople: A Generative Model of Fullbody Human Avatars" (CVPR21). This code is for

Visual Understanding Lab @ Samsung AI Center Moscow

18 Oct 6, 2022

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

SLATE This is the official source code for SLATE. We provide the code for the model, the training code and a dataset loader for the 3D Shapes dataset.

66 Dec 26, 2022

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".

Related tags

Overview

AST: Audio Spectrogram Transformer

Introduction

Citing

Getting Started

ESC-50 Recipe

Speechcommands V2 Recipe

Audioset Recipe

Pretrained Models

Contact

Comments

Owner

Yuan Gong

An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling @ INTERSPEECH 2021 Accepted

PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

Inference code for "StylePeople: A Generative Model of Fullbody Human Avatars" paper. This code is for the part of the paper describing video-based avatars.

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

Code of the lileonardo team for the 2021 Emotion and Theme Recognition in Music task of MediaEval 2021

This is the code for the paper "Contrastive Clustering" (AAAI 2021)

Code for our ICASSP 2021 paper: SA-Net: Shuffle Attention for Deep Convolutional Neural Networks

PyTorch code for ICLR 2021 paper Unbiased Teacher for Semi-Supervised Object Detection

Code for our CVPR 2021 paper "MetaCam+DSCE"

Code for ICLR 2021 Paper, "Anytime Sampling for Autoregressive Models via Ordered Autoencoding"

Official code of the paper "ReDet: A Rotation-equivariant Detector for Aerial Object Detection" (CVPR 2021)

Code for the paper "Training GANs with Stronger Augmentations via Contrastive Discriminator" (ICLR 2021)

Official code for the paper: Deep Graph Matching under Quadratic Constraint (CVPR 2021)

Official code for the ICLR 2021 paper Neural ODE Processes

Code for CVPR 2021 paper: Anchor-Free Person Search

Code of paper "CDFI: Compression-Driven Network Design for Frame Interpolation", CVPR 2021