PyTorch code for training MM-DistillNet for multimodal knowledge distillation


There is More than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking with Sound by Distilling Multimodal Knowledge

MM-DistillNet is a novel framework that is able to perform Multi-Object Detection and tracking using only ambient sound during inference time. The framework leverages on our new new MTA loss function that facilitates the distillation of information from multimodal teachers (RGB, thermal and depth) into an audio-only student network.

Illustration of MM-DistillNet

This repository contains the PyTorch implementation of our CVPR'2021 paper There is More than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking with Sound by Distilling Multimodal Knowledge. The repository builds on PyTorch-YOLOv3 Metrics and Yet-Another-EfficientDet-Pytorch codebases.

If you find the code useful for your research, please consider citing our paper:

  title={There is More than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking with Sound by Distilling Multimodal Knowledge},
  author={Rivera Valverde, Francisco and Valeria Hurtado, Juana and Valada, Abhinav},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},


System Requirements

  • Linux
  • Python 3.7
  • PyTorch 1.3
  • CUDA 10.1

IMPORTANT NOTE: These requirements are not necessarily mandatory. However, we have only tested the code under the above settings and cannot provide support for other setups.


a. Create a conda virtual environment.

git clone
cd MM-DistillNet
conda create -n mmdistillnet_env
conda activate mmdistillnet_env

b. Install dependencies

pip install -r requirements.txt

Prepare datasets and configure run

We also supply our large-scale multimodal dataset with over 113,000 time-synchronized frames of RGB, depth, thermal, and audio modalities, available at

Please make sure the data is available in the directory under the name data.

The binary download contains the expected folder format for our scripts to work. The path where the binary was extracted must be updated in the configuration files, in this case configs/mm-distillnet.cfg.

You will also need to download our trained teacher-models available here. Kindly download this files and have them available in the current directory, with the name of trained_models. The directory structure should look something like this:

configs/  images/  LICENSE  logs/  requirements.txt  setup.cfg  src/ trained_models/

>ls trained_models
LICENSE.txt              README.txt                             yet-another-efficientdet-d2-embedding.pth  yet-another-efficientdet-d2-rgb.pth
mm-distillnet.0.pth.tar  yet-another-efficientdet-d2-depth.pth  yet-another-efficientdet-d2.pth            yet-another-efficientdet-d2-thermal.pth

Additionally, the file configs/mm-distillnet.cfg contains support for different parallelization strategies and GPU/CPU support (using PyTorch's DataParallel and DistributedDataParallel)

Due to disk space constraints, we provide a mp3 version of the audio files. Librosa is known to be slow with mp3 files, so we also provide a mp3->pickle conversion utility. The idea is, that before training we convert the audio files to a spectogram and store it to a pickle file. --dir <path to the dataset>

Training and Evaluation

Training Procedure

Edit the config file appropriately in configs folder. Our best recipe is found under configs/mm-distillnet.cfg.

python --config 


To run the full dataset We our method using 4 GPUs with 2.4 Gb memory each (The expected runtime is 7 days). After training, the best model would be stored under /best.pth.tar . This file can be used to evaluate the performance of the model.

Evaluation Procedure

Evaluate the performance of the model (Our best model can be found under trained_models/mm-distillnet.0.pth.tar):

python --config 



The evaluation results of our method, after bayesian optimization, are (more details can be found in the paper):

Method KD mAP@Avg [email protected] [email protected] CDx CDy
StereoSoundNet[4] RGB 44.05 62.38 41.46 3.00 2.24
:--- ------------- ------------- ------------- ------------- ------------- -------------
MM-DistillNet RGB 61.62 84.29 59.66 1.27 0.69

Pre-Trained Models

Our best pre-trained model can be found on the dataset installation path.


We have used utility functions from other open-source projects. We especially thank the authors of:



For academic usage, the code is released under the GPLv3 license. For any commercial purpose, please contact the authors.

  • Question in Evaluate

    Question in Evaluate

    Hello there! First of all, thank you for your outstanding work! I have a problem when reproducing your work.

    I'm use the following command to evaluate. python --config configs/mm-distillnet.cfg --checkpoint trained_models/mm-distillnet.0.pth.tar

    But get bad performance.Can you help me how to improve? image



    opened by muzhaohui 7
  • Question in Train

    Question in Train

    Hello there! First of all, thank you for your outstanding work! I have a problem when reproducing your work.

    I use the model and the config you provided for training, but the results are very poor.


    It stopped at the 21 epoch

    image image

    mAP is only 48. Is there a difference between the data set you use and the one provided? Because I use the model you provided (distillnet.0.pth.tar) to evaluate is even worse than this! image

    So is it the wrong for the best model you provided?

    opened by muzhaohui 6
  • Question about dataset structure

    Question about dataset structure


    Thank you so much for this dataset, it is very large and well thought out!

    I have a question about the structure of the dataset. The audio files are in the form: audio/audio_<mic_number_from_0_to_7>_.mp3

    When I untar the audio directories they are mostly like this audio/audio_<mic_number_from_0_to_7>.mp3
    but sometimes they are of the form audio/audio
    <mic_number_from_0_to_7><extra_number>.mp3 where there is another number after the time stamp.

    For example in /drive_day_2020_04_14_15_56_26/audio there is audio_0_1586873154_433877998_1.mp3 and audio_0_1586873154_433877998_4.mp3 and when I diff them, they seem to be the same file.

    Why is this the case. Can I just ignore all but one when processing the audio?


    opened by drydenwiebe 3
  • question in Evaluate

    question in Evaluate

    Hello there!

    First of all, thank you for your outstanding work! I have a problem when reproducing your work.

    Your GT is generated through the teacher network, so when the teacher network performance changes, then the GT will change accordingly. Do you have a more accurate GT? Or can you teach me how to measure the performance of the student model more accurately?


    opened by muzhaohui 1
  • Something wrong with the dataset download path

    Something wrong with the dataset download path

    I want to download the dataset, but an error occurred HTTP request sent, awaiting response... 502 Bad Gateway 2021-05-20 00:49:02 ERROR 502: Bad Gateway.

    The URL of the dataset also cannot be opened. image

    Could you fix this problem?

    opened by zhouweii234 1
  • Source codes and datasets are missing

    Source codes and datasets are missing


    I have tried to run the following code: python --config ./configs

    Then, got the following error: ModuleNotFoundError: No module named 'src.fullcnn_net'.

    After checking the, I feel the followings source files are missing:

    src.fullcnn_net src.loss.ABLoss src.loss.MTALoss src.loss.KLLoss src.loss.GroupAttentionLoss src.loss.MultiTeacherPairWiseSimilarityLoss src.loss.MultiTeacherPairWiseSimilarityLoss src.loss.MultiTeacherContrastiveAttentionLoss src.loss. MultiTeacherContrastiveAttentionLoss src.loss.MultiTeacherTrippletAttentionLoss src.loss. MultiTeacherTrippletAttentionLoss src.loss.CRDLoss import CRDLoss src.loss.NSTLoss src.loss.PKTLoss src.loss.SimilarityLoss src.loss.RankingLoss

    Also, I am not sure how to download/access the datasets. I do not see any binary files downloaded inside the folder.

    Thanks for your help.

    opened by as4mz 1
  • Microphone array configurations

    Microphone array configurations

    Thanks for the impressive results. I have some problems on the microphone array.

    1. what is the separation among different microphones?
    2. Are they colocated with all other sensors? Will the position affect the model a lot?
    opened by sunwell1994 0
  • Could not unzip dataset

    Could not unzip dataset

    Thank you very much for sharing the dataset. I found It is really interesting to have a try this project. However, I have an issue with the shared dataset.

    After downloading 84 files and joining them into mavd_dataset.tar.gz by using the command: cat mavd_dataset.tar.gz.part-* > mavd_dataset.tar.gz, I could not unzip it even I tried serval options:

    1. gzip -d mavd_dataset.tar.gz gzip: mavd_dataset.tar.gz: not in gzip format

    2. tar -xzvf mavd_dataset.tar.gz gzip: stdin: not in gzip format tar: Child returned status 1 tar: Error is not recoverable: exiting now

    opened by tuantdang 0
  • About train

    About train

    Thank you very much for your data set and code. I encountered this problem when training the model: Traceback (most recent call last): File "F:/py_pro/MM-DistillNet-main/sec/optimization/", line 318, in logits_s, features_s = self.student_model(audio) File "D:\ProgramData\Anaconda3\envs\MM-DistillNet-main\lib\site-packages\torch\nn\modules\", line 889, in _call_impl result = self.forward(*input, **kwargs) File "F:\py_pro\MM-DistillNet-main\src\", line 670, in forward _, p3, p4, p5 = self.backbone_net(inputs) File "D:\ProgramData\Anaconda3\envs\MM-DistillNet-main\lib\site-packages\torch\nn\modules\", line 889, in _call_impl result = self.forward(*input, **kwargs) File "F:\py_pro\MM-DistillNet-main\src\", line 556, in forward x = self.model._conv_stem(x) File "D:\ProgramData\Anaconda3\envs\MM-DistillNet-main\lib\site-packages\torch\nn\modules\", line 889, in _call_impl result = self.forward(*input, **kwargs) File "F:\py_pro\MM-DistillNet-main\src\", line 54, in forward x = F.pad(x, [left, right, top, bottom]) File "D:\ProgramData\Anaconda3\envs\MM-DistillNet-main\lib\site-packages\torch\nn\", line 3998, in _pad assert len(pad) // 2 <= input.dim(), "Padding length too large" RuntimeError:Input type (torch.cuda.DoubleTensor) and weight type (torch.cuda.FloatTensor) should be the same.

    I can't solve this problem. Did I make an error in processing audio files.

    opened by liushibei 0
  • About loading the data

    About loading the data

    Hello, I have downloaded the data you provided and put it under the "data" folder as "mavd_dataset.tar.gz". However, when running the I encountered an issue, which is detailed as follows. I wonder what the "train.txt" is. Need I uncompress or do anything else?

    Traceback (most recent call last): File "", line 316, in train_multimodal_detection(config) File "", line 149, in train_multimodal_detection mode="train", File "/home/jcli/MM-DistillNet/src/datasets/", line 92, in init super().init(config=config, mode=mode,classes=self.classes) File "/home/jcli/MM-DistillNet/src/datasets/", line 106, in init self.ids = self.get_id_list() File "/home/jcli/MM-DistillNet/src/datasets/", line 111, in get_id_list self.ids = [id.strip() for id in open(id_list_path)] FileNotFoundError: [Errno 2] No such file or directory: 'data/train_all.txt'

    opened by Frankie123421 0
  • Issue with

    Issue with

    Hi, First I must thank you for the great work and making the data set available. I was trying to convert the dataset to pkl and at about 24%, I got the following error. Not sure how to fix this.


    Appreciate any help in this.

    opened by eyeris 0
  • Is the code for evaluating the tracking performance available?

    Is the code for evaluating the tracking performance available?

    I think the is for evaluating the object detection performance. And I could not find the code for evaluating the tracking performance such as ID switch, MOTA and MOTP?

    opened by KawhiZhao 0
  • The thermal teacher

    The thermal teacher


    Thanks for sharing this great work!

    I just meet some problems about the thermal teacher. When I load the thermal teacher model, I found that most models were not updated to the initialized efficientDet. The evaluate.log output is :

    using path=trained_models/yet-another-efficientdet-d2-thermal.pth ModelDict Update:174/1076

    Because I found that the thermal teacher's batch_labels is always empty.

    So can you teach me how to load the thermal teacher model correctly?

    Thank you very much!

    opened by LE0J-Song 0
