MixMo: Mixing Multiple Inputs for Multiple Outputs via Deep Subnetworks

Official PyTorch implementation of the MixMo framework | paper | docs

Citation

If you find this code useful for your research, please cite:

@article{rame2021ixmo,
    title={MixMo: Mixing Multiple Inputs for Multiple Outputs via Deep Subnetworks},
    author={Alexandre Rame and Remy Sun and Matthieu Cord},
    year={2021},
    journal={arXiv preprint arXiv:2103.06132}
}

Abstract

Recent strategies achieved ensembling “for free” by fitting concurrently diverse subnetworks inside a single base network. The main idea during training is that each subnetwork learns to classify only one of the multiple inputs simultaneously provided. However, the question of how to best mix these multiple inputs has not been studied so far.

In this paper, we introduce MixMo, a new generalized framework for learning multi-input multi-output deep subnetworks. Our key motivation is to replace the suboptimal summing operation hidden in previous approaches by a more appropriate mixing mechanism. For that purpose, we draw inspiration from successful mixed sample data augmentations. We show that binary mixing in features - particularly with rectangular patches from CutMix - enhances results by making subnetworks stronger and more diverse.

We improve state of the art for image classification on CIFAR-100 and Tiny ImageNet datasets. Our easy to implement models notably outperform data augmented deep ensembles, without the inference and memory overheads. As we operate in features and simply better leverage the expressiveness of large networks, we open a new line of research complementary to previous works.

Overview

Most important code sections

This repository provides a general wrapper over PyTorch to reproduce the main results from the paper. The code sections specific to MixMo can be found in:

mixmo.loaders.dataset_wrapper.py and specifically MixMoDataset to create batches with multiple inputs and multiple outputs.
mixmo.augmentations.mixing_blocks.py where we create the mixing masks, e.g. via linear summing (_mixup_mask) or via patch mixing (_cutmix_mask).
mixmo.networks.resnet.py and mixmo.networks.wrn.py where we adapt the network structures to handle:
- multiple inputs via multiple conv1s encoders (one for each input). The function mixmo.augmentations.mixing_blocks.mix_manifold is used to mix the extracted representations according to the masks provided in metadata from MixMoDataset.
- multiple outputs via multiple predictions.

This translates to additional tensor management in mixmo.learners.learner.py.

Pseudo code

Our MixMoDataset wraps a PyTorch Dataset. The batch_repetition_sampler repeats the same index b times in each batch. Moreover, we provide SoftCrossEntropyLoss which handles soft-labels required by mixed sample data augmentations such as CutMix.

from mixmo.loaders import (dataset_wrapper, batch_repetition_sampler)
from mixmo.networks.wrn import WideResNetMixMo
from mixmo.core.loss import SoftCrossEntropyLoss as criterion

...

# cf mixmo.loaders.loader
train_dataset = dataset_wrapper.MixMoDataset(
        dataset=CIFAR100(os.path.join(dataplace, "cifar100-data")),
        num_members=2,  # we use M=2 subnetworks
        mixmo_mix_method="cutmix",  # patch mixing, linker to mixmo.augmentations.mixing_blocks._cutmix_mask
        mixmo_alpha=2,  # mixing ratio sampled from Beta distribution with concentration 2
        mixmo_weight_root=3  # root for reweighting of loss components 3
        )
network = WideResNetMixMo(depth=28, widen_factor=10, num_classes=100)

...

# cf mixmo.learners.learner and mixmo.learners.model_wrapper
for _ in range(num_epochs):
    for indexes_0, indexes_1 in batch_repetition_sampler(batch_size=64, b=4, max_index=len(train_dataset)):
        for (inputs_0, inputs_1, targets_0, targets_1, metadata_mixmo_masks) in train_dataset(indexes_0, indexes_1):
            outputs_0, outputs_1 = network([inputs_0, inputs_1], metadata_mixmo_masks)
            loss = criterion(outputs_0, targets_0) + criterion(outputs_1, targets_1)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

Configuration files

Our code heavily relies on yaml config files. In the mixmo-pytorch/config folder, we provide the configs to reproduce the main paper results.

For example, the state-of-the-art exp_cifar100_wrn2810-2_cutmixmo-p5_msdacutmix_bar4 means that:

cifar100: dataset is CIFAR-100.
wrn2810-2: WideResNet-28-10 network architecture with M=2 subnetworks.
cutmixmo-p5: mixing block is patch mixing with probability p=0.5 else linear mixing.
msdacutmix: use CutMix mixed sample data augmentation.
bar4: batch repetition to b=4.

Results and available checkpoints

CIFAR-100 with WideResNet-28-10

Subnetwork method	MSDA	Top-1 Accuracy	config file in mixmo-pytorch/config/cifar100
--	Vanilla	81.79	exp_cifar100_wrn2810_1net_standard_bar1.yaml
--	Mixup	83.43	exp_cifar100_wrn2810_1net_msdamixup_bar1.yaml
--	CutMix	83.95	exp_cifar100_wrn2810_1net_msdacutmix_bar1.yaml
MIMO	--	82.92	exp_cifar100_wrn2810-2_mimo_standard_bar4.yaml
Linear-MixMo	--	82.96	exp_cifar100_wrn2810-2_linearmixmo_standard_bar4.yaml
Cut-MixMo	--	85.52 - 85.59	exp_cifar100_wrn2810-2_cutmixmo-p5_standard_bar4.yaml
Linear-MixMo	CutMix	85.36 - 85.57	exp_cifar100_wrn2810-2_linearmixmo_msdacutmix_bar4.yaml
Cut-MixMo	CutMix	85.77 - 85.92	exp_cifar100_wrn2810-2_cutmixmo-p5_msdacutmix_bar4.yaml

CIFAR-10 with WideResNet-28-10

Subnetwork method	MSDA	Top-1 Accuracy	config file in mixmo-pytorch/config/cifar10
--	Vanilla	96.37	exp_cifar10_wrn2810_1net_standard_bar1.yaml
--	Mixup	97.07	exp_cifar10_wrn2810_1net_msdamixup_bar1.yaml
--	CutMix	97.28	exp_cifar10_wrn2810_1net_msdacutmix_bar1.yaml
MIMO	--	96.71	exp_cifar10_wrn2810-2_mimo_standard_bar4.yaml
Linear-MixMo	--	96.88	exp_cifar10_wrn2810-2_linearmixmo_standard_bar4.yaml
Cut-MixMo	--	97.52	exp_cifar10_wrn2810-2_cutmixmo-p5_standard_bar4.yaml
Linear-MixMo	CutMix	97.73	exp_cifar10_wrn2810-2_linearmixmo_msdacutmix_bar4.yaml
Cut-MixMo	CutMix	97.83	exp_cifar10_wrn2810-2_cutmixmo-p5_msdacutmix_bar4.yaml

Tiny ImageNet-200 with PreActResNet-18-width

Method	Width	Top-1 Accuracy	config file in mixmo-pytorch/config/tiny
Vanilla	1	62.75	exp_tinyimagenet_res18_1net_standard_bar1.yaml
Linear-MixMo	1	62.91	exp_tinyimagenet_res18-2_linearmixmo_standard_bar4.yaml
Cut-MixMo	1	64.32	exp_tinyimagenet_res18-2_cutmixmo-p5_standard_bar4.yaml
Vanilla	2	64.91	exp_tinyimagenet_res182_1net_standard_bar1.yaml
Linear-MixMo	2	67.03	exp_tinyimagenet_res182-2_linearmixmo_standard_bar4.yaml
Cut-MixMo	2	69.12	exp_tinyimagenet_res182-2_cutmixmo-p5_standard_bar4.yaml
Vanilla	3	65.84	exp_tinyimagenet_res183_1net_standard_bar1.yaml
Linear-MixMo	3	68.36	exp_tinyimagenet_res183-2_linearmixmo_standard_bar4.yaml
Cut-MixMo	3	70.23	exp_tinyimagenet_res183-2_cutmixmo-p5_standard_bar4.yaml

Installation

Requirements overview

python >= 3.6
torch >= 1.4.0
torchsummary >= 1.5.1
torchvision >= 0.5.0
tensorboard >= 1.14.0

Procedure

Clone the repo:

$ git clone https://github.com/alexrame/mixmo-pytorch.git

Install this repository and the dependencies using pip:

$ conda create --name mixmo python=3.6.10
$ conda activate mixmo
$ cd mixmo-pytorch
$ pip install -r requirements.txt

With this, you can edit the MixMo code on the fly.

Datasets

We advise to first create a dedicated data folder dataplace, that will be provided as an argument in the subsequent scripts.

CIFAR

CIFAR-10 and CIFAR-100 datasets are managed by Pytorch dataloader. First time you run a script, the dataloader will download the dataset in your provided dataplace.

Tiny-ImageNet

Tiny-ImageNet dataset needs to be download beforehand. The following process is forked from manifold mixup.

Download the zipped data from https://tiny-imagenet.herokuapp.com/.
Extract the zipped data in folder dataplace.
Run the following script (This will arange the validation data in the format required by the pytorch loader).

$ python scripts/script_load_tiny_data.py --dataplace $dataplace

Running the code

Training

Baseline

First, to train a baseline model, simply execute the following command:

$ python3 scripts/train.py --config_path config/cifar100/exp_cifar100_wrn2810_1net_standard_bar1.yaml --dataplace $dataplace --saveplace $saveplace

It will create an output folder exp_cifar100_wrn2810_1net_standard_bar1 located in parent folder saveplace. This folder includes model checkpoints, a copy of your config file, logs and tensorboard logs. By default, if the output folder already exists, training will load the last weights epoch and will continue. If you want to forcefully restart training, simply add --from_scratch as an argument.

MixMo

When training MixMo, you just need to select the appropriate config file. For example, to obtain state of the art results on CIFAR-100 by combining Cut-MixMo and CutMix, just execute:

$ python3 scripts/train.py --config_path config/cifar100/exp_cifar100_wrn2810-2_cutmixmo-p5_msdacutmix_bar4.yaml --dataplace $dataplace --saveplace $saveplace

Evaluation

To evaluate the accuracy of a given strategy, you can train your own model, or just download our pretrained checkpoints:

$ python3 scripts/evaluate.py --config_path config/cifar100/exp_cifar100_wrn2810-2_cutmixmo-p5_msdacutmix_bar4.yaml --dataplace $dataplace --checkpoint $checkpoint --tempscal

checkpoint can be either:
- a path towards a checkpoint.
- an int matching the training epoch you wish to evaluate. In that case, you need to provide --saveplace $saveplace.
- the string best: we then automatically select the best training epoch. In that case, you need to provide --saveplace $saveplace.
--tempscal: indicates that you will apply temperature scaling

Results will be printed at the end of the script.

If you wish to test the models against common corruptions and perturbations, download the CIFAR-100-c dataset in your dataplace. Then use --robustness at evaluation.

Create your own configuration files and learning strategies

You can create new configs automatically via:

$ python3 scripts/templateutils_mixmo.py --template_path scripts/exp_mixmo_template.yaml --config_dir config/$your_config_dir --dataset $dataset

Acknowledgements and references

Our implementation is based on the repository: https://github.com/valeoai/ConfidNet. We thus thank Charles Corbière for his work Addressing Failure Prediction by Learning Model Confidence.
MIMO: https://github.com/google/edward2/
CutMix: https://github.com/ildoonet/cutmix/
Mixup: https://github.com/facebookresearch/mixup-cifar10
AugMix: https://github.com/google-research/augmix/
Temperature Scaling: https://github.com/gpleiss/temperature_scaling/
Metrics:

Thanks for the great work! I'm currently trying to adopt Mixmo for my own projects, I found some modifications that differs from Google'work MIMO. For input batches generation, suppose we use input repetition=1.0, technically the indexes and images for the 2 inputs of 2 experts should be exactly the same. In goolge's implementation, they read the images first, and then construct 2 inputs for the experts based on input repetition value (to shuffle partial indexes and keep the rest unchanged). In your implementation, you compute the indexes first (based on input repetition), and then read the images accordingly. The problem is, if we use default data augmentation (e.g. random cropping or flipping, which is also used in your code), even if the indexes are the same, it doesn't mean the images are the same because we apply DA to these imgaes randomly! However, in google's implementation, since the images are read in at first, this issue does not exist. I hope I've made my ideas clearly. I'm wondering that, how would this affect the performance, and which implementation is correct or reasonable? I'd like to hear your opinions. Appreciate your prompt reply!

Training speed of mixmo is very slow
I followed your instructions to train mixmo on cifar-100. However, the training is very slow compared to the baseline. Can you tell me how much time it takes to train mixmo on CIFAR-100 in your experiments?

$ python3 scripts/train.py --config_path config/cifar100/exp_cifar100_wrn2810-2_cutmixmo-p5_msdacutmix_bar4.yaml --dataplace $dataplace --saveplace $saveplace
opened by chenshen03 3
Loss weighting

Hi! I have a question for the loss re-weighting. According to my understanding, image with higher k should be assigned a lower loss weight since it is easier to predict with more information disclosed. As in Fig. 3, cat image has area k, and dog image 1-k. The computed loss weight is w(1-k) for cat and w(k) for dog. The higher k, the higher w(k). But in the code, it seems the computed weight w(k) is assiged to image with ratio k, other than the opposite. I am wondering which one is correct? How to understand the difference here?

opened by milliema 2
Data augmentation causes difference in input batches

Thanks for the great work! I'm currently trying to adopt Mixmo for my own projects, I found some modifications that differs from Google'work MIMO. For input batches generation, suppose we use input repetition=1.0, technically the indexes and images for the 2 inputs of 2 experts should be exactly the same. In goolge's implementation, they read the images first, and then construct 2 inputs for the experts based on input repetition value (to shuffle partial indexes and keep the rest unchanged). In your implementation, you compute the indexes first (based on input repetition), and then read the images accordingly. The problem is, if we use default data augmentation (e.g. random cropping or flipping, which is also used in your code), even if the indexes are the same, it doesn't mean the images are the same because we apply DA to these imgaes randomly! However, in google's implementation, since the images are read in at first, this issue does not exist. I hope I've made my ideas clearly. I'm wondering that, how would this affect the performance, and which implementation is correct or reasonable? I'd like to hear your opinions. Appreciate your prompt reply!

opened by milliema 2
Some problems about baseline
When I was training the baseline network (useing exp_cifar10_wrn2810_1net_standard_bar1.yaml) , I want to know if I understand the following correctly?

mixmo uses random sampling (learning every sample once on one epoch)

mixmo uses DADataset

mixmo uses wideRsenet28-10

mixmo with warmup lr in first epoch and reduce to 0.1 * lr in [101,201,226] epoch

mixmo with L2 regular for net params

mixmo uses the specified initialization

I look forward to your comment if I have missed anything !!! Thanks!!!
opened by ZzBros 1
typo for the argument to evalaute models against CIFAR-100-c dataset

to evaluate models against corrupted CIFAR dataset, the right argument should be "--corruptions" not "--robustness" as mentioned in README.md/Evaluation

opened by cagdemir 1
cuDNN error

Hello, can u tell me your GPU type and CUDA version? I tried to use CUDA V11.0.221 + RTX3090 to train the baseline of the model. Then I got an error RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR. Actually, I am training the baseline on my CPU and it is working fine but slowly. I want to use the GPU to help me get higher training speed. Thanks for reading.

opened by GiorgioPeng 1

Official Pytorch implementation of MixMo framework

Related tags

Overview

MixMo: Mixing Multiple Inputs for Multiple Outputs via Deep Subnetworks

Citation

Abstract

Overview

Most important code sections

Pseudo code

Configuration files

Results and available checkpoints

CIFAR-100 with WideResNet-28-10

CIFAR-10 with WideResNet-28-10

Tiny ImageNet-200 with PreActResNet-18-width

Installation

Requirements overview

Procedure

Datasets

Running the code

Training

Baseline

MixMo

Evaluation

Create your own configuration files and learning strategies

Acknowledgements and references

Comments

Training speed of mixmo is very slow

Loss weighting

Data augmentation causes difference in input batches

Some problems about baseline

typo for the argument to evalaute models against CIFAR-100-c dataset

cuDNN error

Owner

This is the official PyTorch implementation for "Mesa: A Memory-saving Training Framework for Transformers".

Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

Softlearning is a reinforcement learning framework for training maximum entropy policies in continuous domains. Includes the official implementation of the Soft Actor-Critic algorithm.

[ICRA 2022] An opensource framework for cooperative detection. Official implementation for OPV2V.

ALBERT-pytorch-implementation - ALBERT pytorch implementation

A general framework for deep learning experiments under PyTorch based on pytorch-lightning

Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

StyleGAN2-ADA - Official PyTorch implementation

Official PyTorch implementation of Joint Object Detection and Multi-Object Tracking with Graph Neural Networks

Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement".

Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

Official PyTorch Implementation of Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity

Official implementation of our paper "LLA: Loss-aware Label Assignment for Dense Pedestrian Detection" in Pytorch.

An official implementation of "SFNet: Learning Object-aware Semantic Correspondence" (CVPR 2019, TPAMI 2020) in PyTorch.

Old Photo Restoration (Official PyTorch Implementation)

Official PyTorch implementation of Spatial Dependency Networks.

Official implementation of our CVPR2021 paper "OTA: Optimal Transport Assignment for Object Detection" in Pytorch.

This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).