Diverse Branch Block: Building a Convolution as an Inception-like Unit

Last update: Dec 24, 2022

Related tags

Deep Learning DiverseBranchBlock

Overview

Diverse Branch Block: Building a Convolution as an Inception-like Unit (PyTorch) (CVPR-2021)

DBB is a powerful ConvNet building block to replace regular conv. It improves the performance without any extra inference-time costs. This repo contains the code for building DBB and converting it into a single conv. You can also get the equivalent kernel and bias in a differentiable way at any time (get_equivalent_kernel_bias in diversebranchblock.py). This may help training-based pruning or quantization.

This is the PyTorch implementation. The MegEngine version is at https://github.com/megvii-model/DiverseBranchBlock

Paper: https://arxiv.org/abs/2103.13425

Update: released the code for building the block, transformations and verification.

Update: a more efficient implementation of BNAndPadLayer

Sometimes I call it ACNet v2 because 'DBB' is two bits larger than 'ACB' in ASCII. (lol)

We provide the trained models and a super simple PyTorch-official-example-style training script to reproduce the results.

Abstract

We propose a universal building block of Convolutional Neural Network (ConvNet) to improve the performance without any inference-time costs. The block is named Diverse Branch Block (DBB), which enhances the representational capacity of a single convolution by combining diverse branches of different scales and complexities to enrich the feature space, including sequences of convolutions, multi-scale convolutions, and average pooling. After training, a DBB can be equivalently converted into a single conv layer for deployment. Unlike the advancements of novel ConvNet architectures, DBB complicates the training-time microstructure while maintaining the macro architecture, so that it can be used as a drop-in replacement for regular conv layers of any architecture. In this way, the model can be trained to reach a higher level of performance and then transformed into the original inference-time structure for inference. DBB improves ConvNets on image classification (up to 1.9% higher top-1 accuracy on ImageNet), object detection and semantic segmentation.

Use our pretrained models

You may download the models reported in the paper from Google Drive (https://drive.google.com/drive/folders/1BPuqY_ktKz8LvHjFK5abD0qy3ESp8v6H?usp=sharing) or Baidu Cloud (https://pan.baidu.com/s/1wPaQnLKyNjF_bEMNRo4z6Q, the access code is "dbbk"). Currently only ResNet-18 models are available. The others will be released very soon. For the ease of transfer learning on other tasks, we provide both training-time and inference-time models. For ResNet-18 as an example, assume IMGNET_PATH is the path to your directory that contains the "train" and "val" directories of ImageNet, you may test the accuracy by running

python test.py IMGNET_PATH train ResNet-18_DBB_7101.pth -a ResNet-18 -t DBB

Here "train" indicates the training-time structure

Convert the training-time models into inference-time

You may convert a trained model into the inference-time structure with

python convert.py [weights file of the training-time model to load] [path to save] -a [architecture name]

For example,

python convert.py ResNet-18_DBB_7101.pth ResNet-18_DBB_7101_deploy.pth -a ResNet-18

Then you may test the inference-time model by

python test.py IMGNET_PATH deploy ResNet-18_DBB_7101_deploy.pth -a ResNet-18 -t DBB

Note that the argument "deploy" builds an inference-time model.

ImageNet training

The multi-processing training script in this repo is based on the official PyTorch example for the simplicity and better readability. The modifications include the model-building part and cosine learning rate scheduler. You may train and test like this:

python train.py -a ResNet-18 -t DBB --dist-url tcp://127.0.0.1:23333 --dist-backend nccl --multiprocessing-distributed --world-size 1 --rank 0 --workers 64 IMGNET_PATH
python test.py IMGNET_PATH train model_best.pth.tar -a ResNet-18

Use like this in your own code

Assume your model is like

class SomeModel(nn.Module):
    def __init__(self, ...):
        ...
        self.some_conv = nn.Conv2d(...)
        self.some_bn = nn.BatchNorm2d(...)
        ...
        
    def forward(self, inputs):
        out = ...
        out = self.some_bn(self.some_conv(out))
        ...

For training, just use DiverseBranchBlock to replace the conv-BN. Then SomeModel will be like

class SomeModel(nn.Module):
    def __init__(self, ...):
        ...
        self.some_dbb = DiverseBranchBlock(..., deploy=False)
        ...
        
    def forward(self, inputs):
        out = ...
        out = self.some_dbb(out)
        ...

Train the model just like you train the other regular models. Then call switch_to_deploy of every DiverseBranchBlock, test, and save.

model = SomeModel(...)
train(model)
for m in train_model.modules():
    if hasattr(m, 'switch_to_deploy'):
        m.switch_to_deploy()
test(model)
save(model)

FAQs

Q: Is the inference-time model's output the same as the training-time model?

A: Yes. You can verify that by

python dbb_verify.py

Q: What is the relationship between DBB and RepVGG?

A: RepVGG is a plain architecture, and the RepVGG-style structural re-param is designed for the plain architecture. On a non-plain architecture, a RepVGG block shows no superiority compared to a single 3x3 conv (it improves Res-50 by only 0.03%, as reported in the RepVGG paper). DBB is a universal building block that can be used on numerous architectures.

Q: How to quantize a model with DBB?

A1: Post-training quantization. After training and conversion, you may quantize the converted model with any post-training quantization method. Then you may insert a BN after the conv converted from a DBB and finetune to recover the accuracy just like you quantize and finetune the other models. This is the recommended solution.

A2: Quantization-aware training. During the quantization-aware training, instead of constraining the params in a single kernel (e.g., making every param in {-127, -126, .., 126, 127} for int8) for an ordinary conv, you should constrain the equivalent kernel of a DBB (get_equivalent_kernel_bias()).

Q: I tried to finetune your model with multiple GPUs but got an error. Why are the names of params like "xxxx.weight" in the downloaded weight file but sometimes like "module.xxxx.weight" (shown by nn.Module.named_parameters()) in my model?

A: DistributedDataParallel may prefix "module." to the name of params and cause a mismatch when loading weights by name. The simplest solution is to load the weights (model.load_state_dict(...)) before DistributedDataParallel(model). Otherwise, you may insert "module." before the names like this

checkpoint = torch.load(...)    # This is just a name-value dict
ckpt = {('module.' + k) : v for k, v in checkpoint.items()}
model.load_state_dict(ckpt)

Likewise, if the param names in the checkpoint file start with "module." but those in your model do not, you may strip the names like

ckpt = {k.replace('module.', ''):v for k,v in checkpoint.items()}   # strip the names
model.load_state_dict(ckpt)

Q: So a DBB derives the equivalent KxK kernels before each forwarding to save computations?

A: No! More precisely, we do the conversion only once right after training. Then the training-time model can be discarded, and every resultant block is just a KxK conv. We only save and use the resultant model.

Contact

[email protected]

Google Scholar Profile: https://scholar.google.com/citations?user=CIjw0KoAAAAJ&hl=en

My open-sourced papers and repos:

Simple and powerful VGG-style ConvNet architecture (preprint, 2021): RepVGG: Making VGG-style ConvNets Great Again (https://github.com/DingXiaoH/RepVGG)

State-of-the-art channel pruning (preprint, 2020): Lossless CNN Channel Pruning via Decoupling Remembering and Forgetting (https://github.com/DingXiaoH/ResRep)

CNN component (ICCV 2019): ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution Blocks (https://github.com/DingXiaoH/ACNet)

Channel pruning (CVPR 2019): Centripetal SGD for Pruning Very Deep Convolutional Networks with Complicated Structure (https://github.com/DingXiaoH/Centripetal-SGD)

Channel pruning (ICML 2019): Approximated Oracle Filter Pruning for Destructive CNN Width Optimization (https://github.com/DingXiaoH/AOFP)

Unstructured pruning (NeurIPS 2019): Global Sparse Momentum SGD for Pruning Very Deep Neural Networks (https://github.com/DingXiaoH/GSM-SGD)

Comments

关于TRANS Ⅲ中的注意事项

您好！不好意思打扰您！！我最近拜读您的论文，看到了TRANS Ⅲ，不得不说这种变换确实很是新颖，但看到其中您提到的注意点，即如果第二层K*K 如果对输入做了0填充，那么公式8是不成立的，解决方案是用第一次等价过来的卷积的偏置 REP(b1) 作为填充，对这一点我有点不太理解，您能给详细解释一下不成立的原因以及解决方案的原因么？谢谢您！

opened by Orangerccc 10
转换模型

您好, 我在谷歌云盘/百度云下载模型时, 发现resnet18是一个文件夹, 文件夹内没有模型, resnet50有对应的模型, 但是在用convert.py进行转换时,第27行train_model.load_state_dict(ckpt)报错 ,会出现不匹配的key,报错信息如下(部分省略): RuntimeError: Error(s) in loading state_dict for ResNet:
Missing key(s) in state_dict: "stage1.0.conv2.dbb_avg.bn.bn.weight", "stage1.0.conv2.dbb_avg.bn.bn.bias", "stage1.0.conv2.dbb_avg.bn.bn.running_mean", "stage1.0.conv2.dbb_avg.bn.bn.running_var", " stage1.0.conv2.dbb_1x1_kxk.bn1.bn.weight", "stage1.0.conv2.dbb_1x1_kxk.bn1.bn.bias", "stage1.0.conv2.dbb_1x1_kxk.bn1.bn.running_mean", "stage1.0.conv2.dbb_1x1_kxk.bn1.bn.running_var", "stage1.1.conv2.dbb_avg.bn.bn.weight", "stage1.1.conv2.dbb_avg.bn.bn.bias", "stage1.1.conv2.dbb_avg.bn.bn.running_mean", "stage1.1.conv2.dbb_avg.bn.bn.running_var", "stage1.1.conv2.dbb_1x1_kxk.bn1.bn.weight", "stage1.1.conv2.dbb_1x1_kxk.bn1.bn.bias", "stage1.1.conv2.dbb_1x1_kxk.bn1.bn.running_mean", "stage1.1.conv2.dbb_1x1_kxk.bn1.bn.running_var", "stage1.2.conv2.dbb_avg.bn.bn.weight", "stage1.2.conv2.dbb_avg.bn.bn.bias", "stage1.2.conv2.dbb_avg.bn.bn.running_mean", "stage1.2.conv2.dbb_avg.bn.bn.running_var", "stage1.2.conv2.dbb_1x1_kxk.bn1.bn.weight", "stage1.2.conv2.dbb_1x1_kxk.bn1.bn.bias", "stage1.2.conv2.dbb_1x1_kxk.bn1.bn.running_mean", "stage1.2.conv2.dbb_1x1_kxk.bn1.bn.running_var"........

Unexpected key(s) in state_dict: "stage1.0.conv2.dbb_avg.bn.weight", "stage1.0.conv2.dbb_avg.bn.bias", "stage1.0.conv2.dbb_avg.bn.running_mean", "stage1.0.conv2.dbb_avg.bn.running_var", "stage1.0conv2.dbb_avg.bn.num_batches_tracked", "stage1.0.conv2.dbb_1x1_kxk.bn1.weight", "stage1.0.conv2.dbb_1x1_kxk.bn1.bias", "stage1.0.conv2.dbb_1x1_kxk.bn1.running_mean", "stage1.0.conv2.dbb_1x1_kxk.bn1.running_var", "stage1.0.conv2.dbb_1x1_kxk.bn1.num_batches_tracked", "stage1.1.conv2.dbb_avg.bn.weight", "stage1.1.conv2.dbb_avg.bn.bias", "stage1.1.conv2.dbb_avg.bn.running_mean", "stage1.1.conv2.dbb_avg.bn.runing_var", "stage1.1.conv2.dbb_avg.bn.num_batches_tracked", "stage1.1.conv2.dbb_1x1_kxk.bn1.weight", "stage1.1.conv2.dbb_1x1_kxk.bn1.bias", "stage1.1.conv2.dbb_1x1_kxk.bn1.running_mean", "stage1.1.conv2.dbb_1x1_kxk.bn1.running_var", "stage1.1.conv2.dbb_1x1_kxk.bn1.num_batches_tracked", "stage1.2.conv2.dbb_avg.bn.weight", "stage1.2.conv2.dbb_avg.bn.bias", "stage1.2.conv2.dbb_avg.bn.running_mean", "stage1.conv2.dbb_avg.bn.running_var", "stage1.2.conv2.dbb_avg.bn.num_batches_tracked", "stage1.2.conv2.dbb_1x1_kxk.bn1.weight", "stage1.2.conv2.dbb_1x1_kxk.bn1.bias", "stage1.2.conv2.dbb_1x1_kxk.bn1.running_mean", "stage1.2.conv2.dbb_1x1_kxk.bn1.running_var", "stage1.2.conv2.dbb_1x1_kxk.bn1.num_batches_tracked", "stage2.0.conv2.dbb_avg.bn.weight", "stage2.0.conv2.dbb_avg.bn.bias", "stage2.0.conv2.dbb_avg.bn.runing_mean", "stage2.0.conv2.dbb_avg.bn.running_var".......

opened by dada-thu 3

transIII_1x1_kxk does not behave as expected

Hi,

I verified like this:

conv1 = nn.Conv2d(32, 64, 1, 1, 0, bias=True)
conv2 = nn.Conv2d(64, 128, 3, 1, 1, bias=True)
conv = nn.Conv2d(32, 128, 3, 1, 1, bias=True)

k, b = transIII_1x1_kxk(conv1.weight, conv1.bias, conv2.weight, conv2.bias, 1)
conv.weight.copy_(k)
conv.bias.copy_(b)
inten = torch.randn(2, 32, 224, 224)
out1 = conv2(conv1(inten))
out2 = conv(inten)
print((out1 - out2).abs().max())

And the output is 0.11, which is much too great. Have you noticed this ?

opened by CoinCheung 2

Worse performance with DiverseBlock for Cifar10, ResNet18

Hello.

Thank you for your interesting work, and code.

I tried using your Diverseblock in ResNet18 (according to your instructions, replacing conv+bn with diverse blocks). My code is based on https://github.com/kuangliu/pytorch-cifar. The accuracy drops from 95.4% to 95.1%. Do you have any ideas for why this is?

Thank you.

opened by WilhelmT 1
关于DBB替换Res18的多分类表现
大佬您好，看了您的文章之后，我试着用使用DBB模块的Res18网络用于自己的多分类任务中，使用方法如下：

import torch import torch.nn as nn from DiverseBranchBlock.convnet_utils import switch_deploy_flag, switch_conv_bn_impl, build_model

def Dbb_Res(num_classes,pretrained=True):

switch_deploy_flag(False) switch_conv_bn_impl('DBB') model = build_model('ResNet-18') if pretrained ==True: model.load_state_dict(torch.load('DiverseBranchBlock\ResNet-18_DBB_7099.pth')) in_features = model.linear.in_features model.linear = nn.Linear(in_features, num_classes) return model

但是在实战中效果却一塌糊涂，预训练res18能达到80%的准确率，我是用如上方法构建的网络，精度只有6% - .-,请问是我这种方法调用不正确吗，如何调整，麻烦您了！
opened by HOLYlmx 0
why padding == kernel_size // 2 is asserted?

https://github.com/DingXiaoH/DiverseBranchBlock/blob/be15be76a5556e04b2b44411a69994abcd1f25eb/diversebranchblock.py#L105 Why padding should be equal to kernel // 2? what if Conv2d(kernel_size=4, stride=2, padding=1)?

opened by PennyPeng369 1
Maybe need to reverse `H_pixels_to_pad` & `W_pixels_to_pad`?

Hi, I just wonder whether here should be F.pad(kernel, [W_pixels_to_pad, W_pixels_to_pad, H_pixels_to_pad, H_pixels_to_pad]), since the F.pad's padding mode should be set as [padding_left, padding_right, padding_top, padding_bottom https://github.com/DingXiaoH/DiverseBranchBlock/blob/cd627d5089eaa25dedaa258b189fde508586a2f7/dbb_transforms.py#L44

Best

opened by CiaoHe 0
ValueError: some parameters appear in more than one parameter group

Hi, when I used DiverseBranchBlock to replace Conv-Bn in my network, I met this error ValueError: some parameters appear in more than one parameter group Have you met it before?

opened by lidehuihxjz 2

Diverse Branch Block: Building a Convolution as an Inception-like Unit

Related tags

Overview

Diverse Branch Block: Building a Convolution as an Inception-like Unit (PyTorch) (CVPR-2021)

Abstract

Use our pretrained models

Convert the training-time models into inference-time

ImageNet training

Use like this in your own code

FAQs

Contact

Comments

关于TRANS Ⅲ中的注意事项

转换模型

transIII_1x1_kxk does not behave as expected

Worse performance with DiverseBlock for Cifar10, ResNet18

关于DBB替换Res18的多分类表现

why padding == kernel_size // 2 is asserted?

Maybe need to reverse `H_pixels_to_pad` & `W_pixels_to_pad`?

ValueError: some parameters appear in more than one parameter group

Owner

Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

Unit-Convertor - Unit Convertor Built With Python

Facial Action Unit Intensity Estimation via Semantic Correspondence Learning with Dynamic Graph Convolution

Example-custom-ml-block-keras - Custom Keras ML block example for Edge Impulse

Edge-oriented Convolution Block for Real-time Super Resolution on Mobile Devices, ACM Multimedia 2021

Base pretrained models and datasets in pytorch (MNIST, SVHN, CIFAR10, CIFAR100, STL10, AlexNet, VGG16, VGG19, ResNet, Inception, SqueezeNet)

Inflated i3d network with inception backbone, weights transfered from tensorflow

I decide to sync up this repo and self-critical.pytorch. (The old master is in old master branch for archive)

Angora is a mutation-based fuzzer. The main goal of Angora is to increase branch coverage by solving path constraints without symbolic execution.

Only works with the dashboard version / branch of jesse

API for RL algorithm design & testing of BCA (Building Control Agent) HVAC on EnergyPlus building energy simulator by wrapping their EMS Python API

Web service for facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation based on OpenFace 2.0

This is the implementation of "SELF SUPERVISED REPRESENTATION LEARNING WITH DEEP CLUSTERING FOR ACOUSTIC UNIT DISCOVERY FROM RAW SPEECH" submitted to ICASSP 2022

OpenFace – a state-of-the art tool intended for facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation.

With this package, you can generate mixed-integer linear programming (MIP) models of trained artificial neural networks (ANNs) using the rectified linear unit (ReLU) activation function

MAU: A Motion-Aware Unit for Video Prediction and Beyond, NeurIPS2021

BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search