RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition

RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition (PyTorch)



title={RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition},
author={Ding, Xiaohan and Zhang, Xiangyu and Han, Jungong and Ding, Guiguang},
journal={arXiv preprint arXiv:2105.01883},

How to use the code

If you want to use RepMLP as a building block in your model, just check It also shows an example of checking the equivalence between a training-time and an inference-time RepMLP. You can see that by


Just use it like this

from import *
your_model = YourModel(...)   # It has RepMLPs somewhere
deploy_model = repmlp_model_convert(your_model)

From repmlp_model_convert, you will see that the conversion is as simple as calling switch_to_deploy of every RepMLP.

The definition of the two block structures (RepMLP Bottleneck and RepMLP Light) are shown in The RepMLP-ResNet is defined in

Use our pre-trained models

You may download our pre-trained models from Google Drive or Baidu Cloud (the access key of Baidu is "rmlp").

python [imagenet-folder] train RepMLP-Res50-light-224_train.pth -a RepMLP-Res50-light-224

Here imagenet-folder should contain the "train" and "val" folders. The default input resolution is 224x224. Here "train" indicates the training-time architecture.

You may convert them into the inference-time structure and test again to check the equivalence. For example

python RepMLP-Res50-light-224_train.pth RepMLP-Res50-light-224_deploy.pth -a RepMLP-Res50-light-224
python [imagenet-folder] deploy RepMLP-Res50-light-224_deploy.pth -a RepMLP-Res50-light-224

Now "deploy" indicates the inference-time structure (without Local Perceptron).


We propose RepMLP, a multi-layer-perceptron-style neural network building block for image recognition, which is composed of a series of fully-connected (FC) layers. Compared to convolutional layers, FC layers are more efficient, better at modeling the long-range dependencies and positional patterns, but worse at capturing the local structures, hence usually less favored for image recognition. We propose a structural re-parameterization technique that adds local prior into an FC to make it powerful for image recognition. Specifically, we construct convolutional layers inside a RepMLP during training and merge them into the FC for inference. On CIFAR, a simple pure-MLP model shows performance very close to CNN. By inserting RepMLP in traditional CNN, we improve ResNets by 1.8% accuracy on ImageNet, 2.9% for face recognition, and 2.3% mIoU on Cityscapes with lower FLOPs. Our intriguing findings highlight that combining the global representational capacity and positional perception of FC with the local prior of convolution can improve the performance of neural network with faster speed on both the tasks with translation invariance (e.g., semantic segmentation) and those with aligned images and positional patterns (e.g., face recognition).


Q: Is the inference-time model's output the same as the training-time model?

A: Yes. You can verify that by


Q: How to use RepMLP for other tasks?

A: It is better to finetune the training-time model on your datasets. Then you should do the conversion after finetuning and before you deploy the models. For example, say you want to use RepMLP-Res50 and PSPNet for semantic segmentation, you should build a PSPNet with a training-time RepMLP-Res50 as the backbone, load pre-trained weights into the backbone, and finetune the PSPNet on your segmentation dataset. Then you should convert the backbone following the code provided in this repo and keep the other task-specific structures (the PSPNet parts, in this case). The pseudo code will be like

#   train_backbone = create_xxx(deploy=False)
#   train_backbone.load_state_dict(torch.load(...))
#   train_pspnet = build_pspnet(backbone=train_backbone)
#   segmentation_train(train_pspnet)
#   deploy_pspnet = repmlp_model_convert(train_pspnet)
#   segmentation_test(deploy_pspnet)

Finetuning with a converted model also makes sense if you insert a BN after fc3, but the performance may be slightly lower.

Q: How to quantize a model with RepMLP?

A1: Post-training quantization. After training and conversion, you may quantize the converted model with any post-training quantization method. Then you may insert a BN after fc3 and finetune to recover the accuracy just like you quantize and finetune the other models. This is the recommended solution.

A2: Quantization-aware training. During the quantization-aware training, instead of constraining the params in a single kernel (e.g., making every param in {-127, -126, .., 126, 127} for int8) for ordinary models, you should constrain the equivalent kernel (get_equivalent_fc1_fc3_params() in

Q: I tried to finetune your model with multiple GPUs but got an error. Why are the names of params like "stage1.0..." in the downloaded weight file but sometimes like "module.stage1.0..." (shown by nn.Module.named_parameters()) in my model?

A: DistributedDataParallel may prefix "module." to the name of params and cause a mismatch when loading weights by name. The simplest solution is to load the weights (model.load_state_dict(...)) before DistributedDataParallel(model). Otherwise, you may insert "module." before the names like this

checkpoint = torch.load(...)    # This is just a name-value dict
ckpt = {('module.' + k) : v for k, v in checkpoint.items()}

Q: So a RepMLP derives the equivalent big fc kernel before each forwarding to save computations?

A: No! More precisely, we do the conversion only once right after training. Then the training-time model can be discarded, and the resultant model has no conv branches. We only save and use the resultant model.


  • Light Block is only 10% faster than Bottleneck?

    Light Block is only 10% faster than Bottleneck?

    Light Block is not fast as the paper says

    def test(network, p=True):
        x = torch.ones(128, 3, 224, 224).cuda()
        model = network.cuda()
        if p: print(model)
        with torch.no_grad(): 
            # warm iters
            for i in range(20):
                y = model(x)
            # inference test 
            iters = 50
            start = time.time()
            for i in range(iters):
                y = model(x)
            end = time.time()
            print((end-start)/iters, 's')
    if __name__ == "__main__":
        test(create_RepMLPRes50_Base_224(deploy=True), False)
        test(create_RepMLPRes50_Light_224(deploy=True), False)
        test(create_RepMLPRes50_Bottleneck_224(deploy=True), False)

    with Titan XP

    Base: 17.1 ms
    Light Block: 16.9 ms
    Bottleneck: 18.6 ms 
    opened by LightToYang 2
  • Why not keep repmlp-resnet?

    Why not keep repmlp-resnet?

    This design of repmlp-resnet is different from the lastest repmlpnet, and it shows great face recognition accuracy.

    why not keep repmlp-resnet in this repo?

    opened by twmht 1
  • 请教一点代码问题


    关于在单位阵上做卷积,单位阵里有很多0啊,局部信息不会丢失嘛,(还是我理解错了) 比如这段代码里: 假设输入就是(1,1,3,3), groups=1, c_in=c_out=1, 就是简单地在一张(3,3)的图上做一个3x3卷积。 I = torch.eye(9).repeat(1,1).reshape(9,1,3,3) I = tensor([[[[1., 0., 0.], [0., 0., 0.], [0., 0., 0.]]], [[[0., 1., 0.], [0., 0., 0.], [0., 0., 0.]]], [[[0., 0., 1.], [0., 0., 0.], [0., 0., 0.]]], [[[0., 0., 0.], [1., 0., 0.], [0., 0., 0.]]], [[[0., 0., 0.], [0., 1., 0.], [0., 0., 0.]]], [[[0., 0., 0.], [0., 0., 1.], [0., 0., 0.]]], [[[0., 0., 0.], [0., 0., 0.], [1., 0., 0.]]], [[[0., 0., 0.], [0., 0., 0.], [0., 1., 0.]]], [[[0., 0., 0.], [0., 0., 0.], [0., 0., 1.]]]])


    opened by hsm1997 0
  • How to convert the 1D model of RepMLP [B, C, H]

    How to convert the 1D model of RepMLP [B, C, H]

    Thank you very much for proposing an excellent model and sharing it publicly. Also congratulations on the publication of your results in CVPR. Since I want the RepMLP model should be on one-dimensional data, that is, the input is only [B, C, H]. Would like to ask if it is possible to provide a RepMLP model for such one-dimensional data?

    opened by kuaileyuandi 0
  • Why the size after average pooling of Global Perceptron be (1, 1)

    Why the size after average pooling of Global Perceptron be (1, 1)

    def forward(self, inputs):
            x = F.adaptive_avg_pool2d(inputs, output_size=(1, 1))
            x = self.fc1(x)

    according to the paper, it may should be (h, w)?

    opened by Lloyd-Pottiger 0
