ConvNet training using pytorch


Convolutional networks using PyTorch

This is a complete training example for Deep Convolutional Networks on various datasets (ImageNet, Cifar10, Cifar100, MNIST).

Available models include:

'alexnet', 'amoebanet', 'darts', 'densenet', 'googlenet', 'inception_resnet_v2', 'inception_v2', 'mnist', 'mobilenet', 'mobilenet_v2', 'nasnet', 'resnet', 'resnet_se', 'resnet_zi', 'resnet_zi_se', 'resnext', 'resnext_se'

It is based off imagenet example in pytorch with helpful additions such as:

  • Training on several datasets other than imagenet
  • Complete logging of trained experiment
  • Graph visualization of the training/validation loss and accuracy
  • Definition of preprocessing and optimization regime for each model
  • Distributed training

To clone:

git clone --recursive

example for efficient multi-gpu training of resnet50 (4 gpus, label-smoothing):

python -m torch.distributed.launch --nproc_per_node=4 --model resnet --model-config "{'depth': 50}" --eval-batch-size 512 --save resnet50_ls --label-smoothing 0.1

This code can be used to implement several recent papers:



  • Configure your dataset path with datasets-dir argument
  • To get the ILSVRC data, you should register on their site for access:

Model configuration

Network model is defined by writing a .py file in models folder, and selecting it using the model flag. Model function must be registered in models/ The model function must return a trainable network. It can also specify additional training options such optimization regime (either a dictionary or a function), and input transform modifications.

e.g for a model definition:

class Model(nn.Module):

    def __init__(self, num_classes=1000):
        super(Model, self).__init__()
        self.model = nn.Sequential(...)

        self.regime = [
            {'epoch': 0, 'optimizer': 'SGD', 'lr': 1e-2,
                'weight_decay': 5e-4, 'momentum': 0.9},
            {'epoch': 15, 'lr': 1e-3, 'weight_decay': 0}

        self.data_regime = [
            {'epoch': 0, 'input_size': 128, 'batch_size': 256},
            {'epoch': 15, 'input_size': 224, 'batch_size': 64}
    def forward(self, inputs):
        return self.model(inputs)
 def model(**kwargs):
        return Model()


If you use the code in your paper, consider citing one of the implemented works.

  title={Fix your classifier: the marginal value of training the last weight layer},
  author={Elad Hoffer and Itay Hubara and Daniel Soudry},
  booktitle={International Conference on Learning Representations},
  title={Norm matters: efficient and accurate normalization schemes in deep networks},
  author={Hoffer, Elad and Banner, Ron and Golan, Itay and Soudry, Daniel},
  booktitle={Advances in Neural Information Processing Systems},
  title={Scalable Methods for 8-bit Training of Neural Networks},
  author={Banner, Ron and Hubara, Itay and Hoffer, Elad and Soudry, Daniel},
  booktitle={Advances in Neural Information Processing Systems},
  author = {Hoffer, Elad and Ben-Nun, Tal and Hubara, Itay and Giladi, Niv and Hoefler, Torsten and Soudry, Daniel},
  title = {Augment Your Batch: Improving Generalization Through Instance Repetition},
  booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
  title={Mix \& Match: training convnets with mixed image sizes for improved accuracy, speed and scale resiliency},
  author={Hoffer, Elad and Weinstein, Berry and Hubara, Itay and Ben-Nun, Tal and Hoefler, Torsten and Soudry, Daniel},
  journal={arXiv preprint arXiv:1908.08986},
  • TypeError: __init__() got an unexpected keyword argument 'reduction'

    TypeError: __init__() got an unexpected keyword argument 'reduction'

    Hi, When I try to run the script, I face this error , whats wrong ? TypeError: __init__() got an unexpected keyword argument 'reduction' I'm calling the script like this :

    python --dataset $IMAGENET_DIR --model $MODEL_NAME 

    I'm using Python3.6 and Pytorch 0.4 Thanks alot in advance

    opened by Coderx7 6
  • RuntimeError: view size is not compatible with input tensor's size and stride

    RuntimeError: view size is not compatible with input tensor's size and stride

    I have the following config file:

        "adapt_grad_norm": null,
        "autoaugment": false,
        "batch_size": 256,
        "chunk_batch": 1,
        "config_file": null,
        "cutmix": null,
        "cutout": false,
        "dataset": "imagenet",
        "datasets_dir": "~/data/",
        "device": "cuda",
        "device_ids": [
        "dist_backend": "nccl",
        "dist_init": "env://",
        "distributed": false,
        "drop_optim_state": false,
        "dtype": "float",
        "duplicates": 1,
        "epochs": 90,
        "eval_batch_size": -1,
        "evaluate": null,
        "grad_clip": -1,
        "input_size": null,
        "label_smoothing": 0,
        "local_rank": -1,
        "loss_scale": 1,
        "lr": 0.1,
        "mixup": null,
        "model": "alexnet",
        "model_config": "",
        "momentum": 0.9,
        "optimizer": "SGD",
        "print_freq": 10,
        "results_dir": "./results",
        "resume": "",
        "save": "alexnet_unquant",
        "save_all": false,
        "seed": 123,
        "start_epoch": -1,
        "sync_bn": false,
        "tensorwatch": false,
        "tensorwatch_port": 0,
        "weight_decay": 0,
        "workers": 8,
        "world_size": -1

    I get the following error:

    Starting Epoch: 1
    Traceback (most recent call last):
      File "", line 364, in <module>
      File "", line 130, in main
      File "", line 306, in main_worker
        train_results = trainer.train(train_data.get_loader(),
      File "/MyPath/convNet.pytorch/", line 269, in train
        return self.forward(data_loader, training=True, average_output=average_output, chunk_batch=chunk_batch)
      File "/MyPath/convNet.pytorch/", line 224, in forward
        prec1, prec5 = accuracy(output, target, topk=(1, 5))
      File "/MyPath/convNet.pytorch/utils/", line 70, in accuracy
        correct_k = correct[:k].view(-1).float().sum(0)
    RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

    I get the same error when I try the following command in README.rd:

    python --model resnet --model-config "{'depth': 18, 'quantize':True}" --save resnet18_8bit -b 64

    How to rectify this?


    opened by gakadam 1
  • Hi, is any explanation about the scale_fix in the RangeBN

    Hi, is any explanation about the scale_fix in the RangeBN

    A fix is added in the RangeBN for the scale: scale_fix = (0.5 * 0.35) * (1 + (math.pi * math.log(4)) ** 0.5) / ((2 * math.log(y.size(-1))) ** 0.5)

    (2lng(n)) ** 0.5 is explained in the paper. However where do the 0.5 / 0.35/ pie*ln(4) come from?

    opened by blueardour 1
  • AttributeError: 'Image' object has no attribute 'new'

    AttributeError: 'Image' object has no attribute 'new'

    Hi, Why am I facing this error ? its complaining about this line: alpha =, self.alphastd)

    Whats wrong here? I'm using Pytorch 0.4

    opened by Coderx7 1
  • May I know the software versions

    May I know the software versions

    Hi, @eladhoffer

    Thanks for the excellent project. I'm trying the code on my local machine. However, some python package dependence seems not met. I wonder if you could share the version of some key software, such as the pytorch, python, os?

    opened by blueardour 0
  • ResNext CIFAR10 crashing during layer construction

    ResNext CIFAR10 crashing during layer construction

    When running the command : python3.6 -b 32 --gpus 1 --model resnext --dataset cifar10

    An error pops up

    Traceback (most recent call last):
      File "", line 306, in <module>
      File "", line 194, in main
        train_loader, model, criterion, epoch, optimizer)
      File "", line 295, in train
        training=True, optimizer=optimizer)
      File "", line 254, in forward
      File "/usr/lib64/python3.6/site-packages/torch/nn/modules/", line 325, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/vista_fpga/pytorch_repos/elad_hofer_pytorch/convNet.pytorch/models/", line 158, in forward
        x = self.layer1(x)
      File "/usr/lib64/python3.6/site-packages/torch/nn/modules/", line 325, in __call__
        result = self.forward(*input, **kwargs)
      File "/usr/lib64/python3.6/site-packages/torch/nn/modules/", line 72, in forward
        input = module(input)
      File "/usr/lib64/python3.6/site-packages/torch/nn/modules/", line 325, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/vista_fpga/pytorch_repos/elad_hofer_pytorch/convNet.pytorch/models/", line 56, in forward
        residual = self.downsample(x)
      File "/usr/lib64/python3.6/site-packages/torch/nn/modules/", line 325, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/vista_fpga/pytorch_repos/elad_hofer_pytorch/convNet.pytorch/models/", line 119, in forward
        return[ds,*zeros_size)], 1)
    RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)
    opened by Lancer555 0
  • How to run the code in Colaboratory?

    How to run the code in Colaboratory?

    I'm trying to run using colab and I executed following command in colab

    !python /content/convNet.pytorch/ --dataset cifar10 --model resnet --model-config "{'depth': 44}" --duplicates 40 --cutout -b 64 --epochs 100 --save resnet44_cutout_m-40

    and I'm getting the following error

    Traceback (most recent call last): File "/content/convNet.pytorch/", line 14, in <module> from data import DataRegime, SampledDataRegime File "/content/convNet.pytorch/", line 9, in <module> from utils.dataset import IndexedFileDataset File "/content/convNet.pytorch/utils/", line 6, in <module> from import Sampler, RandomSampler, BatchSampler, _int_classes ImportError: cannot import name '_int_classes' from '' (/usr/local/lib/python3.7/dist-packages/torch/utils/data/

    How can I fix this issue? What steps do I need to follow to be able to run the code in colab? What changes do I need to make?

    opened by abdulsam 2
  • line 287 defaults = {**train_data_defaults} - SyntaxError: invalid syntax line 287 defaults = {**train_data_defaults} - SyntaxError: invalid syntax


    This is the output on command line:

    python --dataset cifar10 --model resnet --model-config "{'depth': 44}" --duplicates 32 --cutout -b 64 --epochs 100 --save resnet_44_cutout_m-32-new File "", line 287 defaults = {**train_data_defaults} ^ SyntaxError: invalid syntax

    opened by BrandonLiang 1
  • Nan loss for quantization

    Nan loss for quantization

    I'm running the code for 8-bit quantization but found that the training loss always gets NAN while I didn't make a slight modification to the original code. Wondering why this could happen and hoping for your clarification.

    opened by syorami 4
  • Stochastic quantization: difference between code and paper

    Stochastic quantization: difference between code and paper

    In the paper, the stochastic quantization was done by rounding up with probability p=clip(0.5x, 0, 1), and rounding down with probability 1-p. However, in the code it's done by adding random uniform noise before quantization: noise =, 0.5) output.add_(noise) This noise does not depend on the magnitude of x. I wonder what is the reasoning behind this discrepancy?

    opened by michaelklachko 0
  • bug in vgg?

    bug in vgg?

    Hi, cool framework! note that you add a layer of AvgPool2D with kernel=1 in the class VGG. This basically doesn't have any effect. Perhaps you meant AdaptiveAveragePool? In addition, the input for the classification layer is usually 77512, given an input of 224x224.

    opened by rosenfeldamir 0
Elad Hoffer
Elad Hoffer
