A memory-efficient implementation of DenseNets

Overview

efficient_densenet_pytorch

A PyTorch >=1.0 implementation of DenseNets, optimized to save GPU memory.

Recent updates

  1. Now works on PyTorch 1.0! It uses the checkpointing feature, which makes this code WAY more efficient!!!

Motivation

While DenseNets are fairly easy to implement in deep learning frameworks, most implmementations (such as the original) tend to be memory-hungry. In particular, the number of intermediate feature maps generated by batch normalization and concatenation operations grows quadratically with network depth. It is worth emphasizing that this is not a property inherent to DenseNets, but rather to the implementation.

This implementation uses a new strategy to reduce the memory consumption of DenseNets. We use checkpointing to compute the Batch Norm and concatenation feature maps. These intermediate feature maps are discarded during the forward pass and recomputed for the backward pass. This adds 15-20% of time overhead for training, but reduces feature map consumption from quadratic to linear.

This implementation is inspired by this technical report, which outlines a strategy for efficient DenseNets via memory sharing.

Requirements

  • PyTorch >=1.0.0
  • CUDA

Usage

In your existing project: There is one file in the models folder.

If you care about speed, and memory is not an option, pass the efficient=False argument into the DenseNet constructor. Otherwise, pass in efficient=True.

Options:

  • All options are described in the docstrings of the model files
  • The depth is controlled by block_config option
  • efficient=True uses the memory-efficient version
  • If you want to use the model for ImageNet, set small_inputs=False. For CIFAR or SVHN, set small_inputs=True.

Running the demo:

The only extra package you need to install is python-fire:

pip install fire
  • Single GPU:
CUDA_VISIBLE_DEVICES=0 python demo.py --efficient True --data <path_to_folder_with_cifar10> --save <path_to_save_dir>
  • Multiple GPU:
CUDA_VISIBLE_DEVICES=0,1,2 python demo.py --efficient True --data <path_to_folder_with_cifar10> --save <path_to_save_dir>

Options:

  • --depth (int) - depth of the network (number of convolution layers) (default 40)
  • --growth_rate (int) - number of features added per DenseNet layer (default 12)
  • --n_epochs (int) - number of epochs for training (default 300)
  • --batch_size (int) - size of minibatch (default 256)
  • --seed (int) - manually set the random seed (default None)

Performance

A comparison of the two implementations (each is a DenseNet-BC with 100 layers, batch size 64, tested on a NVIDIA Pascal Titan-X):

Implementation Memory cosumption (GB/GPU) Speed (sec/mini batch)
Naive 2.863 0.165
Efficient 1.605 0.207
Efficient (multi-GPU) 0.985 -

Other efficient implementations

Reference

@article{pleiss2017memory,
  title={Memory-Efficient Implementation of DenseNets},
  author={Pleiss, Geoff and Chen, Danlu and Huang, Gao and Li, Tongcheng and van der Maaten, Laurens and Weinberger, Kilian Q},
  journal={arXiv preprint arXiv:1707.06990},
  year={2017}
}
Comments
  • test failed on v0.2

    test failed on v0.2

    the efficient_densenet_bottleneck_test.py failed in test_backward_computes_backward_pass

    >       assert(almost_equal(layer.conv.weight.grad.data, layer_efficient.conv_weight.grad.data))
    E       assert False
    E        +  where False = almost_equal(\n(0 ,0 ,.,.) = \n    0.3746\n\n(0 ,1 ,.,.) = \n   70.7402\n\n(0 ,2 ,.,.) = \n   68.3647\n\n(0 ,3 ,.,.) = \n    5.2501\n\n(0 ,4 ,.,...) = \n  101.7459\n\n(3 ,6 ,.,.) = \n   10.9038\n\n(3 ,7 ,.,.) = \n    0.0000\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n, \n(0 ,0 ,.,.) = \n  0.0000e+00\n\n(0 ,1 ,.,.) = \n -2.0594e+24\n\n(0 ,2 ,.,.) = \n -9.6653e+20\n\n(0 ,3 ,.,.) = \n  2.1138e+21\n\n(...-1.5375e+00\n\n(3 ,6 ,.,.) = \n -7.0127e-03\n\n(3 ,7 ,.,.) = \n  0.0000e+00\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n)
    E        +    where \n(0 ,0 ,.,.) = \n    0.3746\n\n(0 ,1 ,.,.) = \n   70.7402\n\n(0 ,2 ,.,.) = \n   68.3647\n\n(0 ,3 ,.,.) = \n    5.2501\n\n(0 ,4 ,.,...) = \n  101.7459\n\n(3 ,6 ,.,.) = \n   10.9038\n\n(3 ,7 ,.,.) = \n    0.0000\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = Variable containing:\n(0 ,0 ,.,.) = \n    0.3746\n\n(0 ,1 ,.,.) = \n   70.7402\n\n(0 ,2 ,.,.) = \n   68.3647\n\n(0 ,3 ,.,.) = \n ...) = \n  101.7459\n\n(3 ,6 ,.,.) = \n   10.9038\n\n(3 ,7 ,.,.) = \n    0.0000\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n.data
    E        +      where Variable containing:\n(0 ,0 ,.,.) = \n    0.3746\n\n(0 ,1 ,.,.) = \n   70.7402\n\n(0 ,2 ,.,.) = \n   68.3647\n\n(0 ,3 ,.,.) = \n ...) = \n  101.7459\n\n(3 ,6 ,.,.) = \n   10.9038\n\n(3 ,7 ,.,.) = \n    0.0000\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = Parameter containing:\n(0 ,0 ,.,.) = \n  0.0978\n\n(0 ,1 ,.,.) = \n  1.9624\n\n(0 ,2 ,.,.) = \n  2.4802\n\n(0 ,3 ,.,.) = \n  1.06...5 ,.,.) = \n  0.4832\n\n(3 ,6 ,.,.) = \n  1.0052\n\n(3 ,7 ,.,.) = \n  1.7624\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n.grad
    E        +        where Parameter containing:\n(0 ,0 ,.,.) = \n  0.0978\n\n(0 ,1 ,.,.) = \n  1.9624\n\n(0 ,2 ,.,.) = \n  2.4802\n\n(0 ,3 ,.,.) = \n  1.06...5 ,.,.) = \n  0.4832\n\n(3 ,6 ,.,.) = \n  1.0052\n\n(3 ,7 ,.,.) = \n  1.7624\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = Conv2d(8, 4, kernel_size=(1, 1), stride=(1, 1), bias=False).weight
    E        +          where Conv2d(8, 4, kernel_size=(1, 1), stride=(1, 1), bias=False) = Sequential (\n  (norm): BatchNorm2d(8, eps=1e-05, momentum=0.1, affine=True)\n  (relu): ReLU (inplace)\n  (conv): Conv2d(8, 4, kernel_size=(1, 1), stride=(1, 1), bias=False)\n).conv
    E        +    and   \n(0 ,0 ,.,.) = \n  0.0000e+00\n\n(0 ,1 ,.,.) = \n -2.0594e+24\n\n(0 ,2 ,.,.) = \n -9.6653e+20\n\n(0 ,3 ,.,.) = \n  2.1138e+21\n\n(...-1.5375e+00\n\n(3 ,6 ,.,.) = \n -7.0127e-03\n\n(3 ,7 ,.,.) = \n  0.0000e+00\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = Variable containing:\n(0 ,0 ,.,.) = \n  0.0000e+00\n\n(0 ,1 ,.,.) = \n -2.0594e+24\n\n(0 ,2 ,.,.) = \n -9.6653e+20\n\n(0 ,3 ,.,....-1.5375e+00\n\n(3 ,6 ,.,.) = \n -7.0127e-03\n\n(3 ,7 ,.,.) = \n  0.0000e+00\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n.data
    E        +      where Variable containing:\n(0 ,0 ,.,.) = \n  0.0000e+00\n\n(0 ,1 ,.,.) = \n -2.0594e+24\n\n(0 ,2 ,.,.) = \n -9.6653e+20\n\n(0 ,3 ,.,....-1.5375e+00\n\n(3 ,6 ,.,.) = \n -7.0127e-03\n\n(3 ,7 ,.,.) = \n  0.0000e+00\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = Parameter containing:\n(0 ,0 ,.,.) = \n  0.0978\n\n(0 ,1 ,.,.) = \n  1.9624\n\n(0 ,2 ,.,.) = \n  2.4802\n\n(0 ,3 ,.,.) = \n  1.06...5 ,.,.) = \n  0.4832\n\n(3 ,6 ,.,.) = \n  1.0052\n\n(3 ,7 ,.,.) = \n  1.7624\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n.grad
    E        +        where Parameter containing:\n(0 ,0 ,.,.) = \n  0.0978\n\n(0 ,1 ,.,.) = \n  1.9624\n\n(0 ,2 ,.,.) = \n  2.4802\n\n(0 ,3 ,.,.) = \n  1.06...5 ,.,.) = \n  0.4832\n\n(3 ,6 ,.,.) = \n  1.0052\n\n(3 ,7 ,.,.) = \n  1.7624\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = _EfficientDensenetBottleneck (\n).conv_weight
    
    

    I uncommented the code in densenet_efficient.py

    self.efficient_batch_norm.training = False,
    

    but the issue persists.

    call-for-contribution 
    opened by yifita 18
  • Compatibility with PyTorch 0.4

    Compatibility with PyTorch 0.4

    Thanks very much for sharing this implementation. I forked the code. It works great on PyTorch 0.3.1. But when I ran it with 0.4.0 (master version), I got following error (I made some minor change so the line number wouldn't match): File "../networks/densenet_efficient.py", line 330, in forward

    bn_input_var = Variable(type(inputs[0])(storage).resize_(size), volatile=True)
    

    TypeError: Variable data has to be a tensor, but got torch.cuda.FloatStorage

    It turned out that for this line: bn_input_var = Variable(type(inputs[0])(storage).resize_(size), volatile=True)

    The inputs in version 0.3.1 is FloatTensor but in 0.4.0 it's Variable.

    I am wondering what's the best way to update the code for 0.4.0?

    Many thanks!

    compatibility 
    opened by seyiqi 11
  • I meet this problem when I run the demo.py. How to solve it?

    I meet this problem when I run the demo.py. How to solve it?

    0.1 Training Traceback (most recent call last): File "demo.py", line 272, in fire.Fire(demo) File "/home/yyj/anaconda2/lib/python2.7/site-packages/fire/core.py", line 127, in Fire component_trace = _Fire(component, args, context, name) File "/home/yyj/anaconda2/lib/python2.7/site-packages/fire/core.py", line 366, in _Fire component, remaining_args) File "/home/yyj/anaconda2/lib/python2.7/site-packages/fire/core.py", line 542, in _CallCallable result = fn(*varargs, **kwargs) File "demo.py", line 250, in demo n_epochs=n_epochs, batch_size=batch_size, seed=seed) File "demo.py", line 166, in train train=True, File "demo.py", line 101, in run_epoch output_var = model(input_var) File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in call result = self.forward(*input, **kwargs) File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 218, in forward features = self.features(x) File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in call result = self.forward(*input, **kwargs) File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/container.py", line 67, in forward input = module(input) File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in call result = self.forward(*input, **kwargs) File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 152, in forward outputs.append(module.forward(outputs)) File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 107, in forward new_features = super(_DenseLayer, self).forward(prev_features) File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/container.py", line 67, in forward input = module(input) File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in call result = self.forward(*input, **kwargs) File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 85, in forward return fn(self.norm_weight, self.norm_bias, self.conv_weight, *inputs) File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 265, in forward conv_output = self.efficient_conv.forward(conv_weight, None, relu_output) File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 457, in forward self.groups, cudnn.benchmark TypeError: _cudnn_convolution_full_forward received an invalid combination of arguments - got (torch.cuda.FloatTensor, torch.cuda.FloatTensor, NoneType, torch.cuda.FloatTensor, tuple, tuple, tuple, int, bool), but expected (torch.cuda.RealTensor input, torch.cuda.RealTensor weight, torch.cuda.RealTensor bias, torch.cuda.RealTensor output, std::vector pad, std::vector stride, std::vector dilation, int groups, bool benchmark, bool deterministic)

    opened by yyjFish 9
  • Multi-GPU model in pytorch0.3 consumes much more memory than pytorch0.1 version

    Multi-GPU model in pytorch0.3 consumes much more memory than pytorch0.1 version

    Just tried the new implementation in pytorch0.3, but it consumes much more memory than old implementation. Some issues:

    1. when the model runs on a single gpu, it still allocates shared storage on all the gpus, i think the for device_idx in range(torch.cuda.device_count()) part in _SharedAllocation() part requires some modification and optimization.

    2. when the model runs on multi gpu, the batch size it can afford is much less than the batch size of single gpu times number of gpu. From my test it can only afford same size as single gpu version.

    bug 
    opened by ZhengRui 8
  • PyTorch 0.3 compatibility

    PyTorch 0.3 compatibility

    The efficient model now works on PyTorch 0.3

    Some other changes:

    • Multi-GPU support is now baked directly into DenseNetEfficient, so I removed the multi-GPU specific model
    • Changed the name of the cifar option to small_inputs (more generic).

    I will merge it in tomorrow after I confirm that the demo gets the same error on CIFAR-10.

    opened by gpleiss 7
  • The final test accuracy

    The final test accuracy

    Hi, Thanks for this implementation ! I'm wondering how to obtain the quite strong test set result on CIFAR-10, as reported in the original densenet paper (e.g., error rate <=3.5 on C-10+, with depth =190, growth_rate = 40). When I run the script as:

    CUDA_VISIBLE_DEVICES=0,1,2,3 python demo.py --depth 190 --efficient False --data ./data --save ./ckpts

    The final test error is reported as 0.0535. I'm wondering whether the high error is due to no data augmentation is conducted in the default setting. May I know whether it is C10+ dataset or C10?

    Best

    opened by ustctf-zz 6
  • FP become slower after upgrade to 0.4

    FP become slower after upgrade to 0.4

    Hi, Thanks for your works! Recently I upgrade my network to 0.4 with your implementation of DenseNet. And I found that the new version is slower than before. I thought that the shared memory could speed up the forward pass obviously.In my application, predicting one subject on the 0.3.x version cost 9s but now it need 11s.

    The dice metric also get worth than before. I found that in the new code you use the Kaiming normal initialization but before default initialization (uniform?). I have try to make all parameters as before but it has not effect. Have you some advice for me?

    Thanks.

    opened by DesertsP 5
  • Error in trying to use for the first time

    Error in trying to use for the first time

    Hello,

    I am a beginner in python and pyTorch and am trying to use your densenet efficient implementation on a different dataset than CIFAR (images are 80 pixels wide, instead of 32). I use a windows 10 laptop with the experimental pyTorch port on Windows by peterjc123 (see https://github.com/pytorch/pytorch/issues/494).

    I have incorporated your DenseNetEfficient model in a training script adapted from andreasveit's densenet implementation for pyTorch and replaced the CIFAR datasets loaders with datasets ImageFolder as follows: train_loader = torch.utils.data.DataLoader(

    datasets.CIFAR10('../data', train=True, download=True,

    transform=transform_train),

        datasets.ImageFolder(root=args.dataroot + '/train', transform=transform_train), batch_size=args.batch_size, shuffle=True, **kwargs)
    

    When launching the training script; I get a cryptic error (for me):

    Traceback (most recent call last): File "train.py", line 312, in main() File "train.py", line 153, in main train(train_loader, model, criterion, optimizer, epoch) File "train.py", line 185, in train output = model(input_var) File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\module.py", line 206, in call result = self.forward(*input, **kwargs) File "D:\deepLearning\densenet\densenetEfficient.py", line 213, in forward out = self.classifier(out) File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\module.py", line 206, in call result = self.forward(*input, **kwargs) File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\linear.py", line 54, in forward return self.backend.Linear.apply(input, self.weight, self.bias) File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn_functions\linear.py", line 12, in forward output.addmm(0, 1, input, weight.t()) RuntimeError: size mismatch at d:\downloads\pytorch-master-1\torch\lib\thc\generic/THCTensorMathBlas.cu:243

    I am surely doing something wrong but searched a lot and did not find anything,

    Any recommendation would be welcome,

    Thanks a lot,

    Christophe

    opened by chmaz 5
  • storage resize_ function

    storage resize_ function

    I try to use torch.storage in my network. I use Pytorch3.

    if self.storage.size() < size:
                is_cuda = self.storage.is_cuda
                if is_cuda:
                    gpu_ID = self.storage.get_device()
                    print('gpu_ID1:',gpu_ID)
                self.storage.resize_(size)
                
                gpu_ID= self.storage.get_device()
                print('gpu_ID2:',gpu_ID)
                
                if is_cuda:
                    self.storage = self.storage.cuda(gpu_ID)
                gpu_ID= self.storage.get_device()
                print('gpu_ID3:',gpu_ID)
    

    The output is

    gpu_ID1: 1
    gpu_ID2: 0
    gpu_ID3: 0
    
    

    The self.storage comes from self.storage= torch.Storage(1024). It seems the resize_ function will change the gpu where the storage will be saved. I wish the storage is saved in GPU 1 rather than 0. How can i do that?

    opened by mingminzhen 4
  • Could it be MORE memory efficient?

    Could it be MORE memory efficient?

    Hi,

    Thanks for your code. I read both the single-gpu and multi-gpus codes. For the single-gpu version, you create the shared memory inside each dense block. Could all the dense blocks share the same memory and you only allocate one block of space? I think it should further reduce the space usage.

    For the multi-gpus version, you create the shared memory in the initialization method of the whole network, i.e. one level upper the dense block initialization. However, you register a buffer inside each dense block for the shared memory, which is done by self.register_buffer('CatBN_output_buffer', self.storage) Does this mean each dense block has independent shared memory? If so, why don't you let them share the same area?

    Thanks

    opened by zhiqiangdon 4
  • error rate compute in demo.py

    error rate compute in demo.py

    Hi, thanks for this efficient densenet code.

    But I found a probable mistake in demo.py at line 112: error = 1 - torch.eq(predictions_var, target_var).float().mean() it might have to be corrected to: error = 1 - torch.eq(predictions_var.view(-1), target_var).float().mean()

    Because the size of predictions_var and target_var are (train_size, 1) and (train_size, ), torch.eq(...) will return a train_size * train_size matrix, and its entries are almost 0 (only 1 at diagonal). Then the error rate will not able to decrease.

    opened by ghost 4
  • How can I apply this to my own model?

    How can I apply this to my own model?

    Thank you for your nice work. My model use the densenet connections like:

    tensorFeat = torch.cat([self.moduleOne(tensorFeat), tensorFeat], 1) tensorFeat = torch.cat([self.moduleTwo(tensorFeat), tensorFeat], 1) tensorFeat = torch.cat([self.moduleThr(tensorFeat), tensorFeat], 1) tensorFeat = torch.cat([self.moduleFou(tensorFeat), tensorFeat], 1) tensorFeat = torch.cat([self.moduleFiv(tensorFeat), tensorFeat], 1)

    What do I need to do to implement efficient technology to save this part of memory consumption.Densenet connections is just a part of my full model.

    opened by CXMANDTXW 1
  • Is this really memory efficient?

    Is this really memory efficient?

    I see the memory consumption chart in the readme, but after looking at the code, I have doubts that this implementation is fully memory efficient. I see the call to cp.checkpoint in _DenseLayer.forward(), but I don't see some of the other modifications that were called for in the paper, specifically post-activation normalization and contiguous concatenation. Am I missing something?

    If I understand your approach, you are using a method that still requires quadratic memory and computation, but tossing the memory-hogging intermediate values and recomputing them later?

    opened by leonardishere 1
  • dropout not in 3x3 convolutional layer

    dropout not in 3x3 convolutional layer

    Hi, thanks for your work. I have a question here on dropout. In densenet.py, the block function is defined as follow:

        def forward(self, *prev_features):
            bn_function = _bn_function_factory(self.norm1, self.relu1, self.conv1)
            if self.efficient and any(prev_feature.requires_grad for prev_feature in prev_features):
                bottleneck_output = cp.checkpoint(bn_function, *prev_features)
            else:
                bottleneck_output = bn_function(*prev_features)
            new_features = self.conv2(self.relu2(self.norm2(bottleneck_output)))
            if self.drop_rate > 0:
                new_features = F.dropout(new_features, p=self.drop_rate, training=self.training)
            return new_features
    

    while, the original densenet in torch version, the dropout is add after both 1x1-conv and 3x3-conv

    function DenseConnectLayerStandard(nChannels, opt)
       local net = nn.Sequential()
    
       net:add(ShareGradInput(cudnn.SpatialBatchNormalization(nChannels), 'first'))
       net:add(cudnn.ReLU(true))   
       if opt.bottleneck then
          net:add(cudnn.SpatialConvolution(nChannels, 4 * opt.growthRate, 1, 1, 1, 1, 0, 0))
          nChannels = 4 * opt.growthRate
          if opt.dropRate > 0 then net:add(nn.Dropout(opt.dropRate)) end
          net:add(cudnn.SpatialBatchNormalization(nChannels))
          net:add(cudnn.ReLU(true))      
       end
       net:add(cudnn.SpatialConvolution(nChannels, opt.growthRate, 3, 3, 1, 1, 1, 1))
       if opt.dropRate > 0 then net:add(nn.Dropout(opt.dropRate)) end
    
       return nn.Sequential()
          :add(nn.Concat(2)
             :add(nn.Identity())
             :add(net))  
    end
    

    Is here a particular reason for not adding dropout layer for 3x3 convolutional layer? Thanks in advance

    opened by lizhenstat 0
  • What is the minimum GPU memory required? Still breaks for me in a single GPU

    What is the minimum GPU memory required? Still breaks for me in a single GPU

    Amazon p3.2xlarge: 1 GPUs - Tesla V100 -- GPU Memory: 16GB -- Batch Size = 64 If efficient = False:
    Error: RuntimeError: CUDA out of memory. Tried to allocate 1024.00 KiB (GPU 0; 15.75 GiB total capacity; 14.71 GiB already allocated; 4.88 MiB free; 4.02 MiB cached)

    If efficient = True: Error: RuntimeError: CUDA out of memory. Tried to allocate 61.25 MiB (GPU 0; 15.75 GiB total capacity; 14.65 GiB already allocated; 50.88 MiB free; 5.33 MiB cached)


    Amazon g3.4xlarge: 1 GPUs - Tesla M60 -- GPU Memory: 8GB -- Batch Size = 64

    If efficient = False:
    RuntimeError: CUDA out of memory. Tried to allocate 184.00 MiB (GPU 0; 7.44 GiB total capacity; 6.98 GiB already allocated; 25.81 MiB free; 5.57 MiB cached)

    If efficient = True:
    RuntimeError: CUDA out of memory. Tried to allocate 184.00 MiB (GPU 0; 7.44 GiB total capacity; 6.98 GiB already allocated; 25.81 MiB free; 5.57 MiB cached)

    opened by PabloRR100 1
Owner
Geoff Pleiss
Geoff Pleiss
Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

Phil Wang 180 Jan 5, 2023
Segcache: a memory-efficient and scalable in-memory key-value cache for small objects

Segcache: a memory-efficient and scalable in-memory key-value cache for small objects This repo contains the code of Segcache described in the followi

TheSys Group @ CMU CS 78 Jan 7, 2023
PyTorch Code of "Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spatiotemporal Dynamics"

Memory In Memory Networks It is based on the paper Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spati

Yang Li 12 May 30, 2022
Episodic-memory - Ego4D Episodic Memory Benchmark

Ego4D Episodic Memory Benchmark EGO4D is the world's largest egocentric (first p

null 3 Feb 18, 2022
Implementation of "Efficient Regional Memory Network for Video Object Segmentation" (Xie et al., CVPR 2021).

RMNet This repository contains the source code for the paper Efficient Regional Memory Network for Video Object Segmentation. Cite this work @inprocee

Haozhe Xie 76 Dec 14, 2022
Official and maintained implementation of the paper "OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data" [BMVC 2021].

OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data Christoph Reich, Tim Prangemeier, Özdemir Cetin & Heinz Koeppl | Pr

Christoph Reich 23 Sep 21, 2022
Implementation of Memory-Efficient Neural Networks with Multi-Level Generation, ICCV 2021

Memory-Efficient Multi-Level In-Situ Generation (MLG) By Jiaqi Gu, Hanqing Zhu, Chenghao Feng, Mingjie Liu, Zixuan Jiang, Ray T. Chen and David Z. Pan

Jiaqi Gu 2 Jan 4, 2022
Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation

STCN Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation Ho Kei Cheng, Yu-Wing Tai, Chi-Keung Tang [a

Rex Cheng 456 Dec 12, 2022
InvTorch: memory-efficient models with invertible functions

InvTorch: Memory-Efficient Invertible Functions This module extends the functionality of torch.utils.checkpoint.checkpoint to work with invertible fun

Modar M. Alfadly 12 May 12, 2022
Memory Efficient Attention (O(sqrt(n)) for Jax and PyTorch

Memory Efficient Attention This is unofficial implementation of Self-attention Does Not Need O(n^2) Memory for Jax and PyTorch. Implementation is almo

Amin Rezaei 126 Dec 27, 2022
Memory-efficient optimum einsum using opt_einsum planning and PyTorch kernels.

opt-einsum-torch There have been many implementations of Einstein's summation. numpy's numpy.einsum is the least efficient one as it only runs in sing

Haoyan Huo 9 Nov 18, 2022
Lowest memory consumption and second shortest runtime in NTIRE 2022 challenge on Efficient Super-Resolution

FMEN Lowest memory consumption and second shortest runtime in NTIRE 2022 on Efficient Super-Resolution. Our paper: Fast and Memory-Efficient Network T

null 33 Dec 1, 2022
Efficient-GlobalPointer - Pytorch Efficient GlobalPointer

引言 感谢苏神带来的模型,原文地址:https://spaces.ac.cn/archives/8877 如何运行 对应模型EfficientGlobalPoi

powerycy 40 Dec 14, 2022
Numenta Platform for Intelligent Computing is an implementation of Hierarchical Temporal Memory (HTM), a theory of intelligence based strictly on the neuroscience of the neocortex.

NuPIC Numenta Platform for Intelligent Computing The Numenta Platform for Intelligent Computing (NuPIC) is a machine intelligence platform that implem

Numenta 6.3k Dec 30, 2022
Numenta Platform for Intelligent Computing is an implementation of Hierarchical Temporal Memory (HTM), a theory of intelligence based strictly on the neuroscience of the neocortex.

NuPIC Numenta Platform for Intelligent Computing The Numenta Platform for Intelligent Computing (NuPIC) is a machine intelligence platform that implem

Numenta 6.2k Feb 12, 2021
Official pytorch implementation of Rainbow Memory (CVPR 2021)

Rainbow Memory: Continual Learning with a Memory of Diverse Samples

Clova AI Research 91 Dec 17, 2022
This is a pytorch implementation of the NeurIPS paper GAN Memory with No Forgetting.

GAN Memory for Lifelong learning This is a pytorch implementation of the NeurIPS paper GAN Memory with No Forgetting. Please consider citing our paper

Miaoyun Zhao 43 Dec 27, 2022
Implementation of Hierarchical Transformer Memory (HTM) for Pytorch

Hierarchical Transformer Memory (HTM) - Pytorch Implementation of Hierarchical Transformer Memory (HTM) for Pytorch. This Deepmind paper proposes a si

Phil Wang 63 Dec 29, 2022
PyTorch implementation of Memory-based semantic segmentation for off-road unstructured natural environments.

MemSeg: Memory-based semantic segmentation for off-road unstructured natural environments Introduction This repository is a PyTorch implementation of

null 11 Nov 28, 2022