A memory-efficient implementation of DenseNets

Geoff Pleiss

Last update: Dec 25, 2022

Related tags

Overview

efficient_densenet_pytorch

A PyTorch >=1.0 implementation of DenseNets, optimized to save GPU memory.

Recent updates

Now works on PyTorch 1.0! It uses the checkpointing feature, which makes this code WAY more efficient!!!

Motivation

While DenseNets are fairly easy to implement in deep learning frameworks, most implmementations (such as the original) tend to be memory-hungry. In particular, the number of intermediate feature maps generated by batch normalization and concatenation operations grows quadratically with network depth. It is worth emphasizing that this is not a property inherent to DenseNets, but rather to the implementation.

This implementation uses a new strategy to reduce the memory consumption of DenseNets. We use checkpointing to compute the Batch Norm and concatenation feature maps. These intermediate feature maps are discarded during the forward pass and recomputed for the backward pass. This adds 15-20% of time overhead for training, but reduces feature map consumption from quadratic to linear.

This implementation is inspired by this technical report, which outlines a strategy for efficient DenseNets via memory sharing.

Requirements

PyTorch >=1.0.0
CUDA

Usage

In your existing project: There is one file in the models folder.

models/densenet.py is an implementation based off the torchvision and project killer implementations.

If you care about speed, and memory is not an option, pass the efficient=False argument into the DenseNet constructor. Otherwise, pass in efficient=True.

Options:

All options are described in the docstrings of the model files
The depth is controlled by block_config option
efficient=True uses the memory-efficient version
If you want to use the model for ImageNet, set small_inputs=False. For CIFAR or SVHN, set small_inputs=True.

Running the demo:

The only extra package you need to install is python-fire:

pip install fire

Single GPU:

CUDA_VISIBLE_DEVICES=0 python demo.py --efficient True --data <path_to_folder_with_cifar10> --save <path_to_save_dir>

Multiple GPU:

CUDA_VISIBLE_DEVICES=0,1,2 python demo.py --efficient True --data <path_to_folder_with_cifar10> --save <path_to_save_dir>

Options:

--depth (int) - depth of the network (number of convolution layers) (default 40)
--growth_rate (int) - number of features added per DenseNet layer (default 12)
--n_epochs (int) - number of epochs for training (default 300)
--batch_size (int) - size of minibatch (default 256)
--seed (int) - manually set the random seed (default None)

Performance

A comparison of the two implementations (each is a DenseNet-BC with 100 layers, batch size 64, tested on a NVIDIA Pascal Titan-X):

Implementation	Memory cosumption (GB/GPU)	Speed (sec/mini batch)
Naive	2.863	0.165
Efficient	1.605	0.207
Efficient (multi-GPU)	0.985	-

Other efficient implementations

LuaTorch (by Gao Huang)
Tensorflow (by Joe Yearsley)
Caffe (by Tongcheng Li)

Reference

@article{pleiss2017memory,
  title={Memory-Efficient Implementation of DenseNets},
  author={Pleiss, Geoff and Chen, Danlu and Huang, Gao and Li, Tongcheng and van der Maaten, Laurens and Weinberger, Kilian Q},
  journal={arXiv preprint arXiv:1707.06990},
  year={2017}
}

Comments

test failed on v0.2

the efficient_densenet_bottleneck_test.py failed in test_backward_computes_backward_pass

>       assert(almost_equal(layer.conv.weight.grad.data, layer_efficient.conv_weight.grad.data))
E       assert False
E        +  where False = almost_equal(\n(0 ,0 ,.,.) = \n    0.3746\n\n(0 ,1 ,.,.) = \n   70.7402\n\n(0 ,2 ,.,.) = \n   68.3647\n\n(0 ,3 ,.,.) = \n    5.2501\n\n(0 ,4 ,.,...) = \n  101.7459\n\n(3 ,6 ,.,.) = \n   10.9038\n\n(3 ,7 ,.,.) = \n    0.0000\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n, \n(0 ,0 ,.,.) = \n  0.0000e+00\n\n(0 ,1 ,.,.) = \n -2.0594e+24\n\n(0 ,2 ,.,.) = \n -9.6653e+20\n\n(0 ,3 ,.,.) = \n  2.1138e+21\n\n(...-1.5375e+00\n\n(3 ,6 ,.,.) = \n -7.0127e-03\n\n(3 ,7 ,.,.) = \n  0.0000e+00\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n)
E        +    where \n(0 ,0 ,.,.) = \n    0.3746\n\n(0 ,1 ,.,.) = \n   70.7402\n\n(0 ,2 ,.,.) = \n   68.3647\n\n(0 ,3 ,.,.) = \n    5.2501\n\n(0 ,4 ,.,...) = \n  101.7459\n\n(3 ,6 ,.,.) = \n   10.9038\n\n(3 ,7 ,.,.) = \n    0.0000\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = Variable containing:\n(0 ,0 ,.,.) = \n    0.3746\n\n(0 ,1 ,.,.) = \n   70.7402\n\n(0 ,2 ,.,.) = \n   68.3647\n\n(0 ,3 ,.,.) = \n ...) = \n  101.7459\n\n(3 ,6 ,.,.) = \n   10.9038\n\n(3 ,7 ,.,.) = \n    0.0000\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n.data
E        +      where Variable containing:\n(0 ,0 ,.,.) = \n    0.3746\n\n(0 ,1 ,.,.) = \n   70.7402\n\n(0 ,2 ,.,.) = \n   68.3647\n\n(0 ,3 ,.,.) = \n ...) = \n  101.7459\n\n(3 ,6 ,.,.) = \n   10.9038\n\n(3 ,7 ,.,.) = \n    0.0000\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = Parameter containing:\n(0 ,0 ,.,.) = \n  0.0978\n\n(0 ,1 ,.,.) = \n  1.9624\n\n(0 ,2 ,.,.) = \n  2.4802\n\n(0 ,3 ,.,.) = \n  1.06...5 ,.,.) = \n  0.4832\n\n(3 ,6 ,.,.) = \n  1.0052\n\n(3 ,7 ,.,.) = \n  1.7624\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n.grad
E        +        where Parameter containing:\n(0 ,0 ,.,.) = \n  0.0978\n\n(0 ,1 ,.,.) = \n  1.9624\n\n(0 ,2 ,.,.) = \n  2.4802\n\n(0 ,3 ,.,.) = \n  1.06...5 ,.,.) = \n  0.4832\n\n(3 ,6 ,.,.) = \n  1.0052\n\n(3 ,7 ,.,.) = \n  1.7624\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = Conv2d(8, 4, kernel_size=(1, 1), stride=(1, 1), bias=False).weight
E        +          where Conv2d(8, 4, kernel_size=(1, 1), stride=(1, 1), bias=False) = Sequential (\n  (norm): BatchNorm2d(8, eps=1e-05, momentum=0.1, affine=True)\n  (relu): ReLU (inplace)\n  (conv): Conv2d(8, 4, kernel_size=(1, 1), stride=(1, 1), bias=False)\n).conv
E        +    and   \n(0 ,0 ,.,.) = \n  0.0000e+00\n\n(0 ,1 ,.,.) = \n -2.0594e+24\n\n(0 ,2 ,.,.) = \n -9.6653e+20\n\n(0 ,3 ,.,.) = \n  2.1138e+21\n\n(...-1.5375e+00\n\n(3 ,6 ,.,.) = \n -7.0127e-03\n\n(3 ,7 ,.,.) = \n  0.0000e+00\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = Variable containing:\n(0 ,0 ,.,.) = \n  0.0000e+00\n\n(0 ,1 ,.,.) = \n -2.0594e+24\n\n(0 ,2 ,.,.) = \n -9.6653e+20\n\n(0 ,3 ,.,....-1.5375e+00\n\n(3 ,6 ,.,.) = \n -7.0127e-03\n\n(3 ,7 ,.,.) = \n  0.0000e+00\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n.data
E        +      where Variable containing:\n(0 ,0 ,.,.) = \n  0.0000e+00\n\n(0 ,1 ,.,.) = \n -2.0594e+24\n\n(0 ,2 ,.,.) = \n -9.6653e+20\n\n(0 ,3 ,.,....-1.5375e+00\n\n(3 ,6 ,.,.) = \n -7.0127e-03\n\n(3 ,7 ,.,.) = \n  0.0000e+00\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = Parameter containing:\n(0 ,0 ,.,.) = \n  0.0978\n\n(0 ,1 ,.,.) = \n  1.9624\n\n(0 ,2 ,.,.) = \n  2.4802\n\n(0 ,3 ,.,.) = \n  1.06...5 ,.,.) = \n  0.4832\n\n(3 ,6 ,.,.) = \n  1.0052\n\n(3 ,7 ,.,.) = \n  1.7624\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n.grad
E        +        where Parameter containing:\n(0 ,0 ,.,.) = \n  0.0978\n\n(0 ,1 ,.,.) = \n  1.9624\n\n(0 ,2 ,.,.) = \n  2.4802\n\n(0 ,3 ,.,.) = \n  1.06...5 ,.,.) = \n  0.4832\n\n(3 ,6 ,.,.) = \n  1.0052\n\n(3 ,7 ,.,.) = \n  1.7624\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = _EfficientDensenetBottleneck (\n).conv_weight

I uncommented the code in densenet_efficient.py

self.efficient_batch_norm.training = False,

but the issue persists.

call-for-contribution

opened by yifita 18

Compatibility with PyTorch 0.4
Thanks very much for sharing this implementation. I forked the code. It works great on PyTorch 0.3.1. But when I ran it with 0.4.0 (master version), I got following error (I made some minor change so the line number wouldn't match): File "../networks/densenet_efficient.py", line 330, in forward

bn_input_var = Variable(type(inputs[0])(storage).resize_(size), volatile=True)

TypeError: Variable data has to be a tensor, but got torch.cuda.FloatStorage

It turned out that for this line: bn_input_var = Variable(type(inputs[0])(storage).resize_(size), volatile=True)

The inputs in version 0.3.1 is FloatTensor but in 0.4.0 it's Variable.

I am wondering what's the best way to update the code for 0.4.0?

Many thanks!
compatibility
opened by seyiqi 11
I meet this problem when I run the demo.py. How to solve it?

0.1 Training Traceback (most recent call last): File "demo.py", line 272, in fire.Fire(demo) File "/home/yyj/anaconda2/lib/python2.7/site-packages/fire/core.py", line 127, in Fire component_trace = _Fire(component, args, context, name) File "/home/yyj/anaconda2/lib/python2.7/site-packages/fire/core.py", line 366, in _Fire component, remaining_args) File "/home/yyj/anaconda2/lib/python2.7/site-packages/fire/core.py", line 542, in _CallCallable result = fn(*varargs, **kwargs) File "demo.py", line 250, in demo n_epochs=n_epochs, batch_size=batch_size, seed=seed) File "demo.py", line 166, in train train=True, File "demo.py", line 101, in run_epoch output_var = model(input_var) File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in call result = self.forward(*input, **kwargs) File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 218, in forward features = self.features(x) File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in call result = self.forward(*input, **kwargs) File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/container.py", line 67, in forward input = module(input) File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in call result = self.forward(*input, **kwargs) File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 152, in forward outputs.append(module.forward(outputs)) File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 107, in forward new_features = super(_DenseLayer, self).forward(prev_features) File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/container.py", line 67, in forward input = module(input) File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in call result = self.forward(*input, **kwargs) File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 85, in forward return fn(self.norm_weight, self.norm_bias, self.conv_weight, *inputs) File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 265, in forward conv_output = self.efficient_conv.forward(conv_weight, None, relu_output) File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 457, in forward self.groups, cudnn.benchmark TypeError: _cudnn_convolution_full_forward received an invalid combination of arguments - got (torch.cuda.FloatTensor, torch.cuda.FloatTensor, NoneType, torch.cuda.FloatTensor, tuple, tuple, tuple, int, bool), but expected (torch.cuda.RealTensor input, torch.cuda.RealTensor weight, torch.cuda.RealTensor bias, torch.cuda.RealTensor output, std::vector pad, std::vector stride, std::vector dilation, int groups, bool benchmark, bool deterministic)

opened by yyjFish 9
Multi-GPU model in pytorch0.3 consumes much more memory than pytorch0.1 version
Just tried the new implementation in pytorch0.3, but it consumes much more memory than old implementation. Some issues:

when the model runs on a single gpu, it still allocates shared storage on all the gpus, i think the for device_idx in range(torch.cuda.device_count()) part in _SharedAllocation() part requires some modification and optimization.

when the model runs on multi gpu, the batch size it can afford is much less than the batch size of single gpu times number of gpu. From my test it can only afford same size as single gpu version.

bug
opened by ZhengRui 8
PyTorch 0.3 compatibility
The efficient model now works on PyTorch 0.3

Some other changes:

Multi-GPU support is now baked directly into DenseNetEfficient, so I removed the multi-GPU specific model

Changed the name of the cifar option to small_inputs (more generic).

I will merge it in tomorrow after I confirm that the demo gets the same error on CIFAR-10.
opened by gpleiss 7
The final test accuracy

Hi, Thanks for this implementation ! I'm wondering how to obtain the quite strong test set result on CIFAR-10, as reported in the original densenet paper (e.g., error rate <=3.5 on C-10+, with depth =190, growth_rate = 40). When I run the script as:

CUDA_VISIBLE_DEVICES=0,1,2,3 python demo.py --depth 190 --efficient False --data ./data --save ./ckpts

The final test error is reported as 0.0535. I'm wondering whether the high error is due to no data augmentation is conducted in the default setting. May I know whether it is C10+ dataset or C10?

Best

opened by ustctf-zz 6
FP become slower after upgrade to 0.4

Hi, Thanks for your works! Recently I upgrade my network to 0.4 with your implementation of DenseNet. And I found that the new version is slower than before. I thought that the shared memory could speed up the forward pass obviously.In my application, predicting one subject on the 0.3.x version cost 9s but now it need 11s.

The dice metric also get worth than before. I found that in the new code you use the Kaiming normal initialization but before default initialization (uniform?). I have try to make all parameters as before but it has not effect. Have you some advice for me?

Thanks.

opened by DesertsP 5
Error in trying to use for the first time
Hello,

I am a beginner in python and pyTorch and am trying to use your densenet efficient implementation on a different dataset than CIFAR (images are 80 pixels wide, instead of 32). I use a windows 10 laptop with the experimental pyTorch port on Windows by peterjc123 (see https://github.com/pytorch/pytorch/issues/494).

I have incorporated your DenseNetEfficient model in a training script adapted from andreasveit's densenet implementation for pyTorch and replaced the CIFAR datasets loaders with datasets ImageFolder as follows: train_loader = torch.utils.data.DataLoader(

datasets.CIFAR10('../data', train=True, download=True,

transform=transform_train),

datasets.ImageFolder(root=args.dataroot + '/train', transform=transform_train), batch_size=args.batch_size, shuffle=True, **kwargs)

When launching the training script; I get a cryptic error (for me):

Traceback (most recent call last): File "train.py", line 312, in main() File "train.py", line 153, in main train(train_loader, model, criterion, optimizer, epoch) File "train.py", line 185, in train output = model(input_var) File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\module.py", line 206, in call result = self.forward(*input, **kwargs) File "D:\deepLearning\densenet\densenetEfficient.py", line 213, in forward out = self.classifier(out) File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\module.py", line 206, in call result = self.forward(*input, **kwargs) File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\linear.py", line 54, in forward return self.backend.Linear.apply(input, self.weight, self.bias) File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn_functions\linear.py", line 12, in forward output.addmm(0, 1, input, weight.t()) RuntimeError: size mismatch at d:\downloads\pytorch-master-1\torch\lib\thc\generic/THCTensorMathBlas.cu:243

I am surely doing something wrong but searched a lot and did not find anything,

Any recommendation would be welcome,

Thanks a lot,

Christophe
opened by chmaz 5

storage resize_ function

I try to use torch.storage in my network. I use Pytorch3.

if self.storage.size() < size:
            is_cuda = self.storage.is_cuda
            if is_cuda:
                gpu_ID = self.storage.get_device()
                print('gpu_ID1:',gpu_ID)
            self.storage.resize_(size)
            
            gpu_ID= self.storage.get_device()
            print('gpu_ID2:',gpu_ID)
            
            if is_cuda:
                self.storage = self.storage.cuda(gpu_ID)
            gpu_ID= self.storage.get_device()
            print('gpu_ID3:',gpu_ID)

The output is

gpu_ID1: 1
gpu_ID2: 0
gpu_ID3: 0

The self.storage comes from self.storage= torch.Storage(1024). It seems the resize_ function will change the gpu where the storage will be saved. I wish the storage is saved in GPU 1 rather than 0. How can i do that?

opened by mingminzhen 4

Could it be MORE memory efficient?

Hi,

Thanks for your code. I read both the single-gpu and multi-gpus codes. For the single-gpu version, you create the shared memory inside each dense block. Could all the dense blocks share the same memory and you only allocate one block of space? I think it should further reduce the space usage.

For the multi-gpus version, you create the shared memory in the initialization method of the whole network, i.e. one level upper the dense block initialization. However, you register a buffer inside each dense block for the shared memory, which is done by self.register_buffer('CatBN_output_buffer', self.storage) Does this mean each dense block has independent shared memory? If so, why don't you let them share the same area?

Thanks

opened by zhiqiangdon 4
error rate compute in demo.py

Hi, thanks for this efficient densenet code.

But I found a probable mistake in demo.py at line 112: error = 1 - torch.eq(predictions_var, target_var).float().mean() it might have to be corrected to: error = 1 - torch.eq(predictions_var.view(-1), target_var).float().mean()

Because the size of predictions_var and target_var are (train_size, 1) and (train_size, ), torch.eq(...) will return a train_size * train_size matrix, and its entries are almost 0 (only 1 at diagonal). Then the error rate will not able to decrease.

opened by ghost 4
How can I apply this to my own model?

Thank you for your nice work. My model use the densenet connections like:

tensorFeat = torch.cat([self.moduleOne(tensorFeat), tensorFeat], 1) tensorFeat = torch.cat([self.moduleTwo(tensorFeat), tensorFeat], 1) tensorFeat = torch.cat([self.moduleThr(tensorFeat), tensorFeat], 1) tensorFeat = torch.cat([self.moduleFou(tensorFeat), tensorFeat], 1) tensorFeat = torch.cat([self.moduleFiv(tensorFeat), tensorFeat], 1)

What do I need to do to implement efficient technology to save this part of memory consumption.Densenet connections is just a part of my full model.

opened by CXMANDTXW 1
Is this really memory efficient?

I see the memory consumption chart in the readme, but after looking at the code, I have doubts that this implementation is fully memory efficient. I see the call to cp.checkpoint in _DenseLayer.forward(), but I don't see some of the other modifications that were called for in the paper, specifically post-activation normalization and contiguous concatenation. Am I missing something?

If I understand your approach, you are using a method that still requires quadratic memory and computation, but tossing the memory-hogging intermediate values and recomputing them later?

opened by leonardishere 1

dropout not in 3x3 convolutional layer

Hi, thanks for your work. I have a question here on dropout. In densenet.py, the block function is defined as follow:

    def forward(self, *prev_features):
        bn_function = _bn_function_factory(self.norm1, self.relu1, self.conv1)
        if self.efficient and any(prev_feature.requires_grad for prev_feature in prev_features):
            bottleneck_output = cp.checkpoint(bn_function, *prev_features)
        else:
            bottleneck_output = bn_function(*prev_features)
        new_features = self.conv2(self.relu2(self.norm2(bottleneck_output)))
        if self.drop_rate > 0:
            new_features = F.dropout(new_features, p=self.drop_rate, training=self.training)
        return new_features

while, the original densenet in torch version, the dropout is add after both 1x1-conv and 3x3-conv

function DenseConnectLayerStandard(nChannels, opt)
   local net = nn.Sequential()

   net:add(ShareGradInput(cudnn.SpatialBatchNormalization(nChannels), 'first'))
   net:add(cudnn.ReLU(true))   
   if opt.bottleneck then
      net:add(cudnn.SpatialConvolution(nChannels, 4 * opt.growthRate, 1, 1, 1, 1, 0, 0))
      nChannels = 4 * opt.growthRate
      if opt.dropRate > 0 then net:add(nn.Dropout(opt.dropRate)) end
      net:add(cudnn.SpatialBatchNormalization(nChannels))
      net:add(cudnn.ReLU(true))      
   end
   net:add(cudnn.SpatialConvolution(nChannels, opt.growthRate, 3, 3, 1, 1, 1, 1))
   if opt.dropRate > 0 then net:add(nn.Dropout(opt.dropRate)) end

   return nn.Sequential()
      :add(nn.Concat(2)
         :add(nn.Identity())
         :add(net))  
end

Is here a particular reason for not adding dropout layer for 3x3 convolutional layer? Thanks in advance

opened by lizhenstat 0

What is the minimum GPU memory required? Still breaks for me in a single GPU

Amazon p3.2xlarge: 1 GPUs - Tesla V100 -- GPU Memory: 16GB -- Batch Size = 64 If efficient = False:
Error: RuntimeError: CUDA out of memory. Tried to allocate 1024.00 KiB (GPU 0; 15.75 GiB total capacity; 14.71 GiB already allocated; 4.88 MiB free; 4.02 MiB cached)

If efficient = True: Error: RuntimeError: CUDA out of memory. Tried to allocate 61.25 MiB (GPU 0; 15.75 GiB total capacity; 14.65 GiB already allocated; 50.88 MiB free; 5.33 MiB cached)

Amazon g3.4xlarge: 1 GPUs - Tesla M60 -- GPU Memory: 8GB -- Batch Size = 64

If efficient = False:
RuntimeError: CUDA out of memory. Tried to allocate 184.00 MiB (GPU 0; 7.44 GiB total capacity; 6.98 GiB already allocated; 25.81 MiB free; 5.57 MiB cached)

If efficient = True:
RuntimeError: CUDA out of memory. Tried to allocate 184.00 MiB (GPU 0; 7.44 GiB total capacity; 6.98 GiB already allocated; 25.81 MiB free; 5.57 MiB cached)

opened by PabloRR100 1

Owner

Geoff Pleiss

GitHub

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

180 Jan 5, 2023

Segcache: a memory-efficient and scalable in-memory key-value cache for small objects

Segcache: a memory-efficient and scalable in-memory key-value cache for small objects This repo contains the code of Segcache described in the followi

78 Jan 7, 2023

PyTorch Code of "Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spatiotemporal Dynamics"

Memory In Memory Networks It is based on the paper Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spati

12 May 30, 2022

Episodic-memory - Ego4D Episodic Memory Benchmark

Ego4D Episodic Memory Benchmark EGO4D is the world's largest egocentric (first p

3 Feb 18, 2022

Implementation of "Efficient Regional Memory Network for Video Object Segmentation" (Xie et al., CVPR 2021).

RMNet This repository contains the source code for the paper Efficient Regional Memory Network for Video Object Segmentation. Cite this work @inprocee

76 Dec 14, 2022

Official and maintained implementation of the paper "OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data" [BMVC 2021].

OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data Christoph Reich, Tim Prangemeier, Özdemir Cetin & Heinz Koeppl | Pr

23 Sep 21, 2022

Implementation of Memory-Efficient Neural Networks with Multi-Level Generation, ICCV 2021

Memory-Efficient Multi-Level In-Situ Generation (MLG) By Jiaqi Gu, Hanqing Zhu, Chenghao Feng, Mingjie Liu, Zixuan Jiang, Ray T. Chen and David Z. Pan

2 Jan 4, 2022

Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation

STCN Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation Ho Kei Cheng, Yu-Wing Tai, Chi-Keung Tang [a

456 Dec 12, 2022

InvTorch: memory-efficient models with invertible functions

InvTorch: Memory-Efficient Invertible Functions This module extends the functionality of torch.utils.checkpoint.checkpoint to work with invertible fun

12 May 12, 2022

Memory Efficient Attention (O(sqrt(n)) for Jax and PyTorch

Memory Efficient Attention This is unofficial implementation of Self-attention Does Not Need O(n^2) Memory for Jax and PyTorch. Implementation is almo

126 Dec 27, 2022

Memory-efficient optimum einsum using opt_einsum planning and PyTorch kernels.

opt-einsum-torch There have been many implementations of Einstein's summation. numpy's numpy.einsum is the least efficient one as it only runs in sing

9 Nov 18, 2022

Lowest memory consumption and second shortest runtime in NTIRE 2022 challenge on Efficient Super-Resolution

FMEN Lowest memory consumption and second shortest runtime in NTIRE 2022 on Efficient Super-Resolution. Our paper: Fast and Memory-Efficient Network T

33 Dec 1, 2022

Efficient-GlobalPointer - Pytorch Efficient GlobalPointer

引言感谢苏神带来的模型，原文地址：https://spaces.ac.cn/archives/8877 如何运行对应模型EfficientGlobalPoi

40 Dec 14, 2022

Numenta Platform for Intelligent Computing is an implementation of Hierarchical Temporal Memory (HTM), a theory of intelligence based strictly on the neuroscience of the neocortex.

NuPIC Numenta Platform for Intelligent Computing The Numenta Platform for Intelligent Computing (NuPIC) is a machine intelligence platform that implem

6.3k Dec 30, 2022

Numenta Platform for Intelligent Computing is an implementation of Hierarchical Temporal Memory (HTM), a theory of intelligence based strictly on the neuroscience of the neocortex.

NuPIC Numenta Platform for Intelligent Computing The Numenta Platform for Intelligent Computing (NuPIC) is a machine intelligence platform that implem

6.2k Feb 12, 2021

A memory-efficient implementation of DenseNets

Related tags

Overview

efficient_densenet_pytorch

Recent updates

Motivation

Requirements

Usage

Performance

Other efficient implementations

Reference

Comments

datasets.CIFAR10('../data', train=True, download=True,

transform=transform_train),

Owner

Geoff Pleiss

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Segcache: a memory-efficient and scalable in-memory key-value cache for small objects

PyTorch Code of "Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spatiotemporal Dynamics"

Episodic-memory - Ego4D Episodic Memory Benchmark

Implementation of "Efficient Regional Memory Network for Video Object Segmentation" (Xie et al., CVPR 2021).

Official and maintained implementation of the paper "OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data" [BMVC 2021].

Implementation of Memory-Efficient Neural Networks with Multi-Level Generation, ICCV 2021

Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation

InvTorch: memory-efficient models with invertible functions

Memory Efficient Attention (O(sqrt(n)) for Jax and PyTorch

Memory-efficient optimum einsum using opt_einsum planning and PyTorch kernels.

Lowest memory consumption and second shortest runtime in NTIRE 2022 challenge on Efficient Super-Resolution

Efficient-GlobalPointer - Pytorch Efficient GlobalPointer

Numenta Platform for Intelligent Computing is an implementation of Hierarchical Temporal Memory (HTM), a theory of intelligence based strictly on the neuroscience of the neocortex.

Numenta Platform for Intelligent Computing is an implementation of Hierarchical Temporal Memory (HTM), a theory of intelligence based strictly on the neuroscience of the neocortex.

Official pytorch implementation of Rainbow Memory (CVPR 2021)

This is a pytorch implementation of the NeurIPS paper GAN Memory with No Forgetting.

Implementation of Hierarchical Transformer Memory (HTM) for Pytorch

PyTorch implementation of Memory-based semantic segmentation for off-road unstructured natural environments.