An addernet CUDA version

LingXY

Last update: Jun 20, 2022

Related tags

Deep Learning AdderNetCUDA

Overview

Training addernet accelerated by CUDA

Usage

cd adder_cuda
python setup.py install
cd ..
python main.py

Environment

pytorch 1.10.0 CUDA 11.3

benchmark

version	training_time_per_batch/s
raw	1.61
torch.cdist	1.49
cuda_unoptimized	0.4508
this work	0.3158

The CUDA version of AdderNet has achieved a 5× speed increase over the original version. There seems to be some bugs in the Cuda_unoptimized version, causing the model to fail to converge. Its speed is still listed here for comparison. The experiment was run on RTX 2080Ti platform, and ResNet-20 based on CIFAR-10 was trained.

Time(%)	Time	Calls	Avg	Min	Max	Name
48.57	30.4752s	3920	7.7743ms	162.70us	12.271ms	CONV_BACKWARD
34.85	21.8686s	19680	1.1112ms	5.3770us	11.827ms	_ZN2at6native27unrolled_elementwise_kernel...
7.46	4.67901s	5920	790.37us	26.529us	1.5841ms	CONV
2.24	1.40372s	3920	358.09us	31.298us	845.80us	col2im_kernel
2.10	1.31882s	36862	35.777us	1.4720us	276.24us	vectorized_elementwise_kernel
1.43	900.03ms	5920	152.03us	7.9040us	372.40us	im2col_kernel

Here is the time distribution of training an epoch. If you are interested, you can continue to optimize the CUDA kernel.

Time-stretch audio clips quickly with PyTorch (CUDA supported)! Additional utilities for searching efficient transformations are included.

22 Jul 7, 2022

A dead simple python wrapper for darknet that works with OpenCV 4.1, CUDA 10.1

What Dead simple python wrapper for Yolo V3 using AlexyAB's darknet fork. Works with CUDA 10.1 and OpenCV 4.1 or later (I use OpenCV master as of Jun

6 Jan 12, 2022

Prevent `CUDA error: out of memory` in just 1 line of code.

🐨 Koila Koila solves CUDA error: out of memory error painlessly. Fix it with just one line of code, and forget it. 🚀 Features 🙅 Prevents CUDA error

1.7k Jan 2, 2023

Neural network for digit classification powered by cuda

cuda_nn_mnist Neural network library for digit classification powered by cuda Resources The library was built to work with MNIST dataset. python-mnist

1 Dec 20, 2021

Lunar is a neural network aimbot that uses real-time object detection accelerated with CUDA on Nvidia GPUs.

Lunar Lunar is a neural network aimbot that uses real-time object detection accelerated with CUDA on Nvidia GPUs. About Lunar can be modified to work

276 Jan 7, 2023

Decorators for maximizing memory utilization with PyTorch & CUDA

torch-max-mem This package provides decorators for memory utilization maximization with PyTorch and CUDA by starting with a maximum parameter size and

10 May 2, 2022

Hand gesture recognition based whiteboard that allows you to write on live webcam. This is the first version and has features like 4 different colors, eraser and a recording option that records your session and saves it in a "recordings" folder. Use index finger to draw and two or more fingers to move around and select items. Future version will contain more functionalities like changeable thickness, color palette, integration with zoom and google meet etc.

hand-write Hand gesture recognition based whiteboard that allows you to write on live webcam. This is the first version and has features like 4 differ

27 Dec 16, 2022

A PaddlePaddle version of Neural Renderer, refer to its PyTorch version

Neural 3D Mesh Renderer in PadddlePaddle A PaddlePaddle version of Neural Renderer, refer to its PyTorch version Install Run: pip install neural-rende

13 Jul 12, 2022

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

4.7k Jan 1, 2023

Comments

illegal memory access was encountered

每次跑完8个batch后出现这个问题，改batchsize没用，都是8个batch后报错。机器装了3块GPU，设置的GPU_ID = 1 Files already downloaded and verified Files already downloaded and verified Train - Epoch 1, Batch: 0, Loss: 2.296886, Time 5.307902 Train - Epoch 1, Batch: 1, Loss: 2.301040, Time 0.105161 Train - Epoch 1, Batch: 2, Loss: 2.300776, Time 0.110913 Train - Epoch 1, Batch: 3, Loss: 2.303986, Time 0.104652 Train - Epoch 1, Batch: 4, Loss: 2.289750, Time 0.100140 Train - Epoch 1, Batch: 5, Loss: 2.315252, Time 0.099318 Train - Epoch 1, Batch: 6, Loss: 2.298506, Time 0.106323 Train - Epoch 1, Batch: 7, Loss: 2.310294, Time 0.106855 Traceback (most recent call last): File "/work/sunbiao/AdderNetCUDA-LingYeAI/main.py", line 146, in main() File "/work/sunbiao/AdderNetCUDA-LingYeAI/main.py", line 142, in main train_and_test(e) File "/work/sunbiao/AdderNetCUDA-LingYeAI/main.py", line 135, in train_and_test train(epoch) File "/work/sunbiao/AdderNetCUDA-LingYeAI/main.py", line 90, in train output = net(images) File "/home/nature/anaconda3/envs/addernet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/work/sunbiao/AdderNetCUDA-LingYeAI/densenet.py", line 83, in forward x = self.trans3(self.dense3(x)) File "/home/nature/anaconda3/envs/addernet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/nature/anaconda3/envs/addernet/lib/python3.9/site-packages/torch/nn/modules/container.py", line 141, in forward input = module(input) File "/home/nature/anaconda3/envs/addernet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/work/sunbiao/AdderNetCUDA-LingYeAI/densenet.py", line 17, in forward y = self.conv1(func.relu(self.bn1(x))) File "/home/nature/anaconda3/envs/addernet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/work/sunbiao/AdderNetCUDA-LingYeAI/adder.py", line 104, in forward output = adder2d_function(x, self.adder, self.stride, self.padding) File "/work/sunbiao/AdderNetCUDA-LingYeAI/adder.py", line 39, in adder2d_function out = out.permute(3, 0, 1, 2).contiguous() RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

opened by tju-sun-lab 2
Resnet20 based on adder_cuda seems to have difficulty converging

I try to train resnet20 for classification task on cifar10 dataset. But when using adder_cuda, the network seems to have difficulty converging. So, I am curious about the author's experimental results on the cifar10 dataset.

opened by 154115081020 1
CUDA ERROR

hello, I run your code and there is an CUDA ERROR: an illegal memory access was encountered. The detailed information is

Traceback (most recent call last): File "/home/new/classification-CNN/AdderNetCUDA-main/main.py", line 145, in main() File "/home/new/classification-CNN/AdderNetCUDA-main/main.py", line 141, in main train_and_test(e) File "/home/new/classification-CNN/AdderNetCUDA-main/main.py", line 134, in train_and_test train(epoch) File "/home/new/classification-CNN/AdderNetCUDA-main/main.py", line 101, in train loss.backward() File "/home/new/anaconda3/envs/pytorch38/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/new/anaconda3/envs/pytorch38/lib/python3.8/site-packages/torch/autograd/init.py", line 145, in backward Variable._execution_engine.run_backward( File "/home/new/anaconda3/envs/pytorch38/lib/python3.8/site-packages/torch/autograd/function.py", line 89, in apply return self._forward_cls.backward(self, *args) # type: ignore File "/home/new/classification-CNN/AdderNetCUDA-main/adder.py", line 78, in backward grad_W_col = grad_W_col/grad_W_col.norm(p=2).clamp(min=1e-12)*math.sqrt(W_col.size(1)*W_col.size(0))/5 File "/home/new/anaconda3/envs/pytorch38/lib/python3.8/site-packages/torch/tensor.py", line 401, in norm return torch.norm(self, p, dim, keepdim, dtype=dtype) File "/home/new/anaconda3/envs/pytorch38/lib/python3.8/site-packages/torch/functional.py", line 1376, in norm return _VF.norm(input, p, dim=_dim, keepdim=keepdim) # type: ignore RuntimeError: CUDA error: an illegal memory access was encountered

Can you provide me with some solutions to this problem？

opened by wangchangyi1160 1

Owner

LingXY

GitHub

Extending JAX with custom C++ and CUDA code

Extending JAX with custom C++ and CUDA code This repository is meant as a tutorial demonstrating the infrastructure required to provide custom ops in

237 Dec 23, 2022

Several simple examples for popular neural network toolkits calling custom CUDA operators.

Neural Network CUDA Example Several simple examples for neural network toolkits (PyTorch, TensorFlow, etc.) calling custom CUDA operators. We provide

798 Jan 1, 2023

Picasso: A CUDA-based Library for Deep Learning over 3D Meshes

The Picasso Library is intended for complex real-world applications with large-scale surfaces, while it also performs impressively on the small-scale applications over synthetic shape manifolds. We have upgraded the point cloud modules of SPH3D-GCN from homogeneous to heterogeneous representations, and included the upgraded modules into this latest work as well. We are happy to announce that the work is accepted to IEEE CVPR2021.

97 Dec 1, 2022

An addernet CUDA version

Related tags

Overview

Training addernet accelerated by CUDA

Usage

Environment

benchmark

You might also like...

Time-stretch audio clips quickly with PyTorch (CUDA supported)! Additional utilities for searching efficient transformations are included.

A dead simple python wrapper for darknet that works with OpenCV 4.1, CUDA 10.1

Prevent `CUDA error: out of memory` in just 1 line of code.

Neural network for digit classification powered by cuda

Lunar is a neural network aimbot that uses real-time object detection accelerated with CUDA on Nvidia GPUs.

Decorators for maximizing memory utilization with PyTorch & CUDA

A PaddlePaddle version of Neural Renderer, refer to its PyTorch version

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

Comments

illegal memory access was encountered

Resnet20 based on adder_cuda seems to have difficulty converging

CUDA ERROR

Owner

LingXY

Extending JAX with custom C++ and CUDA code

Several simple examples for popular neural network toolkits calling custom CUDA operators.

Picasso: A CUDA-based Library for Deep Learning over 3D Meshes

This Repo is the official CUDA implementation of ICCV 2019 Oral paper for CARAFE: Content-Aware ReAssembly of FEatures

PyTorch implementation of Soft-DTW: a Differentiable Loss Function for Time-Series in CUDA

Example repository for custom C++/CUDA operators for TorchScript

Convert Python 3 code to CUDA code.

This demo showcase the use of onnxruntime-rs with a GPU on CUDA 11 to run Bert in a data pipeline with Rust.

LightSeq is a high performance training and inference library for sequence processing and generation implemented in CUDA

CUDA Python Low-level Bindings