An addernet CUDA version

Overview

Training addernet accelerated by CUDA

Usage

cd adder_cuda
python setup.py install
cd ..
python main.py

Environment

pytorch 1.10.0 CUDA 11.3

benchmark

version training_time_per_batch/s
raw 1.61
torch.cdist 1.49
cuda_unoptimized 0.4508
this work 0.3158

The CUDA version of AdderNet has achieved a 5× speed increase over the original version. There seems to be some bugs in the Cuda_unoptimized version, causing the model to fail to converge. Its speed is still listed here for comparison. The experiment was run on RTX 2080Ti platform, and ResNet-20 based on CIFAR-10 was trained.

Time(%) Time Calls Avg Min Max Name
48.57 30.4752s 3920 7.7743ms 162.70us 12.271ms CONV_BACKWARD
34.85 21.8686s 19680 1.1112ms 5.3770us 11.827ms _ZN2at6native27unrolled_elementwise_kernel...
7.46 4.67901s 5920 790.37us 26.529us 1.5841ms CONV
2.24 1.40372s 3920 358.09us 31.298us 845.80us col2im_kernel
2.10 1.31882s 36862 35.777us 1.4720us 276.24us vectorized_elementwise_kernel
1.43 900.03ms 5920 152.03us 7.9040us 372.40us im2col_kernel

Here is the time distribution of training an epoch. If you are interested, you can continue to optimize the CUDA kernel.

You might also like...
Time-stretch audio clips quickly with PyTorch (CUDA supported)! Additional utilities for searching efficient transformations are included.

Time-stretch audio clips quickly with PyTorch (CUDA supported)! Additional utilities for searching efficient transformations are included.

A dead simple python wrapper for darknet that works with OpenCV 4.1, CUDA 10.1

What Dead simple python wrapper for Yolo V3 using AlexyAB's darknet fork. Works with CUDA 10.1 and OpenCV 4.1 or later (I use OpenCV master as of Jun

Prevent `CUDA error: out of memory` in just 1 line of code.
Prevent `CUDA error: out of memory` in just 1 line of code.

🐨 Koila Koila solves CUDA error: out of memory error painlessly. Fix it with just one line of code, and forget it. 🚀 Features 🙅 Prevents CUDA error

Neural network for digit classification powered by cuda

cuda_nn_mnist Neural network library for digit classification powered by cuda Resources The library was built to work with MNIST dataset. python-mnist

Lunar is a neural network aimbot that uses real-time object detection accelerated with CUDA on Nvidia GPUs.
Lunar is a neural network aimbot that uses real-time object detection accelerated with CUDA on Nvidia GPUs.

Lunar Lunar is a neural network aimbot that uses real-time object detection accelerated with CUDA on Nvidia GPUs. About Lunar can be modified to work

Decorators for maximizing memory utilization with PyTorch & CUDA

torch-max-mem This package provides decorators for memory utilization maximization with PyTorch and CUDA by starting with a maximum parameter size and

A PaddlePaddle version of Neural Renderer, refer to its PyTorch version
A PaddlePaddle version of Neural Renderer, refer to its PyTorch version

Neural 3D Mesh Renderer in PadddlePaddle A PaddlePaddle version of Neural Renderer, refer to its PyTorch version Install Run: pip install neural-rende

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

Comments
  • illegal memory access was encountered

    illegal memory access was encountered

    每次跑完8个batch后出现这个问题,改batchsize没用,都是8个batch后报错。 机器装了3块GPU,设置的GPU_ID = 1 Files already downloaded and verified Files already downloaded and verified Train - Epoch 1, Batch: 0, Loss: 2.296886, Time 5.307902 Train - Epoch 1, Batch: 1, Loss: 2.301040, Time 0.105161 Train - Epoch 1, Batch: 2, Loss: 2.300776, Time 0.110913 Train - Epoch 1, Batch: 3, Loss: 2.303986, Time 0.104652 Train - Epoch 1, Batch: 4, Loss: 2.289750, Time 0.100140 Train - Epoch 1, Batch: 5, Loss: 2.315252, Time 0.099318 Train - Epoch 1, Batch: 6, Loss: 2.298506, Time 0.106323 Train - Epoch 1, Batch: 7, Loss: 2.310294, Time 0.106855 Traceback (most recent call last): File "/work/sunbiao/AdderNetCUDA-LingYeAI/main.py", line 146, in main() File "/work/sunbiao/AdderNetCUDA-LingYeAI/main.py", line 142, in main train_and_test(e) File "/work/sunbiao/AdderNetCUDA-LingYeAI/main.py", line 135, in train_and_test train(epoch) File "/work/sunbiao/AdderNetCUDA-LingYeAI/main.py", line 90, in train output = net(images) File "/home/nature/anaconda3/envs/addernet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/work/sunbiao/AdderNetCUDA-LingYeAI/densenet.py", line 83, in forward x = self.trans3(self.dense3(x)) File "/home/nature/anaconda3/envs/addernet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/nature/anaconda3/envs/addernet/lib/python3.9/site-packages/torch/nn/modules/container.py", line 141, in forward input = module(input) File "/home/nature/anaconda3/envs/addernet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/work/sunbiao/AdderNetCUDA-LingYeAI/densenet.py", line 17, in forward y = self.conv1(func.relu(self.bn1(x))) File "/home/nature/anaconda3/envs/addernet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/work/sunbiao/AdderNetCUDA-LingYeAI/adder.py", line 104, in forward output = adder2d_function(x, self.adder, self.stride, self.padding) File "/work/sunbiao/AdderNetCUDA-LingYeAI/adder.py", line 39, in adder2d_function out = out.permute(3, 0, 1, 2).contiguous() RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

    opened by tju-sun-lab 2
  •  Resnet20 based on adder_cuda seems to have difficulty converging

    Resnet20 based on adder_cuda seems to have difficulty converging

    I try to train resnet20 for classification task on cifar10 dataset. But when using adder_cuda, the network seems to have difficulty converging. So, I am curious about the author's experimental results on the cifar10 dataset.

    opened by 154115081020 1
  • CUDA ERROR

    CUDA ERROR

    hello, I run your code and there is an CUDA ERROR: an illegal memory access was encountered. The detailed information is

    Traceback (most recent call last): File "/home/new/classification-CNN/AdderNetCUDA-main/main.py", line 145, in main() File "/home/new/classification-CNN/AdderNetCUDA-main/main.py", line 141, in main train_and_test(e) File "/home/new/classification-CNN/AdderNetCUDA-main/main.py", line 134, in train_and_test train(epoch) File "/home/new/classification-CNN/AdderNetCUDA-main/main.py", line 101, in train loss.backward() File "/home/new/anaconda3/envs/pytorch38/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/new/anaconda3/envs/pytorch38/lib/python3.8/site-packages/torch/autograd/init.py", line 145, in backward Variable._execution_engine.run_backward( File "/home/new/anaconda3/envs/pytorch38/lib/python3.8/site-packages/torch/autograd/function.py", line 89, in apply return self._forward_cls.backward(self, *args) # type: ignore File "/home/new/classification-CNN/AdderNetCUDA-main/adder.py", line 78, in backward grad_W_col = grad_W_col/grad_W_col.norm(p=2).clamp(min=1e-12)*math.sqrt(W_col.size(1)*W_col.size(0))/5 File "/home/new/anaconda3/envs/pytorch38/lib/python3.8/site-packages/torch/tensor.py", line 401, in norm return torch.norm(self, p, dim, keepdim, dtype=dtype) File "/home/new/anaconda3/envs/pytorch38/lib/python3.8/site-packages/torch/functional.py", line 1376, in norm return _VF.norm(input, p, dim=_dim, keepdim=keepdim) # type: ignore RuntimeError: CUDA error: an illegal memory access was encountered

    Can you provide me with some solutions to this problem?

    opened by wangchangyi1160 1
Owner
LingXY
LingXY
Extending JAX with custom C++ and CUDA code

Extending JAX with custom C++ and CUDA code This repository is meant as a tutorial demonstrating the infrastructure required to provide custom ops in

Dan Foreman-Mackey 237 Dec 23, 2022
Several simple examples for popular neural network toolkits calling custom CUDA operators.

Neural Network CUDA Example Several simple examples for neural network toolkits (PyTorch, TensorFlow, etc.) calling custom CUDA operators. We provide

WeiYang 798 Jan 1, 2023
Picasso: A CUDA-based Library for Deep Learning over 3D Meshes

The Picasso Library is intended for complex real-world applications with large-scale surfaces, while it also performs impressively on the small-scale applications over synthetic shape manifolds. We have upgraded the point cloud modules of SPH3D-GCN from homogeneous to heterogeneous representations, and included the upgraded modules into this latest work as well. We are happy to announce that the work is accepted to IEEE CVPR2021.

null 97 Dec 1, 2022
This Repo is the official CUDA implementation of ICCV 2019 Oral paper for CARAFE: Content-Aware ReAssembly of FEatures

Introduction This Repo is the official CUDA implementation of ICCV 2019 Oral paper for CARAFE: Content-Aware ReAssembly of FEatures. @inproceedings{Wa

Jiaqi Wang 42 Jan 7, 2023
PyTorch implementation of Soft-DTW: a Differentiable Loss Function for Time-Series in CUDA

Soft DTW Loss Function for PyTorch in CUDA This is a Pytorch Implementation of Soft-DTW: a Differentiable Loss Function for Time-Series which is batch

Keon Lee 76 Dec 20, 2022
Example repository for custom C++/CUDA operators for TorchScript

Custom TorchScript Operators Example This repository contains examples for writing, compiling and using custom TorchScript operators. See here for the

null 106 Dec 14, 2022
Convert Python 3 code to CUDA code.

Py2CUDA Convert python code to CUDA. Usage To convert a python file say named py_file.py to CUDA, run python generate_cuda.py --file py_file.py --arch

Yuval Rosen 3 Jul 14, 2021
This demo showcase the use of onnxruntime-rs with a GPU on CUDA 11 to run Bert in a data pipeline with Rust.

Demo BERT ONNX pipeline written in rust This demo showcase the use of onnxruntime-rs with a GPU on CUDA 11 to run Bert in a data pipeline with Rust. R

Xavier Tao 14 Dec 17, 2022
Bytedance Inc. 2.5k Jan 6, 2023
CUDA Python Low-level Bindings

CUDA Python Low-level Bindings

NVIDIA Corporation 529 Jan 3, 2023