Library for faster pinned CPU <-> GPU transfer in Pytorch

Overview

SpeedTorch

Join the chat at https://gitter.im/SpeedTorch/community Downloads Downloads

Faster pinned CPU tensor <-> GPU Pytorch variabe transfer and GPU tensor <-> GPU Pytorch variable transfer, in certain cases.

Update 9-29-19

Since for some systems, using the pinned Pytorch CPU tensors is faster than using Cupy tensors (see 'How It Works' section for more detail), I created general Pytorch tensor classes PytorchModelFactory and PytorchOptimizerFactory which can specifiy either setting the tensors to cuda or cpu, and if using cpu, if its memory should be pinned. The original GPUPytorchModelFactory and GPUPytorchOptimizerFactory classes are still in the library, so no existing code using SpeedTorch should be affected. The documentation has been updated to include these new classes.

What is it?

This library revovles around Cupy tensors pinned to CPU, which can achieve 3.1x faster CPU -> GPU transfer than regular Pytorch Pinned CPU tensors can, and 410x faster GPU -> CPU transfer. Speed depends on amount of data, and number of CPU cores on your system (see the How it Works section for more details)

The library includes functions for embeddings training; it can host embeddings on CPU RAM while they are idle, sparing GPU RAM.

Inspiration

I initially created this library to help train large numbers of embeddings, which the GPU may have trouble holding in RAM. In order to do this, I found that by hosting some of the embeddings on the CPU can help achieve this. Embedding systems use sprase training; only fraction of the total prameters participate in the forward/update steps, the rest are idle. So I figured, why not keep the idle parameters off the GPU during the training step? For this, I needed fast CPU -> GPU transfer.

For the full backstory, please see the Devpost page

https://devpost.com/software/speedtorch-6w5unb

What can fast CPU->GPU do for me? (more that you might initially think)

With fast CPU->GPU, a lot of fun methods can be developed for functionalities which previously people thought may not have been possible.

🏎️ Incorporate SpeedTorch into your data pipelines for fast data transfer to/from CPU <-> GPU

🏎️ Augment training parameters via CPU storage. As long as you have enough CPU RAM, you can host any number of embeddings without having to worry about the GPU RAM.

🏎️ Use Adadelta, Adamax, RMSprop, Rprop, ASGD, AdamW, and Adam optimizers for sparse embeddings training. Previously, only SpraseAdam, Adagrad, and SGD were suitable since only these directly support sparse gradients.

Benchmarks

Speed

(Edit 9-20-19, one of the Pytorch developers pointed out some minor bugs in the original bench marking code, the values and code have been updated)

Here is a notebook comparing transfer via SpeedTorch vs Pytorch tensors, with both pinned CPU and Cuda tensors. All tests were done with a colab instance with a Tesla K80 GPU, and 2 core CPU.

UPDATE 10-17-19: Google Colab is now standard with 4 core CPUs, so this notebook will give different results than what is reported below, since Pytorch's indexing kernals get more efficient as the number of CPU cores increase.

https://colab.research.google.com/drive/1PXhbmBZqtiq_NlfgUIaNpf_MfpiQSKKs

This notebook times the data transfer of 131,072 float32 embeddings of dimension 128, to and from the Cupy/Pytorch tensors and Pytorch variables, with n=100. Google Colab's CPU has 4 cores, which has an impact on the transfer speed. CPU's with a higher number of cores will see less of an advatage to using SpeedTorch.

The table below is a summary of the results. Transfering data from Pytorch cuda tensors to the Cuda Pytorch embedding variable is faster than the SpeedTorch equivalent, but for all other transfer types, SpeedTorch is faster. And for the sum of both steps transferring to/from the Cuda Pytorch embedding, SpeedTorch is faster than the Pytorch equivalent for both the regular GPU and CPU Pinned tensors.

I have noticed that different instances of Colab result in different speed results, so keep this in mind while reviewing these results. A personal run of the colab notebook may result in different values, though the order of magnetude of the results are generally the same.

The transfer times in the following tables are given in seconds. This benchmarking was preformed with a colab instance whose CPU has 2 cores. Colab has a Pro version of paid instances which are 4 core CPUs, so the following benchmarking would not reflect for those instances.

Tensor Type To Cuda Pytorch Variable Comparison
SpeedTorch(cuda) 0.0087 6.2x Slower than Pytorch Equivalent
SpeedTorch(PinnedCPU) 0.0154 3.1x Faster than Pytorch Equivalent
Pytorch(cuda) 0.0014 6.2x Faster than SpeedTorch Equivalent
Pytorch(PinnedCPU) 0.0478 3.1x Slower than SpeedTorch Equivalent
Tensor Type From Cuda Pytorch Variable Comparison
SpeedTorch(cuda) 0.0035 9.7x Faster than Pytorch Equivalent
SpeedTorch(PinnedCPU) 0.0065 410x Faster than Pytorch Equivalent
Pytorch(cuda) 0.0341 9.7x Slower than SpeedTorch Equivalent
Pytorch(PinnedCPU) 2.6641 410x Slower than SpeedTorch Equivalent
Tensor Type Sum of to/from Cuda Pytorch Variable Comparison
SpeedTorch(cuda) 0.0122 2.9x Faster than Pytorch Equivalent
SpeedTorch(PinnedCPU) 0.0219 124x Faster than Pytorch Equivalent
Pytorch(cuda) 0.0355 2.9x Slower than SpeedTorch Equivalent
Pytorch(PinnedCPU) 2.7119 124x Slower than SpeedTorch Equivalent

Similar benchmarks were calculated for transferring to/from Pytorch Cuda optimizers. The results are basically the same, here is the notebook used for the optimizers benchmarking

https://colab.research.google.com/drive/1Y2nehd8Xj-ixfjkj2QWuA_UjQjBBHhJ5

Memory

Although SpeedTorch's tensors are generally faster than Pytorch's, the drawback is SpeedTorch's tensors use more memory. However, because transferring data can happen more quickly, you can use SpeedTorch to augment the number of embeddings trained in your architecture by holding parameters in both the GPU And CPU.

This table is a summary of benchmarking done in Google Colab. From my experience, there seems to be some variation in the reported memory values in Colab, +-0.30 gb, so keep this in mind while reviewing these numbers. The values are for holding a 10,000,000x128 float32 tensor.

Tensor Type CPU (gb) GPU (gb)
Cupy PinnedCPU 9.93 0.06
Pytorch PinnedCPU 6.59 0.32
Cupy Cuda 0.39 9.61
Pytorch Cuda 1.82 5.09

Although Pytorch's time to/from for Pytorch GPU tensor <-> Pytorch cuda Variable is not as fast as the Cupy equivalent, the speed is still workable. So if memory is still a concern, a best of both worlds approach would be to SpeedTorch's Cupy CPU Pinned Tensors to store parameters on the CPU, and SpeedTorch's Pytorch GPU tensors to store parameters on the GPU.

This is the notebook I used for measuring how much memory each variable type takes. https://colab.research.google.com/drive/1ZKY7PyuPAIDrnx2HdtbujWo8JuY0XkuE If using this in Colab, you will need to restart the enviroment after each tensor creation, to get a measure for the next tensor.

What systems get a speed advantage?

For the CPU<->GPU transfer, it depends on the amount of data being transfered, and the number of cores you have. Generally for 1-2 CPU cores SpeedTorch will be much faster. But as the number of CPU cores goes up, Pytorch's CPU<->GPU indexing operations get more efficient. For more details on this, please see the next 'How it works' section. For an easy way to see if you get a speed advantage in your system, please run the benchmarking code on your system, but change the amount of data to reflect the amount that you will be working with in your application.

For the GPU <-> GPU transfer, if using ordinary indexing notations in vanilla Pytorch, all systems will get a speed increase because SpeedTorch bypasses a bug in Pytorch's indexing operations. But this bug can be avoided if using the nightly version, or just using different indexing notions, please see the 'How it works' section for more details.

How it works?

Update 9-20-19: I initially had no idea why this is faster than using Pytorch tensors; I stumbled upon the speed advantage by accident. But one of the Pytorch developers on the Pytorch forum pointed it out.

As for the better CPU<->GPU transfer, it's because SpeedTorch avoids a CPU indexing operation by masquarding CPU tensors as GPU tensors. The CPU index operation may be slow if working on with very few CPU cores, such as 2 in Google Colab, but may be faster if you have many cores. It depends on how much data you're transfering and how many cores you have.

As for the better GPU<->GPU transfer, it's because SpeedTorch avoids a bug in the indexing operation. This bug can also be avoided by using the nightly builds, or using index_select / index_copy_ instead of a[idx] notation in 1.1/1.2.

For more details of this, please see this Pytorch post

https://discuss.pytorch.org/t/introducing-speedtorch-4x-speed-cpu-gpu-transfer-110x-gpu-cpu-transfer/56147/2

where a Pytorch engineer gives a detailed analysis on how the Cupy indexing kernals are resulting in speed ups in certain cases. It's not the transfer itself that is getting faster, but the indexing kernals which are being used.

As for how the memory management in Cupy works, I direct to these two stackoverflow questions I asked, where brilliant user Robert Crovella not only gave detailed explanations, but also figured out how to allocate pinned memory to Cupy arrays by developing his own memory allocator for Cupy. This is basically the core technology behind SpeedTorch.

https://stackoverflow.com/questions/57750125/cupy-outofmemoryerror-when-trying-to-cupy-load-larger-dimension-npy-files-in-me

https://stackoverflow.com/questions/57752516/how-to-use-cuda-pinned-zero-copy-memory-for-a-memory-mapped-file

Guide

Geting Started

SpeedTorch is pip installable. You need to have Cupy installed and imported before you import SpeedTorch.

!pip install SpeedTorch
import cupy
import SpeedTorch

Using SpeedTorch to increase speed transfer of data from CPU to GPU

This colab notebook shows how to load data into SpeedTorch using its Data Gadget, and how to transfer this data to/from a Pytorch cuda variable.

https://colab.research.google.com/drive/185Z5Gi62AZxh-EeMfrTtjqxEifHOBXxF

Please see the speed benchmarking notebook to see the speed advantage of using SpeedTorch.

Using SpeedTorch to use non-sparse optimizers (in this case, Adamax) for sparse training

For people first trying to figure out how to use SpeedTorch, I recommend following this example, since word2vec is one of the more commonly known algorithms in machine learning.

https://colab.research.google.com/drive/1ApJR3onbgQWM3FBcBKMvwaGXIDXlDXOt

The notebook shows how to train word2vec the regular way, then shows how to use SpeedTorch to train on the same data, using one of the optimizers normally not supported for sparse training. This is possible because since all the embeddings contained in the embedding variable has an update during each step, you can set sparse=False during initialization.

Augment training parameters via CPU storage

tl;dr:

Normal training: Pytorch embedding variables contain all embeddings. Pytorch optimizer contain all the corresponding parameter weights for each embedding.

SpeedTorch traing: Pytorch embeddng variables only contain a batch of embeddings. Pytorch optimizer only contains all the corresponding parameter weights for that batch. SparseTorch tensors contain the rest, and exchanges the embeddings/weights with the Pytorch variable at each step.

In sparse training algorithms like word2vec, GloVe, or Neural Collaborative Filtering, only a fraction of the total parameters (embeddngs) are trained during every step. If your GPU can not handle all of your embeddings at a desired embedding size, an option would be to host some of your parameters on pinned CPU Cupy arrays, and transfer those parameters to your model tensors as needed. Doing this primary in Pytorch would be very slow, especially because transferring parameters between a Cuda mounted Pytorch variable and a pinned CPU pytorch tensor can take 2.5-3 seconds (on Google Colab). fortunately, this step only takes 0.02-0.03 seconds with SpeedTorch!

Case Uses :

--2,829,853 book embeddings--

SpeedTorch was used in training 2,829,853 books for a rare book recommender.

https://github.com/Santosh-Gupta/Lit2Vec2

https://devpost.com/software/lit2vec2

Each book had an embedding of size of 400, but an embedding size of 496 could have been used, the 400 embedding size was due to limits of space on my Google Drive to store the trained embeddings :(. But the limits of the GPU RAM is no longer an issue :) Here is a direct link to a demo training notebook, which trains with a 496 embedding size using SpeedTorch

NOTE: You need the version of the Colab notebook that has 25 gb of RAM, instead of the usual 12 gb. To get this type of instance, you need to crash your current instance due to overwhelming the RAM, and then a note in the bottom left corner asking if you would like to upgrade. You can do this by making a loop that keeps doubling the size of a numpy float matrix.

https://colab.research.google.com/drive/1AqhT-HetihXMET1wJQROrC3Q9tFJqJ19

Here is a directly link with the same model and data, but doesn't use SpeedTorch

https://colab.research.google.com/drive/1idV1jBOUZVPCfdsy40wIrRPHeDOanti_

Using the orthodox training method, the largest embedding size that colab is able to handle is 255-260, any higher than that and a cuda error will occur

RuntimeError: CUDA out of memory. Tried to allocate 2.74 GiB (GPU 0; 11.17 GiB total capacity; 8.22 GiB already allocated; 2.62 GiB free; 5.05 MiB cached)

--14,886,544 research paper embeddings--

https://github.com/Santosh-Gupta/Research2Vec2

SpeedTorch can allow me to train 14,886,544 research paper embeddings at an embedding size of 188, by allowing my to store my target embeddings on the CPU, while keeping my context embeddings on the GPU (SGD optimizer was used, so there are no optimizer weights).

Here is a direct link to the notebook.

https://colab.research.google.com/drive/1saKzsaHoy6O_U1DF_z15_Qkr5YLNI_GR

NOTE: You need the version of the Colab notebook that has 25 gb of RAM, instead of the usual 12 gb. To get this type of instance, you need to crash your current instance due to overwhelming the RAM, and then a note in the bottom left corner asking if you would like to upgrade. You can do this by making a loop that keeps doubling the size of a numpy float matrix.

Without SpeedTorch, only an embedding size of 94-96 can be used on Google Colab Tesla K80 GPU before an RuntimeError: CUDA out of memory error. Here is a version of the training without using SpeedTorch.

https://colab.research.google.com/drive/1jh7RUgeajhdWdGNfWG3Twm1ZjyTQU0KR

Best Practices

  1. Whenever using the Cupy GPU tensors, initialize these before any pinned CPU tensors. This is because the initialization of the Cupy GPU tensors seem to uses a solid amount of CPU RAM. So if you're limited on CPU RAM, and you already have your pinned CPU tensors in memory, then initializing the cupy GPU tensors may cause a crash.

  2. If you're able to fit all of your parameters in your GPU memory, use pure Pytorch since this is the fastest option for training. But if you can't fit all your parameters in memory, split your parameters (keep in mind that your optimizers also have weights) between SpeedTorch's Cupy cuda tensors and SpeedTorch's Cupy pinned CPU tensors; this is the 2nd fastest options. But, if you're still not able to fit all your parameters into memory that way, then split your parameters between SpeedTorch's Cupy pinned CPU tensors, and SpeedTorch's Pytorch cuda tensors; this is slower than both options, but uses less GPU memory. For the 3rd option, here are two notebooks which shows an example of this https://colab.research.google.com/drive/1AqhT-HetihXMET1wJQROrC3Q9tFJqJ19 , https://colab.research.google.com/drive/1saKzsaHoy6O_U1DF_z15_Qkr5YLNI_GR

  3. After training, saving any cuda variables will cause an increase in memory usage, and may cause a crash, especially with Cupy. If you're at the limits of your RAM. In this case, use the getNumpyVersion method to get a numpy version of your tensor, and then use use numpy.save or hdpy/pytables to save your numpy array. Numpy save is more lightweight.

Need Help?

Either open an issue, or chat with me directory on Gitter here https://gitter.im/SpeedTorch

Future Work

I am looking incoporate more functionalities around the fast CPU -> GPU transfer. If you have an idea, please post a Github Issue.

Experimental

In addition the the Cupy GPU/pinned CPU and Pytorch GPU tensors, SpeedTorch also has Pytorch pinned CPU tensors, and Cupy memmap GPU/pinned CPU tensors. I have not found a solid use for these sorts of tensors, but they're fully coded and availible for use.

https://github.com/Santosh-Gupta/SpeedTorch/tree/master/SpeedTorch

One area I would like to look at is if there is a way to have RAM memory reduction by using Cupy Memmaps. So far they use just as much memory as the live versions.

Documentation

Class ModelFactory

ModelFactory(model_variable,  total_classes,  embed_dimension, datatype = 'float32', CPUPinn = False)

Creates switchers for model variables using Cupy. Switches variables from your full embedding collection and your model batch collection. Each variable needs its own switcher.

Example:

uEmbed_switcher = SpeedTorch.ModelFactory( skip_gram_modelSparse.u_embeddings, total_classes=50000, embed_dimension=128)

Arguments:

model_variable: Specific variable from your model you would like to create a switcher for.

total_classes: The total amount of embeddings to be trained.

embed_dimension: Dimension of the embeddings.

datatype (optional): Datatype for the variable. Default is 'float32'.

CPUPinn (optional): Pin your full embedding collection to CPU. Spares GPU memory, but data transfer will be slower. Default is False.

Methods:

zerosInit() : Initializes the variable switcher full collection with zeros:

uniformDistributionInit(low, high): Initializes the variable switcher full collection with a uniform distribution from low to high

normalDistributionInit(mean, stdDev): Initializes the variable switcher full collection with a normal distribution with a mean of mean and a standard deviation of stdDev

variableTransformer( batchSize, posPerBatch, negPerBatch = None ): Sets up a dummy input to be used for the forward step of you model. batchSize is the size of your batch, and posPerBatch is the number of positive examples per batch. If a second dummy input is needed for the negative examples, negPerBatch (optional) can be set to the number of negative examples, and two dummy inputs will be returned instead of one.

beforeForwardPass(retrievedPosIndexes , retrievedNegIndexes = None): Switches embeddings from the full embeddings collection to your model embeddings. retrievedPosIndexes is the indexes of the positive samples to be retrieved. If negative samples are to be retrieved as well, a value for retrievedNegIndexes (optional) can be passed as well.

afterOptimizerStep( retrievedPosIndexes , retrievedNegIndexes = None): Switches updated embeddings from your model to the full embeddings collection. retrievedPosIndexes is the indexes of the positive samples that were retrieved. If negative samples were retrieved as well, a value for retrievedNegIndexes (optional) can be passed as well.

saveCupy(saveFileName): Save tensor to .npy file.

loadCupy(loadFileName): Load tensor from .npy file.

getNumpyVersion: Get numpy version of tensor.

Class OptimizerFactory

OptimizerFactory( given_optimizer,  total_classes,  embed_dimension, model, variable_name, dtype='float32' , CPUPinn = False)

Creates switchers for optimizer variables using Cupy. Switches variables from your full embedding collection and your optimizer batch collection. Each variable needs its own switcher.

Example:

uAdagrad_switcher = SpeedTorch.OptimizerFactory(given_optimizer,  total_classes,  embed_dimension, model, variable_name, dtype='float32', CPUPinn = False)

Arguments:

given_optimizer: The optimizer initialized with your model weights. If using for embeddings training, remember to set the sparse parameter to False. Currently, supported optimizers are SparseAdam, Adadelta, Adamax, Adam, AdamW, ASGD, and RMSprop. Rprop is also inlcuded, but needs the first forward pass, and loss.backward() step to be completed for initializing the OptimizerFactory instance. This is due to the Rprop optimizer needing gradients of its parameters for initialization.

total_classes: The total amount of embeddings to be trained.

embed_dimension: Dimension of the embeddings.

model: The instance of your model.

variable_name: Exact name of the variable defined in your model.

dtype (optional): Data type of your variable. Default is 'float32'

CPUPinn (optional): Pin your full optimizer variable weight collection to CPU. Spares GPU memory, but data transfer will be slower. Default is False.

Methods:

optInit: Initializes the optimizer variable switcher.

beforeForwardPass(retrievedPosIndexes , retrievedNegIndexes = None): Switches optimizer variable weights from the full weights collection to optimizer weight tensor. retrievedPosIndexes is the indexes of the positive samples to be retrieved. If negative samples are to be retrieved as well, a value for retrievedNegIndexes (optional) can be passed as well.

afterOptimizerStep( retrievedPosIndexes , retrievedNegIndexes = None): Switches optimizer variable weights from your optimizer to the full weights collection. retrievedPosIndexes is the indexes of the positive samples that were retrieved. If negative samples were retrieved as well, a value for retrievedNegIndexes (optional) can be passed as well.

Class DataGadget

Creates a tensor whose main function is to transfer it's contents to a Pytorch cuda variable.

DataGadget( fileName, CPUPinn=False)

Arguments:

fileName: Location of data .npy file to be opened

CPUPinn: (optional): Pin your data to CPU. Default is False.

Methods:

getData(indexes): Retrieves data in a format that is ready to be accepted by a Pytorch Cuda Variable. indexes is the indexes of the tensor from which to retrieve data from.

insertData(dataObject, indexes): Insert data from a Pytorch Cuda Variable. dataObject is the Pytorch cuda variable tensor data from which the data is is going to be retrived from, and indexes of the tensor from which to retrieve data from.

saveCupy(saveFileName): Save tensor to .npy file.

loadCupy(loadFileName): Load new tensor from .npy file.

getNumpyVersion: Get numpy version of tensor.

Please see this notebook on how to use the data gadget

https://colab.research.google.com/drive/185Z5Gi62AZxh-EeMfrTtjqxEifHOBXxF

Class PytorchModelFactory

PytorchModelFactory(model_variable,  total_classes,  embed_dimension, datatype = 'float32', deviceType = 'cuda', pinType = False)

Creates switchers for model variables using Pytorch tensors. Switches variables from your full embedding collection and your model batch collection. Each variable needs its own switcher.

Example:

uEmbed_switcher = SpeedTorch.PytorchModelFactory( skip_gram_modelSparse.u_embeddings, total_classes=50000, embed_dimension=128)

Arguments:

model_variable: Specific variable from your model you would like to create a switcher for.

total_classes: The total amount of embeddings to be trained.

embed_dimension: Dimension of the embeddings.

datatype (optional): Datatype for the variable. Default is 'float32'.

deviceType (optional): Set device either to 'cuda' or 'cpu'. Default is 'cuda'

pinType (optional): If device is set to 'cpu', you can specify using pinned memory. Default is 'False'.

Methods:

zerosInit() : Initializes the variable switcher full collection with zeros:

uniformDistributionInit(low, high): Initializes the variable switcher full collection with a uniform distribution from low to high

normalDistributionInit(mean, stdDev): Initializes the variable switcher full collection with a normal distribution with a mean of mean and a standard deviation of stdDev

customInit(initFunction, *args): Use any Pytorch initializer for the variable switchers full collection. Pass the initializer using initFunction and its corresponding arguments using *args.

variableTransformer(batchSize, posPerBatch, negPerBatch = None ): Sets up a dummy input to be used for the forward step of you model. batchSize is the size of your batch, and posPerBatch is the number of positive examples per batch. If a second dummy input is needed for the negative examples, negPerBatch (optional) can be set to the number of negative examples, and two dummy inputs will be returned instead of one.

beforeForwardPass(retrievedPosIndexes , retrievedNegIndexes = None): Switches embeddings from the full embeddings collection to your model embeddings. retrievedPosIndexes is the indexes of the positive samples to be retrieved. If negative samples are to be retrieved as well, a value for retrievedNegIndexes (optional) can be passed as well.

afterOptimizerStep(retrievedPosIndexes , retrievedNegIndexes = None): Switches updated embeddings from your model to the full embeddings collection. retrievedPosIndexes is the indexes of the positive samples that were retrieved. If negative samples were retrieved as well, a value for retrievedNegIndexes (optional) can be passed as well.

saveTorch(saveFileName): Save tensor to file using torch.save

loadTorch(loadFileName): Load tensor using torch.load

getNumpyVersion: Get numpy version of tensor.

Class PytorchOptimizerFactory

PytorchOptimizerFactory( given_optimizer,  total_classes,  embed_dimension, model, variable_name, dtype='float32', deviceType = 'cuda', pinType = False)

Creates switchers for optimizer variables using Pytorch tensors. Switches variables from your full embedding collection and your optimizer batch collection. Each variable needs its own switcher.

Example:

uAdagrad_switcher = SpeedTorch.PytorchOptimizerFactory(given_optimizer,  total_classes,  embed_dimension, model, variable_name, dtype='float32')

Arguments:

given_optimizer: The optimizer initialized with your model weights. If using for embeddings training, remember to set the sparse parameter to False. Currently, supported optimizers are SparseAdam, Adadelta, Adamax, Adam, AdamW, ASGD, and RMSprop. Rprop is also inlcuded, but needs the first forward pass, and loss.backward() step to be completed for initializing the OptimizerFactory instance. This is due to the Rprop optimizer needing gradients of its parameters for initialization.

total_classes: The total amount of embeddings to be trained.

embed_dimension: Dimension of the embeddings.

model: The instance of your model.

variable_name: Exact name of the variable defined in your model.

dtype (optional): Data type of your variable. Default is 'float32'

deviceType (optional): Set device either to 'cuda' or 'cpu'. Default is 'cuda'

pinType (optional): If device is set to 'cpu', you can specify using pinned memory. Default is 'False'.

Methods:

optInit: Initializes the optimizer variable switcher.

beforeForwardPass(retrievedPosIndexes , retrievedNegIndexes = None): Switches optimizer variable weights from the full weights collection to optimizer weight tensor. retrievedPosIndexes is the indexes of the positive samples to be retrieved. If negative samples are to be retrieved as well, a value for retrievedNegIndexes (optional) can be passed as well.

afterOptimizerStep( retrievedPosIndexes , retrievedNegIndexes = None): Switches optimizer variable weights from your optimizer to the full weights collection. retrievedPosIndexes is the indexes of the positive samples that were retrieved. If negative samples were retrieved as well, a value for retrievedNegIndexes (optional) can be passed as well.

Citing SpeedTorch:

If you use SpeedTorch in your research or wish to cite, please cite with:

@misc{

title={SpeedTorch},

author={Santosh Gupta},

howpublished={\url{github.com/Santosh-Gupta/SpeedTorch}},

year={2019}

}

Comments
  • question about speedup pytorch GPU tensor to cpu

    question about speedup pytorch GPU tensor to cpu

    i just want to speedup pytorch tensor from GPU to CPU, here is my original code: output = torch.squeeze(inference_output).data.cpu().numpy() the type of inference_output is tensor.cuda. How can i use SpeedTorch to speedup this code?

    opened by buqing2009 9
  • Benchmarking code

    Benchmarking code

    Hello,

    Thanks for sharing SpeedTorch. The README mentions

    For an easy way to see if you get a speed advantage in your system, please run the benchmarking code on your system, but change the amount of data to reflect the amount that you will be working with in your application.

    Where is this benchmarking code? Is it the google colab or do you have a separate .py script we can use? Thanks!

    opened by juanmed 1
  • Add a Gitter chat badge to README.md

    Add a Gitter chat badge to README.md

    Santosh-Gupta/SpeedTorch now has a Chat Room on Gitter

    @Santosh-Gupta has just created a chat room. You can visit it here: https://gitter.im/SpeedTorch/community.

    This pull-request adds this badge to your README.md:

    Gitter

    If my aim is a little off, please let me know.

    Happy chatting.

    PS: Click here if you would prefer not to receive automatic pull-requests from Gitter in future.

    opened by gitter-badger 0
  • something wrong in the benchmark code

    something wrong in the benchmark code

    Hi, I guess there is a problem in your benchmark code.

    sampl = np.random.uniform(low=-1.0, high=1.0, size=(1000000, 128))
    gadgetCPU = SpeedTorch.DataGadget( 'data.npy', CPUPinn = True )
    indexess = np.random.randint(0,1000000, 131072)
    gadgetCPU.insertData(u_embeddings.weight.data,  indexess )
    

    gadgetCPU is initialized and it's shape is (1000000,128), but the size of indexes is 131072 and when I copy the code and run on my docker environment, I came across an error like this:

    self.CUPYcorpus[indexes] = cupy.fromDlpack( to_dlpack( dataObject ) ) SystemError: error return without exception set

    And I change the sampl size and it works fine

    benchmark code link https://colab.research.google.com/drive/1b3QpfSETePo-J2TjyO6D2LgTCjVrT1lu#scrollTo=04iE5HtWYqOz&uniqifier=1

    opened by Courtesy-Xs 0
  • gadgetCPU.gadgetInit()  report an error!

    gadgetCPU.gadgetInit() report an error!

    When I run the code: gadgetCPU = SpeedTorch.DataGadget( 'data.npy',CPUPinn=True) gadgetCPU.gadgetInit()

    report an error like this: Exception ignored in: <function PMemory.del at 0x7fcef8ca86a8> Traceback (most recent call last): File "/usr/local/python3/lib/python3.7/site-packages/SpeedTorch/CUPYLive.py", line 19, in del AttributeError: 'NoneType' object has no attribute 'runtime'

    But it's ok when I run : gadgetGPU = SpeedTorch.DataGadget( 'data.npy' ) gadgetGPU.gadgetInit()

    I can't find the reason,it confused me .

    opened by xscjun 19
  • Fast transfer of small tensors

    Fast transfer of small tensors

    Hello again,

    After reviewing your example benchmark script, I was doing some measurements on CPU->GPU->CPU transfer times, comparing Pytorch CPU pinned tensor versus Speedtorch's gadgetGPU. My tensors are actually very small compared to your tests cases, at most 20x3 tensors, but I need to transfer them fast enough to allow me make other computations >100Hz.

    So far, if I understood how to correctly use SpeedTorch, it seems that Pytorch's cpu pinned tensors have faster transfer times compared to SpeedTorch DataGadget cpu pinned object. See the graph below, were both histograms corresponds to CPU->GPU->CPU transfer of a 6x3 matrix. The pink histogram corresponds to pytorch's cpu pinned tensor and the turquoise one to Speedtorch's DataGadget operations.

    torchpin_vs_speedtorch (Units: millisecond)

    For my use case, it seems Pytorch's pinned cpu Tensor has better performance. I would like to ask if this makes sense in your experience and what recommendations could you provide for using SpeedTorch to achieve better performance? My use case implies receiving data in the CPU, transfer to GPU, make various linear algebra operations, and finally getting the result back to the CPU. All this must be performed at 100Hz minimum. So far I have only achieved 70Hz and would like to speed up every operation as much as possible.

    You can find the code I used to get this graph here, and this was run on a Jetson Nano (armV8, nVidia Tegra TX1 GPU).

    Thank you very much!

    opened by juanmed 5
Owner
Santosh Gupta
Machine learning/NLP for scientific/medical information retrieval
Santosh Gupta
ArrayFire: a general purpose GPU library.

ArrayFire is a general-purpose library that simplifies the process of developing software that targets parallel and massively-parallel architectures i

ArrayFire 4k Dec 29, 2022
cuDF - GPU DataFrame Library

cuDF - GPU DataFrames NOTE: For the latest stable README.md ensure you are on the main branch. Resources cuDF Reference Documentation: Python API refe

RAPIDS 5.2k Jan 8, 2023
Python 3 Bindings for NVML library. Get NVIDIA GPU status inside your program.

py3nvml Documentation also available at readthedocs. Python 3 compatible bindings to the NVIDIA Management Library. Can be used to query the state of

Fergal Cotter 212 Jan 4, 2023
A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

NVIDIA DALI The NVIDIA Data Loading Library (DALI) is a library for data loading and pre-processing to accelerate deep learning applications. It provi

NVIDIA Corporation 4.2k Jan 8, 2023
📊 A simple command-line utility for querying and monitoring GPU status

gpustat Just less than nvidia-smi? NOTE: This works with NVIDIA Graphics Devices only, no AMD support as of now. Contributions are welcome! Self-Promo

Jongwook Choi 3.2k Jan 4, 2023
Python interface to GPU-powered libraries

Package Description scikit-cuda provides Python interfaces to many of the functions in the CUDA device/runtime, CUBLAS, CUFFT, and CUSOLVER libraries

Lev E. Givon 924 Dec 26, 2022
BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.

A lightweight, GPU accelerated, SQL engine built on the RAPIDS.ai ecosystem. Get Started on app.blazingsql.com Getting Started | Documentation | Examp

BlazingSQL 1.8k Jan 2, 2023
A Python module for getting the GPU status from NVIDA GPUs using nvidia-smi programmically in Python

GPUtil GPUtil is a Python module for getting the GPU status from NVIDA GPUs using nvidia-smi. GPUtil locates all GPUs on the computer, determines thei

Anders Krogh Mortensen 927 Dec 8, 2022
jupyter/ipython experiment containers for GPU and general RAM re-use

ipyexperiments jupyter/ipython experiment containers and utils for profiling and reclaiming GPU and general RAM, and detecting memory leaks. About Thi

Stas Bekman 153 Dec 7, 2022
A Python function for Slurm, to monitor the GPU information

Gpu-Monitor A Python function for Slurm, where I couldn't use nvidia-smi to monitor the GPU information. whole repo is not finish Installation TODO Mo

Squidward Tentacles 2 Feb 11, 2022
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Introduction This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code her

NVIDIA Corporation 6.9k Dec 28, 2022
A NumPy-compatible array library accelerated by CUDA

CuPy : A NumPy-compatible array library accelerated by CUDA Website | Docs | Install Guide | Tutorial | Examples | API Reference | Forum CuPy is an im

CuPy 6.6k Jan 5, 2023
cuML - RAPIDS Machine Learning Library

cuML - GPU Machine Learning Algorithms cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions t

RAPIDS 3.1k Jan 4, 2023
cuGraph - RAPIDS Graph Analytics Library

cuGraph - GPU Graph Analytics The RAPIDS cuGraph library is a collection of GPU accelerated graph algorithms that process data found in GPU DataFrames

RAPIDS 1.2k Jan 1, 2023
cuSignal - RAPIDS Signal Processing Library

cuSignal The RAPIDS cuSignal project leverages CuPy, Numba, and the RAPIDS ecosystem for GPU accelerated signal processing. In some cases, cuSignal is

RAPIDS 646 Dec 30, 2022
Python 3 Bindings for the NVIDIA Management Library

====== pyNVML ====== *** Patched to support Python 3 (and Python 2) *** ------------------------------------------------ Python bindings to the NVID

Nicolas Hennion 95 Jan 1, 2023
Lazy Profiler is a simple utility to collect CPU, GPU, RAM and GPU Memory stats while the program is running.

lazyprofiler Lazy Profiler is a simple utility to collect CPU, GPU, RAM and GPU Memory stats while the program is running. Installation Use the packag

Shankar Rao Pandala 28 Dec 9, 2022
High performance Cross-platform Inference-engine, you could run Anakin on x86-cpu,arm, nv-gpu, amd-gpu,bitmain and cambricon devices.

Anakin2.0 Welcome to the Anakin GitHub. Anakin is a cross-platform, high-performance inference engine, which is originally developed by Baidu engineer

null 514 Dec 28, 2022
Django package to log request values such as device, IP address, user CPU time, system CPU time, No of queries, SQL time, no of cache calls, missing, setting data cache calls for a particular URL with a basic UI.

django-web-profiler's documentation: Introduction: django-web-profiler is a django profiling tool which logs, stores debug toolbar statistics and also

MicroPyramid 77 Oct 29, 2022
A set of tools to keep your pinned Python dependencies fresh.

pip-tools = pip-compile + pip-sync A set of command line tools to help you keep your pip-based packages fresh, even when you've pinned them. You do pi

Jazzband 6.5k Dec 29, 2022