Deep Learning GPU Training System

Overview

DIGITS

Build Status

DIGITS (the Deep Learning GPU Training System) is a webapp for training deep learning models. The currently supported frameworks are: Caffe, Torch, and Tensorflow.

Feedback

In addition to submitting pull requests, feel free to submit and vote on feature requests via our ideas portal.

Documentation

Current and most updated document is availabel at NVIDIA Accelerated Computing, Deep Learning Documentation, NVIDIA DIGITS.

Installation

Installation method Supported platform[s] Available versions Instructions
Source Ubuntu 14.04, 16.04 GitHub tags docs/BuildDigits.md

Official DIGITS container is available at nvcr.io via docker pull command.

Usage

Once you have installed DIGITS, visit docs/GettingStarted.md for an introductory walkthrough.

Then, take a look at some of the other documentation at docs/ and examples/:

Get help

Installation issues

  • First, check out the instructions above
  • Then, ask questions on our user group

Usage questions

Bugs and feature requests

Notice on security

Users shall understand that DIGITS is not designed to be run as an exposed external web service.

Comments
  • Torch Data Augmentation

    Torch Data Augmentation

    Data augmentation needs little introduction I recon. It counters overfitting and makes your model generalize better, yielding better validation accuracies; or alternatively, allows you to use smaller datasets with similar performance.

    In the Zoo that's the internet, I see many implementations of different augmentations, of which few are proper and nicely portable. A part from Digits yielding a great UI; ease of use; and deep learning turn-key solution, I strongly feel we can expand to the functional side as well to make this a deep learning killer-app.

    For torch, I have made an implementation during lua preprocessing from frontend to backend to enable Digits to do so. In #330 there was already an attempt for augmentation, which happened on the dataset-creation side; something I am strongly against. Resizing and cropping I would consider a transformation, while I consider augmenting the data in its container an augmentation. I think therefore it's fine to resize during dataset loading (and squashing/filling/etc), but I would probably leave it at that.

    Anyway, I set up a more dynamic structure to pass around these options on the torch side; instead of adding a dozen of arguments to each function, I am just adding a table.

    Implements the following (screenshot): image

    I have iterated through many augmentation types but these were the most useful. Almost done, now running elaborate tests.

    Progress

    The code is already functional, though see progress below. See code, shoot!

    Features

    • [x] Make UI data transforms only visible for the Torch framework (invisible for Caffe)
    • [x] ~~Implement UI option for normalization (scales the [0 255] to [0 1])~~
    • [x] Data Augmentation UI
    • [x] Flips (mirrors)
    • [x] Quadrilateral rotations
    • [x] Arbitrary rotations
    • [x] Arbitrary scales
    • [x] Augmenting in HSV space
    • [x] Augmenting with noise (Thoughts?)
    • [x] [Travis] Tests
    • [x] Use Data Augmentation Template: data_augmentation.html

    Testing

    • [x] No augmentation
    • [x] Flips (mirrors)
    • [x] Quadrilateral rotations
    • [x] Arbitrary rotations
    • [x] Arbitrary scales
    • [x] Arbitrary rotations & arbitrary scales
    • [x] Augmenting in HSV space
    • [x] Augmenting with noise
    • [x] All Augmentations & benchmark speed; identify bottlenecks
    • [x] Verify models reporting a slower learning/less overfitting trade-off : more generalization.
    enhancement torch 
    opened by TimZaman 46
  • running on multiple GPU is very slow

    running on multiple GPU is very slow

    I am trying to run 50-layer residual network with 4 K40m GPUs and it's very slow (same batch_size 16 as running on single GPU), take 6 hours for 1 epoch. However, If I run it on 1 GPU the speed is normal.

    System: CentOS, digits v3, nvcaffe-0.14

    BTW, I tried use Googlenet and it was ok on 4 GPUs.

    Any suggestion or potential issue?

    duplicate 
    opened by 201power 37
  • ERROR: Expected caffe suffix

    ERROR: Expected caffe suffix "-nv". libcaffe.so does not match. Are you building from the NVIDIA/caffe fork?

    Hi,

    I'm running on Ubuntu 14.4 LTS.

    ERROR: Expected caffe suffix "-nv". libcaffe.so does not match. Are you building from the NVIDIA/caffe fork?

    ubuntu@ip-10-0-1-51:~/digits$ pip install -r requirements.txt
    You are using pip version 7.0.3, however version 7.1.0 is available.
    You should consider upgrading via the 'pip install --upgrade pip' command.
    Requirement already satisfied (use --upgrade to upgrade): Pillow>=2.3.0 in /home/ubuntu/anaconda/lib/python2.7/site-packages (from -r requirements.txt (line 1))
    Requirement already satisfied (use --upgrade to upgrade): numpy>=1.7 in /home/ubuntu/anaconda/lib/python2.7/site-packages (from -r requirements.txt (line 2))
    Requirement already satisfied (use --upgrade to upgrade): scipy>=0.13.3 in /home/ubuntu/anaconda/lib/python2.7/site-packages (from -r requirements.txt (line 3))
    Collecting protobuf>=2.5.0 (from -r requirements.txt (line 4))
      Downloading protobuf-2.6.1.tar.gz (188kB)
        100% |████████████████████████████████| 188kB 2.3MB/s 
    Collecting pydot>=1.0.2 (from -r requirements.txt (line 5))
      Downloading pydot-1.0.2.tar.gz
    Requirement already satisfied (use --upgrade to upgrade): six>=1.5.2 in /home/ubuntu/anaconda/lib/python2.7/site-packages (from -r requirements.txt (line 6))
    Requirement already satisfied (use --upgrade to upgrade): requests>=2.2.1 in /home/ubuntu/anaconda/lib/python2.7/site-packages (from -r requirements.txt (line 7))
    Requirement already satisfied (use --upgrade to upgrade): gevent>=1.0 in /home/ubuntu/anaconda/lib/python2.7/site-packages (from -r requirements.txt (line 8))
    Requirement already satisfied (use --upgrade to upgrade): Flask>=0.10.1 in /home/ubuntu/anaconda/lib/python2.7/site-packages (from -r requirements.txt (line 9))
    Collecting Flask-WTF>=0.11 (from -r requirements.txt (line 10))
      Downloading Flask_WTF-0.12-py2-none-any.whl
    Collecting Flask-SocketIO (from -r requirements.txt (line 11))
      Downloading Flask-SocketIO-0.6.0.tar.gz
    Collecting lmdb (from -r requirements.txt (line 12))
      Downloading lmdb-0.86.tar.gz (144kB)
        100% |████████████████████████████████| 147kB 2.9MB/s 
    Requirement already satisfied (use --upgrade to upgrade): nose>=1.3.1 in /home/ubuntu/anaconda/lib/python2.7/site-packages (from -r requirements.txt (line 13))
    Requirement already satisfied (use --upgrade to upgrade): mock>=1.0.1 in /home/ubuntu/anaconda/lib/python2.7/site-packages (from -r requirements.txt (line 14))
    Requirement already satisfied (use --upgrade to upgrade): beautifulsoup4>=4.2.1 in /home/ubuntu/anaconda/lib/python2.7/site-packages (from -r requirements.txt (line 15))
    Requirement already satisfied (use --upgrade to upgrade): selenium>=2.25.0 in /home/ubuntu/anaconda/lib/python2.7/site-packages (from -r requirements.txt (line 16))
    Collecting gunicorn (from -r requirements.txt (line 17))
      Downloading gunicorn-19.3.0-py2.py3-none-any.whl (110kB)
        100% |████████████████████████████████| 110kB 3.8MB/s 
    Requirement already satisfied (use --upgrade to upgrade): setuptools in /home/ubuntu/anaconda/lib/python2.7/site-packages/setuptools-17.1.1-py2.7.egg (from protobuf>=2.5.0->-r requirements.txt (line 4))
    Requirement already satisfied (use --upgrade to upgrade): pyparsing in /home/ubuntu/anaconda/lib/python2.7/site-packages (from pydot>=1.0.2->-r requirements.txt (line 5))
    Requirement already satisfied (use --upgrade to upgrade): Werkzeug in /home/ubuntu/anaconda/lib/python2.7/site-packages (from Flask-WTF>=0.11->-r requirements.txt (line 10))
    Collecting WTForms (from Flask-WTF>=0.11->-r requirements.txt (line 10))
      Downloading WTForms-2.0.2-py27-none-any.whl (128kB)
        100% |████████████████████████████████| 131kB 3.3MB/s 
    Collecting gevent-socketio>=0.3.6 (from Flask-SocketIO->-r requirements.txt (line 11))
      Downloading gevent_socketio-0.3.6-py27-none-any.whl
    Requirement already satisfied (use --upgrade to upgrade): gevent-websocket in /home/ubuntu/anaconda/lib/python2.7/site-packages (from gevent-socketio>=0.3.6->Flask-SocketIO->-r requirements.txt (line 11))
    Installing collected packages: protobuf, pydot, WTForms, Flask-WTF, gevent-socketio, Flask-SocketIO, lmdb, gunicorn
      Running setup.py install for protobuf
      Running setup.py install for pydot
      Running setup.py install for Flask-SocketIO
      Running setup.py install for lmdb
    Successfully installed Flask-SocketIO-0.6.0 Flask-WTF-0.12 WTForms-2.0.2 gevent-socketio-0.3.6 gunicorn-19.3.0 lmdb-0.86 protobuf-2.6.1 pydot-1.0.2
    ubuntu@ip-10-0-1-51:~/digits$ sudo apt-get install graphviz
    Reading package lists... Done
    Building dependency tree       
    Reading state information... Done
    graphviz is already the newest version.
    The following packages were automatically installed and are no longer required:
      linux-headers-3.13.0-49 linux-headers-3.13.0-49-generic
      linux-image-3.13.0-49-generic linux-image-extra-3.13.0-49-generic
    Use 'apt-get autoremove' to remove them.
    0 upgraded, 0 newly installed, 0 to remove and 267 not upgraded.
    ubuntu@ip-10-0-1-51:~/digits$ ./digits-devserver
      ___ ___ ___ ___ _____ ___
     |   \_ _/ __|_ _|_   _/ __|
     | |) | | (_ || |  | | \__ \
     |___/___\___|___| |_| |___/
    
    Welcome to the DIGITS config module.
    
    Where is caffe installed?
        (enter "SYS" if installed system-wide)
        [default is SYS]
    (q to quit) >>> SYS
    ERROR: Expected caffe suffix "-nv". libcaffe.so does not match. Are you building from the NVIDIA/caffe fork?
    
    (q to quit) >>> 
    
    caffe 
    opened by dbl001 35
  • Accuracy & confusion matrix

    Accuracy & confusion matrix

    See #17

    Adds a new kind of job for performance evaluation of trained classifiers. It is now possible to visualize :

    • accuracy / recall curve
    • confusion matrix

    Accuracy and the confusion matrix are computed against a chosen snapshot of a training task, and against both the validation set and testing set (if it exists). An "evaluate performance" button has been added on the training view. This is currently the only way to run an evaluation job. The results are stored in the job directory in the form of two pickle files.

    button

    Accuracy / recall curve

    accuracy recall curve

    Confusion matrix

    I chose a very simple representation of the confusion matrix (not in the form of a matrix !), because it is more adapted to datasets with lots of classes. For each class, the top 10 most represented classes are displayed, with their respective %.

    confusion matrix

    Related jobs

    I added a "Related jobs" section on each job show view. It displays the jobs which depends on the current job. For example, models trained on a specific dataset, evaluations ran on a specific model.

    Related jobs

    Let me know what you think, critiques and comments are more than welcome.

    opened by groar 29
  • Windows Compatibility

    Windows Compatibility

    On my machine the image serving, e.g. of the mean.jpg does not work. The browser (tested IE and Chrome) cannot interpret the image probably due to the missing content type. The send_file function takes care of that all.

    windows 
    opened by crohkohl 27
  • Add support for HDF5 datasets

    Add support for HDF5 datasets

    Closes #224

    TODO before merge

    • [x] Create models from HDF5 datasets using HDF5Data layers
    • [x] Expose backend and compression information in REST API
    • [x] Shard HDF5 files into acceptable dataset sizes - https://github.com/BVLC/caffe/issues/2953#issuecomment-137274066

    TODO after merge

    • Allow non-image data (see #197)
    • Analyze prebuilt HDF5 datasets in "generic" path
    enhancement 
    opened by lukeyeager 26
  • Set map_size for LMDB

    Set map_size for LMDB

    @crohkohl, @danst18, I'm breaking the discussion in #203 out into a new issue.

    Here's the situation as I understand it. Please correct me if any of this is wrong.

    | map_size | Linux | OSX & Windows | | --- | --- | --- | | lower than size of dataset | LMDB runs out of memory | ? | | higher than system memory | No problem | LMDB can't allocate enough memory |

    On Linux, you can just set it as high as you like and never see a problem. But that strategy blows up on other platforms.

    Should [map_size] be made configurable? https://github.com/NVIDIA/DIGITS/pull/203#issuecomment-128859465

    This is a sufficient but lazy solution. I would like to understand whether this can be avoided programmatically somehow before making a decision. My googling skills are failing me.

    question 
    opened by lukeyeager 26
  • can't find hdf5.h when build caffe

    can't find hdf5.h when build caffe

    I want to install digits on my debian jessie.
    When I build caffe(NVIDIA's fork), I got errors complaining that hdf5.h could not be found.

    I'm sure I had installed libhdf5-serial-dev and libhdf5-dev, and I found the header file in /usr/include/hdf5/serial and its libs in /usr/lib/x86_64-linux-gnu.

    So, what's wrong? Some one help me?

    The build error message show below:

    (venv)➜  caffe  make all --jobs=4
    CXX src/caffe/layer_factory.cpp
    CXX src/caffe/util/insert_splits.cpp
    CXX src/caffe/util/db.cpp
    CXX src/caffe/util/upgrade_proto.cpp
    In file included from src/caffe/util/upgrade_proto.cpp:10:0:
    ./include/caffe/util/io.hpp:8:18: fatal error: hdf5.h: no such file or directory
     #include "hdf5.h"
                      ^
    compilation terminated.
    Makefile:512: recipe for target '.build_release/src/caffe/util/upgrade_proto.o' failed
    make: *** [.build_release/src/caffe/util/upgrade_proto.o] Error 1
    make: *** 正在等待未完成的任务....
    In file included from ./include/caffe/common_layers.hpp:10:0,
                     from ./include/caffe/vision_layers.hpp:10,
                     from src/caffe/layer_factory.cpp:6:
    ./include/caffe/data_layers.hpp:9:18: fatal error: hdf5.h: no such file or directory
     #include "hdf5.h"
                      ^
    compilation terminated.
    Makefile:512: recipe for target '.build_release/src/caffe/layer_factory.o' failed
    make: *** [.build_release/src/caffe/layer_factory.o] Error 1
    
    question caffe platform 
    opened by tangshi 26
  • mAP always zero

    mAP always zero

    I can't figure out why my model training mAP (val) doesn't get above zero. I'm trying to use the same approach and the SpaceNet_DetectNet_Train_Val.prototxt from this article.

    My label files 000n.txt look like this: p 0.0 0 0.0 0 0 24 118 0 0 0 0 0 0 0 0

    My images are 1280x1280, and I'm using these custom classes: dontcare,p

    image

    Where am I going wrong?

    object-detection 
    opened by DarylWM 25
  • CUDNN_STATUS_BAD_PARAM

    CUDNN_STATUS_BAD_PARAM

    Ubuntu 14.04LTS Clean install nvidia dpkg install

    $ sudo apt-get install cuda
    $ sudo apt-get install digits
    
    $ gedit .bashrc
    add to endline next.
    
    export PATH=/usr/local/cuda/bin:$PATH
    export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
    
    $ sudo reboot
    
    $ nvidia-smi
    Tue May 31 13:32:37 2016       
    +------------------------------------------------------+                       
    | NVIDIA-SMI 352.93     Driver Version: 352.93         |                       
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce GTX 960     Off  | 0000:01:00.0      On |                  N/A |
    | 20%   37C    P8    10W / 160W |    289MiB /  4095MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  GeForce GTX 960     Off  | 0000:02:00.0     Off |                  N/A |
    | 20%   43C    P8     9W / 160W |     13MiB /  4095MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    
    $ nvcc -V
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2015 NVIDIA Corporation
    Built on Tue_Aug_11_14:27:32_CDT_2015
    Cuda compilation tools, release 7.5, V7.5.17
    

    ----digits run and create Dataset----

    MNIST Image Size28x28 Image Type GRAYSCALE

    run Image Classification Model

    select Caffe and LeNet

    run, and rize next error

    ERROR: Check failed: status == CUDNN_STATUS_SUCCESS (3 vs. 0) CUDNN_STATUS_BAD_PARAM

    bug 
    opened by shinfo001 25
  • Error: status == CUDNN_STATUS_SUCCESS (8 vs. 0)  CUDNN_STATUS_EXECUTION_FAILED

    Error: status == CUDNN_STATUS_SUCCESS (8 vs. 0) CUDNN_STATUS_EXECUTION_FAILED

    I am getting this error when trying to run training with my custom network.

    status == CUDNN_STATUS_SUCCESS (8 vs. 0) CUDNN_STATUS_EXECUTION_FAILED

    I found this post that refers to this error: https://github.com/BVLC/caffe/issues/1700#issuecomment-133476490

    But it doesn't specify where or how to fix it. Also I am not sure if the issues are related or something completely different. Let me mention that this custom framework works perfectly fine when I run it in my local caffe install, and I can also see all the nodes if I hit the visualize button. It starts training and fails after the first epoch.

    pasted_image_at_2015_08_21_12_18_am

    bug 
    opened by alfredox10 24
  • Fix TypeError

    Fix TypeError

    File "/opt/digits/digits/extensions/data/imageSegmentation/data.py", line 225, in split_image_list random.shuffle(self.random_indices) File "/usr/lib/python3.8/random.py", line 307, in shuffle x[i], x[j] = x[j], x[i] TypeError: 'range' object does not support item assignment

    opened by vertexodessa 0
  • DIGITS DOCKET CONTAINER INSTALLING SUNNY PLUGIN

    DIGITS DOCKET CONTAINER INSTALLING SUNNY PLUGIN

    I'm Sorry, I'm trying to install Sunnybrook for the segmentation example on the docker container, as I want to run it over the TensorFlow backend (not Coffe). I tried to repeat the install procedure from inside the container doing docker exec -it XXXXX bash, being XXX the container ID, and later downloading the plugin from https://github.com/NVIDIA/DIGITS/tree/master/plugins/data and later doing the install proccedure, but it not works. Is there any official way to do this? I did pip install --ignore-installed setuptools (no error appears)

    Installing collected packages: setuptools Successfully installed setuptools-44.1.1

    git clone https://github.com/NVIDIA/DIGITS.git I went to /DIGITS/plugins/data/sunnybrook via "cd" finally I run pip install . No error appear, but after restarting docker, when trying to create a Sunny dataset it fails (See in the following post the error, I've posted appart, for clarity)

    Can you help please? Kind regards

    opened by crmuinos 1
  • I'm confused between which version of DIGITS to install

    I'm confused between which version of DIGITS to install

    Apologies in advance since I'm new to all this but I'm confused regarding which version of DIGITS to install. I'm beginning a fresh install of the latest Ubuntu version and as of now, after hours of scouring the internet, I have found DIGITS versions that work standalone, versions that work in Docker, then there's the official DIGITS github page which has DIGITS upto version 6 and on the NGC, there's DIGITS 20.03???

    What is going on I'm so confused. I was excited to get DIGITS up and running on my local machine just as soon as I had completed the Nvidia DLI's course and now I'm just stumped as to where to start. Would also like to know how different is DIGITS running for Tensorflow from the Caffe DIGITS.

    Please help.

    opened by RazaZaidi2802 0
  • cannot see detectnet bounding boxes using Caffe model on Nano

    cannot see detectnet bounding boxes using Caffe model on Nano

    We have trained and deployed a custom model on the nano using a caffe detectnet model. We trained in digits, and it works well when conducting inference in DIGITS, but it will not show bounding boxes when running on the nano. Is there a patch for this issue?

    opened by eanmikale 0
  • Module Creation erros

    Module Creation erros

    So I am about to train with digits as specify in Hello AI Wold an then 4cd6b3f6e3058db2dfd91edaef62c9058f65ab8d

    this is the run code

    inception_5b/relu_pool_proj ← inception_5b/pool_proj inception_5b/relu_pool_proj → inception_5b/pool_proj (in-place) Setting up inception_5b/relu_pool_proj TRAIN Top shape for layer 158 ‘inception_5b/relu_pool_proj’ 5 128 40 40 (1024000) Creating layer ‘inception_5b/output’ of type ‘Concat’ Layer’s types are Ftype:FLOAT Btype:FLOAT Fmath:FLOAT Bmath:FLOAT Created Layer inception_5b/output (159) inception_5b/output ← inception_5b/1x1 inception_5b/output ← inception_5b/3x3 inception_5b/output ← inception_5b/5x5 inception_5b/output ← inception_5b/pool_proj inception_5b/output → inception_5b/output Setting up inception_5b/output TRAIN Top shape for layer 159 ‘inception_5b/output’ 5 1024 40 40 (8192000) Creating layer ‘pool5/drop_s1’ of type ‘Dropout’ Layer’s types are Ftype:FLOAT Btype:FLOAT Fmath:FLOAT Bmath:FLOAT Created Layer pool5/drop_s1 (160) pool5/drop_s1 ← inception_5b/output pool5/drop_s1 → pool5/drop_s1 Check failed: status == CUDNN_STATUS_SUCCESS (8 vs. 0) CUDNN_STATUS_EXECUTION_FAILED, device 0

    I am using a 2070 super

    Server: 9dca63a42e15 DIGITS version: 6.1.1 Caffe version: 0.17.0 Caffe flavor: NVIDIA My brain is soup at this point please help me out. caffe_output.log

    I have not be able to create one model yet

    3f542d1f6aa28d3568d8dcf4a11558753180c8ff

    I am also unable to install the source digits without crashing Ubuntu. Today is May 11 and I started trying to have it work since the 7th please could you help me out. I am really exited about this tool.

    opened by cespedesk 0
Releases(v6.1.1)
  • v6.1.1(Apr 10, 2018)

    Since 6.1.0

    Bugfixes

    • Update for new TF API (#2014)
    • Update CI scripts to add some new deps to Caffe build (#1993)
    • Update import and API for pydicom 1.0
    • Fix label distribution and its view page (#1916)
    Source code(tar.gz)
    Source code(zip)
  • v6.1.0(Dec 12, 2017)

    Since 6.0

    New Features

    • Added functionality to integrate DIGITS with S3 Endpoints (#1868)
    • Added publish to inference server on classification workflow (#1906)

    Bugfixes

    • Fix frozen graph issue (#1907)
    • Fix 404 error for /datasets/inference-form/... from #1888 (#1889)
    • Remove timeout assertion (#1859)

    Changes

    • Various updates on document

    Known Issues

    • Out of memory error in the semantic-segmentation example when training the FCN AlexNet model on Tesla P100.
    Source code(tar.gz)
    Source code(zip)
  • v6.0.0(Aug 30, 2017)

    See release notes for the 6.0 release candidate.

    Since 6.0 RC1

    New Features

    • Added support for URL prefix (#1803)

    Bugfixes

    • Fixed loading/saving tensorflow models (#1794)

    Changes

    • Various updates on document

    Known Issues

    • Visualization for Caffe models does not currently work. (#1738)
    Source code(tar.gz)
    Source code(zip)
  • v6.0.0-rc.1(Jul 25, 2017)

    New Features

    • Added TensorFlow backend for DIGITS as an alternate to Caffe and Torch (#1714)
    • Added examples and support for GANs (#1714)
    • Added support for text classification (#1025)
    • Added more viewing options for image segmentation (#1188)

    Changes

    • HTML embedding now defaults to PNG (#1270)
    • Images that causes exceptions will now show the file name (#1636)

    Bugfixes

    • Fixed softmax visualization issue with scaled images (#1647)
    • Documentation was changed for model store with official pictures (#1650)
    • Fixed Caffe search path in Windows (#1244)
    • Fixed image file entry in Sunnybrook inference form (#1237)
    • Fixed bugs when visiting nested image folder (#1477)

    Known Issues

    • Visualization for Caffe models does not currently work. (#1738)
    Source code(tar.gz)
    Source code(zip)
  • v5.0.0(Feb 2, 2017)

    See release notes for the 5.0 release candidate.

    New since 5.0 RC

    • Enable the DIGITS Model Store (https://github.com/NVIDIA/DIGITS/pull/1308)
    • Fix calculations related to batch accumulation for Caffe (https://github.com/NVIDIA/DIGITS/pull/1307)
    • Various documentation updates
    Source code(tar.gz)
    Source code(zip)
  • v5.0.0-rc.1(Oct 15, 2016)

    279 commits since v4.0.0

    New Features

    • Import pretrained models from a model "store" (#896, #1077, #1161)
    • Support for image segmentation workflows (#830, #961, #1131)
    • Online data augmentation with Torch (#777)
    • Show CPU and system memory utilization during training (#800)
    • Improved bounding-box visualizations for object detection models (#869)
    • Create groups of jobs for easier display on the home page (#734)
    • Reuse data extensions for inference (#1024)
    • Support for plugin extensions (#1093, #927, #947)
    • Add documentation for the REST API (#964)

    Changes

    • Use environment variables for configuration instead of a file (#1091)
    • Remove digits-server and dependency on gunicorn (#1127)
    • digits-devserver is now just a small shell script instead of a Python script (#1121)
    • New design for Torch multi-GPU training (#828)
    • Add Ubuntu 16.04 support by updating dependency versions (#965)
    • Allow testing of only Caffe or only Torch with the testsuite (#1143)
    • Return more info when downloading a model tarball or json (#891)

    Bugfixes

    • Fix bug with Torch and CUDA_VISIBLE_DEVICES (#1130)
    • Fix issues with browsers returning incorrectly cached css and js files (#904)

    Known Issues

    • Training goes on longer than required when using batch accumulation (#1240)
    Source code(tar.gz)
    Source code(zip)
  • v4.0.0(Jul 19, 2016)

    529 commits since v3.0.0

    New Features

    • Add support for object-detection networks like DetectNet (#735) with documentation (#803)
    • Parameter sweep over batch size and learning rate (#708)
    • Show accuracy confusion matrix for "Classify Many" (#608)
    • Test a model with an LMDB (#638)
    • Add basic login functionality (#463)

    Changes

    • Major revamp of home page (#728, #790)
    • Allow use of BVLC/caffe (#769)
    • Run inference jobs in separate processes (#573)

    Bugfixes

    • Made device_query compatible with CUDA 8.0 (#890)

    For more information, see the release notes for v3.1, v3.2, v3.3, and the 4.0 RC.

    Source code(tar.gz)
    Source code(zip)
  • v4.0.0-rc.2(Jul 19, 2016)

    211 commits since v3.3.0

    New Features

    • Add support for object-detection networks like DetectNet (#735) with documentation (#803)
    • Parameter sweep over batch size and learning rate (#708)
    • Add plugin systems for data formats (#731) and inference visualizations (#756)
    • Expose Caffe's iter_size solver option (#744)
    • Add syntax highlighting when editing custom networks (#751)
    • View list of related jobs (#767)
    • Explore generic datasets (#822)
    • Add example for doing text classification with Torch (#684)

    Changes

    • Major revamp of home page (#728, #790)
    • Allow use of BVLC/caffe (#769)
    • New Torch multi-GPU programming model (#732)
    • Make small improvements to standard networks (#733, #749)
    • Set weight_decay to lr / 100 (#792)
    • Make major improvements to TravisCI build system (#766, #788)
    Source code(tar.gz)
    Source code(zip)
  • v3.3.0(Apr 25, 2016)

    New Features

    • Show accuracy confusion matrix for "Classify Many" (#608)
    • Test a model with an LMDB (#638)
    • Use layer stages in network descriptions for full control over train/val/deploy networks (#628)
    • Option to limit number of images to use for "Classify/Test Many" (#592)
    • Better in-app documentation for Python layers (#651)

    Changes

    • Run inference jobs in separate processes (#573)
    • Path autocompletion returns sorted list (#621)

    Bugfixes

    • Fixed UI bugs when using Safari (#702)
    • Fixed file serving for files with absolute paths (#586)
    • Fixed some UI bugs related to permissions (#594, #596)
    • Various torch-related bugfixes (#661, #663, #681, #686, #699)
    • Windows compatibility fixes (#698)
    Source code(tar.gz)
    Source code(zip)
  • v3.2.0(Feb 18, 2016)

    New Features

    • Add support for new solvers - RMSprop, AdaDelta and Adam (#564)
    • AlexNet for Torch now works for multiple GPUs (#539)
    • New documentation for installing CUDA toolkit, drivers, etc. (#558)

    Changes

    • Only look in one location for config files (#541)
    • Re-use weights when retraining a model on the same dataset (#538)
    • Functional improvements and documentation changes for examples/classification (#559, #557, #579, #582)
    • Better error-checking for caffe networks referencing invalid layer "bottoms" (#576)

    Bugfixes

    • Fixes for multistep learning rate (#549, #550)
    Source code(tar.gz)
    Source code(zip)
  • v3.1.0(Jan 22, 2016)

    New Features

    • Enable multi-GPU for Torch (#480)
    • Add basic login functionality (#463)
    • Allow Torch to fine-tune pretrained models (#499)
    • Allow Caffe to fine-tune from multiple pretrained models (#498)
    • New tutorials
      • Fine-tuning (#500)
      • Siamese networks (#453)
      • Weight initialization (#522)
    • Allow optional specification of image folder during multiple inference (#526)

    Changes

    • Torch performance improvements (#368, #390, #441, #339)
    • Disable colormap for "Top N" feature (#481)
    • Better real-time updates for dataset creation (#473)
    • Better display for device_query tool (#497)
    • Display the job directory for all job types (#469)
    • Use Flask "Blueprints" to cleanup routing code (#507)
    • Cleanup and alphabetize imports throughout the project (#501)
    • Removed docs/API.md and docs/FlaskRoutes.md (a05356ebfe0fe462f20143625ec8c942847348de)

    Bugfixes

    • Enable importing of LMDBs created with Caffe's convert_imageset tool (#517)
    Source code(tar.gz)
    Source code(zip)
  • v3.0.0(Jan 22, 2016)

    See release notes for v3.0 RC.

    New since 3.0 RC

    • Fix handling of unencoded LMDBs in Torch (#475)
    • Significant performance enhancement for creating datasets (#491)
    • Various documentation fixes / updates
    Source code(tar.gz)
    Source code(zip)
  • v3.0.0-rc.3(Dec 10, 2015)

    New Features

    • Add Torch7 as an alternative backend to Caffe (#324, #345)
    • Make using python layers easier by [optionally] attaching a python file to each model (#329)
    • Add the ability to clone previous jobs with a click (#334)
    • Update the homepage to show job updates in real-time (#240)
    • Enable mean subtraction by subtracting the mean file as well as subtracting the mean pixel (#321)
    • Support NVcaffe v0.14 (#341, #336)
    • Display the job directory size for each DatasetJob and ModelJob (#309)
    • Add a backend badge (LMDB/HDF5) to DatasetJobs on the homepage (#323)
    • Explore images in LMDB datasets (#331)

    Changes

    • Use port 34448 for the digits-server instead of port 8080 (#392)
    • Remove digits-walkthrough (#352)
    • Enforce standard UI for file input fields across different browsers (#325)

    Bugfixes

    • Fix PicklingErrors issues on all platforms (#307)
    • Fix issue when running inference on many images at once (#361)

    Known Issues

    • Large inference requests (i.e. "Classify many") may cause timeouts or even crashes (#479)
    • Incorrect handling of unencoded LMDB in Torch wrapper (#477)
    Source code(tar.gz)
    Source code(zip)
  • v2.2.1(Sep 17, 2015)

  • v2.2.0(Sep 16, 2015)

    New Features

    • Add [initial] support for HDF5 datasets (#226)
    • Zoom in on weight/activation visualizations (#267)
    • Add a new page for comparing training results (#195)
    • Add notes to jobs (#283)

    Changes

    • Open inference results in a new browser tab (#244)
    • Various improvements for using prebuilt LMDBs (#268)
    • Sort subfolders when parsing a folder of images (#296)
    • Use input_shape instead of input_dim for deploy network prototxt (#231)

    Known Issues

    • Using a snapshot from a previous network doesn't work unless the network is on the first page (#285)
    • Parameter counting fails for some layer types (like PReLU) (#317)
    Source code(tar.gz)
    Source code(zip)
  • v2.1.0(Sep 14, 2015)

    New Features

    • Add support for "Generic Inference" (i.e. non-classification) networks (#189)
    • Display number of learned parameters in a model (#221)
    • Show ground truth in "Classify Many" if provided (#110)
    • Zoom in on a selection of the loss/accuracy graph (#113)
    • Add autocomplete for server-side path input fields (#183)
    • Select max/min images per class when parsing a folder of images (#161)
    • Allow user to download log from CreateDb tasks (#221)
    • Show number of available GPUs on home page (#207)
    • Allow local file upload for image lists (#106)
    • Display DIGITS version in top right of page header (#153) and in the console output (c181797cdf3ce27bf65a22fd39fbc61b95ecaab6)

    Changes

    • Double the LMDB map_size when running out of memory instead of setting to 1TB (#209)
      • requires py-lmdb 0.87
    • Rename default GoogLeNet layers and tops (9ff246eed47ec04461956b133495260855168e2e)
    • Add pagination to Previous Networks list (c181797cdf3ce27bf65a22fd39fbc61b95ecaab6)
    • Various changes that help with Windows compatibility (#199)
    • Major refactoring of tests (#192)

    Known issues

    • Parameter counting fails for some layer types (like PReLU) (#317)
    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(Sep 3, 2015)

    New Features

    • Enabled support for multi-GPU Caffe (#92)
      • Select multiple and/or specific GPUs for training (#92, #104)
    • Created new routes for JSON REST API (#134, #136)
    • Started using GPU for inference (#66)
    • Added NVML info about GPU memory/utilization (#93)
    • Enabled ADAGRAD and NESTEROV as alternative solver types (@drozdvadym in #102)
    • Added scripts to download standard datasets MNIST and CIFAR
    • Added option to set server name (#111)
    • Added support for PPM images (#123)
    • Enabled path autocompletion while setting values in the configuration (#96)

    Changes

    • Added a python classification example (#147)
    • Subtract mean pixel during training (#169)
    • Added TravisCI integration to run tests (#28)
    • Added Coveralls integration for test coverage
    • Added Landscape integration to inspect code
    • Added auto-generated documentation of the webapp’s HTTP routes
    • Switched to loading config files from new, more logical locations (#96)
    • Started suppressing most of Caffe’s raw output (b382e99b8a143c9bbbf659ba74e67bf2ef12718e, 019bc6ca750601396a502ad0fd2b0d47b239f0d7)
    • Added a CLA

    Bugfixes

    • Fixed various OSX platform-specific issues (#32, @trivedigaurav in #94)

    Known Issues

    • Some motherboards cause P2P bandwidth issues (https://github.com/NVIDIA/caffe/issues/10)
    Source code(tar.gz)
    Source code(zip)
  • v2.0.0-rc3(Jul 31, 2015)

    See release notes for v2.0.0-preview.

    New since 2.0 Preview

    • Recommend NVIDIA/Caffe v0.13(https://github.com/NVIDIA/DIGITS/commit/5dc0f8e646d28587c07ff6fe9bcd1990820b41c2)
      • Requires cuDNN v3
    • Subtract mean pixel during training (#169)
    • Fixes regarding deployment of digits-server (c9a9dce2fcf7bb12363e6cccc44a6dd0a26a8271, e7bbc63213a10bbea516ee51adc5ffcf160494e8)
    Source code(tar.gz)
    Source code(zip)
  • v2.0.0-preview(Jul 7, 2015)

    New Features

    • Enabled support for multi-GPU Caffe (#92)
      • Select multiple and/or specific GPUs for training (#92, #104)
    • Created new routes for JSON REST API (#134, #136)
    • Started using GPU for inference (#66)
    • Added NVML info about GPU memory/utilization (#93)
    • Enabled ADAGRAD and NESTEROV as alternative solver types (@drozdvadym in #102)
    • Added scripts to download standard datasets MNIST and CIFAR
    • Added option to set server name (#111)
    • Added support for PPM images (#123)
    • Enabled path autocompletion while setting values in the configuration (#96)

    Changes

    Bugfixes

    • Fixed various OSX platform-specific issues (#32, @trivedigaurav in #94)

    Known Issues

    • Some motherboards cause P2P bandwidth issues (https://github.com/NVIDIA/caffe/issues/10)
    Source code(tar.gz)
    Source code(zip)
  • v1.1.2(Jun 26, 2015)

  • v1.1.0(Apr 24, 2015)

    New Features

    • Add GoogLeNet as a default network (#11)
    • "Classify Many Images" shows classification results of many images at once (#61)
    • Show statistics (mean, standard deviation, histogram of values) for each layer of the network at inference time (#67)
    • Allow saving images in database with PNG encoding (#73)
    • Optionally turn off shuffling when creating a dataset (#72)
    • Optionally provide a random seed to caffe (73fe257)

    Changes

    • Upgrade to NVIDIA/caffe version 0.11.0 (e2bcb27)
    • Update pip requirements list to match packages available on Ubuntu 14.04 where possible (4162db4, 133213d)
    • Use C3.js instead of Google Charts to enable DIGITS to run without an internet connection (#34)
    • Change default image resize mode from HALF_CROP to SQUASH (b4f3261)

    Bugfixes

    • Save images in BGR order instead of RGB because caffe uses OpenCV to read encoded images (#59)
    • Scale the LeNet standard network by the standard deviation of MNIST (~80) during train, val and test phases (5a38aa5, 23c1a78)
    • Use a white background when removing transparency from images (#85)

    Known Issues

    • The GoogLeNet standard network is not behaving correctly when trained on the full ImageNet dataset (#82)
    • "Classify Many Images" may timeout if too many images are uploaded and the server takes too long to respond (#70)
    Source code(tar.gz)
    Source code(zip)
Multiple types of NN model optimization environments. It is possible to directly access the host PC GUI and the camera to verify the operation. Intel iHD GPU (iGPU) support. NVIDIA GPU (dGPU) support.

mtomo Multiple types of NN model optimization environments. It is possible to directly access the host PC GUI and the camera to verify the operation.

Katsuya Hyodo 24 Mar 2, 2022
High performance Cross-platform Inference-engine, you could run Anakin on x86-cpu,arm, nv-gpu, amd-gpu,bitmain and cambricon devices.

Anakin2.0 Welcome to the Anakin GitHub. Anakin is a cross-platform, high-performance inference engine, which is originally developed by Baidu engineer

null 514 Dec 28, 2022
GrabGpu_py: a scripts for grab gpu when gpu is free

GrabGpu_py a scripts for grab gpu when gpu is free. WaitCondition: gpu_memory >

tianyuluan 3 Jun 18, 2022
NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

NVIDIA Merlin NVIDIA Merlin is an open source library designed to accelerate recommender systems on NVIDIA’s GPUs. It enables data scientists, machine

null 419 Jan 3, 2023
GPU-Accelerated Deep Learning Library in Python

Hebel GPU-Accelerated Deep Learning Library in Python Hebel is a library for deep learning with neural networks in Python using GPU acceleration with

Hannes Bretschneider 1.2k Dec 21, 2022
WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU

WarpDrive is a flexible, lightweight, and easy-to-use open-source reinforcement learning (RL) framework that implements end-to-end multi-agent RL on a single GPU (Graphics Processing Unit).

Salesforce 334 Jan 6, 2023
PyTorchMemTracer - Depict GPU memory footprint during DNN training of PyTorch

A Memory Tracer For PyTorch OOM is a nightmare for PyTorch users. However, most

Jiarui Fang 9 Nov 14, 2022
E2e music remastering system - End-to-end Music Remastering System Using Self-supervised and Adversarial Training

End-to-end Music Remastering System This repository includes source code and pre

Junghyun (Tony) Koo 37 Dec 15, 2022
Colossal-AI: A Unified Deep Learning System for Large-Scale Parallel Training

ColossalAI An integrated large-scale model training system with efficient parallelization techniques Installation PyPI pip install colossalai Install

HPC-AI Tech 7.1k Jan 3, 2023
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.9k Jan 4, 2023
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 5.7k Feb 12, 2021
Run Effective Large Batch Contrastive Learning on Limited Memory GPU

Gradient Cache Gradient Cache is a simple technique for unlimitedly scaling contrastive learning batch far beyond GPU memory constraint. This means tr

Luyu Gao 198 Dec 29, 2022
Tensors and Dynamic neural networks in Python with strong GPU acceleration

PyTorch is a Python package that provides two high-level features: Tensor computation (like NumPy) with strong GPU acceleration Deep neural networks b

null 61.4k Jan 4, 2023
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more

JAX: Autograd and XLA Quickstart | Transformations | Install guide | Neural net libraries | Change logs | Reference docs | Code search News: JAX tops

Google 21.3k Jan 1, 2023
Torchserve server using a YoloV5 model running on docker with GPU and static batch inference to perform production ready inference.

Yolov5 running on TorchServe (GPU compatible) ! This is a dockerfile to run TorchServe for Yolo v5 object detection model. (TorchServe (PyTorch librar

null 82 Nov 29, 2022
[ICLR 2021] "Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective" by Wuyang Chen, Xinyu Gong, Zhangyang Wang

Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective [PDF] Wuyang Chen, Xinyu Gong, Zhangyang Wang In ICLR 2

VITA 156 Nov 28, 2022
Tensors and Dynamic neural networks in Python with strong GPU acceleration

PyTorch is a Python package that provides two high-level features: Tensor computation (like NumPy) with strong GPU acceleration Deep neural networks b

null 46.1k Feb 13, 2021
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more

JAX: Autograd and XLA Quickstart | Transformations | Install guide | Neural net libraries | Change logs | Reference docs | Code search News: JAX tops

Google 11.4k Feb 13, 2021