SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

Overview

SLIDE

The SLIDE package contains the source code for reproducing the main experiments in this paper.

Dataset

The Datasets can be downloaded in Amazon-670K. Note that the data is sorted by labels so please shuffle at least the validation/testing data.

TensorFlow Baselines

We suggest directly get TensorFlow docker image to install TensorFlow-GPU. For TensorFlow-CPU compiled with AVX2, we recommend using this precompiled build.

Also there is a TensorFlow docker image specifically built for CPUs with AVX-512 instructions, to get it use:

docker pull clearlinux/stacks-dlrs_2-mkl    

config.py controls the parameters of TensorFlow training like learning rate. example_full_softmax.py, example_sampled_softmax.py are example files for Amazon-670K dataset with full softmax and sampled softmax respectively.

Build/Run on Intel platform

Prerequisites:

CMake >= 3.0 Intel Compiler (ICC) >= 19

Build with ICC compiler

source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh -arch intel64 -platform linux
cd /path/to/slide-root
mkdir -p bin && cd bin 
# BDW (AVX2)
cmake .. -DCMAKE_CXX_COMPILER=icpc -DCMAKE_C_COMPILER=icc
# SKX/CLX (AVX512)
cmake .. -DCMAKE_CXX_COMPILER=icpc -DCMAKE_C_COMPILER=icc -DOPT_AVX512=1
# CPX (AVX512 + BF16)
cmake .. -DCMAKE_CXX_COMPILER=icpc -DCMAKE_C_COMPILER=icc -DOPT_AVX512=1 -DOPT_AVX512_BF16=1
make -j

Run on Intel SKX/CLX/CPX

cd bin
OMP_NUM_THREADS= KMP_HW_SUBSET=s,c,t KMP_AFFINITY=compact,granularity=fine KMP_BLOCKTIME=200 ./runme ../SLIDE/Config_amz.csv
For example, on CLX8280 2Sx28c:
OMP_NUM_THREADS=112 KMP_HW_SUBSET=2s,28c,2t KMP_AFFINITY=compact,granularity=fine KMP_BLOCKTIME=200 ./runme ../SLIDE/Config_amz.csv

For best performance please set Batchsize=multiple-of-logic-core-number from SLIDE/Config_amz.csv.

Results can be checked from the log file under dataset:

tail -f dataset/log.txt
You might also like...
DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021)
DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021)

DeepLM DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021) Run Please install th

Colossal-AI: A Unified Deep Learning System for Large-Scale Parallel Training

ColossalAI An integrated large-scale model training system with efficient parallelization techniques Installation PyPI pip install colossalai Install

This is a virtual picture dragging application. Users may virtually slide photos across the screen. The distance between the index and middle fingers determines the movement. Smaller distances indicate click and motion, whereas bigger distances indicate only hand movement. Makes patches from huge resolution .svs slide files using openslide
Makes patches from huge resolution .svs slide files using openslide

openslide_patcher Makes patches from huge resolution .svs slide files using openslide Example collage I made from outputs:

Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021
Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021

Geometric Vector Perceptron Implementation of Geometric Vector Perceptron, a simple circuit with 3d rotation equivariance for learning over large biom

Intel® Nervana™ reference deep learning framework committed to best performance on all hardware

DISCONTINUATION OF PROJECT. This project will no longer be maintained by Intel. Intel will not provide or guarantee development of or support for this

Intel® Nervana™ reference deep learning framework committed to best performance on all hardware

DISCONTINUATION OF PROJECT. This project will no longer be maintained by Intel. Intel will not provide or guarantee development of or support for this

Tensors and Dynamic neural networks in Python with strong GPU acceleration
Tensors and Dynamic neural networks in Python with strong GPU acceleration

PyTorch is a Python package that provides two high-level features: Tensor computation (like NumPy) with strong GPU acceleration Deep neural networks b

Tensors and Dynamic neural networks in Python with strong GPU acceleration
Tensors and Dynamic neural networks in Python with strong GPU acceleration

PyTorch is a Python package that provides two high-level features: Tensor computation (like NumPy) with strong GPU acceleration Deep neural networks b

Comments
  • Query regarding comparison of Cascade Lake & Cooper Lake performance

    Query regarding comparison of Cascade Lake & Cooper Lake performance

    Hello @nmeisburger, @uyongw & @iitkgpanshu,

    The MLSys '21 paper doesn't seem to mention how many cores (and hence, threads) were used on each machine to gather data, but based on the README file in this repo, it seems that the experiments were performed with different number of cores (and hence, threads) for both the machines.

    Besides the data reported in the paper, had you also compared performance (without BF16) on Cascade Lake & Cooper Lake by using equal number of cores for both?

    I'm curious if you observed any improvement in AVX512 performance (besides BF16 support) in Cooper Lake over Cascade Lake, as Ice Lake SP (like Cooper Lake, it's also Xeon SP 3rd gen, but with 1 or 2 sockets, and 48 KB L1D cache on each core) reportedly has improvements pertaining to frequency (downclocking) when AVX512 instructions are used. Since GCP/AWS/Microsoft Azure don't have Cooper Lake, so it's not possible for me to gauge its performance.

    Thank you!

    opened by imaginary-person 0
  • Why use log() to combine instead of log2()

    Why use log() to combine instead of log2()

    https://github.com/IntelLabs/SLIDE_opt_ia/blob/29e40b45d89d62d50bc4a86df5b804b0594ce514/SLIDE/LSH.cpp#L115

    Could you please explain the reason for changing log2() to log() ? Given that each of the K hash functions has log2(binsize) bits, is it not correct to shift the hash functions by those number of bits in order to combine them? Wouldn't log2() give a larger value for the shift length, hence increasing the range of the hash function?

    opened by its-sandy 0
  • Problems during compilation

    Problems during compilation

    Hi,

    My name is André, and I'm trying to run some experiments with AVX512 and Deep Learning. I read your paper and found it's exciting. So I decided to test your optimized version of SLIDE, but I'm facing an error (below) during the compilation of the code.

    [ 58%] Building CXX object CMakeFiles/SLIDE_LIB.dir/SLIDE/DataLayerOpt.cpp.o
    /opt/intel/parallel_studio_xe_2020/compilers_and_libraries_2020.2.254/linux/bin/intel64/icpc  -DOPT_AVX512=1 -DOPT_AVX512_BF16=1 -DOPT_IA=1 -I${HOMEDIR}/SLIDE/SLIDE_opt_ia/bin/ep/include  -qopenmp   -std=c++14 -O2 -DNDEBUG -std=gnu++14 -o CMakeFiles/SLIDE_LIB.dir/SLIDE/DataLayerOpt.cpp.o -c ${HOMEDIR}/SLIDE/SLIDE_opt_ia/SLIDE/DataLayerOpt.cpp
    In file included from ${HOMEDIR}/SLIDE/SLIDE_opt_ia/SLIDE/DataLayerOpt.cpp(5):
    ${HOMEDIR}/SLIDE/SLIDE_opt_ia/SLIDE/DataLayerOpt.h(27): error: expected a ";"
        DataLayerOpt() numRecords_{0}, numFeatures_ {0}, numLabels_ {0} {};
                       ^
    
    compilation aborted for ${HOMEDIR}/SLIDE/SLIDE_opt_ia/SLIDE/DataLayerOpt.cpp (code 2)
    make[2]: ** [CMakeFiles/SLIDE_LIB.dir/SLIDE/DataLayerOpt.cpp.o] Erro 2
    

    The file DataLayerOpt.h is as following:

    ...
    22   // labels
         23   std::vector<int> labelOffsets_;
         24   std::vector<int> labelLengths_;
         25   std::vector<int> labels_;
         26 
         27   DataLayerOpt() numRecords_{0}, numFeatures_ {0}, numLabels_ {0} {};
         28   void loadData(const std::string &srcFile);
         29 
         30   inline int lengthByRecordIndex(size_t n) {
         31     return lengths_[n];
         32   }
    ...
    
    

    Any suggestion on how to solve this?

    I used Intel PSXE 2020 and 2019, with both resulting in the same error. Below are the commands used to download and compile the code:

    git clone https://github.com/RUSH-LAB/SLIDE.git
    cd SLIDE
    git submodule init
    git submodule update
    
    cd SLIDE_opt_ia
    
    module load cmake/3.17.3 
    module load intel_psxe/2020 
    
    mkdir -p bin && cd bin 
    # SKX/CLX (AVX512)
    cmake .. -DCMAKE_CXX_COMPILER=icpc -DCMAKE_C_COMPILER=icc -DOPT_AVX512=1
    # CPX (AVX512 + BF16)
    cmake .. -DCMAKE_CXX_COMPILER=icpc -DCMAKE_C_COMPILER=icc -DOPT_AVX512=1 -DOPT_AVX512_BF16=1
    
    VERBOSE=1 make
    

    Thank you for your attention and best regards.

    opened by arcarneiro 3
Owner
Intel Labs
Intel Labs
Calculates JMA (Japan Meteorological Agency) seismic intensity (shindo) scale from acceleration data recorded in NumPy array

shindo.py Calculates JMA (Japan Meteorological Agency) seismic intensity (shindo) scale from acceleration data stored in NumPy array Introduction Japa

RR_Inyo 3 Sep 23, 2022
Deep Learning Slide Captcha

滑动验证码深度学习识别 本项目使用深度学习 YOLOV3 模型来识别滑动验证码缺口,基于 https://github.com/eriklindernoren/PyTorch-YOLOv3 修改。 只需要几百张缺口标注图片即可训练出精度高的识别模型,识别效果样例: 克隆项目 运行命令: git cl

Python3WebSpider 55 Jan 2, 2023
Deep Multi-Magnification Network for multi-class tissue segmentation of whole slide images

Deep Multi-Magnification Network This repository provides training and inference codes for Deep Multi-Magnification Network published here. Deep Multi

Computational Pathology 12 Aug 6, 2022
PyTorch reimplementation of the Smooth ReLU activation function proposed in the paper "Real World Large Scale Recommendation Systems Reproducibility and Smooth Activations" [arXiv 2022].

Smooth ReLU in PyTorch Unofficial PyTorch reimplementation of the Smooth ReLU (SmeLU) activation function proposed in the paper Real World Large Scale

Christoph Reich 10 Jan 2, 2023
Open source hardware and software platform to build a small scale self driving car.

Donkeycar is minimalist and modular self driving library for Python. It is developed for hobbyists and students with a focus on allowing fast experimentation and easy community contributions.

Autorope 2.4k Jan 4, 2023
Minimal implementation of Denoised Smoothing: A Provable Defense for Pretrained Classifiers in TensorFlow.

Denoised-Smoothing-TF Minimal implementation of Denoised Smoothing: A Provable Defense for Pretrained Classifiers in TensorFlow. Denoised Smoothing is

Sayak Paul 19 Dec 11, 2022
A certifiable defense against adversarial examples by training neural networks to be provably robust

DiffAI v3 DiffAI is a system for training neural networks to be provably robust and for proving that they are robust. The system was developed for the

SRI Lab, ETH Zurich 202 Dec 13, 2022
LBK 20 Dec 2, 2022
Dcf-game-infrastructure-public - Contains all the components necessary to run a DC finals (attack-defense CTF) game from OOO

dcf-game-infrastructure All the components necessary to run a game of the OOO DC

Order of the Overflow 46 Sep 13, 2022