SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

Intel Labs

Last update: Dec 16, 2022

Related tags

Deep Learning SLIDE_opt_ia

Overview

SLIDE

The SLIDE package contains the source code for reproducing the main experiments in this paper.

Dataset

The Datasets can be downloaded in Amazon-670K. Note that the data is sorted by labels so please shuffle at least the validation/testing data.

TensorFlow Baselines

We suggest directly get TensorFlow docker image to install TensorFlow-GPU. For TensorFlow-CPU compiled with AVX2, we recommend using this precompiled build.

Also there is a TensorFlow docker image specifically built for CPUs with AVX-512 instructions, to get it use:

docker pull clearlinux/stacks-dlrs_2-mkl

config.py controls the parameters of TensorFlow training like learning rate. example_full_softmax.py, example_sampled_softmax.py are example files for Amazon-670K dataset with full softmax and sampled softmax respectively.

Build/Run on Intel platform

Prerequisites:

CMake >= 3.0 Intel Compiler (ICC) >= 19

Build with ICC compiler

source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh -arch intel64 -platform linux
cd /path/to/slide-root
mkdir -p bin && cd bin 
# BDW (AVX2)
cmake .. -DCMAKE_CXX_COMPILER=icpc -DCMAKE_C_COMPILER=icc
# SKX/CLX (AVX512)
cmake .. -DCMAKE_CXX_COMPILER=icpc -DCMAKE_C_COMPILER=icc -DOPT_AVX512=1
# CPX (AVX512 + BF16)
cmake .. -DCMAKE_CXX_COMPILER=icpc -DCMAKE_C_COMPILER=icc -DOPT_AVX512=1 -DOPT_AVX512_BF16=1
make -j

Run on Intel SKX/CLX/CPX

cd bin
OMP_NUM_THREADS= KMP_HW_SUBSET=s,c,t KMP_AFFINITY=compact,granularity=fine KMP_BLOCKTIME=200 ./runme ../SLIDE/Config_amz.csv
For example, on CLX8280 2Sx28c:
OMP_NUM_THREADS=112 KMP_HW_SUBSET=2s,28c,2t KMP_AFFINITY=compact,granularity=fine KMP_BLOCKTIME=200 ./runme ../SLIDE/Config_amz.csv

For best performance please set Batchsize=multiple-of-logic-core-number from SLIDE/Config_amz.csv.

Results can be checked from the log file under dataset:

tail -f dataset/log.txt

You might also like...

DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021)

DeepLM DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021) Run Please install th

130 Dec 2, 2022

Colossal-AI: A Unified Deep Learning System for Large-Scale Parallel Training

ColossalAI An integrated large-scale model training system with efficient parallelization techniques Installation PyPI pip install colossalai Install

7.1k Jan 3, 2023

This is a virtual picture dragging application. Users may virtually slide photos across the screen. The distance between the index and middle fingers determines the movement. Smaller distances indicate click and motion, whereas bigger distances indicate only hand movement.

Virtual_Image_Dragger This is a virtual picture dragging application. Users may virtually slide photos across the screen. The distance between the ind

17 Dec 17, 2022

Makes patches from huge resolution .svs slide files using openslide

openslide_patcher Makes patches from huge resolution .svs slide files using openslide Example collage I made from outputs:

2 Dec 23, 2021

Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021

Geometric Vector Perceptron Implementation of Geometric Vector Perceptron, a simple circuit with 3d rotation equivariance for learning over large biom

59 Nov 24, 2022

Intel® Nervana™ reference deep learning framework committed to best performance on all hardware

DISCONTINUATION OF PROJECT. This project will no longer be maintained by Intel. Intel will not provide or guarantee development of or support for this

3.9k Dec 20, 2022

Intel® Nervana™ reference deep learning framework committed to best performance on all hardware

DISCONTINUATION OF PROJECT. This project will no longer be maintained by Intel. Intel will not provide or guarantee development of or support for this

3.9k Feb 9, 2021

Tensors and Dynamic neural networks in Python with strong GPU acceleration

PyTorch is a Python package that provides two high-level features: Tensor computation (like NumPy) with strong GPU acceleration Deep neural networks b

61.4k Jan 4, 2023

Tensors and Dynamic neural networks in Python with strong GPU acceleration

PyTorch is a Python package that provides two high-level features: Tensor computation (like NumPy) with strong GPU acceleration Deep neural networks b

46.1k Feb 13, 2021

Comments

Query regarding comparison of Cascade Lake & Cooper Lake performance

Hello @nmeisburger, @uyongw & @iitkgpanshu,

The MLSys '21 paper doesn't seem to mention how many cores (and hence, threads) were used on each machine to gather data, but based on the README file in this repo, it seems that the experiments were performed with different number of cores (and hence, threads) for both the machines.

Besides the data reported in the paper, had you also compared performance (without BF16) on Cascade Lake & Cooper Lake by using equal number of cores for both?

I'm curious if you observed any improvement in AVX512 performance (besides BF16 support) in Cooper Lake over Cascade Lake, as Ice Lake SP (like Cooper Lake, it's also Xeon SP 3rd gen, but with 1 or 2 sockets, and 48 KB L1D cache on each core) reportedly has improvements pertaining to frequency (downclocking) when AVX512 instructions are used. Since GCP/AWS/Microsoft Azure don't have Cooper Lake, so it's not possible for me to gauge its performance.

Thank you!

opened by imaginary-person 0
Why use log() to combine instead of log2()

https://github.com/IntelLabs/SLIDE_opt_ia/blob/29e40b45d89d62d50bc4a86df5b804b0594ce514/SLIDE/LSH.cpp#L115

Could you please explain the reason for changing log2() to log() ? Given that each of the K hash functions has log2(binsize) bits, is it not correct to shift the hash functions by those number of bits in order to combine them? Wouldn't log2() give a larger value for the shift length, hence increasing the range of the hash function?

opened by its-sandy 0

Problems during compilation

Hi,

My name is André, and I'm trying to run some experiments with AVX512 and Deep Learning. I read your paper and found it's exciting. So I decided to test your optimized version of SLIDE, but I'm facing an error (below) during the compilation of the code.

[ 58%] Building CXX object CMakeFiles/SLIDE_LIB.dir/SLIDE/DataLayerOpt.cpp.o
/opt/intel/parallel_studio_xe_2020/compilers_and_libraries_2020.2.254/linux/bin/intel64/icpc  -DOPT_AVX512=1 -DOPT_AVX512_BF16=1 -DOPT_IA=1 -I${HOMEDIR}/SLIDE/SLIDE_opt_ia/bin/ep/include  -qopenmp   -std=c++14 -O2 -DNDEBUG -std=gnu++14 -o CMakeFiles/SLIDE_LIB.dir/SLIDE/DataLayerOpt.cpp.o -c ${HOMEDIR}/SLIDE/SLIDE_opt_ia/SLIDE/DataLayerOpt.cpp
In file included from ${HOMEDIR}/SLIDE/SLIDE_opt_ia/SLIDE/DataLayerOpt.cpp(5):
${HOMEDIR}/SLIDE/SLIDE_opt_ia/SLIDE/DataLayerOpt.h(27): error: expected a ";"
    DataLayerOpt() numRecords_{0}, numFeatures_ {0}, numLabels_ {0} {};
                   ^

compilation aborted for ${HOMEDIR}/SLIDE/SLIDE_opt_ia/SLIDE/DataLayerOpt.cpp (code 2)
make[2]: ** [CMakeFiles/SLIDE_LIB.dir/SLIDE/DataLayerOpt.cpp.o] Erro 2

The file DataLayerOpt.h is as following:

...
22   // labels
     23   std::vector<int> labelOffsets_;
     24   std::vector<int> labelLengths_;
     25   std::vector<int> labels_;
     26 
     27   DataLayerOpt() numRecords_{0}, numFeatures_ {0}, numLabels_ {0} {};
     28   void loadData(const std::string &srcFile);
     29 
     30   inline int lengthByRecordIndex(size_t n) {
     31     return lengths_[n];
     32   }
...

Any suggestion on how to solve this?

I used Intel PSXE 2020 and 2019, with both resulting in the same error. Below are the commands used to download and compile the code:

git clone https://github.com/RUSH-LAB/SLIDE.git
cd SLIDE
git submodule init
git submodule update

cd SLIDE_opt_ia

module load cmake/3.17.3 
module load intel_psxe/2020 

mkdir -p bin && cd bin 
# SKX/CLX (AVX512)
cmake .. -DCMAKE_CXX_COMPILER=icpc -DCMAKE_C_COMPILER=icc -DOPT_AVX512=1
# CPX (AVX512 + BF16)
cmake .. -DCMAKE_CXX_COMPILER=icpc -DCMAKE_C_COMPILER=icc -DOPT_AVX512=1 -DOPT_AVX512_BF16=1

VERBOSE=1 make

Thank you for your attention and best regards.

opened by arcarneiro 3

SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

Related tags

Overview

SLIDE

Dataset

TensorFlow Baselines

Build/Run on Intel platform

Prerequisites:

Build with ICC compiler

Run on Intel SKX/CLX/CPX

You might also like...

DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021)

Colossal-AI: A Unified Deep Learning System for Large-Scale Parallel Training

This is a virtual picture dragging application. Users may virtually slide photos across the screen. The distance between the index and middle fingers determines the movement. Smaller distances indicate click and motion, whereas bigger distances indicate only hand movement.

Makes patches from huge resolution .svs slide files using openslide

Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021

Intel® Nervana™ reference deep learning framework committed to best performance on all hardware

Intel® Nervana™ reference deep learning framework committed to best performance on all hardware

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Comments

Query regarding comparison of Cascade Lake & Cooper Lake performance

Why use log() to combine instead of log2()

Problems during compilation

Owner

Intel Labs

Calculates JMA (Japan Meteorological Agency) seismic intensity (shindo) scale from acceleration data recorded in NumPy array

Deep Learning Slide Captcha

Deep Multi-Magnification Network for multi-class tissue segmentation of whole slide images

PyTorch reimplementation of the Smooth ReLU activation function proposed in the paper "Real World Large Scale Recommendation Systems Reproducibility and Smooth Activations" [arXiv 2022].

This is the unofficial code of Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes. which achieve state-of-the-art trade-off between accuracy and speed on cityscapes and camvid, without using inference acceleration and extra data

Open source hardware and software platform to build a small scale self driving car.

Minimal implementation of Denoised Smoothing: A Provable Defense for Pretrained Classifiers in TensorFlow.

A certifiable defense against adversarial examples by training neural networks to be provably robust

Hierarchical-Bayesian-Defense - Towards Adversarial Robustness of Bayesian Neural Network through Hierarchical Variational Inference (Openreview)

Dcf-game-infrastructure-public - Contains all the components necessary to run a DC finals (attack-defense CTF) game from OOO