GraphLily: A Graph Linear Algebra Overlay on HBM-Equipped FPGAs

Cornell Zhang Research Group

Last update: Dec 13, 2022

Related tags

Deep Learning GraphLily

Overview

GraphLily: A Graph Linear Algebra Overlay on HBM-Equipped FPGAs

GraphLily is the first FPGA overlay for graph processing. GraphLily supports a rich set of graph algorithms by adopting the GraphBLAS programming interface, which formulates graph algorithms as sparse linear algebra kernels. GraphLily effectively utilizes the high bandwidth of HBM to accelerate SpMV and SpMSpV, the two widely-used kernels in GraphBLAS, by co-designing the data layout and the accelerator architecture. GraphLily further builds a middleware to provide runtime support, enabling users to easily port existing GraphBLAS programs from CPUs/GPUs to FPGAs.

For more information, refer to our ICCAD'21 paper.

@article{hu2021graphlily,
  title={GraphLily: Accelerating Graph Linear Algebra on HBM-Equipped FPGAs},
  author={Hu, Yuwei and Du, Yixiao and Ustun, Ecenur and Zhang, Zhiru},
  journal={International Conference On Computer Aided Design},
  year={2021}
}

Prerequisites

Platform: Xilinx Alveo U280
Tool: Xilinx Vitis 2019.2

Run Benchmarking

Clone the repo

git clone [email protected]:cornell-zhang/GraphLily.git
export GRAPHLILY_ROOT_PATH=/path/to/GraphLily

Get the bitstream

A pre-compiled bitstream (166 MHz) is provided here.
To generate a new bitstream:

cd GraphLily/generate_bitstream
make synthesize

Prepare datasets

The input is an adjacency matrix in csr format stored as a scipy npz file. Please install cnpy, which is required for data loading.

Our ICCAD'21 paper evaluated the following six graph datasets:

Run

Go to the GraphLily/benchmark folder, modify the cnpy path in Makefile, modify the bitstream path and the datasets path in run_bfs.sh, then:

bash run_bfs.sh

You might also like...

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

2.8k Feb 12, 2021

An automated algorithm to extract the linear blend skinning (LBS) from a set of example poses

Dem Bones This repository contains an implementation of Smooth Skinning Decomposition with Rigid Bones, an automated algorithm to extract the Linear B

684 Dec 26, 2022

Code for our paper at ECCV 2020: Post-Training Piecewise Linear Quantization for Deep Neural Networks

PWLQ Updates 2020/07/16 - We are working on getting permission from our institution to release our source code. We will release it once we are granted

54 Dec 15, 2022

The code release of paper 'Domain Generalization for Medical Imaging Classification with Linear-Dependency Regularization' NIPS 2020.

Domain Generalization for Medical Imaging Classification with Linear Dependency Regularization The code release of paper 'Domain Generalization for Me

56 Dec 28, 2022

PyTorch implementation of the Deep SLDA method from our CVPRW-2020 paper "Lifelong Machine Learning with Deep Streaming Linear Discriminant Analysis"

Lifelong Machine Learning with Deep Streaming Linear Discriminant Analysis This is a PyTorch implementation of the Deep Streaming Linear Discriminant

41 Dec 25, 2022

A python library to build Model Trees with Linear Models at the leaves.

Comments

Deadlock happens after scaling SpMSpV to 8 HBM channels

After fusing the three interfaces of one matrix partition (partptr, indptr, packets) into a single interface (commit: 1409b0d4085d37d975e2a64ebb1afbdbadaeb682), we are now able to scale SpMSpV to up to 32 channels.

Deadlock happens when I benchmark an 8-channel SpMSpV. On uniform_10K_10, it passed the first iteration and ran into deadlock after several iterations; on ogbl_ppa_576K_42M, it ran into deadlock at the first iteration. With hardware emulation, there is no deadlock. I tried increasing the FIFO depth, but deadlock still happens.

@yxd97 Please take a look at this issue later when you have time. It might be related to the shuffler. Also check the data loader and stream merger.

opened by Huyuwei 3
Reproduce 165 MHz
To reproduce the 165 MHz design in our paper, this PR makes three changes:

Use a 3-D output buffer for SpMSpV instead of 2-D

Set the latency of both URAM and BRAM to 4

Use interleaving (not clear why it helps with timing)

Set the target frequency to 200 MHz
opened by Huyuwei 0
remove results_.resize in SpMSpVModule::send_results_device_to_host
An error occurs when we call BFS::pull_push multiple times on the same dataset with different source vertices. This is due to the results_.resize function call in SpMSpVModule::send_results_device_to_host. Specifically, the following piece of code causes wrong results:

this->command_queue_.enqueueMigrateMemObjects({this->results_buf}, CL_MIGRATE_MEM_OBJECT_HOST); this->command_queue_.finish(); this->results_.resize(a_small_number); return this->results_; this->command_queue_.enqueueMigrateMemObjects({this->results_buf}, CL_MIGRATE_MEM_OBJECT_HOST); this->command_queue_.finish(); this->results_.resize(a_large_number); return this->results_; // wrong because this->results_ has been resized to a small number

This PR solves the bug by removing results_.resize, which is indeed not useful at all. I don't remember why we do it; maybe it makes it clear that the results vector is sparse with a size determined at run time.

@zzczzc20 Zhongchun, please check if this PR solves the bug you encountered and then merge it.
opened by Huyuwei 0

GraphLily: A Graph Linear Algebra Overlay on HBM-Equipped FPGAs

Related tags

Overview

GraphLily: A Graph Linear Algebra Overlay on HBM-Equipped FPGAs

Prerequisites

Run Benchmarking

Clone the repo

Get the bitstream

Prepare datasets

Run

You might also like...

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

An automated algorithm to extract the linear blend skinning (LBS) from a set of example poses

Code for our paper at ECCV 2020: Post-Training Piecewise Linear Quantization for Deep Neural Networks

The code release of paper 'Domain Generalization for Medical Imaging Classification with Linear-Dependency Regularization' NIPS 2020.

PyTorch implementation of the Deep SLDA method from our CVPRW-2020 paper "Lifelong Machine Learning with Deep Streaming Linear Discriminant Analysis"

A python library to build Model Trees with Linear Models at the leaves.

Official repository for the paper "Going Beyond Linear Transformers with Recurrent Fast Weight Programmers"

Relative Positional Encoding for Transformers with Linear Complexity

A PyTorch Implementation of the Luna: Linear Unified Nested Attention

Comments

Deadlock happens after scaling SpMSpV to 8 HBM channels

Reproduce 165 MHz

remove results_.resize in SpMSpVModule::send_results_device_to_host

Owner

Cornell Zhang Research Group

With this package, you can generate mixed-integer linear programming (MIP) models of trained artificial neural networks (ANNs) using the rectified linear unit (ReLU) activation function

Simple Linear 2nd ODE Solver GUI - A 2nd constant coefficient linear ODE solver with simple GUI using euler's method

Hitters Linear Regression - Hitters Linear Regression With Python

⚾🤖⚾ Automatic baseball pitching overlay in realtime

OpenCV, MediaPipe Pose Estimation, Affine Transform for Icon Overlay

Geometric Algebra package for JAX

Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.