10x faster matrix and vector operations

Last update: Jan 9, 2023

Related tags

Overview

Bolt is an algorithm for compressing vectors of real-valued data and running mathematical operations directly on the compressed representations.

If you have a large collection of mostly-dense vectors and can tolerate lossy compression, Bolt can probably save you 10-200x space and compute time.

Bolt also has theoretical guarantees bounding the errors in its approximations.

EDIT: this repo now also features the source code for MADDNESS, our shiny new algorithm for approximate matrix multiplication. MADDNESS has no Python wrapper yet, and is referred to as "mithral" in the source code. Name changed because apparently I'm the only who gets Lord of the Rings references. MADDNESS runs ridiculously fast and, under reasonable assumptions, requires zero multiply-adds. Realistically, it'll be most useful for speeding up neural net inference on CPUs, but it'll take another couple papers to get it there; we need to generalize it to convolution and write the CUDA kernels to allow GPU training.

NOTE: All below code refers to the Python wrapper for Bolt and has nothing to do with MADDNESS. It also seems to be no longer building for many people. If you want to use MADDNESS, see the Python Implementation driven by amm_main.py or C++ implementation. All code is ugly, but Python code should be pretty easy to add new AMM methods/variations to.

Installing

Python

  $ brew install swig  # for wrapping C++; use apt-get, yum, etc, if not OS X
  $ pip install numpy  # bolt installation needs numpy already present
  $ git clone https://github.com/dblalock/bolt.git
  $ cd bolt && python setup.py install
  $ pytest tests/  # optionally, run the tests

If you run into any problems, please don't hesitate to mention it in the Python build problems issue.

C++

Install Bazel, Google's open-source build system. Then

  $ git clone https://github.com/dblalock/bolt.git
  $ cd bolt/cpp && bazel run :main

The bazel run command will build the project and run the tests and benchmarks.

If you want to integrate Bolt with another C++ project, include cpp/src/include/public.hpp and add the remaining files under cpp/src to your builds. You should let me know if you're interested in doing such an integration because I'm hoping to see Bolt become part of many libraries and thus would be happy to help you.

Notes

Bolt currently only supports machines with AVX2 instructions, which basically means x86 machines from fall 2013 or later. Contributions for ARM support are welcome. Also note that the Bolt Python wrapper is currently configured to require Clang, since GCC apparently runs into issues.

How does it work?

Bolt is based on vector quantization. For details, see the Bolt paper or slides.

Benchmarks

Bolt includes a thorough set of speed and accuracy benchmarks. See the experiments/ directory. This is also what you want if you want to reproduce the results in the paper.

Note that all of the timing results use the raw C++ implementation. At present, the Python wrapper is slightly slower due to Python overhead. If you're interested in having a full-speed wrapper, let me know and I'll allocate time to making this happen.

Basic usage

X, queries = some N x D array, some iterable of length D arrays

# these are approximately equal (though the latter are shifted and scaled)
enc = bolt.Encoder(reduction='dot').fit(X)
[np.dot(X, q) for q in queries]
[enc.transform(q) for q in queries]

# same for these
enc = bolt.Encoder(reduction='l2').fit(X)
[np.sum((X - q) * (X - q), axis=1) for q in queries]
[enc.transform(q) for q in queries]

# but enc.transform() is 10x faster or more

Example: Matrix-vector multiplies

import bolt
import numpy as np
from scipy.stats import pearsonr as corr
from sklearn.datasets import load_digits
import timeit

# for simplicity, use the sklearn digits dataset; we'll split
# it into a matrix X and a set of queries Q
X, _ = load_digits(return_X_y=True)
nqueries = 20
X, Q = X[:-nqueries], X[-nqueries:]

enc = bolt.Encoder(reduction='dot', accuracy='lowest') # can tweak acc vs speed
enc.fit(X)

dot_corrs = np.empty(nqueries)
for i, q in enumerate(Q):
    dots_true = np.dot(X, q)
    dots_bolt = enc.transform(q)
    dot_corrs[i] = corr(dots_true, dots_bolt)[0]

# dot products closely preserved despite compression
print "dot product correlation: {} +/- {}".format(
    np.mean(dot_corrs), np.std(dot_corrs))  # > .97

# massive space savings
print(X.nbytes)  # 1777 rows * 64 cols * 8B = 909KB
print(enc.nbytes)  # 1777 * 2B = 3.55KB

# massive time savings (~10x here, but often >100x on larger
# datasets with less Python overhead; see the paper)
t_np = timeit.Timer(
    lambda: [np.dot(X, q) for q in Q]).timeit(5)        # ~9ms
t_bolt = timeit.Timer(
    lambda: [enc.transform(q) for q in Q]).timeit(5)    # ~800us
print "Numpy / BLAS time, Bolt time: {:.3f}ms, {:.3f}ms".format(
    t_np * 1000, t_bolt * 1000)

# can get output without offset/scaling if needed
dots_bolt = [enc.transform(q, unquantize=True) for q in Q]

Example: K-Nearest Neighbor / Maximum Inner Product Search

# search using squared Euclidean distances
# (still using the Digits dataset from above)
enc = bolt.Encoder('l2', accuracy='high').fit(X)
bolt_knn = [enc.knn(q, k_bolt) for q in Q]  # knn for each query

# search using dot product (maximum inner product search)
enc = bolt.Encoder('dot', accuracy='medium').fit(X)
bolt_knn = [enc.knn(q, k_bolt) for q in Q]  # knn for each query

Miscellaneous

Bolt stands for "Based On Lookup Tables". Feel free to use this exciting fact at parties.

Comments

Python unit test is broken?

OS: MacOS Monterey 12.5 (Intel chip) Python: 3.10.5


❯ pytest tests
============================================================================== test session starts ===============================================================================
platform darwin -- Python 3.10.5, pytest-7.1.2, pluggy-1.0.0
rootdir: /Users/xiao/development/github.com/XiaoConstantine/bolt-1
collected 4 items

tests/test_encoder.py ..F.                                                                                                                                                 [100%]

==================================================================================== FAILURES ====================================================================================
________________________________________________________________________________ test_unquantize _________________________________________________________________________________

    def test_unquantize():
        X, Q = _load_digits_X_Q(nqueries=20)
>       enc = bolt.Encoder('dot', accuracy='high').fit(X)

tests/test_encoder.py:151:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../dblalock/bolt/venv/lib/python3.10/site-packages/pybolt-0.1.4-py3.10-macosx-11-x86_64.egg/bolt/bolt_api.py:466: in fit
    centroids = _learn_centroids(X, ncentroids=ncentroids,
../../dblalock/bolt/venv/lib/python3.10/site-packages/pybolt-0.1.4-py3.10-macosx-11-x86_64.egg/bolt/bolt_api.py:142: in _learn_centroids
    centroids, labels = kmeans(X_in, ncentroids)
../../dblalock/bolt/venv/lib/python3.10/site-packages/pybolt-0.1.4-py3.10-macosx-11-x86_64.egg/bolt/bolt_api.py:106: in kmeans
    seeds = kmc2.kmc2(X, k).astype(np.float32)
kmc2.pyx:97: in kmc2.kmc2
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   ValueError: probabilities contain NaN

mtrand.pyx:935: ValueError

opened by XiaoConstantine 5

Update README with additional steps for python build

The existing steps on the main README seem a bit outdated(for python at least)

This PR is mainly moving existing information around to avoid frictions to get started with using python api

opened by XiaoConstantine 0

Old AVX test fails due to GPF/sigsegv

Summary

The avx sgemm test fails at n=14 on my system, apparently due to a general protection fault reading unaligned memory.

Problem Information

_mm256_load_ps raises a general protection fault if an address is passed to it that is not aligned to 32 bytes. Documentation: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_load_ps&ig_expand=4279

System information

$ cat .git/refs/heads/master
e7726a4c165cc45ac117e9eabd8761013a26640e
$ uname -a
Linux DESKTOP-E3P7TC0 3.10.0-1160.62.1.el7.x86_64 #1 SMP Wed Mar 23 09:04:02 UTC 2022 x86_64 GNU/Linux
$ cat /etc/system-release
Red Hat Enterprise Linux Workstation release 7.9 (Maipo)

Diagnostic Information

$ gdb cpp/bolt-build/bolt
(gdb) run
Starting program: /shared/src/bolt/cpp/bolt-build/bolt 
brute force sgemm test, n = 1...
brute force sgemm test, n = 2...
brute force sgemm test, n = 5...
brute force sgemm test, n = 14...

Program received signal SIGSEGV, Segmentation fault.
0x00000000005a1af9 in _mm256_load_ps (__P=0x886cf8) at /usr/local/lib/gcc/x86_64-pc-linux-gnu/9.4.1/include/avxintrin.h:874
874       return *(__m256 *)__P;
(gdb) up
#1  (anonymous namespace)::sgemm_colmajor_narrow_padded<1, 2> (A=0x8878e0, B=0x875d80, N=14, D=1, M=2, out=0x886cc0, add_to_output=false, A_col_stride=14, B_col_stride=1, 
    out_col_stride=14, nrows_per_chunk=512) at /home/user/src/bolt/cpp/src/utils/avx_utils.hpp:394
394                             sums[mm] = _mm256_load_ps(out_ptr);
(gdb) p out_ptr
$1 = (float *) 0x886cf8
(gdb) bt
#0  0x00000000005a1af9 in _mm256_load_ps (__P=0x886cf8) at /usr/local/lib/gcc/x86_64-pc-linux-gnu/9.4.1/include/avxintrin.h:874
#1  (anonymous namespace)::sgemm_colmajor_narrow_padded<1, 2> (A=0x8878e0, B=0x875d80, N=14, D=1, M=2, out=0x886cc0, add_to_output=false, A_col_stride=14, B_col_stride=1, 
    out_col_stride=14, nrows_per_chunk=512) at /home/user/src/bolt/cpp/src/utils/avx_utils.hpp:394
#2  0x000000000059ff9e in sgemm_colmajor (A=0x8878e0, B=0x875d80, N=14, D=1, M=2, out=0x886cc0) at /home/user/src/bolt/cpp/src/utils/avx_utils.cpp:18
#3  0x00000000005ee949 in _test_sgemm_colmajor<-1, -1> (N=14, D=1, M=2, simple_entries=false) at /home/user/src/bolt/cpp/test/test_avx_utils.cpp:54
#4  0x00000000005ec3f5 in ____C_A_T_C_H____T_E_S_T____100 () at /home/user/src/bolt/cpp/test/test_avx_utils.cpp:155
#5  0x00000000005bb56e in Catch::FreeFunctionTestCase::invoke (this=0x86ef90) at /home/user/src/bolt/cpp/test/external/catch.hpp:5507
#6  0x00000000005aa337 in Catch::TestCase::invoke (this=0x889280) at /home/user/src/bolt/cpp/test/external/catch.hpp:6389
#7  0x00000000005b972b in Catch::RunContext::runCurrentTest (this=0x7fffffffd560, redirectedCout="", redirectedCerr="") at /home/user/src/bolt/cpp/test/external/catch.hpp:5131
#8  0x00000000005b8737 in Catch::RunContext::runTest (this=0x7fffffffd560, testCase=...) at /home/user/src/bolt/cpp/test/external/catch.hpp:5001
#9  0x00000000005ba095 in Catch::Runner::runTests (this=0x7fffffffd810) at /home/user/src/bolt/cpp/test/external/catch.hpp:5275
#10 0x00000000005bae1b in Catch::Session::run (this=0x7fffffffdb10) at /home/user/src/bolt/cpp/test/external/catch.hpp:5395
#11 0x00000000005bace8 in Catch::Session::run (this=0x7fffffffdb10, argc=1, argv=0x7fffffffdce8) at /home/user/src/bolt/cpp/test/external/catch.hpp:5378
#12 0x00000000005ae8bb in main (argc=1, argv=0x7fffffffdce8) at /home/user/src/bolt/cpp/test/main.cpp:22

$ valgrind cpp/bolt-build/bolt
==5087== Memcheck, a memory error detector                                                     
==5087== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.                       
==5087== Using Valgrind-3.18.0.GIT and LibVEX; rerun with -h for copyright info                
==5087== Command: cpp/bolt-build/bolt                                                          
==5087==                                                                                       
brute force sgemm test, n = 1...                                                               
brute force sgemm test, n = 2...                                                               
brute force sgemm test, n = 5...                                                                                                                                                              
brute force sgemm test, n = 14...                                                                                                                                                             
==5087==                                                                                                                                                                                      
==5087== Process terminating with default action of signal 11 (SIGSEGV)                                                                                                                       
==5087==  General Protection Fault                                                                                                                                                            
==5087==    at 0x5A1AF9: _mm256_load_ps (avxintrin.h:874)                                                                                                                                     
==5087==    by 0x5A1AF9: void (anonymous namespace)::sgemm_colmajor_narrow_padded<1, 2>(float const*, float const*, int, int, int, float*, bool, int, int, int, int) (avx_utils.hpp:394)      
==5087==    by 0x59FF9D: sgemm_colmajor(float const*, float const*, int, int, int, float*) (avx_utils.cpp:18)                                                                                 
==5087==    by 0x5EE948: void _test_sgemm_colmajor<-1, -1>(int, int, int, bool) (test_avx_utils.cpp:54)                                                                                       
==5087==    by 0x5EC3F4: ____C_A_T_C_H____T_E_S_T____100() (test_avx_utils.cpp:155)                                                                                                           
==5087==    by 0x5BB56D: Catch::FreeFunctionTestCase::invoke() const (catch.hpp:5507)                                                                                                         
==5087==    by 0x5AA336: Catch::TestCase::invoke() const (catch.hpp:6389)                                                                                                                     
==5087==    by 0x5B972A: Catch::RunContext::runCurrentTest(std::string&, std::string&) (catch.hpp:5131)
==5087==    by 0x5B8736: Catch::RunContext::runTest(Catch::TestCase const&) (catch.hpp:5001)
==5087==    by 0x5BA094: Catch::Runner::runTests() (catch.hpp:5275)
==5087==    by 0x5BAE1A: Catch::Session::run() (catch.hpp:5395)
==5087==    by 0x5BACE7: Catch::Session::run(int, char* const*) (catch.hpp:5378)
==5087==    by 0x5AE8BA: main (main.cpp:22)

Work to Resolve

I pursued this just a little bit before changing tasks. Following the control flow, it looked to me like the unaligned memory arose from passing a matrix with a nonaligned stride greater than 8, to sgemm_colmajor. I've come up with the below so far, but have not yet tested and debugged it and make frequent mistakes so likely something is wrong. The patch is pasted here from a tmux pane and then hand edited, so may need manual application.

diff --git a/cpp/test/test_avx_utils.cpp b/cpp/test/test_avx_utils.cpp
index 4ec4e00..34c318a 100644
--- a/cpp/test/test_avx_utils.cpp
+++ b/cpp/test/test_avx_utils.cpp
@@ -44,6 +44,8 @@ void _test_sgemm_colmajor(int N, int D, int M, bool simple_entries=false) {
         B.setRandom();
     }
     C = (C.array() + -999).matrix();  // value we won't accidentally get
+    int aligned_rows = C.rows() - (C.rows() % (-32 / int(sizeof(float))));
+    C.resize(aligned_rows, C.cols());

opened by xloem 0

setup.py does not work on osx (`error: command swig failed with exit status 1`)

Hello,

I've seen the other thread about installation but did not find an answer to my problem.

I can install swig and numpy but when I do python3 setup.py install I get the following

....
running install
running bdist_egg
running egg_info
writing pybolt.egg-info/PKG-INFO
writing dependency_links to pybolt.egg-info/dependency_links.txt
writing requirements to pybolt.egg-info/requires.txt
writing top-level names to pybolt.egg-info/top_level.txt
reading manifest file 'pybolt.egg-info/SOURCES.txt'
writing manifest file 'pybolt.egg-info/SOURCES.txt'
installing library code to build/bdist.macosx-10.9-x86_64/egg
running install_lib
running build_py
creating build/lib.macosx-10.9-x86_64-3.7
creating build/lib.macosx-10.9-x86_64-3.7/bolt
copying python/bolt/bolt.py -> build/lib.macosx-10.9-x86_64-3.7/bolt
copying python/bolt/bolt_api.py -> build/lib.macosx-10.9-x86_64-3.7/bolt
copying python/bolt/__init__.py -> build/lib.macosx-10.9-x86_64-3.7/bolt
copying python/bolt/native.i -> build/lib.macosx-10.9-x86_64-3.7/bolt
running build_ext
building '_bolt' extension
swigging python/bolt/native.i to python/bolt/native_wrap.c
swig -python -o python/bolt/native_wrap.c python/bolt/native.i
python/bolt/config.i:50: Warning 301: class keyword used, but not in C++ mode.
python/bolt/config.i:51: Error: Syntax error - possibly a missing semicolon.
error: command 'swig' failed with exit status 1

It seems swig is not installed but I can do swig -version and I get


SWIG Version 4.0.2

Compiled with clang++ [x86_64-apple-darwin21.1.0]

Configured options: +pcre

Please see http://www.swig.org for reporting bugs and further information

opened by davidbp 0

mithral C++ parameter optimization

Does the mithral C++ codebase provide functions for parameter optimization? I have problems finding it and executing run_matmul() in the struct mithral_amm_task returns a wrong output matrix, since the optimization parameters are set randomly. Thank you!

opened by fjrdev 16

10x faster matrix and vector operations

Related tags

Overview

Installing

Python

C++

Notes

How does it work?

Benchmarks

Basic usage

Example: Matrix-vector multiplies

Example: K-Nearest Neighbor / Maximum Inner Product Search

Miscellaneous

Comments

Python unit test is broken?

Update README with additional steps for python build

Old AVX test fails due to GPF/sigsegv

Summary

Problem Information

System information

Diagnostic Information

Work to Resolve

setup.py does not work on osx (`error: command swig failed with exit status 1`)

mithral C++ parameter optimization

Owner

A faster pytorch implementation of faster r-cnn

A numpy-based implementation of RANSAC for fundamental matrix and homography estimation. The degeneracy updating and local optimization components are included and optional.

A series of convenience functions to make basic image processing operations such as translation, rotation, resizing, skeletonization, and displaying Matplotlib images easier with OpenCV and Python.

Linear algebra python - Number of operations and problems in Linear Algebra and Numerical Linear Algebra

Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021

Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data.

A script written in Python that returns a consensus string and profile matrix of a given DNA string(s) in FASTA format.

Deep learning operations reinvented (for pytorch, tensorflow, jax and others)

Torch-mutable-modules - Use in-place and assignment operations on PyTorch module parameters with support for autograd

Vector Neurons: A General Framework for SO(3)-Equivariant Networks

General Virtual Sketching Framework for Vector Line Art (SIGGRAPH 2021)

Neural implicit reconstruction experiments for the Vector Neuron paper

Geometric Vector Perceptron --- a rotation-equivariant GNN for learning from biomolecular structure

This repository contains a set of codes to run (i.e., train, perform inference with, evaluate) a diarization method called EEND-vector-clustering.

The all new way to turn your boring vector meshes into the new fad in town; Voxels!

[SIGGRAPH Asia 2021] DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning.

[SIGGRAPH Asia 2021] DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning.

Vector.ai assignment