10x faster matrix and vector operations

Overview

Bolt

Bolt is an algorithm for compressing vectors of real-valued data and running mathematical operations directly on the compressed representations.

If you have a large collection of mostly-dense vectors and can tolerate lossy compression, Bolt can probably save you 10-200x space and compute time.

Bolt also has theoretical guarantees bounding the errors in its approximations.

EDIT: this repo now also features the source code for MADDNESS, our shiny new algorithm for approximate matrix multiplication. MADDNESS has no Python wrapper yet, and is referred to as "mithral" in the source code. Name changed because apparently I'm the only who gets Lord of the Rings references. MADDNESS runs ridiculously fast and, under reasonable assumptions, requires zero multiply-adds. Realistically, it'll be most useful for speeding up neural net inference on CPUs, but it'll take another couple papers to get it there; we need to generalize it to convolution and write the CUDA kernels to allow GPU training.

NOTE: All below code refers to the Python wrapper for Bolt and has nothing to do with MADDNESS. It also seems to be no longer building for many people. If you want to use MADDNESS, see the Python Implementation driven by amm_main.py or C++ implementation. All code is ugly, but Python code should be pretty easy to add new AMM methods/variations to.

Installing

Python

  $ brew install swig  # for wrapping C++; use apt-get, yum, etc, if not OS X
  $ pip install numpy  # bolt installation needs numpy already present
  $ git clone https://github.com/dblalock/bolt.git
  $ cd bolt && python setup.py install
  $ pytest tests/  # optionally, run the tests

If you run into any problems, please don't hesitate to mention it in the Python build problems issue.

C++

Install Bazel, Google's open-source build system. Then

  $ git clone https://github.com/dblalock/bolt.git
  $ cd bolt/cpp && bazel run :main

The bazel run command will build the project and run the tests and benchmarks.

If you want to integrate Bolt with another C++ project, include cpp/src/include/public.hpp and add the remaining files under cpp/src to your builds. You should let me know if you're interested in doing such an integration because I'm hoping to see Bolt become part of many libraries and thus would be happy to help you.

Notes

Bolt currently only supports machines with AVX2 instructions, which basically means x86 machines from fall 2013 or later. Contributions for ARM support are welcome. Also note that the Bolt Python wrapper is currently configured to require Clang, since GCC apparently runs into issues.

How does it work?

Bolt is based on vector quantization. For details, see the Bolt paper or slides.

Benchmarks

Bolt includes a thorough set of speed and accuracy benchmarks. See the experiments/ directory. This is also what you want if you want to reproduce the results in the paper.

Note that all of the timing results use the raw C++ implementation. At present, the Python wrapper is slightly slower due to Python overhead. If you're interested in having a full-speed wrapper, let me know and I'll allocate time to making this happen.

Basic usage

X, queries = some N x D array, some iterable of length D arrays

# these are approximately equal (though the latter are shifted and scaled)
enc = bolt.Encoder(reduction='dot').fit(X)
[np.dot(X, q) for q in queries]
[enc.transform(q) for q in queries]

# same for these
enc = bolt.Encoder(reduction='l2').fit(X)
[np.sum((X - q) * (X - q), axis=1) for q in queries]
[enc.transform(q) for q in queries]

# but enc.transform() is 10x faster or more

Example: Matrix-vector multiplies

import bolt
import numpy as np
from scipy.stats import pearsonr as corr
from sklearn.datasets import load_digits
import timeit

# for simplicity, use the sklearn digits dataset; we'll split
# it into a matrix X and a set of queries Q
X, _ = load_digits(return_X_y=True)
nqueries = 20
X, Q = X[:-nqueries], X[-nqueries:]

enc = bolt.Encoder(reduction='dot', accuracy='lowest') # can tweak acc vs speed
enc.fit(X)

dot_corrs = np.empty(nqueries)
for i, q in enumerate(Q):
    dots_true = np.dot(X, q)
    dots_bolt = enc.transform(q)
    dot_corrs[i] = corr(dots_true, dots_bolt)[0]

# dot products closely preserved despite compression
print "dot product correlation: {} +/- {}".format(
    np.mean(dot_corrs), np.std(dot_corrs))  # > .97

# massive space savings
print(X.nbytes)  # 1777 rows * 64 cols * 8B = 909KB
print(enc.nbytes)  # 1777 * 2B = 3.55KB

# massive time savings (~10x here, but often >100x on larger
# datasets with less Python overhead; see the paper)
t_np = timeit.Timer(
    lambda: [np.dot(X, q) for q in Q]).timeit(5)        # ~9ms
t_bolt = timeit.Timer(
    lambda: [enc.transform(q) for q in Q]).timeit(5)    # ~800us
print "Numpy / BLAS time, Bolt time: {:.3f}ms, {:.3f}ms".format(
    t_np * 1000, t_bolt * 1000)

# can get output without offset/scaling if needed
dots_bolt = [enc.transform(q, unquantize=True) for q in Q]

Example: K-Nearest Neighbor / Maximum Inner Product Search

# search using squared Euclidean distances
# (still using the Digits dataset from above)
enc = bolt.Encoder('l2', accuracy='high').fit(X)
bolt_knn = [enc.knn(q, k_bolt) for q in Q]  # knn for each query

# search using dot product (maximum inner product search)
enc = bolt.Encoder('dot', accuracy='medium').fit(X)
bolt_knn = [enc.knn(q, k_bolt) for q in Q]  # knn for each query

Miscellaneous

Bolt stands for "Based On Lookup Tables". Feel free to use this exciting fact at parties.

Comments
  • Python unit test is broken?

    Python unit test is broken?

    OS: MacOS Monterey 12.5 (Intel chip) Python: 3.10.5

    
    ❯ pytest tests
    ============================================================================== test session starts ===============================================================================
    platform darwin -- Python 3.10.5, pytest-7.1.2, pluggy-1.0.0
    rootdir: /Users/xiao/development/github.com/XiaoConstantine/bolt-1
    collected 4 items
    
    tests/test_encoder.py ..F.                                                                                                                                                 [100%]
    
    ==================================================================================== FAILURES ====================================================================================
    ________________________________________________________________________________ test_unquantize _________________________________________________________________________________
    
        def test_unquantize():
            X, Q = _load_digits_X_Q(nqueries=20)
    >       enc = bolt.Encoder('dot', accuracy='high').fit(X)
    
    tests/test_encoder.py:151:
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    ../../dblalock/bolt/venv/lib/python3.10/site-packages/pybolt-0.1.4-py3.10-macosx-11-x86_64.egg/bolt/bolt_api.py:466: in fit
        centroids = _learn_centroids(X, ncentroids=ncentroids,
    ../../dblalock/bolt/venv/lib/python3.10/site-packages/pybolt-0.1.4-py3.10-macosx-11-x86_64.egg/bolt/bolt_api.py:142: in _learn_centroids
        centroids, labels = kmeans(X_in, ncentroids)
    ../../dblalock/bolt/venv/lib/python3.10/site-packages/pybolt-0.1.4-py3.10-macosx-11-x86_64.egg/bolt/bolt_api.py:106: in kmeans
        seeds = kmc2.kmc2(X, k).astype(np.float32)
    kmc2.pyx:97: in kmc2.kmc2
        ???
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    
    >   ???
    E   ValueError: probabilities contain NaN
    
    mtrand.pyx:935: ValueError
    
    opened by XiaoConstantine 5
  • Update README with additional steps for python build

    Update README with additional steps for python build

    The existing steps on the main README seem a bit outdated(for python at least)

    This PR is mainly moving existing information around to avoid frictions to get started with using python api

    opened by XiaoConstantine 0
  • Old AVX test fails due to GPF/sigsegv

    Old AVX test fails due to GPF/sigsegv

    Summary

    The avx sgemm test fails at n=14 on my system, apparently due to a general protection fault reading unaligned memory.

    Problem Information

    _mm256_load_ps raises a general protection fault if an address is passed to it that is not aligned to 32 bytes. Documentation: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_load_ps&ig_expand=4279

    System information

    $ cat .git/refs/heads/master
    e7726a4c165cc45ac117e9eabd8761013a26640e
    $ uname -a
    Linux DESKTOP-E3P7TC0 3.10.0-1160.62.1.el7.x86_64 #1 SMP Wed Mar 23 09:04:02 UTC 2022 x86_64 GNU/Linux
    $ cat /etc/system-release
    Red Hat Enterprise Linux Workstation release 7.9 (Maipo)
    

    Diagnostic Information

    $ gdb cpp/bolt-build/bolt
    (gdb) run
    Starting program: /shared/src/bolt/cpp/bolt-build/bolt 
    brute force sgemm test, n = 1...
    brute force sgemm test, n = 2...
    brute force sgemm test, n = 5...
    brute force sgemm test, n = 14...
    
    Program received signal SIGSEGV, Segmentation fault.
    0x00000000005a1af9 in _mm256_load_ps (__P=0x886cf8) at /usr/local/lib/gcc/x86_64-pc-linux-gnu/9.4.1/include/avxintrin.h:874
    874       return *(__m256 *)__P;
    (gdb) up
    #1  (anonymous namespace)::sgemm_colmajor_narrow_padded<1, 2> (A=0x8878e0, B=0x875d80, N=14, D=1, M=2, out=0x886cc0, add_to_output=false, A_col_stride=14, B_col_stride=1, 
        out_col_stride=14, nrows_per_chunk=512) at /home/user/src/bolt/cpp/src/utils/avx_utils.hpp:394
    394                             sums[mm] = _mm256_load_ps(out_ptr);
    (gdb) p out_ptr
    $1 = (float *) 0x886cf8
    (gdb) bt
    #0  0x00000000005a1af9 in _mm256_load_ps (__P=0x886cf8) at /usr/local/lib/gcc/x86_64-pc-linux-gnu/9.4.1/include/avxintrin.h:874
    #1  (anonymous namespace)::sgemm_colmajor_narrow_padded<1, 2> (A=0x8878e0, B=0x875d80, N=14, D=1, M=2, out=0x886cc0, add_to_output=false, A_col_stride=14, B_col_stride=1, 
        out_col_stride=14, nrows_per_chunk=512) at /home/user/src/bolt/cpp/src/utils/avx_utils.hpp:394
    #2  0x000000000059ff9e in sgemm_colmajor (A=0x8878e0, B=0x875d80, N=14, D=1, M=2, out=0x886cc0) at /home/user/src/bolt/cpp/src/utils/avx_utils.cpp:18
    #3  0x00000000005ee949 in _test_sgemm_colmajor<-1, -1> (N=14, D=1, M=2, simple_entries=false) at /home/user/src/bolt/cpp/test/test_avx_utils.cpp:54
    #4  0x00000000005ec3f5 in ____C_A_T_C_H____T_E_S_T____100 () at /home/user/src/bolt/cpp/test/test_avx_utils.cpp:155
    #5  0x00000000005bb56e in Catch::FreeFunctionTestCase::invoke (this=0x86ef90) at /home/user/src/bolt/cpp/test/external/catch.hpp:5507
    #6  0x00000000005aa337 in Catch::TestCase::invoke (this=0x889280) at /home/user/src/bolt/cpp/test/external/catch.hpp:6389
    #7  0x00000000005b972b in Catch::RunContext::runCurrentTest (this=0x7fffffffd560, redirectedCout="", redirectedCerr="") at /home/user/src/bolt/cpp/test/external/catch.hpp:5131
    #8  0x00000000005b8737 in Catch::RunContext::runTest (this=0x7fffffffd560, testCase=...) at /home/user/src/bolt/cpp/test/external/catch.hpp:5001
    #9  0x00000000005ba095 in Catch::Runner::runTests (this=0x7fffffffd810) at /home/user/src/bolt/cpp/test/external/catch.hpp:5275
    #10 0x00000000005bae1b in Catch::Session::run (this=0x7fffffffdb10) at /home/user/src/bolt/cpp/test/external/catch.hpp:5395
    #11 0x00000000005bace8 in Catch::Session::run (this=0x7fffffffdb10, argc=1, argv=0x7fffffffdce8) at /home/user/src/bolt/cpp/test/external/catch.hpp:5378
    #12 0x00000000005ae8bb in main (argc=1, argv=0x7fffffffdce8) at /home/user/src/bolt/cpp/test/main.cpp:22
    
    $ valgrind cpp/bolt-build/bolt
    ==5087== Memcheck, a memory error detector                                                     
    ==5087== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.                       
    ==5087== Using Valgrind-3.18.0.GIT and LibVEX; rerun with -h for copyright info                
    ==5087== Command: cpp/bolt-build/bolt                                                          
    ==5087==                                                                                       
    brute force sgemm test, n = 1...                                                               
    brute force sgemm test, n = 2...                                                               
    brute force sgemm test, n = 5...                                                                                                                                                              
    brute force sgemm test, n = 14...                                                                                                                                                             
    ==5087==                                                                                                                                                                                      
    ==5087== Process terminating with default action of signal 11 (SIGSEGV)                                                                                                                       
    ==5087==  General Protection Fault                                                                                                                                                            
    ==5087==    at 0x5A1AF9: _mm256_load_ps (avxintrin.h:874)                                                                                                                                     
    ==5087==    by 0x5A1AF9: void (anonymous namespace)::sgemm_colmajor_narrow_padded<1, 2>(float const*, float const*, int, int, int, float*, bool, int, int, int, int) (avx_utils.hpp:394)      
    ==5087==    by 0x59FF9D: sgemm_colmajor(float const*, float const*, int, int, int, float*) (avx_utils.cpp:18)                                                                                 
    ==5087==    by 0x5EE948: void _test_sgemm_colmajor<-1, -1>(int, int, int, bool) (test_avx_utils.cpp:54)                                                                                       
    ==5087==    by 0x5EC3F4: ____C_A_T_C_H____T_E_S_T____100() (test_avx_utils.cpp:155)                                                                                                           
    ==5087==    by 0x5BB56D: Catch::FreeFunctionTestCase::invoke() const (catch.hpp:5507)                                                                                                         
    ==5087==    by 0x5AA336: Catch::TestCase::invoke() const (catch.hpp:6389)                                                                                                                     
    ==5087==    by 0x5B972A: Catch::RunContext::runCurrentTest(std::string&, std::string&) (catch.hpp:5131)
    ==5087==    by 0x5B8736: Catch::RunContext::runTest(Catch::TestCase const&) (catch.hpp:5001)
    ==5087==    by 0x5BA094: Catch::Runner::runTests() (catch.hpp:5275)
    ==5087==    by 0x5BAE1A: Catch::Session::run() (catch.hpp:5395)
    ==5087==    by 0x5BACE7: Catch::Session::run(int, char* const*) (catch.hpp:5378)
    ==5087==    by 0x5AE8BA: main (main.cpp:22)
    

    Work to Resolve

    I pursued this just a little bit before changing tasks. Following the control flow, it looked to me like the unaligned memory arose from passing a matrix with a nonaligned stride greater than 8, to sgemm_colmajor. I've come up with the below so far, but have not yet tested and debugged it and make frequent mistakes so likely something is wrong. The patch is pasted here from a tmux pane and then hand edited, so may need manual application.

    diff --git a/cpp/test/test_avx_utils.cpp b/cpp/test/test_avx_utils.cpp
    index 4ec4e00..34c318a 100644
    --- a/cpp/test/test_avx_utils.cpp
    +++ b/cpp/test/test_avx_utils.cpp
    @@ -44,6 +44,8 @@ void _test_sgemm_colmajor(int N, int D, int M, bool simple_entries=false) {
             B.setRandom();
         }
         C = (C.array() + -999).matrix();  // value we won't accidentally get
    +    int aligned_rows = C.rows() - (C.rows() % (-32 / int(sizeof(float))));
    +    C.resize(aligned_rows, C.cols());
    
    opened by xloem 0
  • setup.py does not work on osx (`error: command swig failed with exit status 1`)

    setup.py does not work on osx (`error: command swig failed with exit status 1`)

    Hello,

    I've seen the other thread about installation but did not find an answer to my problem.

    I can install swig and numpy but when I do python3 setup.py install I get the following

    ....
    running install
    running bdist_egg
    running egg_info
    writing pybolt.egg-info/PKG-INFO
    writing dependency_links to pybolt.egg-info/dependency_links.txt
    writing requirements to pybolt.egg-info/requires.txt
    writing top-level names to pybolt.egg-info/top_level.txt
    reading manifest file 'pybolt.egg-info/SOURCES.txt'
    writing manifest file 'pybolt.egg-info/SOURCES.txt'
    installing library code to build/bdist.macosx-10.9-x86_64/egg
    running install_lib
    running build_py
    creating build/lib.macosx-10.9-x86_64-3.7
    creating build/lib.macosx-10.9-x86_64-3.7/bolt
    copying python/bolt/bolt.py -> build/lib.macosx-10.9-x86_64-3.7/bolt
    copying python/bolt/bolt_api.py -> build/lib.macosx-10.9-x86_64-3.7/bolt
    copying python/bolt/__init__.py -> build/lib.macosx-10.9-x86_64-3.7/bolt
    copying python/bolt/native.i -> build/lib.macosx-10.9-x86_64-3.7/bolt
    running build_ext
    building '_bolt' extension
    swigging python/bolt/native.i to python/bolt/native_wrap.c
    swig -python -o python/bolt/native_wrap.c python/bolt/native.i
    python/bolt/config.i:50: Warning 301: class keyword used, but not in C++ mode.
    python/bolt/config.i:51: Error: Syntax error - possibly a missing semicolon.
    error: command 'swig' failed with exit status 1
    

    It seems swig is not installed but I can do swig -version and I get

    
    SWIG Version 4.0.2
    
    Compiled with clang++ [x86_64-apple-darwin21.1.0]
    
    Configured options: +pcre
    
    Please see http://www.swig.org for reporting bugs and further information
    
    opened by davidbp 0
  • mithral C++ parameter optimization

    mithral C++ parameter optimization

    Does the mithral C++ codebase provide functions for parameter optimization? I have problems finding it and executing run_matmul() in the struct mithral_amm_task returns a wrong output matrix, since the optimization parameters are set randomly. Thank you!

    opened by fjrdev 16
Owner
Machine learning PhD Student at MIT. I build fast machine learning algorithms.
null
A faster pytorch implementation of faster r-cnn

A Faster Pytorch Implementation of Faster R-CNN Write at the beginning [05/29/2020] This repo was initaited about two years ago, developed as the firs

Jianwei Yang 7.1k Jan 1, 2023
A numpy-based implementation of RANSAC for fundamental matrix and homography estimation. The degeneracy updating and local optimization components are included and optional.

Description A numpy-based implementation of RANSAC for fundamental matrix and homography estimation. The degeneracy updating and local optimization co

AoxiangFan 9 Nov 10, 2022
A series of convenience functions to make basic image processing operations such as translation, rotation, resizing, skeletonization, and displaying Matplotlib images easier with OpenCV and Python.

imutils A series of convenience functions to make basic image processing functions such as translation, rotation, resizing, skeletonization, and displ

Adrian Rosebrock 4.3k Jan 8, 2023
Linear algebra python - Number of operations and problems in Linear Algebra and Numerical Linear Algebra

Linear algebra in python Number of operations and problems in Linear Algebra and

Alireza 5 Oct 9, 2022
Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021

Geometric Vector Perceptron Implementation of Geometric Vector Perceptron, a simple circuit with 3d rotation equivariance for learning over large biom

Phil Wang 59 Nov 24, 2022
Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data.

Deep Learning Dataset Maker Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data. How to use Down

deepbands 25 Dec 15, 2022
A script written in Python that returns a consensus string and profile matrix of a given DNA string(s) in FASTA format.

A script written in Python that returns a consensus string and profile matrix of a given DNA string(s) in FASTA format.

Zain 1 Feb 1, 2022
Deep learning operations reinvented (for pytorch, tensorflow, jax and others)

This video in better quality. einops Flexible and powerful tensor operations for readable and reliable code. Supports numpy, pytorch, tensorflow, and

Alex Rogozhnikov 6.2k Jan 1, 2023
Torch-mutable-modules - Use in-place and assignment operations on PyTorch module parameters with support for autograd

Torch Mutable Modules Use in-place and assignment operations on PyTorch module p

Kento Nishi 7 Jun 6, 2022
Vector Neurons: A General Framework for SO(3)-Equivariant Networks

Vector Neurons: A General Framework for SO(3)-Equivariant Networks Created by Congyue Deng, Or Litany, Yueqi Duan, Adrien Poulenard, Andrea Tagliasacc

Congyue Deng 332 Dec 29, 2022
General Virtual Sketching Framework for Vector Line Art (SIGGRAPH 2021)

General Virtual Sketching Framework for Vector Line Art - SIGGRAPH 2021 Paper | Project Page Outline Dependencies Testing with Trained Weights Trainin

Haoran MO 118 Dec 27, 2022
Neural implicit reconstruction experiments for the Vector Neuron paper

Neural Implicit Reconstruction with Vector Neurons This repository contains code for the neural implicit reconstruction experiments in the paper Vecto

Congyue Deng 35 Jan 2, 2023
Geometric Vector Perceptron --- a rotation-equivariant GNN for learning from biomolecular structure

Geometric Vector Perceptron Code to accompany Learning from Protein Structure with Geometric Vector Perceptrons by B Jing, S Eismann, P Suriana, RJL T

Dror Lab 85 Dec 29, 2022
This repository contains a set of codes to run (i.e., train, perform inference with, evaluate) a diarization method called EEND-vector-clustering.

EEND-vector clustering The EEND-vector clustering (End-to-End-Neural-Diarization-vector clustering) is a speaker diarization framework that integrates

null 45 Dec 26, 2022
The all new way to turn your boring vector meshes into the new fad in town; Voxels!

Voxelator The all new way to turn your boring vector meshes into the new fad in town; Voxels! Notes: I have not tested this on a rotated mesh. With fu

null 6 Feb 3, 2022
[SIGGRAPH Asia 2021] DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning.

DeepVecFont This is the homepage for "DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning". Yizhi Wang and Zhouhui Lian. WI

Yizhi Wang 17 Dec 22, 2022
[SIGGRAPH Asia 2021] DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning.

DeepVecFont This is the homepage for "DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning". Yizhi Wang and Zhouhui Lian. WI

Yizhi Wang 5 Oct 22, 2021
Vector.ai assignment

fabio-tests-nisargatman Low Level Approach: ###Tables: continents: id*, name, population, area, createdAt, updatedAt countries: id*, name, population,

Ravi Pullagurla 1 Nov 9, 2021