SIMD-accelerated bitwise hamming distance Python module for hexidecimal strings

Michael Recachinas

Last update: Oct 14, 2022

Related tags

Machine Learning python c avx edit-distance simd hexadecimal sse42 hamming-distance

Overview

`hexhamming`

What does it do?

This module performs a fast bitwise hamming distance of two hexadecimal strings.

This looks like:

DEADBEEF = 11011110101011011011111011101111
00000000 = 00000000000000000000000000000000
XOR      = 11011110101011011011111011101111
Hamming  = number of ones in DEADBEEF ^ 00000000 = 24

This essentially amounts to

>>> import gmpy
>>> gmpy.popcount(0xdeadbeef ^ 0x00000000)
24

except with Python strings, so

>>> import gmpy
>>> gmpy.popcount(int("deadbeef", 16) ^ int("00000000", 16))
24

A few assumptions are made and enforced:

this is a valid hexadecimal string (i.e., [a-fA-F0-9]+)
the strings are the same length
the strings do not begin with "0x"

Why yet another Hamming distance library?

There are a lot of fantastic (python) libraries that offer methods to calculate various edit distances, including Hamming distances: Distance, textdistance, scipy, jellyfish, etc.

In this case, I needed a hamming distance library that worked on hexadecimal strings (i.e., a Python str) and performed blazingly fast. Furthermore, I often did not care about hex strings greater than 256 bits. That length constraint is different vs all the other libraries and enabled me to explore vectorization techniques via numba, numpy, and SSE/AVX intrinsics.

Lastly, I wanted to minimize dependencies, meaning you do not need to install numpy, gmpy, cython, pypy, pythran, etc.

Eventually, after playing around with gmpy.popcount, numba.jit, pythran.run, numpy, I decided to write what I wanted in essentially raw C. At this point, I'm using raw char* and int*, so exploring re-writing this in Fortran makes little sense.

Installation

To install, ensure you have Python 2.7 or 3.4+. Run

pip install hexhamming

or to install from source

git clone https://github.com/mrecachinas/hexhamming
cd hexhamming
python setup.py install # or pip install .

If you want to contribute to hexhamming, you should install the dev dependencies

pip install -r requirements-dev.txt

and make sure the tests pass with

python -m pytest -vls .

Example

Using hexhamming is as simple as

>>> from hexhamming import hamming_distance_string
>>> hamming_distance_string("deadbeef", "00000000")
24

New in v2.0.0 : hexhamming now supports byte`s via ``hamming_distance_bytes`. You use it in the exact same way as before, except you pass in a byte string.

>>> from hexhamming import hamming_distance_bytes
>>> hamming_distance_bytes(b"\xde\xad\xbe\xef", b"\x00\x00\x00\x00")
24

Benchmark

Below is a benchmark using pytest-benchmark with hexhamming==v1.3.2 my 2020 2.0 GHz quad-core Intel Core i5 16 GB 3733 MHz LPDDR4 macOS Catalina (10.15.5) with Python 3.7.3 and Apple clang version 11.0.3 (clang-1103.0.32.62).

Name	Mean (ns)	Std (ns)	Median (ns)	Rounds	Iterations
test_hamming_distance_bench_3	93.8	10.5	94.3	53268	200
test_hamming_distance_bench_3_same	94.2	15.2	94.9	102146	100
test_check_hexstrings_within_dist_bench	231.9	104.2	216.5	195122	22
test_hamming_distance_bench_256	97.5	34.1	94.0	195122	22
test_hamming_distance_bench_1000	489.8	159.4	477.5	94411	20
test_hamming_distance_bench_1000_same	497.8	87.8	496.6	18971	20
test_hamming_distance_bench_1024	509.9	299.5	506.7	18652	10
test_hamming_distance_bench_1024_same	467.4	205.9	450.4	181819	10

Comments

`check_bytes_arrays_within_dist` function added.

Also hamming_distance_sse41_byte_max_dist and hamming_distance_loop_byte_max_dist added as underlying functions. Added 2 tests(for input args checks and calculations check) and 1 benchmark(with result check too). Version changed from 1.4 to 2.1
enhancement

opened by bigcat88 14
Add function to find closest bytestring/string in list
In #7, @bigcat88 wrote:

Also imho, will be interesting such function for python:

int check_bytes_arrays_within_dist( byte* arrays_to_compare, int nElementsInFirstArg, byte* what_to_compare, int nBytesInElement, int max_dist )

Return index of first array for which distance <(or <=) max_dist, or -1 otherwise.

Like current check_hexstrings_within_dist with ability to compare many arrays(arg1) to one(arg=what_to_compare) to see if for specified distance there is any equality. Thanks, for your work and library - anyway 👍

enhancement
opened by mrecachinas 12
Is there any plan to support bytesin Python?
Thank you for great work 👍🏼 I'm wondering is there any plan to support bytes format?

For example:

hamming_distance(b'\xde\xad\xbe\xef', b'\xde\xad\xbe\xef')
enhancement
opened by TsingJyujing 6
Additional changes to make github actions happy.
Added -s parameter to pytest, to see output of tests.

In tests added this output: print(f'Warning: Skipping {algorithm}, reason: {result}')

set_algo now checks if cpu can run those algorithm.

P.S.: On GitHub MacOS sysctl -a | grep machdep.cpu.features reports that AVX is present(avx1.0) but cpuid reports that avx2.0 is not. So now set_algo checks for cpu capabilities and tests for extra on MacOS are currently skipped. It's fine with this.
opened by bigcat88 5
Arm, sdist, musli, windows builds.
Moved version to separate file, to edit it in one place.

Slightly changed setup.py

In most functions changed return value from int to uint64_t.

The same for input length, changed it to uint64_t.

Added ARM implementation for hamming_distance_bytes__extra.

Added to YML file building of arm, sdist, musli, windows.

After publishing of wheels I will test them on M1(did not test cross building in cibuildwheel yet).

Note: 32 bit version fail tests, so did not added x86 build to Windows and i686 to linux.

#11 #12
opened by bigcat88 4
2.2.3 , Illegal Instruction fixed.
Changed version to 2.2.3

Add sdist upload to pypi.

Fixed Illegal Instruction.

P.S: We can't add O2 flag to Linux builds, because then he automaticly replaces SSE with AVX. We cant use snprintf without args on Linux, because for some reason g++ replace it with SSE instructions.
opened by bigcat88 2
Update test_hexhamming.py

Don't found why this happens on github, looked at docker images, they doesn't change CPU flags, github actions has good CPU flags(checked on ubuntu and macos)... Need to write check inside set_algo to deny change of algorithm that CPU didn't support, will add that later, with arm cpu support.

Commented out new lines in tests, will work now.

opened by bigcat88 1
Fix for `illegal instruction`

Good time of a day. I was wrong that we dont need a check for SSE CPU capabilities. I got many bug reports, some people got Illegal instruction immediately after install, during import of module and two persons had that problem during run. I Didn't know that time that Proxmox software by default uses kvm64 type of CPU(Pentium 4), and some people uses docker with qemu with custom CPU flags. My bad.

This pull request fixes that situations, i reverted calculation of popcount for bytes to your old code that uses SSE4, removed all static const vars of SSE init, added check for SSE support, and added function set_algo to change algorithms. That new func we might be used for test of algorithms(i suppose it is only for internal use). Made some small code refactor(mostly moved functions in different order for further development of arm algo version).

I added you name and email to setup.py file, some paranoic people check this fields during manual install.

Bumped version to 2.1.1 cause nothing new was added, only refactoring and bug fixes.

Made full tests that i was able to, on proxmox(qemu) / macos intel 8xxx / linux Intel 10xxx.

Waiting for this fixes appear on pypi =)

opened by bigcat88 1
Binary wheels for linux aarch64 and macos(x64,aarch64) - example

Good time of day. This is example of yaml file which I use to generate wheels. I can reply on any question about any step in in that file. It works good on macos/linux. MacOS currently is commented there due to bug in delocate, but this bug will not affect project that uses C code directly, as yours.

Will be good to see more wheels :) If you wish I can try to write such file myself for your project and make a pull request with changes.
enhancement

opened by bigcat88 1
Byte support

Per #7, this PR adds support for Python bytes via hexhamming.hamming_distance_bytes.

Note: to maintain naming consistency, hexhamming.hamming_distance has been renamed hexhamming.hamming_distance_string.
enhancement

opened by mrecachinas 0

Owner

Michael Recachinas

Husband to @erinrecachinas, Dad, 🐶 Dad, he/him/his

GitHub https://github.com/mrecachinas/hexhamming

Neighbourhood Retrieval (Nearest Neighbours) with Distance Correlation.

Neighbourhood Retrieval with Distance Correlation Assign Pseudo class labels to datapoints in the latent space. NNDC is a slim wrapper around FAISS. N

1 Jan 16, 2022

Python module for machine learning time series:

seglearn Seglearn is a python package for machine learning time series or sequences. It provides an integrated pipeline for segmentation, feature extr

536 Dec 29, 2022

A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

Stox A Module to predict the "close price" for the next day and give "technical analysis". It uses a Neural Network and the LSTM algorithm to predict

31 Dec 16, 2022

Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm.

Naive-Bayes Spam Classificator Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm. Main goal is to code a

1 Jun 27, 2022

Python module for performing linear regression for data with measurement errors and intrinsic scatter

Linear regression for data with measurement errors and intrinsic scatter (BCES) Python module for performing robust linear regression on (X,Y) data po

56 Sep 27, 2022

Python module for data science and machine learning users.

dsnk-distributions package dsnk distribution is a Python module for data science and machine learning that was created with the goal of reducing calcu

1 Nov 23, 2021

A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

802 Jan 1, 2023

Module for statistical learning, with a particular emphasis on time-dependent modelling

Operating system Build Status Linux/Mac Windows tick tick is a Python 3 module for statistical learning, with a particular emphasis on time-dependent

410 Dec 14, 2022

Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

pyspark-anonymizer Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark envir

6 Jun 30, 2022

Educational python for Neural Networks, written in pure Python/NumPy.

127 Oct 27, 2022

learn python in 100 days, a simple step could be follow from beginner to master of every aspect of python programming and project also include side project which you can use as demo project for your personal portfolio

learn python in 100 days, a simple step could be follow from beginner to master of every aspect of python programming and project also include side project which you can use as demo project for your personal portfolio

6 Nov 5, 2022

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

3k Jan 8, 2023

SIMD-accelerated bitwise hamming distance Python module for hexidecimal strings

Related tags

Overview

hexhamming

What does it do?

Why yet another Hamming distance library?

Installation

Example

Benchmark

Comments

Owner

Michael Recachinas

Neighbourhood Retrieval (Nearest Neighbours) with Distance Correlation.

Python module for machine learning time series:

A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm.

Python module for performing linear regression for data with measurement errors and intrinsic scatter

Python module for data science and machine learning users.

A scikit-learn based module for multi-label et. al. classification

Module for statistical learning, with a particular emphasis on time-dependent modelling

Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

Educational python for Neural Networks, written in pure Python/NumPy.

learn python in 100 days, a simple step could be follow from beginner to master of every aspect of python programming and project also include side project which you can use as demo project for your personal portfolio

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

A modular active learning framework for Python

A library of extension and helper modules for Python's data analysis and machine learning libraries.

Sequence learning toolkit for Python

Simple structured learning framework for python

Python implementation of the rulefit algorithm

Metric learning algorithms in Python

[HELP REQUESTED] Generalized Additive Models in Python

`hexhamming`