SIMD-accelerated bitwise hamming distance Python module for hexidecimal strings

Overview

hexhamming

Pip Prs Github

What does it do?

This module performs a fast bitwise hamming distance of two hexadecimal strings.

This looks like:

DEADBEEF = 11011110101011011011111011101111
00000000 = 00000000000000000000000000000000
XOR      = 11011110101011011011111011101111
Hamming  = number of ones in DEADBEEF ^ 00000000 = 24

This essentially amounts to

>>> import gmpy
>>> gmpy.popcount(0xdeadbeef ^ 0x00000000)
24

except with Python strings, so

>>> import gmpy
>>> gmpy.popcount(int("deadbeef", 16) ^ int("00000000", 16))
24

A few assumptions are made and enforced:

  • this is a valid hexadecimal string (i.e., [a-fA-F0-9]+)
  • the strings are the same length
  • the strings do not begin with "0x"

Why yet another Hamming distance library?

There are a lot of fantastic (python) libraries that offer methods to calculate various edit distances, including Hamming distances: Distance, textdistance, scipy, jellyfish, etc.

In this case, I needed a hamming distance library that worked on hexadecimal strings (i.e., a Python str) and performed blazingly fast. Furthermore, I often did not care about hex strings greater than 256 bits. That length constraint is different vs all the other libraries and enabled me to explore vectorization techniques via numba, numpy, and SSE/AVX intrinsics.

Lastly, I wanted to minimize dependencies, meaning you do not need to install numpy, gmpy, cython, pypy, pythran, etc.

Eventually, after playing around with gmpy.popcount, numba.jit, pythran.run, numpy, I decided to write what I wanted in essentially raw C. At this point, I'm using raw char* and int*, so exploring re-writing this in Fortran makes little sense.

Installation

To install, ensure you have Python 2.7 or 3.4+. Run

pip install hexhamming

or to install from source

git clone https://github.com/mrecachinas/hexhamming
cd hexhamming
python setup.py install # or pip install .

If you want to contribute to hexhamming, you should install the dev dependencies

pip install -r requirements-dev.txt

and make sure the tests pass with

python -m pytest -vls .

Example

Using hexhamming is as simple as

>>> from hexhamming import hamming_distance_string
>>> hamming_distance_string("deadbeef", "00000000")
24

New in v2.0.0 : hexhamming now supports byte`s via ``hamming_distance_bytes`. You use it in the exact same way as before, except you pass in a byte string.

>>> from hexhamming import hamming_distance_bytes
>>> hamming_distance_bytes(b"\xde\xad\xbe\xef", b"\x00\x00\x00\x00")
24

Benchmark

Below is a benchmark using pytest-benchmark with hexhamming==v1.3.2 my 2020 2.0 GHz quad-core Intel Core i5 16 GB 3733 MHz LPDDR4 macOS Catalina (10.15.5) with Python 3.7.3 and Apple clang version 11.0.3 (clang-1103.0.32.62).

Name Mean (ns) Std (ns) Median (ns) Rounds Iterations
test_hamming_distance_bench_3 93.8 10.5 94.3 53268 200
test_hamming_distance_bench_3_same 94.2 15.2 94.9 102146 100
test_check_hexstrings_within_dist_bench 231.9 104.2 216.5 195122 22
test_hamming_distance_bench_256 97.5 34.1 94.0 195122 22
test_hamming_distance_bench_1000 489.8 159.4 477.5 94411 20
test_hamming_distance_bench_1000_same 497.8 87.8 496.6 18971 20
test_hamming_distance_bench_1024 509.9 299.5 506.7 18652 10
test_hamming_distance_bench_1024_same 467.4 205.9 450.4 181819 10
Comments
  • `check_bytes_arrays_within_dist` function added.

    `check_bytes_arrays_within_dist` function added.

    Also hamming_distance_sse41_byte_max_dist and hamming_distance_loop_byte_max_dist added as underlying functions. Added 2 tests(for input args checks and calculations check) and 1 benchmark(with result check too). Version changed from 1.4 to 2.1

    enhancement 
    opened by bigcat88 14
  • Add function to find closest bytestring/string in list

    Add function to find closest bytestring/string in list

    In #7, @bigcat88 wrote:

    Also imho, will be interesting such function for python:

    int check_bytes_arrays_within_dist(
        byte* arrays_to_compare,
        int nElementsInFirstArg,
        byte* what_to_compare,
        int nBytesInElement,
        int max_dist
    )
    

    Return index of first array for which distance <(or <=) max_dist, or -1 otherwise.

    Like current check_hexstrings_within_dist with ability to compare many arrays(arg1) to one(arg=what_to_compare) to see if for specified distance there is any equality. Thanks, for your work and library - anyway ๐Ÿ‘

    enhancement 
    opened by mrecachinas 12
  • Is there any plan to support bytesin Python?

    Is there any plan to support bytesin Python?

    Thank you for great work ๐Ÿ‘๐Ÿผ I'm wondering is there any plan to support bytes format?

    For example:

    hamming_distance(b'\xde\xad\xbe\xef', b'\xde\xad\xbe\xef')
    
    enhancement 
    opened by TsingJyujing 6
  • Additional changes to make github actions happy.

    Additional changes to make github actions happy.

    1. Added -s parameter to pytest, to see output of tests.
    2. In tests added this output: print(f'Warning: Skipping {algorithm}, reason: {result}')
    3. set_algo now checks if cpu can run those algorithm.

    P.S.: On GitHub MacOS sysctl -a | grep machdep.cpu.features reports that AVX is present(avx1.0) but cpuid reports that avx2.0 is not. So now set_algo checks for cpu capabilities and tests for extra on MacOS are currently skipped. It's fine with this.

    opened by bigcat88 5
  • Arm, sdist, musli, windows builds.

    Arm, sdist, musli, windows builds.

    • Moved version to separate file, to edit it in one place.
    • Slightly changed setup.py
    • In most functions changed return value from int to uint64_t.
    • The same for input length, changed it to uint64_t.
    • Added ARM implementation for hamming_distance_bytes__extra.
    • Added to YML file building of arm, sdist, musli, windows.
    • After publishing of wheels I will test them on M1(did not test cross building in cibuildwheel yet).

    Note: 32 bit version fail tests, so did not added x86 build to Windows and i686 to linux.

    #11 #12

    opened by bigcat88 4
  • 2.2.3 , Illegal Instruction fixed.

    2.2.3 , Illegal Instruction fixed.

    • Changed version to 2.2.3
    • Add sdist upload to pypi.
    • Fixed Illegal Instruction.

    P.S: We can't add O2 flag to Linux builds, because then he automaticly replaces SSE with AVX. We cant use snprintf without args on Linux, because for some reason g++ replace it with SSE instructions.

    opened by bigcat88 2
  • Update test_hexhamming.py

    Update test_hexhamming.py

    Don't found why this happens on github, looked at docker images, they doesn't change CPU flags, github actions has good CPU flags(checked on ubuntu and macos)... Need to write check inside set_algo to deny change of algorithm that CPU didn't support, will add that later, with arm cpu support.

    Commented out new lines in tests, will work now.

    opened by bigcat88 1
  • Fix for `illegal instruction`

    Fix for `illegal instruction`

    Good time of a day. I was wrong that we dont need a check for SSE CPU capabilities. I got many bug reports, some people got Illegal instruction immediately after install, during import of module and two persons had that problem during run. I Didn't know that time that Proxmox software by default uses kvm64 type of CPU(Pentium 4), and some people uses docker with qemu with custom CPU flags. My bad.

    This pull request fixes that situations, i reverted calculation of popcount for bytes to your old code that uses SSE4, removed all static const vars of SSE init, added check for SSE support, and added function set_algo to change algorithms. That new func we might be used for test of algorithms(i suppose it is only for internal use). Made some small code refactor(mostly moved functions in different order for further development of arm algo version).

    I added you name and email to setup.py file, some paranoic people check this fields during manual install.

    Bumped version to 2.1.1 cause nothing new was added, only refactoring and bug fixes.

    Made full tests that i was able to, on proxmox(qemu) / macos intel 8xxx / linux Intel 10xxx.

    Waiting for this fixes appear on pypi =)

    opened by bigcat88 1
  • Binary wheels for linux aarch64 and  macos(x64,aarch64) - example

    Binary wheels for linux aarch64 and macos(x64,aarch64) - example

    Good time of day. This is example of yaml file which I use to generate wheels. I can reply on any question about any step in in that file. It works good on macos/linux. MacOS currently is commented there due to bug in delocate, but this bug will not affect project that uses C code directly, as yours.

    Will be good to see more wheels :) If you wish I can try to write such file myself for your project and make a pull request with changes.

    enhancement 
    opened by bigcat88 1
  • Byte support

    Byte support

    Per #7, this PR adds support for Python bytes via hexhamming.hamming_distance_bytes.

    Note: to maintain naming consistency, hexhamming.hamming_distance has been renamed hexhamming.hamming_distance_string.

    enhancement 
    opened by mrecachinas 0
Owner
Michael Recachinas
Husband to @erinrecachinas, Dad, ๐Ÿถ Dad, he/him/his
Michael Recachinas
Neighbourhood Retrieval (Nearest Neighbours) with Distance Correlation.

Neighbourhood Retrieval with Distance Correlation Assign Pseudo class labels to datapoints in the latent space. NNDC is a slim wrapper around FAISS. N

The Learning Machines 1 Jan 16, 2022
Python module for machine learning time series:

seglearn Seglearn is a python package for machine learning time series or sequences. It provides an integrated pipeline for segmentation, feature extr

David Burns 536 Dec 29, 2022
A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

Stox A Module to predict the "close price" for the next day and give "technical analysis". It uses a Neural Network and the LSTM algorithm to predict

Stox 31 Dec 16, 2022
Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm.

Naive-Bayes Spam Classificator Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm. Main goal is to code a

Viktoria Maksymiuk 1 Jun 27, 2022
Python module for performing linear regression for data with measurement errors and intrinsic scatter

Linear regression for data with measurement errors and intrinsic scatter (BCES) Python module for performing robust linear regression on (X,Y) data po

Rodrigo Nemmen 56 Sep 27, 2022
Python module for data science and machine learning users.

dsnk-distributions package dsnk distribution is a Python module for data science and machine learning that was created with the goal of reducing calcu

Emmanuel ASIFIWE 1 Nov 23, 2021
A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

null 802 Jan 1, 2023
Module for statistical learning, with a particular emphasis on time-dependent modelling

Operating system Build Status Linux/Mac Windows tick tick is a Python 3 module for statistical learning, with a particular emphasis on time-dependent

X - Data Science Initiative 410 Dec 14, 2022
Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

pyspark-anonymizer Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark envir

null 6 Jun 30, 2022
Educational python for Neural Networks, written in pure Python/NumPy.

Educational python for Neural Networks, written in pure Python/NumPy.

null 127 Oct 27, 2022
learn python in 100 days, a simple step could be follow from beginner to master of every aspect of python programming and project also include side project which you can use as demo project for your personal portfolio

learn python in 100 days, a simple step could be follow from beginner to master of every aspect of python programming and project also include side project which you can use as demo project for your personal portfolio

BDFD 6 Nov 5, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 8, 2023
A modular active learning framework for Python

Modular Active Learning framework for Python3 Page contents Introduction Active learning from bird's-eye view modAL in action From zero to one in a fe

modAL 1.9k Dec 31, 2022
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

Sebastian Raschka 4.2k Dec 29, 2022
Sequence learning toolkit for Python

seqlearn seqlearn is a sequence classification toolkit for Python. It is designed to extend scikit-learn and offer as similar as possible an API. Comp

Lars 653 Dec 27, 2022
Simple structured learning framework for python

PyStruct PyStruct aims at being an easy-to-use structured learning and prediction library. Currently it implements only max-margin methods and a perce

pystruct 666 Jan 3, 2023
Python implementation of the rulefit algorithm

RuleFit Implementation of a rule based prediction algorithm based on the rulefit algorithm from Friedman and Popescu (PDF) The algorithm can be used f

Christoph Molnar 326 Jan 2, 2023
Metric learning algorithms in Python

metric-learn: Metric Learning in Python metric-learn contains efficient Python implementations of several popular supervised and weakly-supervised met

null 1.3k Dec 28, 2022
[HELP REQUESTED] Generalized Additive Models in Python

pyGAM Generalized Additive Models in Python. Documentation Official pyGAM Documentation: Read the Docs Building interpretable models with Generalized

daniel servรฉn 747 Jan 5, 2023