BREP : Binary Search in plaintext and gzip files
Search large files in O(log n) time using binary search.
We support plaintext and Gzipped files.
grep
on a 2GB dataset !
Benchmark : 8x faster than brep
is usually faster than grep
for >1GB datasets.
Check tests/benchmark.py
to reproduce the results.
grep ^777 test.txt : 1.594 s (15 runs)
brep 777 test.txt : 206.8 ms (15 runs)
Installation
pip install brep
or pip install .
from this repo
Index your file
In order to conduct binary search, your file needs to be sorted.
We recommend GNU sort
, as it's multithreaded and supports large files.
LC_ALL=C sort -u -o output_file input_file
BREP supports compressed file in the GZIP format.
We recommend pigz
for quick multicore compression : pigz file
Usage
Provide 1 prefix search term and 1 filepath
brep 77777 test/large.gz
You can also search from our Python class
from brep import Search
for result in Search("77777", "test/large.gz"):
print(result)
Contribute
PRs are welcome!
Install dev dependencies: pip install -e .[dev]
Test and lint before submitting: pytest && flake8
Todo
- Reimplement in Rust
- Faster gz size estimation
- Search multiple strings at once