Simple archive format designed for quickly reading some files without extracting the entire archive

Overview

hop

Simple archive format designed for quickly reading some files without extracting the entire archive. Possibly will be used in Bun.

25x faster than unzip and 10x faster than tar at reading individual files (uncompressed)

Format Random access Fast extraction Fast archiving Compression Encryption Append
hop
tar
zip (when small)

Features:

  • Faster at printing individual files than tar & zip (compression disabled)
  • Faster extraction than zip, comparable to tar (compression disabled)
  • Faster archiving than zip, comparable to tar (compression disabled)

Anti-features:

  • Single-threaded (but doesn't need to be)
  • I wrote it in about 3 hours and there are no tests
  • No checksums yet. Probably not a good idea to use this for untrusted data until that's fixed.
  • Ignores symlinks
  • Can't be larger than 4 GB
  • Archives are read-only and file names are not normalized across platforms

Usage

Download the binary from /releases

To create an archive:

hop ./path-to-folder

To extract an archive:

hop archive.hop

To print one file from the archive:

hop archive.hop package.json

Why?

Why can't software read many tiny files with similar performance characteristics as individual files?

  • Reading and writing lots of tiny files incurs significant syscall overhead, and (npm) packages often have lots of tiny files. Zip files are unacceptably slow to read from like a directory. tar files extract quickly, but are slow at non-sequential access.
  • Reading directory entries (ls) in large directory trees is slow

Some benchmarks

On macOS 12 with an M1X

Using tigerbeetle github repo as an example

Extracting:

image

Archiving:

image

On an Ubuntu AMD64 server

Extracting a node_modules folder

image

Why faster?

  • It stores an array of hashes for each file path and the list of files are sorted lexigraphically. This makes non-sequential access faster than tar, but can make creating new archives slower.
  • Does not store directories, only files
  • .hop files are read-only (more precisely, one could append but would have to rewrite all metadata)
  • copy_file_range
  • packed struct makes serialization & deserialization very fast because there is very little encoding/decoding step.

How does it work?

  1. File contents go at the top, file metadata goes at the bottom
  2. This is the metadata it currently stores:
package Hop;

struct StringPointer {
    uint32 off;
    uint32 len;
}

struct File {
    StringPointer name;
    uint32 name_hash;
    uint32 chmod;
    uint32 mtime;
    uint32 ctime;
    StringPointer data;
}

message Archive {
    uint32 version = 1;
    uint32 content_offset = 2;
    File[] files = 3;
    uint32[] name_hashes = 4;
    byte[] metadata = 5;
}
You might also like...
🧹 Create symlinks for .m2ts files and classify them into directories in yyyy-mm format.
🧹 Create symlinks for .m2ts files and classify them into directories in yyyy-mm format.

🧹 Create symlinks for .m2ts files and classify them into directories in yyyy-mm format.

Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series.
Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series.

Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series. The Fourier series can be animated and visualized, the function can be output as a two dimensional vector for Desmos and there is a method to output the coefficients as LaTeX code.

Extract an archive file (zip file or tar file) stored on AWS S3

S3 Extract Extract an archive file (zip file or tar file) stored on AWS S3. Details Downloads archive from S3 into memory, then extract and re-upload

csv2ir is a script to convert ir .csv files to .ir files for the flipper.

csv2ir csv2ir is a script to convert ir .csv files to .ir files for the flipper. For a repo of .ir files, please see https://github.com/logickworkshop

Various technical documentation, in electronically parseable format

a-pile-of-documentation Various technical documentation, in electronically parseable format. You will need Python 3 to run the scripts and programs in

A simple file module for creating, editing and saving files.

A simple file module for creating, editing and saving files.

A simple library for temporary storage of small files

TemporaryStorage An simple library for temporary storage of small files. Navigation Install Usage In Python console As a standalone application List o

This python project contains a class FileProcessor which allows one to grab a file and get some meta data and header information from it
This python project contains a class FileProcessor which allows one to grab a file and get some meta data and header information from it

This python project contains a class FileProcessor which allows one to grab a file and get some meta data and header information from it. In the current state, it outputs a PrettyTable to txt file as well as the raw data from that table into a csv.

RMfuse provides access to your reMarkable Cloud files in the form of a FUSE filesystem

RMfuse provides access to your reMarkable Cloud files in the form of a FUSE filesystem. These files are exposed either in their original format, or as PDF files that contain your annotations. This lets you manage files in the reMarkable Cloud using the same tools you use on your local system.

Comments
  • just curious: why use zig?

    just curious: why use zig?

    I'm not familiar with zig and honestly heard it today first in my life. At first glance, s cool language I think!

    Do you have any specific reasons to use zig?

    opened by roeniss 0
  • front page results about

    front page results about "25x faster" are incorrect

    The tests on the front page have extremely small total times, a single milliseconds range. This means you are measuring mostly startup overhead and not actually decompression (the benchmarking harness tells you the same thing as well: "Command took less than 5 milliseconds, results may be inaccurate")

    In order for performance tests to be accurate, you need to re-measure it with larger archives and more data, so that overall time is no longer dominated by program startup.

    opened by theamk 1
Owner
Jarred Sumner
Jarred Sumner
Dragon Age: Origins toolset to extract/build .erf files, patch language-specific .dlg files, and view the contents of files in the ERF or GFF format

DAOTools This is a set of tools for Dragon Age: Origins modding. It can patch the text lines of .dlg files, extract and build an .erf file, and view t

null 8 Dec 6, 2022
Some-tasks - Files for some of the tasks for the group sessions

Files for some of the tasks for the group sessions Here you can find some of the

INF142@UiB.no Computer Networks 0 Aug 25, 2022
Python code snippets for extracting PDB codes from .fasta files

Python_snippets_for_bioinformatics Python code snippets for extracting PDB codes from .fasta files If you have a single .fasta file for all protein se

Sofi-Mukhtar 3 Feb 9, 2022
Python interface for reading and appending tar files

Python interface for reading and appending tar files, while keeping a fast index for finding and reading files in the archive. This interface has been

Lawrence Livermore National Laboratory 1 Nov 12, 2021
Uproot is a library for reading and writing ROOT files in pure Python and NumPy.

Uproot is a library for reading and writing ROOT files in pure Python and NumPy. Unlike the standard C++ ROOT implementation, Uproot is only an I/O li

Scikit-HEP Project 164 Dec 31, 2022
An universal file format tool kit. At present will handle the ico format problem.

An universal file format tool kit. At present will handle the ico format problem.

Sadam·Sadik 1 Dec 26, 2021
Pti-file-format - Reverse engineering the Polyend Tracker instrument file format

pti-file-format Reverse engineering the Polyend Tracker instrument file format.

Jaap Roes 14 Dec 30, 2022
Nintendo Game Boy music assembly files parser into musicxml format

GBMusicParser Nintendo Game Boy music assembly files parser into musicxml format This python code will get an file.asm from the disassembly of a Game

null 1 Dec 11, 2021
Fast Python reader and editor for ASAM MDF / MF4 (Measurement Data Format) files

asammdf is a fast parser and editor for ASAM (Association for Standardization of Automation and Measuring Systems) MDF (Measurement Data Format) files

Daniel Hrisca 440 Dec 31, 2022
A Python library that provides basic functions to read / write Aseprite format files

A Python library that provides basic functions to read / write Aseprite format files

Joe Trewin 1 Jan 13, 2022