Simple archive format designed for quickly reading some files without extracting the entire archive

Jarred Sumner

Last update: Dec 30, 2022

Related tags

File & Path Utilities hop

Overview

hop

Simple archive format designed for quickly reading some files without extracting the entire archive. Possibly will be used in Bun.

25x faster than unzip and 10x faster than tar at reading individual files (uncompressed)

Format	Random access	Fast extraction	Fast archiving	Compression	Encryption	Append
hop	✅	✅	✅	❌	❌	❌
tar	❌	✅	✅	❌	❌	✅
zip	✅ (when small)	❌	❌	✅	✅	✅

Features:

Faster at printing individual files than tar & zip (compression disabled)
Faster extraction than zip, comparable to tar (compression disabled)
Faster archiving than zip, comparable to tar (compression disabled)

Anti-features:

Single-threaded (but doesn't need to be)
I wrote it in about 3 hours and there are no tests
No checksums yet. Probably not a good idea to use this for untrusted data until that's fixed.
Ignores symlinks
Can't be larger than 4 GB
Archives are read-only and file names are not normalized across platforms

Usage

Download the binary from /releases

To create an archive:

hop ./path-to-folder

To extract an archive:

hop archive.hop

To print one file from the archive:

hop archive.hop package.json

Why?

Why can't software read many tiny files with similar performance characteristics as individual files?

Reading and writing lots of tiny files incurs significant syscall overhead, and (npm) packages often have lots of tiny files. Zip files are unacceptably slow to read from like a directory. tar files extract quickly, but are slow at non-sequential access.
Reading directory entries (ls) in large directory trees is slow

Some benchmarks

On macOS 12 with an M1X

Using tigerbeetle github repo as an example

Extracting:

Archiving:

On an Ubuntu AMD64 server

Extracting a node_modules folder

Why faster?

It stores an array of hashes for each file path and the list of files are sorted lexigraphically. This makes non-sequential access faster than tar, but can make creating new archives slower.
Does not store directories, only files
.hop files are read-only (more precisely, one could append but would have to rewrite all metadata)
copy_file_range
packed struct makes serialization & deserialization very fast because there is very little encoding/decoding step.

How does it work?

File contents go at the top, file metadata goes at the bottom
This is the metadata it currently stores:

package Hop;

struct StringPointer {
    uint32 off;
    uint32 len;
}

struct File {
    StringPointer name;
    uint32 name_hash;
    uint32 chmod;
    uint32 mtime;
    uint32 ctime;
    StringPointer data;
}

message Archive {
    uint32 version = 1;
    uint32 content_offset = 2;
    File[] files = 3;
    uint32[] name_hashes = 4;
    byte[] metadata = 5;
}

You might also like...

🧹 Create symlinks for .m2ts files and classify them into directories in yyyy-mm format.

2 Feb 7, 2022

Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series.

Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series. The Fourier series can be animated and visualized, the function can be output as a two dimensional vector for Desmos and there is a method to output the coefficients as LaTeX code.

12 Jan 1, 2023

Extract an archive file (zip file or tar file) stored on AWS S3

S3 Extract Extract an archive file (zip file or tar file) stored on AWS S3. Details Downloads archive from S3 into memory, then extract and re-upload

1 Dec 14, 2021

csv2ir is a script to convert ir .csv files to .ir files for the flipper.

csv2ir csv2ir is a script to convert ir .csv files to .ir files for the flipper. For a repo of .ir files, please see https://github.com/logickworkshop

38 Dec 31, 2022

Various technical documentation, in electronically parseable format

a-pile-of-documentation Various technical documentation, in electronically parseable format. You will need Python 3 to run the scripts and programs in

2 Nov 20, 2022

A simple file module for creating, editing and saving files.

1 Nov 25, 2021

A simple library for temporary storage of small files

TemporaryStorage An simple library for temporary storage of small files. Navigation Install Usage In Python console As a standalone application List o

2 Apr 17, 2022

This python project contains a class FileProcessor which allows one to grab a file and get some meta data and header information from it

This python project contains a class FileProcessor which allows one to grab a file and get some meta data and header information from it. In the current state, it outputs a PrettyTable to txt file as well as the raw data from that table into a csv.

1 Nov 9, 2021

RMfuse provides access to your reMarkable Cloud files in the form of a FUSE filesystem

RMfuse provides access to your reMarkable Cloud files in the form of a FUSE filesystem. These files are exposed either in their original format, or as PDF files that contain your annotations. This lets you manage files in the reMarkable Cloud using the same tools you use on your local system.

82 Nov 24, 2022

Comments

just curious: why use zig?

I'm not familiar with zig and honestly heard it today first in my life. At first glance, s cool language I think!

Do you have any specific reasons to use zig?

opened by roeniss 0
front page results about "25x faster" are incorrect

The tests on the front page have extremely small total times, a single milliseconds range. This means you are measuring mostly startup overhead and not actually decompression (the benchmarking harness tells you the same thing as well: "Command took less than 5 milliseconds, results may be inaccurate")

In order for performance tests to be accurate, you need to re-measure it with larger archives and more data, so that overall time is no longer dominated by program startup.

opened by theamk 1

Simple archive format designed for quickly reading some files without extracting the entire archive

Related tags

Overview

hop

Usage

Why?

Some benchmarks

On macOS 12 with an M1X

On an Ubuntu AMD64 server

Why faster?

How does it work?

You might also like...

🧹 Create symlinks for .m2ts files and classify them into directories in yyyy-mm format.

Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series.

Extract an archive file (zip file or tar file) stored on AWS S3

csv2ir is a script to convert ir .csv files to .ir files for the flipper.

Various technical documentation, in electronically parseable format

A simple file module for creating, editing and saving files.

A simple library for temporary storage of small files

This python project contains a class FileProcessor which allows one to grab a file and get some meta data and header information from it

RMfuse provides access to your reMarkable Cloud files in the form of a FUSE filesystem

Comments

just curious: why use zig?

front page results about "25x faster" are incorrect

Releases(v0.0.0)

v0.0.0(Nov 10, 2021)

Owner

Jarred Sumner

Dragon Age: Origins toolset to extract/build .erf files, patch language-specific .dlg files, and view the contents of files in the ERF or GFF format

Some-tasks - Files for some of the tasks for the group sessions

Python code snippets for extracting PDB codes from .fasta files

Python interface for reading and appending tar files

Uproot is a library for reading and writing ROOT files in pure Python and NumPy.

An universal file format tool kit. At present will handle the ico format problem.

Pti-file-format - Reverse engineering the Polyend Tracker instrument file format

Nintendo Game Boy music assembly files parser into musicxml format

Fast Python reader and editor for ASAM MDF / MF4 (Measurement Data Format) files

A Python library that provides basic functions to read / write Aseprite format files