xpdt: eXPeditious Data Transfer
About
xpdt is (yet another) language for defining data-types and generating code for serializing and deserializing them. It aims to produce code with little or no overhead and is based on fixed-length representations which allows for zero-copy deserialization and (at-most-)one-copy writes (source to buffer).
The generated C code, in particular, is highly optimized and often permits the elimination of data-copying for writes and enables optimizations such as loop-unrolling for fixed-length objects. This can lead to read speeds in excess of 500 million objects per second (~1.8 nsec per object).
Examples
The xpdt source language looks similar to C struct definitions:
struct timestamp {
u32 tv_sec;
u32 tv_nsec;
};
struct point {
i32 x;
i32 y;
i32 z;
};
struct line {
timestamp time;
point line_start;
point line_end;
bytes comment;
};
Fixed width integer types from 8 to 128 bit are supported, along with the bytes
type, which is a variable-length sequence of bytes.
Target Languages
The following target languages are currently supported:
- C
- Python
The C code is very highly optimized.
The Python code is about as well optimized for CPython as I can make it. It uses typed NamedTuple
for objects, which has some small overhead over regular tuples, and it uses struct.Struct
to do the packing/unpacking. I have also code-golfed the generated bytecodes down to what I think is minimal given the design constraints. As a result, performance of the pure Python code is comparable to a JSON library implemented in C or Rust.
For better performance in Python, it may be desirable to develop a Cython target. In some instances CFFI structs may be more performant since they can avoid the creation/destruction of an object for each record.
Target languages are implemented purely as jinja2
templates.
Serialization format
The serialization format for fixed-length objects is simply a packed C struct.
For any object which contains bytes
type fields:
- a 32bit unsigned record length is prepended to the struct
- all
bytes
type fields are converted tou32
and contain the length of the bytes - all bytes contents are appended after the struct in the order in which they appear
For example, following the example above, the serialization would be:
u32 tot_len # = 41
u32 time.tv_sec
u32 time.tv_usec
i32 line_start.x
i32 line_start.y
i32 line_start.z
i32 line_end.x
i32 line_end.y
i32 line_end.z
u32 comment # = 5
u8 'H'
u8 'e'
u8 'l'
u8 'l'
u8 'o'
Features
The feature-set is, as of now, pretty slim.
There are no array / sequence / map types, and no keyed unions.
Support for such things may be added in future provided that suitable implementations exist. An implementation is suitable if:
- It admits a zero (or close to zero) overhead implementation
- it causes no overhead when the feature isn't being used
License
The compiler is released under the GPLv3.
The C support code/headers are released under the MIT license.
The generated code is yours.