This introduces a DOM-like API that parses JSON with forward-only streaming--combining the ease of traditional DOM parsers with the performance of SAX. One major virtue of this approach is that we know what type the user wants a value to be before we parse it, so we eliminate the typical "type check" employed by DOM and SAX, instead parsing with a parser dedicated to that type.
It is far faster than using the DOM (4.0 GB/s vs. 2.3 GB/s). It is also much easier than the SAX approach (15 lines for a Tweet reader versus 300 in SAX), as well as being slightly faster (results on the Clang 10 compiler, on a Skylake machine):
| Benchmark | Generic DOM | SAX | On-Demand |
|------------|---|---|---|
| Tweets | 2.3 GB/s | 3.5 GB/s | 4.0 GB/s |
| LargeRandom | 0.50 GB/s | 0.71 GB/s | 0.71 GB/s |
Examples
The benchmarks have some real, fairly simple examples with equivalent DOM and SAJ implementations.
points.json
This parses a giant array of points: [ { "x": 1.0, "y": 1.2, "z": 1.3 }, ... ]
ondemand::parser parser;
std::vector<my_point> container;
for (ondemand::object p : parser.parse(json)) {
container.emplace_back(my_point{p["x"], p["y"], p["z"]});
}
twitter.json
This parses a list of Tweets (from the Twitter API) into C++ structs, serializing text, screen names, ids, and favore/retweet count.
// Walk the document, parsing the tweets as we go
std::vector<twitter::tweet> tweets;
ondemand::parser parser;
auto doc = parser.parse(json);
for (ondemand::object tweet : doc["statuses"]) {
tweets.emplace_back(twitter::tweet{
tweet["created_at"],
tweet["id"],
tweet["text"],
nullable_int(tweet["in_reply_to_status_id"]),
read_user(tweet["user"]),
tweet["retweet_count"],
tweet["favorite_count"]
});
}
With these auxiliary functions:
simdjson_really_inline twitter::twitter_user read_user(ondemand::object && u) {
return { u["id"], u["screen_name"] };
}
simdjson_really_inline uint64_t nullable_int(ondemand::value && value) {
if (value.is_null()) { return 0; }
return std::move(value);
}
Principles
- Inline Iteration: Iterating arrays or objects is done through exactly the kind of for loop you'd expect. You can nest them, iterating an array of arrays or an array of objects with array values through nested for loops. Under the hood, the iterator checks for the "[", passes you the index of the value, and when you finish with a value, it checks for "," and passes the next value until it sees "]".
- Forward-Only Iteration: To prevent reiteration of the same values and to keep the number of variables down (literally), only a single index is maintained and everything uses it (even if you have nested for loops). This means when you're going through an array of arrays, for example, that the inner array loop will advance the index to the next comma, and the array can just pick it up and look at it.
- Inline, On-Demand Parsing: Parses exactly the type you want and nothing else. Because it's inline this means way fewer branches per value, and they're more predictable as well. For example, if you ask for an unsigned integer, we just start parsing digits. If there were no digits, we toss an error. With a generic parser you have to do a big switch statement checking whether it's a digit before you even start parsing, so it's both an extra branch, and a hard to predict one (because you are also checking other values).
- Streaming Output: This is streaming in the sense of output, but not input. You still have to pass the whole file as input; it just doesn't parse anything besides the marks until you ask. This also means the parser memory has to grow as the file grows (because of structural indexes). Streaming input is a whole nother problem, however.
- Validate What You Use: It deliberately validates the values you use and the structure leading to it, but nothing else. The goal is a guarantee that the value you asked for is the correct one and is not malformed so that there is no confusion over whether you got the right value. But it leaves the possibility that the JSON as a whole is invalid. A full-validation mode is possible and planned, but I think this mode should be the default, personally, or at least pretty heavily advertised. Full-validation mode should really only be for debug.
- Avoids Genericity Pitfalls I think it avoids the pitfalls of generating a generic DOM, which are that you don't know what to expect in the file so you can't tune the parser to expect it (and thus branch mispredictions abound). Even SAX falls into this, though definitely less than others: the core of SAX still has to have a giant switch statement in a loop, and that's just going to be inherently branchy.
Impact on DOM parse (skylake haswell gcc10.2)
As expected / hoped, this is entirely neutral with respect to our existing performance, all the way down to identical instruction count:
| File | Blocks | master Cycles | PR Cycles | +Throughput | master Instr. | PR Instr. | -Instr. |
| --- | --: | --: | --: | --: | --: | --: | --: |
| gsoc-2018 | 51997 | 72.7 | 71.6 | 1% | 160.1 | 160.1 | 0% |
| instruments | 3442 | 108.2 | 107.3 | 0% | 370.6 | 370.6 | 0% |
| github_events | 1017 | 78.6 | 78.4 | 0% | 256.7 | 256.7 | 0% |
| numbers | 2345 | 284.1 | 283.4 | 0% | 791.0 | 791.0 | 0% |
| apache_builds | 1988 | 84.9 | 84.7 | 0% | 295.1 | 295.1 | 0% |
| mesh | 11306 | 319.3 | 318.5 | 0% | 984.2 | 984.2 | 0% |
| twitterescaped | 8787 | 188.1 | 187.8 | 0% | 493.0 | 493.0 | 0% |
| marine_ik | 46616 | 318.4 | 318.1 | 0% | 895.7 | 895.7 | 0% |
| update-center | 8330 | 113.2 | 113.2 | 0% | 326.5 | 326.5 | 0% |
| mesh.pretty | 24646 | 189.0 | 188.9 | 0% | 571.3 | 571.3 | 0% |
| twitter | 9867 | 92.0 | 92.0 | 0% | 281.6 | 281.6 | 0% |
| citm_catalog | 26987 | 81.7 | 81.7 | 0% | 287.5 | 287.5 | 0% |
| canada | 35172 | 311.2 | 311.4 | 0% | 946.7 | 946.7 | 0% |
| semanticscholar-corpus | 134271 | 108.8 | 109.0 | 0% | 274.4 | 274.4 | 0% |
| random | 7976 | 141.2 | 142.2 | 0% | 482.1 | 482.1 | 0% |
Design
The primary classes are:
ondemand::parser
: The equivalent of dom::parser
.
- This handles allocation and parse calls, and keeps memory around between parses.
ondemand::document
: Holds iteration state. Can be cast to array, object or scalar.
- Forward-Only: This is a forward-only input iterator. You may only get the document's value once. Once you have retrieved an array, object, or scalar, subsequent attempts to get other values fail.
- Iteration Owner: If you let go of the document object, iteration will fail. This is not checked, but the failure will be really obvious :) Moves are disallowed after iteration has started, because array/object/value all point to the document.
- Locks the Parser: Only one iteration is allowed at a time. If you attempt to parse a new document before destroying the old one, you will get an error. document cannot be copied.
ondemand::array
: Manages array iteration.
- Forward-Only: Retrieving the same array element twice will fail. There is currently no check on whether it has handed out a value, it's just something you shouldn't do multiple times without
++
. This is consistent with C++'s "input iterator" concept.
- Child Blindness: Once you get an array element, it has no control over whether you do anything with it. For example, you could decide not to handle a value if it's an array or object. To control for this, when you
++
the array checks whether there is an unfinished array or object by checking if we're at the current depth. If so, it skips tokens until it's returned to the current depth.
- Chainable: We allow you to pass an error into the iterator, which it will yield on its first iteration and then stop. This allows error chaining to make its way all the way to the loop:
for (auto o : parser.parse(json))
works!
- C++ Iterator: Because C++ breaks what could be a single next() call into !=, ++, and * calls, we have to break up the algorithm into parts and keep some state between them.
operator *
Reads key, :
and a value, advancing exactly three times. Returns an error if key or :
is invalid. value
takes up the slack from there, incrementing depth if there is a [
or {
even if the user doesn't use the value. If there is an error to report, it decrements depth so that the loop will terminate, and returns it.
operator ++
Checks if we have a ]
(decrementing depth) or ,
(setting error if no comma).
operator !=
lets the loop continue if current depth >= array depth.
- Zero Overhead: It keeps state, but that state is all either constant and knowable at compile time, or only has an effect for a single iteration (the first or last). Our goal is to get the compiler to elide them all.
document
: This member variable is in many objects, but always has the same value. It is likely the compiler will elide it.
at_start
: This member variable is set the first time true. If it is true, we check for }
before the first !=
and then set it to false
.
error
: Whether this member variable is passed in initially or detected by ++
, error
has no effect unless it is nonzero, and when it is zero, the loop always terminates after the next iteration. We hope this will be elided, therefore, into a trailing control flow.
depth
: This member variable is constant and knowable at compile time, because depth
will have been incremented a constant number of times based on how many nested objects you have. Whether the compiler recognizes this is anybody's game, however :/
ondemand::object
: Manages object iteration and object["key"]
.
- Forward-Only:
[]
will never go backwards; you must do []
in order to get all the fields you want. Retrieving the same field twice with *
will fail. There is currently no check on whether it has handed out a value, it's just something you shouldn't do multiple times without ++
. This is consistent with C++'s "input iterator" concept.
- Child Blindness: Once you get a field or value, the object has no control over whether you do anything with it. For example, you could decide not to handle a value if it's an array or object. To control for this, when you
++
or do a second []
, the array checks whether there is an unfinished array or object by checking if we're at the current depth. If so, it skips tokens until it's returned to the current depth.
- Chainable: We allow you to pass an error into the iterator, which it will yield on its first iteration and then stop. This allows error chaining to make its way all the way to the loop:
for (auto o : parser.parse(json))
works!
- C++ Iterator: Because C++ breaks what could be a single next() call into !=, ++, and * calls, we have to break up the algorithm into parts and keep some state between them.
operator *
Reads key, :
and a value, advancing exactly three times. Returns an error if key or :
is invalid. value
takes up the slack from there, incrementing depth if there is a [
or {
even if the user doesn't use the value. If there is an error to report, it decrements depth so that the loop will terminate, and returns it.
operator ++
Checks if we have a ]
(decrementing depth) or ,
(setting error if no comma).
operator !=
lets the loop continue if current depth >= array depth.
- Zero Overhead: It keeps state, but that state is all either constant and knowable at compile time, or only has an effect for a single iteration (the first or last). Our goal is to get the compiler to elide them all.
document
: This member variable is in many objects, but always has the same value. It is likely the compiler will elide it.
at_start
: This member variable is set the first time true. If it is true, we check for }
before the first !=
and then set it to false
therafter. We expect this to be elided in favor of leading control flow.
error
: Whether this member variable is passed in initially or detected by ++
, error
has no effect unless it is nonzero, and when it is zero, the loop always terminates after the next iteration. We hope this will be elided, therefore, into a trailing control flow.
depth
: This member variable is constant and knowable at compile time, because depth
will have been incremented a constant number of times based on how many nested objects you have. Whether the compiler recognizes this is anybody's game, however :/
ondemand::value
: A transient object giving you the opportunity to convert a JSON value into an array, object, or scalar.
- Forward-Only: This is transient: its value can only be retrieved once. Further retrievals will fail. It is an error to keep multiple value objects around at once (it is also hard to do, but possible).
- Skippable: If you don't use the value (for example, if you have a field and don't care about the key), the destructor will check if it's
{
or [
and increment depth, to keep things consistent.
ondemand::raw_json_string
: Represents the raw json string inside the buffer, terminated by "
. This allows you to inspect and do comparisons against the raw, escaped json string without performance penalty.
- The
unescape()
method on it will parse it (escapes and all) into a string buffer of your choice.
ondemand::token_iterator
: Internal. Used to actually track structural iteration. document is the only way users will see this.
Concerns / Rough Edges
- Compiler Flags: You have to compile all of simdjson.cpp/simdjson.h with the target flags. Otherwise, simdjson_result<> and other things in the
include/
headers can't be inlined with your Haswell-specific code.
- Heavy Reliance On Optimizers: The object and array structs have four member variables each, which I'm expecting the compiler to elide completely in normal cases. I think these are largely unavoidable given C++'s iterator design. Without that elision, register pressure will be intense and stuff will get shoved into memory. I made some assumptions about how optimizers should work, particularly that it can deduce the depth value since it's constant, and that it can elide variables like
at_start
into control flow since it only affects the header of the loop.
- Unconsumed Values: Because the user drives the parse, we hand them array, object and value objects which they can then iterate or call get_string/etc. on. However, if they don't do this, or only partially iterate, then the structural_index and depth won't get updated. I added a destructor to
value
to check whether the value has been used, and check for start/end array if not. We also keep track of depth so that if child iterations are left partially iterated, we can skip everything until we get back to our own depth. All of this can be optimized away if the compiler is smart enough ... but I'm not convinced it will be :) We'll see.
Performance
clang does a lot better on the (relatively complex) twitter benchmark with ondemand than g++. I assume this has to do with its affinity for SSA optimizations:
Haswell clang10.0 (Skylake)
| Benchmark | Generic DOM | On-Demand | SAX |
|------------|---|---|---|
| PartialTweets | 2.3 GB/s | 4.0 GB/s | 3.5 GB/s |
| LargeRandom | 0.50 GB/s | 0.71 GB/s | 0.71 GB/s |
Haswell gcc10 (Skylake)
GCC is more or less on par with clang:
| Benchmark | DOM | On-Demand | SAX |
|------------|---|---|---|
| PartialTweets | 2.5 GB/s | 3.8 GB/s | 3.7 GB/s |
| LargeRandom | 0.50 GB/s | 0.78 GB/s | 0.74 GB/s |
Running the Benchmark
You can see several examples in benchmark/bench_ondemand.cpp. To compile for your native platform, do this:
rm -rf build
cd build
cmake -DCMAKE_CXX_FLAGS="-march=native" ..
make bench_ondemand
benchmark/bench_ondemand --benchmark_counters_tabular=true
Raw Data: Haswell clang10.0 (Skylake)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations best_branch_miss best_bytes_per_sec best_cache_miss best_cache_ref best_cycles best_cycles_per_byte best_docs_per_sec best_frequency best_instructions best_instructions_per_byte best_instructions_per_cycle best_items_per_sec branch_miss bytes bytes_per_second cache_miss cache_ref cycles cycles_per_byte docs_per_sec frequency instructions instructions_per_byte instructions_per_cycle items items_per_second
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PartialTweets<OnDemand> 165179 ns 165149 ns 4237 1.538k 4.056G 0 58.152k 574.945k 0.910422 6.42265k 3.69267G 1.93943M 3.07108 3.37325 642.265k 1.71353k 631.515k 3.56129G/s 8.96861m 58.1731k 580.103k 0.91859 6.05512k/s 3.5126G/s 1.93943M 3.07108 3.34325 100 605.512k/s [best: throughput= 4.06 GB/s doc_throughput= 6422 docs/s instructions= 1939432 cycles= 574945 branch_miss= 1538 cache_miss= 0 cache_ref= 58152 items= 100 avg_time= 157210 ns]
PartialTweets<Iter> 182774 ns 182773 ns 3828 2.825k 3.66457G 0 58.28k 636.406k 1.00774 5.80282k 3.69295G 1.84M 2.91363 2.89124 580.282k 3.08751k 631.515k 3.21789G/s 3.9185m 58.158k 644.682k 1.02085 5.47127k/s 3.52723G/s 1.84M 2.91363 2.85412 100 547.127k/s [best: throughput= 3.66 GB/s doc_throughput= 5802 docs/s instructions= 1840002 cycles= 636406 branch_miss= 2825 cache_miss= 0 cache_ref= 58280 items= 100 avg_time= 174683 ns]
PartialTweets<Dom> 282433 ns 282382 ns 2480 3.149k 2.31979G 0 92.404k 1005.16k 1.59167 3.67338k 3.69235G 2.9721M 4.7063 2.95682 367.338k 3.2693k 631.515k 2.0828G/s 0.439516 92.8304k 1011.84k 1.60225 3.5413k/s 3.58324G/s 2.9721M 4.7063 2.93731 100 354.13k/s [best: throughput= 2.32 GB/s doc_throughput= 3673 docs/s instructions= 2972097 cycles= 1005165 branch_miss= 3149 cache_miss= 0 cache_ref= 92404 items= 100 avg_time= 274248 ns]
Creating a source file spanning 44921 KB
LargeRandom<Dom> 91468995 ns 91467607 ns 8 968.537k 503.312M 10.8974M 15.4273M 337.109M 7.32865 10.9419 3.6886G 1041.23M 22.6361 3.08872 10.9419M 967.752k 45.9988M 479.6M/s 10.9539M 15.4278M 337.335M 7.33356 10.9328/s 3.68803G/s 1041.23M 22.6361 3.08665 1000k 10.9328M/s [best: throughput= 0.50 GB/s doc_throughput= 10 docs/s instructions= 1041233885 cycles= 337109080 branch_miss= 968537 cache_miss=10897426 cache_ref= 15427295 items= 1000000 avg_time= 91455371 ns]
LargeRandomSum<Dom> 89605540 ns 89588793 ns 8 968.435k 514.102M 10.3415M 14.5719M 330.07M 7.17562 11.1764 3.689G 1022.23M 22.2231 3.09702 11.1764M 968.274k 45.9988M 489.658M/s 10.3269M 14.5726M 330.174M 7.17789 11.1621/s 3.68544G/s 1022.23M 22.2231 3.09604 1000k 11.1621M/s [best: throughput= 0.51 GB/s doc_throughput= 11 docs/s instructions= 1022233883 cycles= 330069936 branch_miss= 968435 cache_miss=10341543 cache_ref= 14571861 items= 1000000 avg_time= 89591929 ns]
LargeRandom<OnDemand> 64622779 ns 64622223 ns 11 929.755k 712.493M 5.63348M 8.01073M 238.12M 5.17666 15.4894 3.68834G 648.69M 14.1023 2.72422 15.4894M 930.709k 45.9988M 678.835M/s 5.66211M 8.01242M 238.304M 5.18066 15.4746/s 3.68765G/s 648.69M 14.1023 2.72212 1000k 15.4746M/s [best: throughput= 0.71 GB/s doc_throughput= 15 docs/s instructions= 648690465 cycles= 238120090 branch_miss= 929755 cache_miss= 5633485 cache_ref= 8010731 items= 1000000 avg_time= 64609673 ns]
LargeRandomSum<OnDemand> 64569064 ns 64568969 ns 11 951.323k 714.434M 4.98723M 7.12271M 237.524M 5.16371 15.5316 3.68913G 641.69M 13.9502 2.70158 15.5316M 958.366k 45.9988M 679.395M/s 5.0381M 7.12423M 238.12M 5.17667 15.4873/s 3.68785G/s 641.69M 13.9502 2.69481 1000k 15.4873M/s [best: throughput= 0.71 GB/s doc_throughput= 15 docs/s instructions= 641690192 cycles= 237524226 branch_miss= 951323 cache_miss= 4987231 cache_ref= 7122709 items= 1000000 avg_time= 64556393 ns]
LargeRandom<Iter> 60862746 ns 60863442 ns 11 990.089k 757.035M 5.62286M 7.98759M 224.156M 4.87309 16.4577 3.6891G 581.692M 12.6458 2.59503 16.4577M 995.475k 45.9988M 720.759M/s 5.65907M 7.98859M 224.456M 4.8796 16.4302/s 3.68786G/s 581.692M 12.6458 2.59157 1000k 16.4302M/s [best: throughput= 0.76 GB/s doc_throughput= 16 docs/s instructions= 581691751 cycles= 224156097 branch_miss= 990089 cache_miss= 5622863 cache_ref= 7987593 items= 1000000 avg_time= 60850056 ns]
LargeRandomSum<Iter> 59778441 ns 59777987 ns 12 981.57k 770.555M 5.01271M 7.15428M 220.194M 4.78696 16.7516 3.68861G 570.691M 12.4067 2.59177 16.7516M 986.227k 45.9988M 733.846M/s 5.05344M 7.15511M 220.459M 4.79271 16.7286/s 3.68796G/s 570.691M 12.4067 2.58865 1000k 16.7286M/s [best: throughput= 0.77 GB/s doc_throughput= 16 docs/s instructions= 570691393 cycles= 220194110 branch_miss= 981570 cache_miss= 5012710 cache_ref= 7154280 items= 1000000 avg_time= 59765975 ns]
Creating a source file spanning 134087 KB
Kostya<Dom> 94265351 ns 94266428 ns 7 1045.8k 1.45789G 15.7114M 22.3239M 347.272M 2.5292 10.6179 3.6873G 975.883M 7.10741 2.81014 5.56684M 1045.46k 137.305M 1.35653G/s 15.7288M 22.2886M 347.637M 2.53186 10.6082/s 3.68782G/s 975.883M 7.10741 2.80719 524.288k 5.56177M/s [best: throughput= 1.46 GB/s doc_throughput= 10 docs/s instructions= 975882556 cycles= 347271978 branch_miss= 1045795 cache_miss=15711430 cache_ref= 22323923 items= 524288 avg_time= 94251475 ns]
KostyaSum<Dom> 93482419 ns 93481371 ns 7 1048.42k 1.47166G 15.4012M 21.9002M 344.184M 2.50671 10.7182 3.68902G 970.115M 7.0654 2.8186 5.6194M 1049.48k 137.305M 1.36792G/s 15.4135M 21.7919M 344.773M 2.511 10.6973/s 3.68815G/s 970.115M 7.0654 2.81378 524.288k 5.60848M/s [best: throughput= 1.47 GB/s doc_throughput= 10 docs/s instructions= 970115386 cycles= 344183956 branch_miss= 1048419 cache_miss=15401206 cache_ref= 21900243 items= 524288 avg_time= 93468372 ns]
Kostya<OnDemand> 59869079 ns 59857167 ns 12 468.968k 2.2974G 9.97175M 13.9606M 220.483M 1.60579 16.7321 3.68914G 635.858M 4.63099 2.88393 8.77243M 472.341k 137.305M 2.13634G/s 9.99057M 13.9227M 220.746M 1.60771 16.7064/s 3.68788G/s 635.858M 4.63099 2.88049 524.288k 8.75898M/s [best: throughput= 2.30 GB/s doc_throughput= 16 docs/s instructions= 635857684 cycles= 220483050 branch_miss= 468968 cache_miss= 9971751 cache_ref= 13960563 items= 524288 avg_time= 59856199 ns]
KostyaSum<OnDemand> 60280325 ns 60279422 ns 12 469.529k 2.27999G 9.67866M 13.4947M 222.157M 1.61799 16.6053 3.68898G 630.615M 4.5928 2.83859 8.70594M 469.704k 137.305M 2.12137G/s 9.67636M 13.4264M 222.299M 1.61902 16.5894/s 3.68781G/s 630.615M 4.5928 2.83678 524.288k 8.69763M/s [best: throughput= 2.28 GB/s doc_throughput= 16 docs/s instructions= 630614946 cycles= 222157441 branch_miss= 469529 cache_miss= 9678657 cache_ref= 13494711 items= 524288 avg_time= 60267368 ns]
Kostya<Iter> 61758017 ns 61757605 ns 11 497.377k 2.22614G 9.95741M 13.9293M 227.537M 1.65716 16.2131 3.68908G 606.497M 4.41715 2.66549 8.50035M 497.937k 137.305M 2.0706G/s 9.99912M 13.9207M 227.752M 1.65873 16.1923/s 3.68785G/s 606.497M 4.41715 2.66297 524.288k 8.48945M/s [best: throughput= 2.23 GB/s doc_throughput= 16 docs/s instructions= 606497405 cycles= 227536701 branch_miss= 497377 cache_miss= 9957411 cache_ref= 13929258 items= 524288 avg_time= 61745137 ns]
KostyaSum<Iter> 59370790 ns 59359390 ns 12 464.597k 2.31499G 9.64345M 13.4522M 218.801M 1.59354 16.8602 3.68902G 597.061M 4.34843 2.72878 8.83958M 464.774k 137.305M 2.15425G/s 9.67782M 13.4532M 218.89M 1.59419 16.8465/s 3.68754G/s 597.061M 4.34843 2.72767 524.288k 8.83244M/s [best: throughput= 2.31 GB/s doc_throughput= 16 docs/s instructions= 597060518 cycles= 218801084 branch_miss= 464597 cache_miss= 9643448 cache_ref= 13452173 items= 524288 avg_time= 59357614 ns]
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations best_branch_miss best_bytes_per_sec best_cache_miss best_cache_ref best_cycles best_cycles_per_byte best_docs_per_sec best_frequency best_instructions best_instructions_per_byte best_instructions_per_cycle best_items_per_sec branch_miss bytes bytes_per_second cache_miss cache_ref cycles cycles_per_byte docs_per_sec frequency instructions instructions_per_byte instructions_per_cycle items items_per_second
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PartialTweets<Sax> 191072 ns 191072 ns 3670 1.323k 3.47984G 0 59.009k 670.174k 1.06122 5.51031k 3.69287G 2.17912M 3.45062 3.25157 551.031k 1.47164k 631.515k 3.07814G/s 0.013624 58.9976k 674.469k 1.06802 5.23364k/s 3.52993G/s 2.17912M 3.45062 3.23086 100 523.364k/s [best: throughput= 3.48 GB/s doc_throughput= 5510 docs/s instructions= 2179117 cycles= 670174 branch_miss= 1323 cache_miss= 0 cache_ref= 59009 items= 100 avg_time= 182754 ns]
Creating a source file spanning 44921 KB
LargeRandom<Dom> 91475275 ns 91476347 ns 8 933.026k 503.328M 11.1398M 15.6303M 337.144M 7.32942 10.9422 3.6891G 1040.23M 22.6144 3.08543 10.9422M 932.62k 45.9988M 479.554M/s 11.1208M 15.6317M 337.404M 7.33507 10.9318/s 3.68843G/s 1040.23M 22.6144 3.08305 1000k 10.9318M/s [best: throughput= 0.50 GB/s doc_throughput= 10 docs/s instructions= 1040233883 cycles= 337144298 branch_miss= 933026 cache_miss=11139804 cache_ref= 15630295 items= 1000000 avg_time= 91461980 ns]
LargeRandomSum<Dom> 90329253 ns 90329264 ns 8 932.216k 509.83M 10.4788M 14.7649M 332.849M 7.23603 11.0836 3.68915G 1022.23M 22.2231 3.07117 11.0836M 932.897k 45.9988M 485.644M/s 10.522M 14.7658M 333.176M 7.24315 11.0706/s 3.68846G/s 1022.23M 22.2231 3.06815 1000k 11.0706M/s [best: throughput= 0.51 GB/s doc_throughput= 11 docs/s instructions= 1022233881 cycles= 332848515 branch_miss= 932216 cache_miss=10478836 cache_ref= 14764904 items= 1000000 avg_time= 90315728 ns]
LargeRandom<Sax> 67111081 ns 67111854 ns 10 973.014k 686.397M 5.6914M 8.09484M 247.225M 5.3746 14.9221 3.6891G 675.692M 14.6893 2.73311 14.9221M 975.305k 45.9988M 653.653M/s 5.7521M 8.0964M 247.525M 5.38113 14.9005/s 3.68825G/s 675.692M 14.6893 2.72979 1000k 14.9005M/s [best: throughput= 0.69 GB/s doc_throughput= 14 docs/s instructions= 675691776 cycles= 247224897 branch_miss= 973014 cache_miss= 5691399 cache_ref= 8094842 items= 1000000 avg_time= 67098397 ns]
Raw Data: Haswell gcc 10.2 (Skylake)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations best_branch_miss best_bytes_per_sec best_cache_miss best_cache_ref best_cycles best_cycles_per_byte best_docs_per_sec best_frequency best_instructions best_instructions_per_byte best_instructions_per_cycle best_items_per_sec branch_miss bytes bytes_per_second cache_miss cache_ref cycles cycles_per_byte docs_per_sec frequency instructions instructions_per_byte instructions_per_cycle items items_per_second
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PartialTweets<OnDemand> 177528 ns 177530 ns 3941 1.657k 3.74809G 0 54.903k 622.194k 0.98524 5.93507k 3.69277G 2.11159M 3.34369 3.39378 593.507k 1.80355k 631.515k 3.31293G/s 0.0284192 54.9838k 626.235k 0.991639 5.63285k/s 3.52748G/s 2.11159M 3.34369 3.37188 100 563.285k/s [best: throughput= 3.75 GB/s doc_throughput= 5935 docs/s instructions= 2111588 cycles= 622194 branch_miss= 1657 cache_miss= 0 cache_ref= 54903 items= 100 avg_time= 169681 ns]
PartialTweets<Iter> 384499 ns 384501 ns 1819 2.968k 1.68681G 0 55.31k 1.38216M 2.18865 2.67105k 3.69182G 4.40862M 6.98102 3.18965 267.105k 3.19399k 631.515k 1.52963G/s 0.239142 55.4276k 1.38969M 2.20056 2.60078k/s 3.61426G/s 4.40862M 6.98102 3.17239 100 260.078k/s [best: throughput= 1.69 GB/s doc_throughput= 2671 docs/s instructions= 4408621 cycles= 1382163 branch_miss= 2968 cache_miss= 0 cache_ref= 55310 items= 100 avg_time= 376595 ns]
PartialTweets<Dom> 266111 ns 266112 ns 2630 3.53k 2.46696G 0 87.587k 945.164k 1.49666 3.90642k 3.69221G 2.91945M 4.62293 3.08883 390.642k 3.70404k 631.515k 2.21014G/s 0.0285171 87.3744k 952.087k 1.50762 3.75782k/s 3.57777G/s 2.91945M 4.62293 3.06637 100 375.782k/s [best: throughput= 2.47 GB/s doc_throughput= 3906 docs/s instructions= 2919449 cycles= 945164 branch_miss= 3530 cache_miss= 0 cache_ref= 87587 items= 100 avg_time= 257984 ns]
Creating a source file spanning 44921 KB
LargeRandom<Dom> 91849988 ns 91849888 ns 8 889.651k 502.486M 10.9371M 15.2509M 337.691M 7.34129 10.9239 3.6889G 970.316M 21.0944 2.87339 10.9239M 889.185k 45.9988M 477.604M/s 11.0046M 15.2543M 338.763M 7.36461 10.8873/s 3.68822G/s 970.316M 21.0944 2.86429 1000k 10.8873M/s [best: throughput= 0.50 GB/s doc_throughput= 10 docs/s instructions= 970315574 cycles= 337690544 branch_miss= 889651 cache_miss=10937148 cache_ref= 15250891 items= 1000000 avg_time= 91836392 ns]
LargeRandomSum<Dom> 92188755 ns 92188563 ns 8 889.635k 499.861M 10.36M 14.4073M 339.484M 7.38027 10.8668 3.68911G 974.316M 21.1813 2.86999 10.8668M 889.272k 45.9988M 475.849M/s 10.4213M 14.4099M 340.028M 7.3921 10.8473/s 3.68839G/s 974.316M 21.1813 2.8654 1000k 10.8473M/s [best: throughput= 0.50 GB/s doc_throughput= 10 docs/s instructions= 974315578 cycles= 339483518 branch_miss= 889635 cache_miss=10360012 cache_ref= 14407265 items= 1000000 avg_time= 92175359 ns]
LargeRandom<OnDemand> 58992605 ns 58991725 ns 12 869.377k 781.677M 5.62944M 7.89403M 217.093M 4.71954 16.9934 3.68916G 615.695M 13.385 2.83609 16.9934M 868.032k 45.9988M 743.627M/s 5.67331M 7.89681M 217.57M 4.72991 16.9515/s 3.68815G/s 615.695M 13.385 2.82987 1000k 16.9515M/s [best: throughput= 0.78 GB/s doc_throughput= 16 docs/s instructions= 615694894 cycles= 217093162 branch_miss= 869377 cache_miss= 5629445 cache_ref= 7894027 items= 1000000 avg_time= 58980167 ns]
LargeRandomSum<OnDemand> 56594492 ns 56594225 ns 12 876.324k 813.963M 5.01997M 7.05425M 208.485M 4.5324 17.6953 3.68921G 606.695M 13.1894 2.91002 17.6953M 876.066k 45.9988M 775.129M/s 5.066M 7.05672M 208.735M 4.53784 17.6696/s 3.68828G/s 606.695M 13.1894 2.90653 1000k 17.6696M/s [best: throughput= 0.81 GB/s doc_throughput= 17 docs/s instructions= 606694893 cycles= 208485037 branch_miss= 876324 cache_miss= 5019967 cache_ref= 7054246 items= 1000000 avg_time= 56582402 ns]
LargeRandom<Iter> 53364551 ns 53364201 ns 13 894.323k 863.44M 5.63805M 7.89683M 196.541M 4.27273 18.7709 3.68925G 570.695M 12.4067 2.9037 18.7709M 894.787k 45.9988M 822.046M/s 5.66445M 7.89822M 196.82M 4.2788 18.7392/s 3.68823G/s 570.695M 12.4067 2.89958 1000k 18.7392M/s [best: throughput= 0.86 GB/s doc_throughput= 18 docs/s instructions= 570694596 cycles= 196540518 branch_miss= 894323 cache_miss= 5638049 cache_ref= 7896828 items= 1000000 avg_time= 53352485 ns]
LargeRandomSum<Iter> 54883627 ns 54883439 ns 13 871.251k 841.314M 5.02219M 7.05069M 201.706M 4.38502 18.2899 3.68918G 577.695M 12.5589 2.86405 18.2899M 871.778k 45.9988M 799.291M/s 5.06164M 7.05285M 202.423M 4.40061 18.2204/s 3.68823G/s 577.695M 12.5589 2.85391 1000k 18.2204M/s [best: throughput= 0.84 GB/s doc_throughput= 18 docs/s instructions= 577695426 cycles= 201705764 branch_miss= 871251 cache_miss= 5022193 cache_ref= 7050692 items= 1000000 avg_time= 54871279 ns]
Creating a source file spanning 134087 KB
Kostya<Dom> 86984857 ns 86984354 ns 8 494.739k 1.58086G 15.8348M 22.1883M 320.4M 2.33349 11.5135 3.68893G 936.468M 6.82035 2.92281 6.0364M 494.617k 137.305M 1.47009G/s 15.849M 22.1757M 320.827M 2.3366 11.4963/s 3.68833G/s 936.468M 6.82035 2.91892 524.288k 6.02738M/s [best: throughput= 1.58 GB/s doc_throughput= 11 docs/s instructions= 936467833 cycles= 320400264 branch_miss= 494739 cache_miss=15834791 cache_ref= 22188266 items= 524288 avg_time= 86971545 ns]
KostyaSum<Dom> 86970961 ns 86970134 ns 8 495.135k 1.58123G 15.5502M 21.6331M 320.352M 2.33314 11.5162 3.68924G 938.565M 6.83562 2.92979 6.03781M 494.743k 137.305M 1.47034G/s 15.6095M 21.6834M 320.783M 2.33628 11.4982/s 3.68842G/s 938.565M 6.83562 2.92586 524.288k 6.02837M/s [best: throughput= 1.58 GB/s doc_throughput= 11 docs/s instructions= 938564987 cycles= 320352061 branch_miss= 495135 cache_miss=15550182 cache_ref= 21633093 items= 524288 avg_time= 86957969 ns]
Kostya<OnDemand> 60343057 ns 60343775 ns 12 456.213k 2.28197G 10.1305M 13.9803M 221.966M 1.61659 16.6197 3.68902G 647.782M 4.71783 2.91838 8.71353M 456.138k 137.305M 2.11911G/s 10.1682M 13.868M 222.551M 1.62085 16.5717/s 3.68805G/s 647.782M 4.71783 2.91072 524.288k 8.68835M/s [best: throughput= 2.28 GB/s doc_throughput= 16 docs/s instructions= 647782119 cycles= 221966325 branch_miss= 456213 cache_miss=10130481 cache_ref= 13980257 items= 524288 avg_time= 60330609 ns]
KostyaSum<OnDemand> 58642231 ns 58641748 ns 12 453.15k 2.3471G 9.82263M 13.5381M 215.814M 1.57178 17.094 3.68913G 643.064M 4.68347 2.97972 8.9622M 453.464k 137.305M 2.18062G/s 9.84356M 13.5389M 216.28M 1.57518 17.0527/s 3.68816G/s 643.064M 4.68347 2.97329 524.288k 8.94052M/s [best: throughput= 2.35 GB/s doc_throughput= 17 docs/s instructions= 643063529 cycles= 215813521 branch_miss= 453150 cache_miss= 9822627 cache_ref= 13538124 items= 524288 avg_time= 58629573 ns]
Kostya<Iter> 59348929 ns 59348585 ns 12 452.97k 2.32237G 10.0784M 13.9769M 218.108M 1.58849 16.9139 3.68906G 642.015M 4.67583 2.94357 8.86776M 453.045k 137.305M 2.15465G/s 10.1584M 13.924M 218.88M 1.59412 16.8496/s 3.68804G/s 642.015M 4.67583 2.93318 524.288k 8.83404M/s [best: throughput= 2.32 GB/s doc_throughput= 16 docs/s instructions= 642015174 cycles= 218107739 branch_miss= 452970 cache_miss=10078433 cache_ref= 13976859 items= 524288 avg_time= 59336271 ns]
KostyaSum<Iter> 121340358 ns 121337200 ns 6 453.895k 1.13236G 9.935M 13.6518M 447.335M 3.25796 8.24704 3.68919G 1.31992G 9.61305 2.95063 4.32383M 454.037k 137.305M 1079.18M/s 9.95978M 13.6513M 447.577M 3.25973 8.2415/s 3.68871G/s 1.31992G 9.61305 2.94903 524.288k 4.32092M/s [best: throughput= 1.13 GB/s doc_throughput= 8 docs/s instructions= 1319919270 cycles= 447334605 branch_miss= 453895 cache_miss= 9934995 cache_ref= 13651799 items= 524288 avg_time= 121327101 ns]
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations best_branch_miss best_bytes_per_sec best_cache_miss best_cache_ref best_cycles best_cycles_per_byte best_docs_per_sec best_frequency best_instructions best_instructions_per_byte best_instructions_per_cycle best_items_per_sec branch_miss bytes bytes_per_second cache_miss cache_ref cycles cycles_per_byte docs_per_sec frequency instructions instructions_per_byte instructions_per_cycle items items_per_second
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PartialTweets<Sax> 181416 ns 181385 ns 3854 1.384k 3.67147G 0 58.443k 635.165k 1.00578 5.81375k 3.69269G 2.07459M 3.2851 3.26623 581.375k 1.53157k 631.515k 3.24252G/s 0.044629 58.4857k 640.112k 1.01361 5.51314k/s 3.52903G/s 2.07459M 3.2851 3.24098 100 551.314k/s [best: throughput= 3.67 GB/s doc_throughput= 5813 docs/s instructions= 2074593 cycles= 635165 branch_miss= 1384 cache_miss= 0 cache_ref= 58443 items= 100 avg_time= 173490 ns]
Creating a source file spanning 44921 KB
LargeRandom<Dom> 88806316 ns 88806323 ns 8 871.991k 518.893M 10.8318M 15.3949M 326.983M 7.10851 11.2806 3.68855G 970.316M 21.0944 2.96748 11.2806M 872.063k 45.9988M 493.972M/s 10.8779M 15.3961M 327.465M 7.11899 11.2605/s 3.68741G/s 970.316M 21.0944 2.96311 1000k 11.2605M/s [best: throughput= 0.52 GB/s doc_throughput= 11 docs/s instructions= 970315579 cycles= 326982612 branch_miss= 871991 cache_miss=10831777 cache_ref= 15394882 items= 1000000 avg_time= 88792975 ns]
LargeRandomSum<Dom> 89824596 ns 89807350 ns 8 872.01k 513.081M 10.3216M 14.5612M 330.716M 7.18967 11.1542 3.68889G 974.316M 21.1813 2.94608 11.1542M 871.863k 45.9988M 488.466M/s 10.3039M 14.5623M 331.208M 7.20036 11.1349/s 3.68798G/s 974.316M 21.1813 2.9417 1000k 11.1349M/s [best: throughput= 0.51 GB/s doc_throughput= 11 docs/s instructions= 974315579 cycles= 330716103 branch_miss= 872010 cache_miss=10321626 cache_ref= 14561194 items= 1000000 avg_time= 89811290 ns]
LargeRandom<Sax> 62784008 ns 62783943 ns 11 913.123k 738.521M 5.61415M 8.01956M 229.784M 4.99545 16.0552 3.68924G 672.695M 14.6242 2.9275 16.0552M 918.166k 45.9988M 698.711M/s 5.65478M 8.02137M 231.532M 5.03344 15.9276/s 3.68776G/s 672.695M 14.6242 2.90541 1000k 15.9276M/s [best: throughput= 0.74 GB/s doc_throughput= 16 docs/s instructions= 672694521 cycles= 229784372 branch_miss= 913123 cache_miss= 5614148 cache_ref= 8019564 items= 1000000 avg_time= 62771549 ns]
Loose Ends
Things that likely won't get finished in this checkin, but might be considered for a full release (and also might not :)):
- Bugs
- Possible corruption with incomplete arrays / objects ("[ 1, 2, ")
- Win32 / VS2019 failure (#1208)
- Rough Edges
x["a"]["b"]
unsupported (right now x["a"] would be released early and the element never gets fully skipped)
- parser.load()
- parser.iterate(buf, len)
- get_c_str()
- Out-of-order key lookup support
- Features
- Print / minify
- document_stream
- Validation of skipped values
- Strict object support (don't ever skip keys, error immediately if the next key is not what you expect)
- Nullable value support: .get_nullable_int64()
- .is_false_or_null() is probably useful ...
- SIMDJSON_ONDEMAND_SAFETY_RAILS tests
- Compile-time safety tests (make sure bad things don't compile)
- Performance
- Add ondemand to competitions
- Make more competitions
- Performance optimization for recursion
- Tuple support [ x, y, z ]? It would force-unroll loops, basically, possibly improving performance.
- Ability to stop iterating when finished (right now it will inspect and skip all remaining elements)
- Sanity checks:
- Sanity review of & and && versions of methods (I hate the error messages but I hate letting people compile programs that will fail at runtime even more). Can we make it so people can use things in more flexible ways (for example, ["a"]["b"] above)? Can we make error messages better?
- Sanity review of document vs. value. I don't like that they behave almost identically but have different code. Even with the root value parsing difference notwithstanding, the fact that document owns the iterator and value has a reference makes the code fundamentally non-reusable. We should look into doing something about that.
Next Steps
- [X] Supported way to use it on multiple platforms ("single kernel mode"? Plug in to simdjson's architecture selection?)
- [X] Parse numbers/booleans at the root correctly, without overrun
- [X] Don't overrun when objects/arrays are unbalanced
- [X] Thorough type tests similar to DOM API tests
- [X] Error tests for common error cases (to make sure errors are actually raised and iteration stops)
- [X] Last performance check to ensure we haven't dropped below 4.0GB/s in the final round of fixes
- [X] Resolve compiler failures on other platforms
enhancement performance research