Python bindings for the simdjson project.

Related tags

python json simd simdjson
Overview

PyPI - License Tests

pysimdjson

Python bindings for the simdjson project, a SIMD-accelerated JSON parser. If SIMD instructions are unavailable a fallback parser is used, making pysimdjson safe to use anywhere.

Bindings are currently tested on OS X, Linux, and Windows for Python version 3.5 to 3.9.

📝 Documentation

The latest documentation can be found at https://pysimdjson.tkte.ch.

If you've checked out the source code (for example to review a PR), you can build the latest documentation by running cd docs && make html.

🎉 Installation

If binary wheels are available for your platform, you can install from pip with no further requirements:

pip install pysimdjson

Binary wheels are available for the following:

py3.5 py3.6 py3.7 py3.8 py3.9 pypy3
OS X (x86_64) y y y y y y
Windows (x86_64) x x y y y x
Linux (x86_64) y y y y y x
Linux (ARM64) y y y y y x

If binary wheels are not available for your platform, you'll need a C++11-capable compiler to compile the sources:

pip install pysimdjson --no-binary :all:

Both simdjson and pysimdjson support FreeBSD and Linux on ARM when built from source.

Development and Testing

This project comes with a full test suite. To install development and testing dependencies, use:

pip install -e ".[test]"

To also install 3rd party JSON libraries used for running benchmarks, use:

pip install -e ".[benchmark]"

To run the tests, just type pytest. To also run the benchmarks, use pytest --runslow.

To properly test on Windows, you need both a recent version of Visual Studio (VS) as well as VS2015, patch 3. Older versions of CPython required portable C/C++ extensions to be built with the same version of VS as the interpreter. Use the Developer Command Prompt to easily switch between versions.

How It Works

This project uses pybind11 to generate the low-level bindings on top of the simdjson project. You can use it just like the built-in json module, or use the simdjson-specific API for much better performance.

import simdjson
doc = simdjson.loads('{"hello": "world"}')

🚀 Making things faster

pysimdjson provides an api compatible with the built-in json module for convenience, and this API is pretty fast (beating or tying all other Python JSON libraries). However, it also provides a simdjson-specific API that can perform significantly better.

Don't load the entire document

95% of the time spent loading a JSON document into Python is spent in the creation of Python objects, not the actual parsing of the document. You can avoid all of this overhead by ignoring parts of the document you don't want.

pysimdjson supports this in two ways - the use of JSON pointers via at_pointer(), or proxies for objects and lists.

import simdjson
parser = simdjson.Parser()
doc = parser.parse(b'{"res": [{"name": "first"}, {"name": "second"}]}')

For our sample above, we really just want the second entry in res, we don't care about anything else. We can do this two ways:

assert doc['res'][1]['name'] == 'second' # True
assert doc.at_pointer('res/1/name') == 'second' # True

Both of these approaches will be much faster than using load/s(), since they avoid loading the parts of the document we didn't care about.

Both Object and Array have a mini property that returns their entire content as a minified Python str. A message router for example would only parse the document and retrieve a single property, the destination, and forward the payload without ever turning it into a Python object. Here's a (bad) example:

import simdjson

@app.route('/store', methods=['POST'])
def store():
    parser = simdjson.Parser()
    doc = parser.parse(request.data)
    redis.set(doc['key'], doc.mini)

With this, doc could contain thousands of objects, but the only one loaded into a python object was key, and we even minified the content as we went.

Re-use the parser.

One of the easiest performance gains if you're working on many documents is to re-use the parser.

import simdjson
parser = simdjson.Parser()

for i in range(0, 100):
    doc = parser.parse(b'{"a": "b"}')

This will drastically reduce the number of allocations being made, as it will reuse the existing buffer when possible. If it's too small, it'll grow to fit.

📈 Benchmarks

pysimdjson compares well against most libraries for the default load/loads(), which creates full python objects immediately.

pysimdjson performs significantly better when only part of the document is of interest. For each test file we show the time taken to completely deserialize the document into Python objects, as well as the time to get the deepest key in each file. The second approach avoids all unnecessary object creation.

jsonexamples/canada.json deserialization

Name Min (μs) Max (μs) StdDev Ops
simdjson-{canada} 10.67130 22.89260 0.00465 60.30257
yyjson-{canada} 11.29230 29.90640 0.00568 53.27890
orjson-{canada} 11.90260 34.88260 0.00507 54.49605
ujson-{canada} 18.17060 48.99410 0.00718 36.24892
simplejson-{canada} 39.24630 52.62860 0.00483 21.81617
rapidjson-{canada} 41.04930 53.10800 0.00445 21.19078
json-{canada} 44.68320 59.44410 0.00440 19.71509

jsonexamples/canada.json deepest key

Name Min (μs) Max (μs) StdDev Ops
simdjson-{canada} 3.21360 6.88010 0.00044 285.83978
yyjson-{canada} 10.62770 46.10050 0.01000 43.29310
orjson-{canada} 12.54010 39.16080 0.00779 44.28928
ujson-{canada} 17.93980 35.44960 0.00697 36.78481
simplejson-{canada} 38.58160 54.33290 0.00699 21.37382
rapidjson-{canada} 40.69030 58.23460 0.00700 20.30349
json-{canada} 43.88300 65.04480 0.00722 18.55929

jsonexamples/twitter.json deserialization

Name Min (μs) Max (μs) StdDev Ops
orjson-{twitter} 2.36070 14.03050 0.00123 346.94307
simdjson-{twitter} 2.41350 12.01550 0.00117 359.49272
yyjson-{twitter} 2.48130 12.03680 0.00112 353.03313
ujson-{twitter} 2.62890 11.39370 0.00090 346.87994
simplejson-{twitter} 3.34600 11.08840 0.00098 270.58797
json-{twitter} 3.35270 11.82610 0.00116 260.01943
rapidjson-{twitter} 4.29320 13.81980 0.00128 197.91107

jsonexamples/twitter.json deepest key

Name Min (μs) Max (μs) StdDev Ops
simdjson-{twitter} 0.33840 0.67200 0.00002 2800.32496
orjson-{twitter} 2.38460 13.53120 0.00131 352.70788
yyjson-{twitter} 2.48180 13.67470 0.00156 320.56731
ujson-{twitter} 2.65230 11.65150 0.00125 331.69430
json-{twitter} 3.34910 12.44890 0.00116 263.25854
simplejson-{twitter} 3.35760 15.61900 0.00137 262.36758
rapidjson-{twitter} 4.31870 12.77490 0.00119 201.86510

jsonexamples/github_events.json deserialization

Name Min (μs) Max (μs) StdDev Ops
orjson-{github_events} 0.18080 0.67020 0.00004 5041.29485
simdjson-{github_events} 0.19470 0.61450 0.00003 4725.63489
yyjson-{github_events} 0.19710 0.53970 0.00004 4584.50870
ujson-{github_events} 0.23760 1.33490 0.00004 3904.08715
json-{github_events} 0.29030 1.32040 0.00009 3034.22530
simplejson-{github_events} 0.30210 0.82260 0.00005 3067.99997
rapidjson-{github_events} 0.33010 0.92400 0.00005 2793.93274

jsonexamples/github_events.json deepest key

Name Min (μs) Max (μs) StdDev Ops
simdjson-{github_events} 0.03630 0.66110 0.00001 25259.19598
orjson-{github_events} 0.18210 0.71230 0.00003 5073.48086
yyjson-{github_events} 0.20030 0.61270 0.00003 4589.71299
ujson-{github_events} 0.24260 1.05100 0.00007 3644.08240
json-{github_events} 0.29310 2.38770 0.00011 2967.79019
simplejson-{github_events} 0.30580 1.39670 0.00007 2931.01646
rapidjson-{github_events} 0.33340 0.80440 0.00004 2795.27887

jsonexamples/citm_catalog.json deserialization

Name Min (μs) Max (μs) StdDev Ops
orjson-{citm_catalog} 5.40140 17.76900 0.00314 130.33847
yyjson-{citm_catalog} 5.77340 23.09490 0.00421 113.78942
simdjson-{citm_catalog} 6.00620 26.87570 0.00444 104.41073
ujson-{citm_catalog} 6.34300 25.06400 0.00473 96.01414
simplejson-{citm_catalog} 9.54910 23.96350 0.00392 78.99315
json-{citm_catalog} 10.21250 23.52610 0.00329 78.72180
rapidjson-{citm_catalog} 10.81700 21.85400 0.00343 73.94939

jsonexamples/citm_catalog.json deepest key

Name Min (μs) Max (μs) StdDev Ops
simdjson-{citm_catalog} 0.81040 2.11090 0.00015 1088.17698
orjson-{citm_catalog} 5.37260 18.37890 0.00451 120.86345
yyjson-{citm_catalog} 5.61430 23.18500 0.00548 110.29924
ujson-{citm_catalog} 6.25850 30.79090 0.00604 95.50805
simplejson-{citm_catalog} 9.36560 24.44860 0.00510 77.50571
json-{citm_catalog} 10.07650 25.29490 0.00450 76.18267
rapidjson-{citm_catalog} 10.69120 27.84880 0.00493 70.98005

jsonexamples/mesh.json deserialization

Name Min (μs) Max (μs) StdDev Ops
yyjson-{mesh} 2.33710 13.01130 0.00171 331.50569
simdjson-{mesh} 2.52960 13.19230 0.00159 311.37935
orjson-{mesh} 2.88770 12.13010 0.00152 287.31080
ujson-{mesh} 3.64020 18.23620 0.00227 193.35645
json-{mesh} 5.97130 13.58290 0.00136 150.01621
rapidjson-{mesh} 7.54270 16.14480 0.00155 119.37806
simplejson-{mesh} 8.64370 16.35320 0.00136 106.25888

jsonexamples/mesh.json deepest key

Name Min (μs) Max (μs) StdDev Ops
simdjson-{mesh} 1.02020 2.74930 0.00013 919.93044
yyjson-{mesh} 2.30970 13.06730 0.00182 347.76076
orjson-{mesh} 2.85260 12.41860 0.00156 290.19432
ujson-{mesh} 3.59400 16.68610 0.00227 201.03704
json-{mesh} 5.96300 19.18900 0.00185 146.04645
rapidjson-{mesh} 7.43860 16.32260 0.00164 121.84979
simplejson-{mesh} 8.62160 21.89280 0.00221 101.30905

jsonexamples/gsoc-2018.json deserialization

Name Min (μs) Max (μs) StdDev Ops
simdjson-{gsoc-2018} 5.52590 16.27430 0.00178 145.59797
yyjson-{gsoc-2018} 5.62040 16.46250 0.00168 155.97459
orjson-{gsoc-2018} 5.78420 13.87300 0.00140 148.84293
simplejson-{gsoc-2018} 7.76200 15.26480 0.00142 114.98827
ujson-{gsoc-2018} 7.96570 21.53840 0.00188 110.29162
json-{gsoc-2018} 8.63300 19.26320 0.00172 102.78744
rapidjson-{gsoc-2018} 10.55570 19.20210 0.00159 85.84087

jsonexamples/gsoc-2018.json deepest key

Name Min (μs) Max (μs) StdDev Ops
simdjson-{gsoc-2018} 1.56020 4.20200 0.00024 570.15046
yyjson-{gsoc-2018} 5.49930 14.89760 0.00158 161.14242
orjson-{gsoc-2018} 5.72650 15.88270 0.00160 153.18169
simplejson-{gsoc-2018} 7.70780 18.78120 0.00169 116.90299
ujson-{gsoc-2018} 7.91720 21.35300 0.00227 103.06755
json-{gsoc-2018} 8.65190 19.99580 0.00188 103.86934
rapidjson-{gsoc-2018} 10.52410 20.98870 0.00158 87.78973
Issues
  • Rewrite for code quality and move to simdjson 0.4.*. (Issue #31)

    Rewrite for code quality and move to simdjson 0.4.*. (Issue #31)

    This will become the version 2.0.0 release.

    • [x] Update embedded simdjson to 0.3.0 (#31)
    • [x] Update embedded simdjson to 0.4.0 (#31)
    • [x] Move from cython to pybind11
    • [ ] Rewrite documentation
    • [ ] Better CI-generated benchmarks against json, ujson, rapidjson, and orjson.
    • [x] Try to match the json.load, json.loads, json.dump and json.dumps interfaces. Will impact performance over the native simdjson API but users want plug-and-play.
    • [x] Move from appveyor and circleci to github actions for CI tasks.
    • [x] simdjson no longer requires C++17. We can greatly expand the versions of Python on Windows we can provide binary wheels for. This comes from older versions of CPython requiring C extensions to be built with the same compiler they were.
    packaging 
    opened by TkTech 44
  • The Python overhead is about 95% of the processing time

    The Python overhead is about 95% of the processing time

    From simdjson/scripts/javascript, I generated a file called large.json. In C++, parsing this file takes about 0.25 s.

    $parse large.json
    Min:  0.252188 bytes read: 203130424 Gigabytes/second: 0.805471
    

    I wrote the following Python script...

    from timeit import default_timer as timer
    with open('large.json', 'rb') as fin:
       x = fin.read()
    
    for i in range(10):
       start = timer()
       doc = simdjson.loads(x)
       end = timer()
       print(end - start)
    

    I get...

    $ time python3 test.py
    3.471898762974888
    3.9210079659242183
    3.3614078611135483
    3.72252986789681
    3.7506914171390235
    3.756883286871016
    3.752689895918593
    3.751842977013439
    3.7484844669234008
    (...)
    

    If my analysis is correct (and it could be wrong), pysimdjson takes 3.7 s to parse the file, and of that, 0.25 s are due to simdjson, leaving about 95% of the processing time to overhead.

    I know that this is known, but I wanted to provide a data point.

    opened by lemire 24
  • File causes a crash in pysimdjson (reliably)

    File causes a crash in pysimdjson (reliably)

    I am copying over issue https://github.com/simdjson/simdjson/issues/921 from simdjson. We do not see a crash in simdjson itself, but there is a crash in pysimdjson:

    import simdjson
    a = open("test.txt").read()
    b = simdjson.loads(a.encode())
    

    Using the file https://github.com/simdjson/simdjson/files/4749603/test.txt

    opened by lemire 17
  • This parser can't support a document that big

    This parser can't support a document that big

    [email protected]:~$ time python convert-to-pickle.py Traceback (most recent call last): File "convert-to-pickle.py", line 10, in data = simdjson.loads(ch.read()) File "/usr/local/lib/python3.8/dist-packages/simdjson/init.py", line 52, in loads return parser.parse(s, True) File "simdjson/csimdjson.pyx", line 468, in csimdjson.Parser.parse ValueError: This parser can't support a document that big

    invalid zero-effort 
    opened by brianmingus2 17
  • Define typings for simdjson

    Define typings for simdjson

    Closes: #75

    opened by kornicameister 12
  • Segfault when not assigning the parser to a variable

    Segfault when not assigning the parser to a variable

    Here is a Python session that segfaults:

    >>> import simdjson
    >>> pa=simdjson.Parser().parse('{"a": 9999}')
    >>> pa["a"]
    zsh: segmentation fault (core dumped)  python
    

    And here is one that works:

    >>> import simdjson
    >>> p = simdjson.Parser()
    >>> pa = p.parse('{"a": 9999}')
    >>> pa["a"]
    9999
    

    It's unclear to me why the first one segfaults, and it looks like a bug?

    I imagine the parser is garbage collected by Python in the first example, but it's still clearly in use by the "pa" variable?

    bug 
    opened by palkeo 9
  • Fairly high overhead on the boundary Python/C++

    Fairly high overhead on the boundary Python/C++

    We are parsing a very high number of ~2KB JSON files in our Python-based application.

    • The native (C++) SIMDJSON library delivers ~700k parser cycles per second.
    • The pysimdjson delivers ~350k parser cycles per second.
    • The Cython-based PoC implementation (in-house, so far) delivers ~700k parser cycles per second (very close to C++ implementation).

    I also conducted a rather artificial test of "how many parser cycles" can I get with basically empty JSON ({}). The issue here is quite visible, the overhead of the Python<->pysymdjson boundary crossing is high relatively to other possible implementations.

    A "parser cycle" is defined as a one call to parser.parse(json) on the existing parser instance.

    I'm not 100% sure if this is a priority of this library, so feel free to close this one as irrelevant.

    opened by ateska 9
  • Consider upgrading to simdjson 0.4

    Consider upgrading to simdjson 0.4

    Version 0.4 of simdjson is now available

    Highlights

    • Test coverage has been greatly improved and we have resolved many static-analysis warnings on different systems.

    New features:

    • We added a fast (8GB/s) minifier that works directly on JSON strings.
    • We added fast (10GB/s) UTF-8 validator that works directly on strings (any strings, including non-JSON).
    • The array and object elements have a constant-time size() method.

    Performance:

    • Performance improvements to the API (type(), get<>()).
    • The parse_many function (ndjson) has been entirely reworked. It now uses a single secondary thread instead of several new threads.
    • We have introduced a faster UTF-8 validation algorithm (lookup3) for all kernels (ARM, x64 SSE, x64 AVX).

    System support:

    • C++11 support for older compilers and systems.
    • FreeBSD support (and tests).
    • We support the clang front-end compiler (clangcl) under Visual Studio.
    • It is now possible to target ARM platforms under Visual Studio.
    • The simdjson library will never abort or print to standard output/error.

    Version 0.3 of simdjson is now available

    Highlights

    • Multi-Document Parsing: Read a bundle of JSON documents (ndjson) 2-4x faster than doing it individually. API docs / Design Details
    • Simplified API: The API has been completely revamped for ease of use, including a new JSON navigation API and fluent support for error code and exception styles of error handling with a single API. Docs
    • Exact Float Parsing: Now simdjson parses floats flawlessly without any performance loss (https://github.com/simdjson/simdjson/pull/558). Blog Post
    • Even Faster: The fastest parser got faster! With a shiny new UTF-8 validator and meticulously refactored SIMD core, simdjson 0.3 is 15% faster than before, running at 2.5 GB/s (where 0.2 ran at 2.2 GB/s).

    Minor Highlights

    • Fallback implementation: simdjson now has a non-SIMD fallback implementation, and can run even on very old 64-bit machines.
    • Automatic allocation: as part of API simplification, the parser no longer has to be preallocated-it will adjust automatically when it encounters larger files.
    • Runtime selection API: We've exposed simdjson's runtime CPU detection and implementation selection as an API, so you can tell what implementation we detected and test with other implementations.
    • Error handling your way: Whether you use exceptions or check error codes, simdjson lets you handle errors in your style. APIs that can fail return simdjson_result, letting you check the error code before using the result. But if you are more comfortable with exceptions, skip the error code and cast straight to T, and exceptions will be thrown automatically if an error happens. Use the same API either way!
    • Error chaining: We also worked to keep non-exception error-handling short and sweet. Instead of having to check the error code after every single operation, now you can chain JSON navigation calls like looking up an object field or array element, or casting to a string, so that you only have to check the error code once at the very end.
    opened by lemire 8
  • default on the right flags by including the logic in setup.py

    default on the right flags by including the logic in setup.py

    It seems that it should not be necessary for the user to provide the flags in this manner:

    CFLAGS="-march=native -std=c++17" pip install pysimdjson
    

    Violating the standard (which is just pip install pysimdjson) is not great from a usability point of view (for obvious reasons).

    One can detect the platform with platform.system() (Windows, Darwin, and so forth), and then do compile_args.append('-std=c++17') or the like.

    enhancement packaging 
    opened by lemire 6
  • Build binary packages using clang-cl on Windows

    Build binary packages using clang-cl on Windows

    Support for clang-cl is coming. As part of the PR that allows CPython to build against clang-cl, distutils is updated to build with clang-cl (https://github.com/python/cpython/pull/18371). Once this PR is merged and a new CPython release includes it we can start using it for our binary releases.

    Clang has reached a point where it's safe enough for us to use with CPython's built with MSVC2015 or newer. https://clang.llvm.org/docs/MSVCCompatibility.html

    This would alleviate poor windows performance caused by MSVC issues (https://github.com/simdjson/simdjson/issues/847, but not entirely, https://github.com/simdjson/simdjson/issues/848).

    We only need to do this if upstream simdjson doesn't figure out what's up with MSVC. @lemire

    enhancement packaging blocked 
    opened by TkTech 6
  • Update to upstream simdjson 1.0.0

    Update to upstream simdjson 1.0.0

    << Work In Progress >>

    opened by TkTech 0
  • Improve user experience of memory safety.

    Improve user experience of memory safety.

    We've added a check in v4 (https://github.com/TkTech/pysimdjson/blob/master/simdjson/csimdjson.pyx#L437) that prevents parsing new documents while references continue to exist to the old one. This is correct, in that it ensures no errors. I wasn't terribly happy with this, but it's better then segfaulting.

    It has downsides:

    • It sucks as a user (https://github.com/TkTech/pysimdjson/issues/53#issuecomment-850494991), where you might have to del the old objects, even if you didn't intend to use them again. Very un-pythonic.
    • Doesn't work on PyPy, where del is unreliable. The objects may not be garbage collected until much later.

    Brainstorming welcome. Alternatives:

    • Probably the easiest approach would be for a Parser to keep a list of Object and Array proxies that hold a reference to it, and set a dirty bit on them when parse() is called with a different document. The performance of this would probably be unacceptable - I might be wrong.
    • Use the new parse_into_document() and create a new document for every parse. This is potentially both slow and very wasteful with memory, but would let us keep a document around and valid for as long as Object or Array reference it.
    enhancement help wanted 
    opened by TkTech 1
  • Provide the ability to link to system simdjson

    Provide the ability to link to system simdjson

    Bundling a library is a serious sin in our book, so provide the ability to link to the system library. I've also done some refactoring to avoid exponential growth of Extension calls. The default behavior remains the same, so it shouldn't affect existing users.

    That said, the patch isn't perfect. It still uses the bundled headers instead of system headers but it should be good enough for us.

    opened by mgorny 2
  • Expose document_stream interface

    Expose document_stream interface

    The pysimdjson library could support our document_stream interface (parse_many function). It is well tested as of release 0.7 (with fuzz testing) and works well today. It supports streams of indefinite size.

    See https://github.com/simdjson/simdjson/blob/master/doc/parse_many.md

    Related to https://github.com/TkTech/pysimdjson/issues/70

    enhancement 
    opened by lemire 4
  • as_buffer() support for Object values.

    as_buffer() support for Object values.

    We now support as_buffer() #59 which drastically improves performance when loading arrays from JSON into numpy.

    We should also support as_buffer() for Objects, which would retrieve the values as an array. This is to support calling it like this:

    https://github.com/riddell-stan/pystan-next/blob/1e027a8bded88d51c6b957a841b859938c0ac86d/stan/fit.py#L93

    For @riddell-stan's use case, this should use less memory and be roughly 4x faster (since we also void the iterator creation for values())

    enhancement 
    opened by TkTech 1
  • Abandon Travis for ARM & PPC builds

    Abandon Travis for ARM & PPC builds

    Travis.ci (predictably, once they were bought by a fund) is starting to go down and is no longer a viable option for OSS that doesn't want the headache of constantly re-applying for minutes.

    We have couple options:

    1. We cross-compile on a native host with a pre-built docker image, then transfer as an artifact to run in Qemu for tests.
    2. Find an alternative Ci with native platform support (Drone for ARM? I think @ddevault was working on ARM and PPC support for sh? Preferably, we find a one-shop-does-it-all)
    3. We could host github runners ourselves (and we did for a short few minutes), except it turns out these are incredibly exploitable on public repos. No one wants to deal with a PR turning our box into a bitcoin miner.

    @lemire

    packaging 
    opened by TkTech 2
  • Support for Decimal

    Support for Decimal

    I've read in the documentation there's plan to support this. What status are these plans in? Thanks.

    enhancement upstream change required blocked 
    opened by gaborbernat 3
  • Prove we're worth using in real-world cases

    Prove we're worth using in real-world cases

    We should experiment adding simdjson to some real-world projects where performance matters. This is both to prove we're worth using, and it ensure our API is extensive enough for real-world problems, extending it where needed. Basically "success stories". If you've used pysimdjson successfully, please feel free to contribute.

    • @ericls's https://github.com/fellowinsights/prosemirror-py

      • An ~8% gain on tiny documents (the document found in the project's example.py) from just switching import json to import simdjson. before:
        ------------------------------------------------- benchmark: 1 tests -------------------------------------------------
        Name (time in us)           Min      Max     Mean  StdDev   Median     IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
        ----------------------------------------------------------------------------------------------------------------------
        test_decoding_steps     26.0000  78.1000  28.3224  4.1256  27.3000  0.7000   559;930       35.3077    7571           1
       ----------------------------------------------------------------------------------------------------------------------
      

      after:

        ------------------------------------------------- benchmark: 1 tests -------------------------------------------------
        Name (time in us)           Min      Max     Mean  StdDev   Median     IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
        ----------------------------------------------------------------------------------------------------------------------
        test_decoding_steps     24.2000  74.5000  25.6597  2.8573  25.0000  0.5000   365;674       38.9716    5702           1
        ----------------------------------------------------------------------------------------------------------------------
      

      Our high max time is the one-time overhead from selecting algorithm implementation, which can be avoided.

      • On more realistic documents with many attributes that the server doesn't care about, and using the simdjson-specific API for lazy dicts, the speed gain was drastic, seeing a 4-8x increase on a synthetic test documents. However, minimal documents performed much worse, as every key was accessed anyways. Both are things that could be improved by rewriting parts of prosemirror-py to be simdjson-aware rather than patching it in. Additionally, there are a few checks for isinstance(x, list) that needed to be updated, as our Array type isn't a true list.
    • Kinto is Mozilla's simple key-value database used for some production services like bookmark syncing. It's JSON performance is abysmal, and although it (sometime) uses ujson it does some odd bug workarounds. As an example using the memory backend will cause a JSON decode (built-in) -> encode (built-in) -> decode (ujson) -> encode (built-in) when creating (https://github.com/Kinto/kinto/blob/5f8ba312d0af8cac8d6f2ee5371bd26d5501be7e/kinto/core/storage/memory.py#L205) in an attempt to fix an issue where keys might be byte strings. Sure kinto isn't trying to be speedy, but this is silly and we can improve it. WIP.

    enhancement help wanted 
    opened by TkTech 3
  • Properly populate JSONDecodeError & UnicodeDecodeError

    Properly populate JSONDecodeError & UnicodeDecodeError

    We currently raise a ValueError instead of JSONDecodeError (which is a ValueError subclass) in some cases where JSONDecodeError would be more appropriate. This is because upstream simdjson does not report errors with enough granularity to populate lineno and colno (https://docs.python.org/3/library/json.html#json.JSONDecodeError)

    Might be viable to return 0 and 0 for now?

    bug upstream change required blocked 
    opened by TkTech 2
  • Filtering outside CPython

    Filtering outside CPython

    We used to have a little toy state machine that would let us do basic queries against documents before simdjson got JSONPointer support in 0.3.0. Anything we can do to reduce the number of objects crossing the C++ -> CPython barrier will greatly improve real-world performance, so it's time to bring this back.

    This is a draft for experimentation and planning on syntax.

    Selection:

    • . -> Current element.
    • .[] -> "For each item in an array". (returns array)
    • .[1] -> For single item in array (returns scalar)
    • .[1:2] -> For slice of array (returns array)
    • .<prop> -> Property of current object. (returns scalar)
    • <string> -> String literal, escapable with quotes and backslash.

    Construction:

    • {name: .name, description: .description} -> Construct an object from current object, ignoring unspecified fields.
    • [.name] -> Construct an array from current object, ignoring unspecified fields.
    • When an object key is not specified, the field it accesses is used. {.price < 2} is the same as {price: .price < 2}

    Filtering:

    • Given document [0, 1, 2, 3, 4], query .[] <= 2 returns [0, 1, 2]
    • Given document [{"price": 1.00}, {"price": 2.00}, {"price": 3.00}], query .[].price < 2 returns [{"price": 1.00}]
    • Given document {"price": 1.00}, query .price returns 1.00
    • When given multiple key filters, they're implicitly AND. Given document [{"price": 1.00, "available": true, "unwantedfield": false}, {"price": 2.00, "available": false}, {"price": 3.00}], query .[] | {.price < 2, .available == true} returns [{"price": 1.00, "available": true}]
    enhancement 
    opened by TkTech 0
Releases(v4.0.0)
MessagePack serializer implementation for Python msgpack.org[Python]

MessagePack for Python What's this MessagePack is an efficient binary serialization format. It lets you exchange data among multiple languages like JS

MessagePack 1.5k Sep 22, 2021
A lightweight library for converting complex objects to and from simple Python datatypes.

marshmallow: simplified object serialization marshmallow is an ORM/ODM/framework-agnostic library for converting complex datatypes, such as objects, t

marshmallow-code 5.7k Sep 24, 2021
Generic ASN.1 library for Python

ASN.1 library for Python This is a free and open source implementation of ASN.1 types and codecs as a Python package. It has been first written to sup

Ilya Etingof 187 Sep 20, 2021
serialize all of python

dill serialize all of python About Dill dill extends python's pickle module for serializing and de-serializing python objects to the majority of the b

The UQ Foundation 1.5k Sep 23, 2021
Python library for serializing any arbitrary object graph into JSON. It can take almost any Python object and turn the object into JSON. Additionally, it can reconstitute the object back into Python.

jsonpickle jsonpickle is a library for the two-way conversion of complex Python objects and JSON. jsonpickle builds upon the existing JSON encoders, s

null 962 Sep 16, 2021
Protocol Buffers - Google's data interchange format

Protocol Buffers - Google's data interchange format Copyright 2008 Google Inc. https://developers.google.com/protocol-buffers/ Overview Protocol Buffe

Protocol Buffers 50.8k Sep 23, 2021
FlatBuffers: Memory Efficient Serialization Library

FlatBuffers FlatBuffers is a cross platform serialization library architected for maximum memory efficiency. It allows you to directly access serializ

Google 16.8k Sep 23, 2021
Extended pickling support for Python objects

cloudpickle cloudpickle makes it possible to serialize Python constructs not supported by the default pickle module from the Python standard library.

null 1k Sep 23, 2021
simplejson is a simple, fast, extensible JSON encoder/decoder for Python

simplejson simplejson is a simple, fast, complete, correct and extensible JSON <http://json.org> encoder and decoder for Python 3.3+ with legacy suppo

null 1.4k Sep 24, 2021
🦉 Modern high-performance serialization utilities for Python (JSON, MessagePack, Pickle)

srsly: Modern high-performance serialization utilities for Python This package bundles some of the best Python serialization libraries into one standa

Explosion 230 Sep 9, 2021
Fast, correct Python JSON library supporting dataclasses, datetimes, and numpy

orjson orjson is a fast, correct JSON library for Python. It benchmarks as the fastest Python library for JSON and is more correct than the standard j

null 2.4k Sep 22, 2021
Python wrapper around rapidjson

python-rapidjson Python wrapper around RapidJSON Authors: Ken Robbins <[email protected]> Lele Gaifax <[email protected]> License: MIT License Sta

null 429 Sep 17, 2021
Crappy tool to convert .scw files to .json and and vice versa.

SCW-JSON-TOOL Crappy tool to convert .scw files to .json and vice versa. How to use Run main.py file with two arguments: python main.py <scw2json or j

Fred31 5 May 14, 2021