We (mostly @pquentin and I) have been working on a proof of concept for adding pluggable async support to urllib3, with the hope of eventually getting this into the upstream urllib3. It's reached the point where there's still lots of missing bits, but there's an end-to-end demo working and we don't see any major obstacles to getting everything else working, so I wanted to start getting feedback from the urllib3 maintainers about whether this looks like something you'd be interested in eventually merging, and what it would take to get there.
Demo
So: hi! Check it out – a single py2.py3 wheel that keeps the classic synchronous API, and on python 3.6+ it can also run in async mode on both Trio and Twisted:
# Demo from commit 1ca67ee53e18f823d0cb in the python-trio/urllib3 bleach-spike branch
$ curl -O https://vorpus.org/~njs/tmp/async-urllib3-demo.zip
$ unzip async-urllib3-demo.zip
$ cd async-urllib3-demo
$ ls
sync-demo.py
async-demo.py
urllib3-2.0.dev0+bleach.spike.proof.of.concept.dont.use-py2.py3-none-any.whl
$ virtualenv -p python3.6 py36-venv
$ py36-venv/bin/pip install trio twisted[tls] urllib3-2.0.dev0+bleach.spike.proof.of.concept.dont.use-py2.py3-none-any.whl
$ py36-venv/bin/python sync-demo.py
--- urllib3 using synchronous sockets ---
URL: http://httpbin.org/uuid
Status: 200
Data: b'{\n "uuid": "a2c28245-47b8-4a50-b64c-da09d27bf626"\n}\n'
$ py36-venv/bin/python async-demo.py
--- urllib3 using Trio ---
URL: http://httpbin.org/uuid
Status: 200
Data: b'{\n "uuid": "dab50c1a-1b20-483f-903e-fe74494629f2"\n}\n'
--- urllib3 using Twisted ---
URL: http://httpbin.org/uuid
Status: 200
Data: b'{\n "uuid": "72196b66-7caa-40dd-9c7c-af65bb4f7fb6"\n}\n'
$ virtualenv -p python2 py2-venv
$ py2-venv/bin/pip install urllib3-2.0.dev0+bleach.spike.proof.of.concept.dont.use-py2.py3-none-any.whl
$ py2-venv/bin/python sync-demo.py
--- urllib3 using synchronous sockets ---
URL: http://httpbin.org/uuid
Status: 200
Data: '{\n "uuid": "2a2fce90-d853-4940-b111-0969f92b7678"\n}\n'
Things that (probably) don't work yet include: HTTPS, timeouts, proper connection reuse, using python 2 to run setup.py bdist_wheel
, running the test suite. OTOH some non-trivial things do work, like chunked transfer encoding, and there's at least code in there for proxy support and handling early responses from the server.
what sorcery is this
This is based on @Lukasa's "v2" branch, so it's using h11. (Goodbye httplib :wave:.) The v2 branch is currently stalled in a pretty half-finished state (e.g. the I/O layer is a bunch of raw select loops open-coded everywhere that IIRC don't work; I'm sure @Lukasa would have cleaned it up if he had more time); we cleaned all that up and made it work. The major new code is here: https://github.com/python-trio/urllib3/blob/bleach-spike/urllib3/_async/connection.py
And then we made the low-level I/O pluggable – there's a small-ish ad hoc API providing the set of I/O operations that urllib3 actually needs, and several implementations using different backends. Note that the I/O backend interface is internal, so we can adjust it as needed; there's no attempt to provide a generic abstract I/O interface that would work for anyone else. The code for the backends is here: https://github.com/python-trio/urllib3/tree/bleach-spike/urllib3/_backends
Then we started adding async/await annotations to the rest of urllib3's code. That gave us a version that could work on Trio and Twisted. But wait, we also want to support python 2! And everyone using the existing synchronous API on python 3 too, for that matter. But we don't want to maintain two copies of the code. Unfortunately, Python absolutely insists that async APIs and synchronous APIs be loaded from different copies of the code; there's no way around this. (Mayyybe if we dropped python 2 support and were willing to maintain a giant shim layer then it would be possible, but I don't think either of those things is true.)
Solution: we maintain one copy of the code – the version with async/await annotations – and then a little script maintains the synchronous copy by automatically stripping them out again. It's not beautiful, but as far as I can tell all the alternatives are worse.
Currently the async version (source of truth) lives in urllib3/_async/
, and then setup.py
has the code to automatically generate urllib3/_sync/...
at build time. (This script should probably get factored out into a separate project.) The script is not at all clever; you can see the exact set of transformations in the file, but basically it just tokenizes the source, deletes async
and await
tokens, and renames a few other tokens like __aenter__
→ __enter__
. There's no complicated AST manipulation and comments are preserved. Then urllib3/__init__.py
imports things from urllib3._sync
and (if it's running on a new enough Python) urllib3._async
.
The resulting dev experience is sort of half-way between that of a pure-python project and one with a C extension: instead of an edit/run cycle, you need to do an edit/compile/run cycle. But you still end up with a nice py2.py3-none-any wheel at the end, so you don't need to stress out about providing binary builds for different platforms, and the builds are extremely fast because it's just some shallow text manipulation, not invoking a C compiler.
Oh, and for backcompat we also added some shim files like urllib3/connectionpool.py
, that just re-export stuff from the corresponding files in urllib3/_sync/...
, since these submodules are documented as part of the API. This also has the benefit of making it easier to avoid accidentally exporting things in the future; urllib3's public API is perhaps larger than it should be.
Importantly, this basic strategy would also work for libraries that use urllib3, so it provides a path forward for higher-level projects like requests, botocore, etc. to provide dual sync/async APIs while supporting python 2 + multiple async backends.
Backwards compatibility
So far this is looking surprisingly good. Switching to h11 means losing most of urllib3.connection
, since that's directly exposing httplib APIs. In async mode we can't support lazily loading response.data
, but that's fine, and we can still support it in sync mode if we want. Right now the branch is keeping the "close" operations synchronous, but we might want to switch to making them async, because it might make things easier for HTTP/2 later. If we do switch to async close, then that will force some broader changes – in particular RecentlyUsedContainer
assumes that it can close things from __setitem__
and __delitem__
, which can't be made async. It wouldn't be hard to rewrite RecentlyUsedContainer
, but right now technically it's a public API.
I'm not sure about some of the more obscure stuff like urllib3.contrib
. In particular urllib3.contrib.pyopenssl
, urllib3.contrib.socks
, and urllib3.contrib.securetransport
seem difficult to handle in a I/O-backend-agnostic way – maybe we could hack them to keep working on the sync backend only? And I don't know what to think about the appengine support. Maybe it's fine because it doesn't actually use urllib3 internals at all?
But overall, my impression is that we could keep quite a high degree of backwards compatibility if we want. Perhaps we'd want to break things for other reasons if we're doing a v2, but that's a separate discussion.
Questions
Basically at this point it seems clear that this approach can work. And what's left is just straightforward work. So:
-
Does this seem like the direction we want urllib3 to go?
-
If so, then how would you like that process to go?
I know @Lukasa is wary of the code generation approach, but the alternatives seem to be (a) maintaining a ton of redundant code, (b) not supporting async in urllib3, in which case the community will... end up maintaining a ton of redundant code as each I/O library separately implements their own fake copy of requests. And I don't believe that we have enough HTTP experts in the community to do that well; I think there are huge benefits to concentrating all our maintainers on a single library if we can.
CC: @kennethreitz @markrwilliams