AntiStampede Cache leaves orphaned threading.Event object on 304 Not Modified response and results in a 30 second timeout on subsequent request.
We are not able to reliably reproduce this problem, we only know that it happens and have the logs to scrutinize the TOOLS.CACHING code path followed when hitting the bug.
We've narrowed down the bug to
- static files
- an 'If-Modified-Since' request provoking a 304 Not modified response, leaving an orphaned threading.Event, causing
- a 30 seconds timeout on subsequent (normal) request
The logs show that the 304 provoking request follows the AntistampeCache.wait path, setting the threading.Event while returning a None to MemoryCache.get() variant object, causing it to return None to get()'s cache_data, causing it to follow the 'request is not cached' path of get(), returning False, but MemoryCache.put() is never called, thus leaving the threading.Event object orphaned.
The subsequent request for the same static file also follows the AntistampeCache.wait path. Encountering the orphaned threading.Event object causing it to fruitlessly wait 30 seconds after which it resolves the problem by diligently populating the cache object with the (eventually) responded static file.
After this, normal operation is resumed.
This happens mostly once a day, probably after cache has expired, after one of the clients was the last to requests an 'If-Modified-Since' static file response. But until now, we were not able to come up with a clean reproducible test case.
CherryPy is part of the Pyff daemon we deployed.
https://github.com/leifj/pyFF/issues/116
- CherryPy version: 11.1.0
- Python version: 2.7 due to Pyff dependancy
- OS: Debian GNU/Linux 9
- Browser: Chrome 63
The logging showing the problem, which was produced by us inserted extra debugging lines looks as follows. I understand that interpreting these logs without knowledge of the location of these statements is awkward. Nevertheless it clearly shows the timeout after the offending 304'd static file request.
Jan 19 08:43:26 proxy2 pyffd[4100]: [19/Jan/2018:08:43:26] TOOLS.CACHING get https:/***/static/bootstrap/css/bootstrap.min.css
Jan 19 08:43:26 proxy2 pyffd[4100]: [19/Jan/2018:08:43:26] TOOLS.CACHING Wait result for key: (), type(value) <type 'NoneType'>
Jan 19 08:43:26 proxy2 pyffd[4100]: [19/Jan/2018:08:43:26] TOOLS.CACHING Cache was None, set Event <threading._Event object at 0x7f7b6c623a50>
Jan 19 08:43:26 proxy2 pyffd[4100]: [19/Jan/2018:08:43:26] TOOLS.CACHING variant found
Jan 19 08:43:26 proxy2 pyffd[4100]: [19/Jan/2018:08:43:26] TOOLS.CACHING request is not cached
[...]
Jan 19 08:43:26 proxy2 pyffd[4100]: 145.97.144.156 - - [19/Jan/2018:08:43:26] "GET /static/bootstrap/css/bootstrap.min.css HTTP/1.1" 304 - "https://***/role/idp.ds?return=https%3A%2F%2F***%2FSaml2SP%2Fdisco&entit
yID=https%3A%2F%2F***%2FSaml2SP%2Fproxy_saml2_backend.xml" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36"
Jan 19 08:47:52 proxy2 pyffd[4100]: [19/Jan/2018:08:47:52] TOOLS.CACHING get https://***/static/bootstrap/css/bootstrap.min.css
Jan 19 08:47:52 proxy2 pyffd[4100]: [19/Jan/2018:08:47:52] TOOLS.CACHING Wait result for key: (), type(value) <class 'threading._Event'>
Jan 19 08:47:52 proxy2 pyffd[4100]: [19/Jan/2018:08:47:52] TOOLS.CACHING Event <threading._Event object at 0x7f7b6c623a50>
Jan 19 08:47:52 proxy2 pyffd[4100]: [19/Jan/2018:08:47:52] TOOLS.CACHING Waiting up to 30 seconds
[...]
Jan 19 08:48:22 proxy2 pyffd[4100]: [19/Jan/2018:08:48:22] TOOLS.CACHING Timed out 30 seconds
Jan 19 08:48:22 proxy2 pyffd[4100]: [19/Jan/2018:08:48:22] TOOLS.CACHING variant found
Jan 19 08:48:22 proxy2 pyffd[4100]: [19/Jan/2018:08:48:22] TOOLS.CACHING request is not cached
Jan 19 08:48:22 proxy2 pyffd[4100]: [19/Jan/2018:08:48:22] TOOLS.CACHING get() status: None
Jan 19 08:48:22 proxy2 pyffd[4100]: [19/Jan/2018:08:48:22] TOOLS.CACHING Storing status: 200 OK, uri:https://***/static/bootstrap/css/bootstrap.min.css
Jan 19 08:48:22 proxy2 pyffd[4100]: 2001:610:514:172:31da:c3b9:12ad:1b76 - - [19/Jan/2018:08:48:22] "GET /static/bootstrap/css/bootstrap.min.css HTTP/1.1" 200 109518 "https://***/role/idp.ds?return=https%3A%2F%2F***%2FSaml2SP%2Fdisco&entityID=https%3A%2F%2F***%2FSaml2SP%2Fproxy_saml2_backend.xml" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
What we noticed is that the caching.py get() cherrypy.HTTPRedirect 304 exception code path is not touched on the 304 response. This must mean that the 304 is generated in static.py in serve_file() by cptools.validate_since().
Up to now, we were unable to explain why this specific 304 response would provoke the cacheobject not to be populated by the corresponding cache, while having set a threading.Event.
What we did see was that MemoryCache.expire_cache() clears (del) the AntistampedeCache variant object in the store dictionary, but keeps the store uri key. This means that after cache expiration a store[uri] key exists but has no cache object attached. This might explain the 304 following the Antistampede.wait path without replacing the threading.Event object but we were not sure and unable to force faster cache expiration to produce a test-case (up to now).
For now, we had to decide to let the bug go and work-around it by letting nginx serve Pyff's static files, although this is a less than ideal solution of course.
bug CherryPy code reproducer: missing