For the record, the Elusive Signing Bug is a subtle bug where when a file fails the repo signing checks, it's still sometimes served to the code that was looking for it. It was extremely difficult to reproduce for some reason (even though some people could reproduce it every time).
I spent a horrific amount of time tracking this down. I finally determined that
cp.cache_dir was never meant to be a proper sync. The original authors of hubble.audit.sync provided a
clean argument that would
file.remove the cache dir to prevent objects deleted from the repo sticking around after being elided.
The problem is that removing the whole cache and re-downloading it on every schedule loop is expensive, so someone disabled it later. It was probably thought that the regular
file_client.channel.fs.update() (which invokes the
fileserver.reap_fileserver_cache_dir()) would properly handle these cases; but it definitely does not appear to do so.
Essentially the problem is the double cache copy situation:
- we first copy from the fileserver (roots, or gitfs or whatever) into
cp.cache_dir then copies from
- the daemon
file_client.channel.fs.update() updates the
roots (but never attempts to clean up the
- signing needs these files to not be found, which is handled in
roots but not
It's worth noting that the unwelcome leftover files were at least downloaded in good faith. To get downloaded at all, they either had to be downloaded before signing was a thing or at some point when signing verified they were OK...
But they should definitely get cleaned up when they fail the signing pass.
The fix: I taught
cp.cache_dir how to notice these files missing in
roots that exist in
files. It can now remove them automatically during what should have been a proper sync. Note that the old behavior can be restored by setting
cleanup_existing=False but I can't think of a scenario where this would be desirable.
There's still the remaining question:
Why did this affect
hubble.audit but not
cp.cache_dir to copy the whole thing all at once; but
cp.cache_file and checks the result. Specifically checking one file, will indeed reveal the file is missing in
roots even if it exists in
files as a spurious older file. (
module_runner.runner.make_file_available also uses
cp.cache_file, so the
hubble.audit issue is entirely avoided in the successor).
my old (wrong) analysis:
~~Basically the problem is that various parts of the hubble code base read from the fileserver cache directly (rather than using the intended interfaces for such tasks). The repo signing checks are injected into the fileserver
find_file() mechanisms, which normally trigger a
fileserver.reap_cache_dir() hit, causing the file to be deleted from the cache.~~
~~But in some cases, apparently, a cache refresh isn't even part of the invocation of the execution module(s) in question. So the "reaping" never happens (at least not during the execution) and the failing file is serv^H^H^H^H read from the cache as if it was OK.~~
~~All this means, that it feels like a race condition to me. Eventually there would have been a cache refresh that would have cleared the failing file (via the reaping mechanism). This patch simply forces the reaping of files that fail the signing check so there's no need to wait for the reaping and no question about when it happens in the case of a signature check failure.~~