For the record, the Elusive Signing Bug is a subtle bug where when a file fails the repo signing checks, it's still sometimes served to the code that was looking for it. It was extremely difficult to reproduce for some reason (even though some people could reproduce it every time).
I spent a horrific amount of time tracking this down. I finally determined that cp.cache_dir
was never meant to be a proper sync. The original authors of hubble.audit.sync provided a clean
argument that would file.remove
the cache dir to prevent objects deleted from the repo sticking around after being elided.
The problem is that removing the whole cache and re-downloading it on every schedule loop is expensive, so someone disabled it later. It was probably thought that the regular file_client.channel.fs.update()
(which invokes the fileserver.reap_fileserver_cache_dir()
) would properly handle these cases; but it definitely does not appear to do so.
Essentially the problem is the double cache copy situation:
- we first copy from the fileserver (roots, or gitfs or whatever) into
roots
(i.e. /var/cache/hubble/roots
)
cp.cache_dir
then copies from roots
to files
(i.e. /var/cache/hubble/files
)
- the daemon
file_client.channel.fs.update()
updates the roots
(but never attempts to clean up the files
copy).
- signing needs these files to not be found, which is handled in
roots
but not files
It's worth noting that the unwelcome leftover files were at least downloaded in good faith. To get downloaded at all, they either had to be downloaded before signing was a thing or at some point when signing verified they were OK...
But they should definitely get cleaned up when they fail the signing pass.
The fix: I taught cp.cache_dir
how to notice these files missing in roots
that exist in files
. It can now remove them automatically during what should have been a proper sync. Note that the old behavior can be restored by setting cleanup_existing=False
but I can't think of a scenario where this would be desirable.
There's still the remaining question:
Why did this affect hubble.audit
but not audit.run
?
Answer:
hubble.audit
uses cp.cache_dir
to copy the whole thing all at once; but audit.run
uses cp.cache_file
and checks the result. Specifically checking one file, will indeed reveal the file is missing in roots
even if it exists in files
as a spurious older file. (module_runner.runner.make_file_available
also uses cp.cache_file
, so the hubble.audit
issue is entirely avoided in the successor).
my old (wrong) analysis:
~~Basically the problem is that various parts of the hubble code base read from the fileserver cache directly (rather than using the intended interfaces for such tasks). The repo signing checks are injected into the fileserver find_file()
mechanisms, which normally trigger a fileserver.reap_cache_dir()
hit, causing the file to be deleted from the cache.~~
~~But in some cases, apparently, a cache refresh isn't even part of the invocation of the execution module(s) in question. So the "reaping" never happens (at least not during the execution) and the failing file is serv^H^H^H^H read from the cache as if it was OK.~~
~~All this means, that it feels like a race condition to me. Eventually there would have been a cache refresh that would have cleared the failing file (via the reaping mechanism). This patch simply forces the reaping of files that fail the signing check so there's no need to wait for the reaping and no question about when it happens in the case of a signature check failure.~~