-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elastic-Agent fails to clean up when download fails #6680
Comments
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
The logs in question:
A 0 byte file downloading with an HTTP 200 code is an error in the CDN most likely. We can look for it and handle it but this shouldn't be happening regularly. We already have the logic to delete the file on failure elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/verifier.go Lines 100 to 102 in 093fb58
Probably we need to explicit detection of an HTTP 200 with a 0 byte file and then to retry downloading the hash elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/downloader.go Lines 153 to 168 in 093fb58
We download the agent upgrade artifact and hash in the same retry loop, they probably need to be separated into independent retry loops. #6276 added retries on the .asc, I would bet this 0 byte file problem can happen to any of the elastic-agent package, .sha512, .asc and we probably need explicit detection for it to to figure out why we don't detect it as an error. |
I see we unconditionally create the file we intend to download but on error it should be getting deleted elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/downloader.go Lines 100 to 108 in 093fb58
If the overall download fails then we should delete everything here elastic-agent/internal/pkg/agent/application/upgrade/upgrade.go Lines 205 to 212 in 093fb58
There are at least two places there that would clean up a failed download if we explicitly detected it at failed. We need more logs to know what happened. |
@cmacknz what happens if the Elastic-Agent crashes while downloading the file and this 0 bytes file is left behind? I remember seeing this scenario while testing the Elastic-Agent, so there is a possibility of the Elastic-Agent crashing/getting killed during a download that's stuck/will time out. What I believe the Elastic-Agent does not do is to validate the found files are valid, if they're not valid, then removing and downloading them again. By valid, I mean not corrupted, like a zip/tar can be extracted, nothing is zero bytes, etc. |
As far as I can tell there is no logic to skip downloading a file if it already exists. It happens unconditionally. If there were logic to do this, it would have to re-download all the files when the signature and checksum validation fails as agent can't tell which of the 3 is the invalid one easily (agent package, signature, or checksum). |
It can happen that during an upgrade there will be a failure in downloading some files (either the archive or the SHA512), this will leave a zero bytes file behind, which in the next try will make the Elastic-Agent assume the file was correctly downloaded (the file exists), but fail to extract/use it.
The solution is to manually remove the zero-bytes file and retry the upgrade.
We need to implement more resilience and better clean up in the download/upgrade process. At least making sure we do not leave any file behind when there is a download failure.
The text was updated successfully, but these errors were encountered: