Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic-Agent fails to clean up when download fails #6680

Open
belimawr opened this issue Jan 31, 2025 · 5 comments
Open

Elastic-Agent fails to clean up when download fails #6680

belimawr opened this issue Jan 31, 2025 · 5 comments
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@belimawr
Copy link
Contributor

It can happen that during an upgrade there will be a failure in downloading some files (either the archive or the SHA512), this will leave a zero bytes file behind, which in the next try will make the Elastic-Agent assume the file was correctly downloaded (the file exists), but fail to extract/use it.

The solution is to manually remove the zero-bytes file and retry the upgrade.

We need to implement more resilience and better clean up in the download/upgrade process. At least making sure we do not leave any file behind when there is a download failure.

@belimawr belimawr added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Jan 31, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@cmacknz
Copy link
Member

cmacknz commented Jan 31, 2025

The logs in question:

upgrade to version 8.17.1 failed: failed verification of agent binary: failed to verify SHA512 hash: could not read checksum file: checksum for "elastic-agent-8.17.1-darwin-x86_64.tar.gz" was not found in "/Library/Elastic/Agent/data/elastic-agent-8.17.1-b46c28/downloads/elastic-agent-8.17.1-darwin-x86_64.tar.gz.sha512" failed to verify SHA512 hash: could not read checksum file: checksum for "elastic-agent-8.17.1-darwin-x86_64.tar.gz" was not found in "/Library/Elastic/Agent/data/elastic-agent-8.17.1-b46c28/downloads/elastic-agent-8.17.1-darwin-x86_64.tar.gz.sha512"

A 0 byte file downloading with an HTTP 200 code is an error in the CDN most likely. We can look for it and handle it but this shouldn't be happening regularly.

We already have the logic to delete the file on failure

if err = download.VerifySHA512HashWithCleanup(v.log, artifactPath); err != nil {
return fmt.Errorf("failed to verify SHA512 hash: %w", err)
}

Probably we need to explicit detection of an HTTP 200 with a 0 byte file and then to retry downloading the hash

func (e *Downloader) downloadHash(ctx context.Context, remoteArtifact string, operatingSystem string, a artifact.Artifact, version agtversion.ParsedSemVer) (string, error) {
filename, err := artifact.GetArtifactName(a, version, operatingSystem, e.config.Arch())
if err != nil {
return "", errors.New(err, "generating package name failed")
}
fullPath, err := artifact.GetArtifactPath(a, version, operatingSystem, e.config.Arch(), e.config.TargetDirectory)
if err != nil {
return "", errors.New(err, "generating package path failed")
}
filename = filename + ".sha512"
fullPath = fullPath + ".sha512"
return e.downloadFile(ctx, remoteArtifact, filename, fullPath)
}

We download the agent upgrade artifact and hash in the same retry loop, they probably need to be separated into independent retry loops.

#6276 added retries on the .asc, I would bet this 0 byte file problem can happen to any of the elastic-agent package, .sha512, .asc and we probably need explicit detection for it to to figure out why we don't detect it as an error.

@cmacknz
Copy link
Member

cmacknz commented Jan 31, 2025

I see we unconditionally create the file we intend to download but on error it should be getting deleted

defer func() {
if err != nil {
for _, path := range downloadedFiles {
if err := os.Remove(path); err != nil {
e.log.Warnf("failed to cleanup %s: %v", path, err)
}
}
}
}()

If the overall download fails then we should delete everything here

archivePath, err := u.downloadArtifact(ctx, parsedVersion, sourceURI, det, skipVerifyOverride, skipDefaultPgp, pgpBytes...)
if err != nil {
// Run the same pre-upgrade cleanup task to get rid of any newly downloaded files
// This may have an issue if users are upgrading to the same version number.
if dErr := cleanNonMatchingVersionsFromDownloads(u.log, u.agentInfo.Version()); dErr != nil {
u.log.Errorw("Unable to remove file after verification failure", "error.message", dErr)
}

There are at least two places there that would clean up a failed download if we explicitly detected it at failed. We need more logs to know what happened.

@belimawr
Copy link
Contributor Author

belimawr commented Feb 3, 2025

@cmacknz what happens if the Elastic-Agent crashes while downloading the file and this 0 bytes file is left behind?

I remember seeing this scenario while testing the Elastic-Agent, so there is a possibility of the Elastic-Agent crashing/getting killed during a download that's stuck/will time out.

What I believe the Elastic-Agent does not do is to validate the found files are valid, if they're not valid, then removing and downloading them again.

By valid, I mean not corrupted, like a zip/tar can be extracted, nothing is zero bytes, etc.

@cmacknz
Copy link
Member

cmacknz commented Feb 3, 2025

@cmacknz what happens if the Elastic-Agent crashes while downloading the file and this 0 bytes file is left behind?

As far as I can tell there is no logic to skip downloading a file if it already exists. It happens unconditionally. If there were logic to do this, it would have to re-download all the files when the signature and checksum validation fails as agent can't tell which of the 3 is the invalid one easily (agent package, signature, or checksum).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

3 participants