Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWSs3 Input] Handle unexpectedEOF errors #42420

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Conversation

gizas
Copy link
Contributor

@gizas gizas commented Jan 24, 2025

  • Enhancement

Proposed commit message

Please explain:

  • WHAT: Wraps UnexpectedEOF error in the result of readjson fucntion so the caller knows it's not a permanent failure
  • WHY: To be able to retry the processing of S3 json files in case of such errors. Until now

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Related issues

Screenshots

Logs

gizas added 5 commits January 24, 2025 13:04
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jan 24, 2025
@gizas gizas added the Team:obs-ds-hosted-services Label for the Observability Hosted Services team label Jan 24, 2025
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jan 24, 2025
@mergify mergify bot assigned gizas Jan 24, 2025
Copy link
Contributor

mergify bot commented Jan 24, 2025

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @gizas? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Jan 24, 2025

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Jan 24, 2025
@gizas gizas marked this pull request as ready for review February 17, 2025 16:33
@gizas gizas requested review from a team as code owners February 17, 2025 16:33
@elasticmachine
Copy link
Collaborator

Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services)

@gizas
Copy link
Contributor Author

gizas commented Feb 17, 2025

@faec , @zmoog , @Kavindu-Dodan sorry for delaying this a bit.

The change seems minimal, tl;dr :

  • We return errS3DownloadFailed error even if unexpectedEOF returns
  • And now I use a buf to download the file and then process it. If the downloaderror happens in the first part we return the error.

Do you have in mind specific tests that you think are needed for this?
I will make some functional tests but do you think that also tests to check the memory consuption should also be run?

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Feb 18, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@pierrehilbert pierrehilbert requested a review from faec February 18, 2025 07:21
Signed-off-by: Andreas Gkizas <[email protected]>
var buf bytes.Buffer
if _, err := io.Copy(&buf, r); err != nil {
// If an error occurs during the download, handle it here
return fmt.Errorf("%w: %w", errS3DownloadFailed, err)
Copy link
Contributor

@Kavindu-Dodan Kavindu-Dodan Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error may not be appropriate inside the JSON reader. Because at this point, the download is complete and we are processing the object body. So we could derive a dedicated error (ex:- errBodyReading) which help us to add debug logs at error handler

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error type looks ok to me, this is not in the json reader itself it's in s3ObjectProcessor which currently is responsible for the download. The problem is that at this point in execution the download isn't complete, the io.Reader is still a live http pipe and that's why it's able to encounter "EOF" while parsing.

An alternative that would actually separate the network issues from the parsing is to move this io.Copy call and its error check to ProcessS3Object, right after the p.download() call, so we are always starting with a completed download before we do any parsing. I think it would be slightly clearer that way too, so if that would address your concerns then let's try it that way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @faec for the explanation. I am fine with the error usage with the provided background. But I have another doubt with the bytes.Buffer usage.

With this change, we load the full object data into memory. And this could trigger a panic with ErrTooLarge. So I think think it is best to detect unexpectedEOF with existing stream reading, where the error should be thrown from the decoder.Decode()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant about that: if the download is successful, then loading the whole object into memory is the intent. For very large objects there is some overhead to buffering first instead of streaming directly to the decoder, but short-lived buffers are cheap for the garbage collector to recover, and if a single event is so big that we can't hold two copies of it in memory for the length of a function call then we are going to have serious issues ingesting that data regardless (since e.g. the default memory queue holds 3200 events). Most inputs load the full event data into a memory buffer before decoding it, and I think the clarity and reliability we get from doing it here is worth the extra buffer during the initial download.

In particular, I'm worried about the case I described: if a response does have deterministically invalid JSON and we classify that EOF as a retryable error, then we will keep retrying that object forever despite deterministic failure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a single event is so big that we can't hold two copies of it in memory for the length of a function call then we are going to have serious issues ingesting that data regardless

Most inputs load the full event data into a memory buffer before decoding it, and I think the clarity and reliability we get from doing it here is worth the extra buffer during the initial download

+1 on these points.

@gizas what do you think?

@Kavindu-Dodan
Copy link
Contributor

We return errS3DownloadFailed error even if unexpectedEOF returns

Since you tested with the dataset that caused the error, where you able to see why this error was caused ?

And now I use a buf to download the file and then process it. If the downloaderror happens in the first part we return the error.

Since I couldn't find the actual root cause : is it enough only to process the first part ? Can this error caused by other sections of the stream ?

@faec
Copy link
Contributor

faec commented Feb 18, 2025

Since I couldn't find the actual root cause : is it enough only to process the first part ? Can this error caused by other sections of the stream ?

The EOF that this is fixing is caused specifically by a dropped connection during the download process. It's still possible in theory to get that same error later, but only if the completed download contains corrupt json, and in that case we do want to return a fatal error.

@Kavindu-Dodan
Copy link
Contributor

Kavindu-Dodan commented Feb 18, 2025

Since I couldn't find the actual root cause : is it enough only to process the first part ? Can this error caused by other sections of the stream ?

The EOF that this is fixing is caused specifically by a dropped connection during the download process. It's still possible in theory to get that same error later, but only if the completed download contains corrupt json, and in that case we do want to return a fatal error.

Got it. I thought the error came from invalid JSON.

It's still possible in theory to get that same error later

Since we use a bytes.Buffer, we are reading the full body into memory (this is what I understand from the io.copy internals and raised this concern here #42420 (comment) )

if the completed download contains corrupt json, and in that case we do want to return a fatal error.

Agree and +1 for this.

@gizas
Copy link
Contributor Author

gizas commented Feb 19, 2025

Since you tested with the dataset that caused the error, where you able to see why this error was caused ?

I have tested it with some files that customer sent but could not reproduce the error.
As @faec said above, the unexpectedEOF we suspect that comes from an incomplete download of file.

That is why when/ if comes we mark it again errS3DownloadFailed and retry it

Since we use a bytes.Buffer, we are reading the full body into memory

@Kavindu-Dodan that is a total fair concern. Indeed I am also afraid and want to do some more tests. Also I am not still sure if the S3 has a size limit that might protects us - but it is not an actual thing to trust.

So tl;dr:

  • I will need to run some load tests and will come back with numbers of memory consumption
  • Dont know if we need to revert the change of the buffer and just mark all this unexpectedEOF errors as retriable. This might have the drawback that we might be too agressive and report lots of errors and also keep retrying files that might overload the input. WDYT?

@gizas
Copy link
Contributor Author

gizas commented Feb 20, 2025

Screenshot 2025-02-20 at 5 46 53 PM
At the moment I have only run some sporadic load tests, for eg. see above 6K records in 5secs with no errors.

I will need to make tests more structured and take some memory results

@Kavindu-Dodan
Copy link
Contributor

Kavindu-Dodan commented Feb 20, 2025

I will need to make tests more structured and take some memory results

Per the details in this issue of AWS Go SDK 1, I don't think it will be easy to recreate this easily. May be we might see it with large files but that's a may be.

I think it is best to focus on the workaround where we retry if we see the error. However, we should be cautious and limit retries to prevent any unintended impact.

Footnotes

  1. https://github.com/aws/aws-sdk-go/issues/4510#issuecomment-1416406191

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:obs-ds-hosted-services Label for the Observability Hosted Services team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants