[AWSs3 Input] Handle unexpectedEOF errors #42420

gizas · 2025-01-24T11:30:04Z

Enhancement

Proposed commit message

Please explain:

WHAT: Wraps UnexpectedEOF error in the result of readjson fucntion so the caller knows it's not a permanent failure
WHY: To be able to retry the processing of S3 json files in case of such errors. Until now

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Related issues

Closes #[AWS INPUT]Investigate handling of unexcpected EOF error in aws-s3 input #42361

Screenshots

Logs

Signed-off-by: Andreas Gkizas <[email protected]>

mergify · 2025-01-24T11:30:45Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @gizas? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

mergify · 2025-01-24T11:30:46Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

Signed-off-by: Andreas Gkizas <[email protected]>

elasticmachine · 2025-02-17T16:33:55Z

Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services)

gizas · 2025-02-17T16:37:08Z

@faec , @zmoog , @Kavindu-Dodan sorry for delaying this a bit.

The change seems minimal, tl;dr :

We return errS3DownloadFailed error even if unexpectedEOF returns
And now I use a buf to download the file and then process it. If the downloaderror happens in the first part we return the error.

Do you have in mind specific tests that you think are needed for this?
I will make some functional tests but do you think that also tests to check the memory consuption should also be run?

elasticmachine · 2025-02-18T07:21:10Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Signed-off-by: Andreas Gkizas <[email protected]>

Kavindu-Dodan · 2025-02-18T17:39:20Z

x-pack/filebeat/input/awss3/s3_objects.go

+	var buf bytes.Buffer
+	if _, err := io.Copy(&buf, r); err != nil {
+		// If an error occurs during the download, handle it here
+		return fmt.Errorf("%w: %w", errS3DownloadFailed, err)


This error may not be appropriate inside the JSON reader. Because at this point, the download is complete and we are processing the object body. So we could derive a dedicated error (ex:- errBodyReading) which help us to add debug logs at error handler

The error type looks ok to me, this is not in the json reader itself it's in s3ObjectProcessor which currently is responsible for the download. The problem is that at this point in execution the download isn't complete, the io.Reader is still a live http pipe and that's why it's able to encounter "EOF" while parsing.

An alternative that would actually separate the network issues from the parsing is to move this io.Copy call and its error check to ProcessS3Object, right after the p.download() call, so we are always starting with a completed download before we do any parsing. I think it would be slightly clearer that way too, so if that would address your concerns then let's try it that way.

Thank you @faec for the explanation. I am fine with the error usage with the provided background. But I have another doubt with the bytes.Buffer usage.

With this change, we load the full object data into memory. And this could trigger a panic with ErrTooLarge. So I think think it is best to detect unexpectedEOF with existing stream reading, where the error should be thrown from the decoder.Decode()

I'm hesitant about that: if the download is successful, then loading the whole object into memory is the intent. For very large objects there is some overhead to buffering first instead of streaming directly to the decoder, but short-lived buffers are cheap for the garbage collector to recover, and if a single event is so big that we can't hold two copies of it in memory for the length of a function call then we are going to have serious issues ingesting that data regardless (since e.g. the default memory queue holds 3200 events). Most inputs load the full event data into a memory buffer before decoding it, and I think the clarity and reliability we get from doing it here is worth the extra buffer during the initial download.

In particular, I'm worried about the case I described: if a response does have deterministically invalid JSON and we classify that EOF as a retryable error, then we will keep retrying that object forever despite deterministic failure.

if a single event is so big that we can't hold two copies of it in memory for the length of a function call then we are going to have serious issues ingesting that data regardless

Most inputs load the full event data into a memory buffer before decoding it, and I think the clarity and reliability we get from doing it here is worth the extra buffer during the initial download

+1 on these points.

@gizas what do you think?

Kavindu-Dodan · 2025-02-18T17:45:20Z

We return errS3DownloadFailed error even if unexpectedEOF returns

Since you tested with the dataset that caused the error, where you able to see why this error was caused ?

And now I use a buf to download the file and then process it. If the downloaderror happens in the first part we return the error.

Since I couldn't find the actual root cause : is it enough only to process the first part ? Can this error caused by other sections of the stream ?

faec · 2025-02-18T19:35:54Z

Since I couldn't find the actual root cause : is it enough only to process the first part ? Can this error caused by other sections of the stream ?

The EOF that this is fixing is caused specifically by a dropped connection during the download process. It's still possible in theory to get that same error later, but only if the completed download contains corrupt json, and in that case we do want to return a fatal error.

Kavindu-Dodan · 2025-02-18T22:30:56Z

Since I couldn't find the actual root cause : is it enough only to process the first part ? Can this error caused by other sections of the stream ?

The EOF that this is fixing is caused specifically by a dropped connection during the download process. It's still possible in theory to get that same error later, but only if the completed download contains corrupt json, and in that case we do want to return a fatal error.

Got it. I thought the error came from invalid JSON.

It's still possible in theory to get that same error later

Since we use a bytes.Buffer, we are reading the full body into memory (this is what I understand from the io.copy internals and raised this concern here #42420 (comment) )

if the completed download contains corrupt json, and in that case we do want to return a fatal error.

Agree and +1 for this.

gizas · 2025-02-19T10:53:15Z

Since you tested with the dataset that caused the error, where you able to see why this error was caused ?

I have tested it with some files that customer sent but could not reproduce the error.
As @faec said above, the unexpectedEOF we suspect that comes from an incomplete download of file.

That is why when/ if comes we mark it again errS3DownloadFailed and retry it

Since we use a bytes.Buffer, we are reading the full body into memory

@Kavindu-Dodan that is a total fair concern. Indeed I am also afraid and want to do some more tests. Also I am not still sure if the S3 has a size limit that might protects us - but it is not an actual thing to trust.

So tl;dr:

I will need to run some load tests and will come back with numbers of memory consumption
Dont know if we need to revert the change of the buffer and just mark all this unexpectedEOF errors as retriable. This might have the drawback that we might be too agressive and report lots of errors and also keep retrying files that might overload the input. WDYT?

gizas · 2025-02-20T15:49:24Z

At the moment I have only run some sporadic load tests, for eg. see above 6K records in 5secs with no errors.

I will need to make tests more structured and take some memory results

Kavindu-Dodan · 2025-02-20T18:07:59Z

I will need to make tests more structured and take some memory results

Per the details in this issue of AWS Go SDK ¹, I don't think it will be easy to recreate this easily. May be we might see it with large files but that's a may be.

I think it is best to focus on the workaround where we retry if we see the error. However, we should be cautious and limit retries to prevent any unintended impact.

https://github.com/aws/aws-sdk-go/issues/4510#issuecomment-1416406191 ↩

gizas added 5 commits January 24, 2025 13:04

first push for unexpectedEOF

6c1150c

Signed-off-by: Andreas Gkizas <[email protected]>

first push for unexpectedEOF

a602138

Signed-off-by: Andreas Gkizas <[email protected]>

fixing yaml files

7349e31

Signed-off-by: Andreas Gkizas <[email protected]>

fixing yaml files

485472e

Signed-off-by: Andreas Gkizas <[email protected]>

fixing yaml files

834a234

Signed-off-by: Andreas Gkizas <[email protected]>

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jan 24, 2025

gizas added the Team:obs-ds-hosted-services Label for the Observability Hosted Services team label Jan 24, 2025

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jan 24, 2025

mergify bot assigned gizas Jan 24, 2025

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Jan 24, 2025

gizas added 7 commits January 24, 2025 13:31

Merge branch 'main' into awss3unexpectedoef

4508c15

Merge branch 'main' of github.com:elastic/beats into awss3unexpectedoef

9bc494f

return EOF error again as errS3DownloadFailed

f373752

Signed-off-by: Andreas Gkizas <[email protected]>

return EOF error again as errS3DownloadFailed

31b5669

Signed-off-by: Andreas Gkizas <[email protected]>

fixing diffs

08f24a4

Signed-off-by: Andreas Gkizas <[email protected]>

fixing diffs

4fe705f

Signed-off-by: Andreas Gkizas <[email protected]>

fixing diffs

9beaaaa

Signed-off-by: Andreas Gkizas <[email protected]>

gizas marked this pull request as ready for review February 17, 2025 16:33

gizas requested review from a team as code owners February 17, 2025 16:33

gizas requested review from leehinman and khushijain21 February 17, 2025 16:33

pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Feb 18, 2025

pierrehilbert requested a review from faec February 18, 2025 07:21

fixing make update

317f2ac

Signed-off-by: Andreas Gkizas <[email protected]>

Kavindu-Dodan reviewed Feb 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AWSs3 Input] Handle unexpectedEOF errors #42420

[AWSs3 Input] Handle unexpectedEOF errors #42420

gizas commented Jan 24, 2025

mergify bot commented Jan 24, 2025

mergify bot commented Jan 24, 2025

elasticmachine commented Feb 17, 2025

gizas commented Feb 17, 2025 •

edited

Loading

elasticmachine commented Feb 18, 2025

Kavindu-Dodan Feb 18, 2025 •

edited

Loading

faec Feb 18, 2025

Kavindu-Dodan Feb 18, 2025

faec Feb 19, 2025

Kavindu-Dodan Feb 19, 2025

Kavindu-Dodan commented Feb 18, 2025

faec commented Feb 18, 2025

Kavindu-Dodan commented Feb 18, 2025 •

edited

Loading

gizas commented Feb 19, 2025

gizas commented Feb 20, 2025

Kavindu-Dodan commented Feb 20, 2025 •

edited

Loading

[AWSs3 Input] Handle unexpectedEOF errors #42420

Are you sure you want to change the base?

[AWSs3 Input] Handle unexpectedEOF errors #42420

Conversation

gizas commented Jan 24, 2025

Proposed commit message

Checklist

How to test this PR locally

Related issues

Screenshots

Logs

mergify bot commented Jan 24, 2025

mergify bot commented Jan 24, 2025

elasticmachine commented Feb 17, 2025

gizas commented Feb 17, 2025 • edited Loading

elasticmachine commented Feb 18, 2025

Kavindu-Dodan Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

faec Feb 18, 2025

Choose a reason for hiding this comment

Kavindu-Dodan Feb 18, 2025

Choose a reason for hiding this comment

faec Feb 19, 2025

Choose a reason for hiding this comment

Kavindu-Dodan Feb 19, 2025

Choose a reason for hiding this comment

Kavindu-Dodan commented Feb 18, 2025

faec commented Feb 18, 2025

Kavindu-Dodan commented Feb 18, 2025 • edited Loading

gizas commented Feb 19, 2025

gizas commented Feb 20, 2025

Kavindu-Dodan commented Feb 20, 2025 • edited Loading

Footnotes

gizas commented Feb 17, 2025 •

edited

Loading

Kavindu-Dodan Feb 18, 2025 •

edited

Loading

Kavindu-Dodan commented Feb 18, 2025 •

edited

Loading

Kavindu-Dodan commented Feb 20, 2025 •

edited

Loading