-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AWSs3 Input] Handle unexpectedEOF errors #42420
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
|
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services) |
@faec , @zmoog , @Kavindu-Dodan sorry for delaying this a bit. The change seems minimal, tl;dr :
Do you have in mind specific tests that you think are needed for this? |
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
Signed-off-by: Andreas Gkizas <[email protected]>
var buf bytes.Buffer | ||
if _, err := io.Copy(&buf, r); err != nil { | ||
// If an error occurs during the download, handle it here | ||
return fmt.Errorf("%w: %w", errS3DownloadFailed, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error may not be appropriate inside the JSON reader. Because at this point, the download is complete and we are processing the object body. So we could derive a dedicated error (ex:- errBodyReading) which help us to add debug logs at error handler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error type looks ok to me, this is not in the json reader itself it's in s3ObjectProcessor
which currently is responsible for the download. The problem is that at this point in execution the download isn't complete, the io.Reader
is still a live http pipe and that's why it's able to encounter "EOF" while parsing.
An alternative that would actually separate the network issues from the parsing is to move this io.Copy
call and its error check to ProcessS3Object
, right after the p.download()
call, so we are always starting with a completed download before we do any parsing. I think it would be slightly clearer that way too, so if that would address your concerns then let's try it that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @faec for the explanation. I am fine with the error usage with the provided background. But I have another doubt with the bytes.Buffer
usage.
With this change, we load the full object data into memory. And this could trigger a panic with ErrTooLarge
. So I think think it is best to detect unexpectedEOF
with existing stream reading, where the error should be thrown from the decoder.Decode()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hesitant about that: if the download is successful, then loading the whole object into memory is the intent. For very large objects there is some overhead to buffering first instead of streaming directly to the decoder, but short-lived buffers are cheap for the garbage collector to recover, and if a single event is so big that we can't hold two copies of it in memory for the length of a function call then we are going to have serious issues ingesting that data regardless (since e.g. the default memory queue holds 3200 events). Most inputs load the full event data into a memory buffer before decoding it, and I think the clarity and reliability we get from doing it here is worth the extra buffer during the initial download.
In particular, I'm worried about the case I described: if a response does have deterministically invalid JSON and we classify that EOF as a retryable error, then we will keep retrying that object forever despite deterministic failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a single event is so big that we can't hold two copies of it in memory for the length of a function call then we are going to have serious issues ingesting that data regardless
Most inputs load the full event data into a memory buffer before decoding it, and I think the clarity and reliability we get from doing it here is worth the extra buffer during the initial download
+1 on these points.
@gizas what do you think?
Since you tested with the dataset that caused the error, where you able to see why this error was caused ?
Since I couldn't find the actual root cause : is it enough only to process the first part ? Can this error caused by other sections of the stream ? |
The EOF that this is fixing is caused specifically by a dropped connection during the download process. It's still possible in theory to get that same error later, but only if the completed download contains corrupt json, and in that case we do want to return a fatal error. |
Got it. I thought the error came from invalid JSON.
Since we use a
Agree and +1 for this. |
I have tested it with some files that customer sent but could not reproduce the error. That is why when/ if comes we mark it again errS3DownloadFailed and retry it
@Kavindu-Dodan that is a total fair concern. Indeed I am also afraid and want to do some more tests. Also I am not still sure if the S3 has a size limit that might protects us - but it is not an actual thing to trust. So tl;dr:
|
Per the details in this issue of AWS Go SDK 1, I don't think it will be easy to recreate this easily. May be we might see it with large files but that's a may be. I think it is best to focus on the workaround where we retry if we see the error. However, we should be cautious and limit retries to prevent any unintended impact. Footnotes |
Proposed commit message
Please explain:
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
Related issues
Screenshots
Logs