You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When Opensearch rejects certain messages in a bulk index request, due to issues on the Opensearch side (load, networking, ...) they are treated by Graylog as permanent errors and the messages are dropped and logged as indexer failure.
I've seen this with AWS Opensearch service and OR1 instances which is using S3 for remote storage
These are the exceptions observed:
OpenSearchException[OpenSearch exception [type=rejected_execution_exception, reason=rejected execution on primary shard:[redacted_1234][6] due to remote segments lagging behind local segments.time_lag:12074 ms dynamic_time_lag_threshold:4048.0 ms]]
OpenSearchException[OpenSearch exception [type=i_o_exception, reason=Failed to upload [redacted_filename]]; nested: OpenSearchException[OpenSearch exception [type=i_o_exception, reason=Unable to upload object [redacted_path] using a single upload]]; nested: OpenSearchException[OpenSearch exception [type=sdk_client_exception, reason=sdk_client_exception: Unable to execute HTTP request: Connection reset by peer]]; nested: OpenSearchException[OpenSearch exception [type=i_o_exception, reason=Connection reset by peer]];
OpenSearchException[OpenSearch exception [type=i_o_exception, reason=Failed to upload 1 files during transfer]]
Expected Behavior
messages in a bulk request with temporary errors should be retried
Current Behavior
unknown exception types are logged as indexer failures and messages are dropped
Possible Solution
If I correctly understood the code, MessagesAdapterOS2::errorTypeFromResponse should return IndexingError.Type.IndexBlocked for either the specific exception types or the status code (429).
It would be also very helpful to log the status code, not only the exception message, as I'm not sure if all the exceptions shown above use status 429.
I don't know the circumstances which trigger the exceptions on Opensearch side, it's not at peak load time.
Seems to be related to S3 storage load/performance.
I would guess the same issue happens also for other rejections of individual messages in a bulk requests, for example if a specific instance in a cluster has issues and an index is not spread across all instances.
Context
Data loss when Opensearch instances have load issues. In our setup, this happens roughly once a day
Your Environment
Graylog Version: 6.1.5 (docker image)
Java Version: (docker image)
OpenSearch Version: AWS Opensearch Service with OS 2.17 and OR1 instances
MongoDB Version: 7.0.16
Operating System: (docker image)
Browser version:
The text was updated successfully, but these errors were encountered:
I do not think these errors return as 429 from OpenSearch, because in that case they will be thrown as a TooManyRequestsException which will be retried after reducing the chunk size. Is there a chance you could find out which actual response is returned from OpenSearch at this point?
When Opensearch rejects certain messages in a bulk index request, due to issues on the Opensearch side (load, networking, ...) they are treated by Graylog as permanent errors and the messages are dropped and logged as indexer failure.
I've seen this with AWS Opensearch service and OR1 instances which is using S3 for remote storage
These are the exceptions observed:
Expected Behavior
messages in a bulk request with temporary errors should be retried
Current Behavior
unknown exception types are logged as indexer failures and messages are dropped
Possible Solution
If I correctly understood the code, MessagesAdapterOS2::errorTypeFromResponse should return IndexingError.Type.IndexBlocked for either the specific exception types or the status code (429).
It would be also very helpful to log the status code, not only the exception message, as I'm not sure if all the exceptions shown above use status 429.
see also https://github.com/opensearch-project/OpenSearch/blob/main/libs/core/src/main/java/org/opensearch/ExceptionsHelper.java#L95
Steps to Reproduce (for bugs)
I don't know the circumstances which trigger the exceptions on Opensearch side, it's not at peak load time.
Seems to be related to S3 storage load/performance.
I would guess the same issue happens also for other rejections of individual messages in a bulk requests, for example if a specific instance in a cluster has issues and an index is not spread across all instances.
Context
Data loss when Opensearch instances have load issues. In our setup, this happens roughly once a day
Your Environment
The text was updated successfully, but these errors were encountered: