Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graylog drops messages as indexer failure, which were rejected due to load issues on Opensearch side #21616

Open
kcyd opened this issue Feb 12, 2025 · 3 comments

Comments

@kcyd
Copy link

kcyd commented Feb 12, 2025

When Opensearch rejects certain messages in a bulk index request, due to issues on the Opensearch side (load, networking, ...) they are treated by Graylog as permanent errors and the messages are dropped and logged as indexer failure.

I've seen this with AWS Opensearch service and OR1 instances which is using S3 for remote storage

These are the exceptions observed:

OpenSearchException[OpenSearch exception [type=rejected_execution_exception, reason=rejected execution on primary shard:[redacted_1234][6] due to remote segments lagging behind local segments.time_lag:12074 ms dynamic_time_lag_threshold:4048.0 ms]]

OpenSearchException[OpenSearch exception [type=i_o_exception, reason=Failed to upload [redacted_filename]]; nested: OpenSearchException[OpenSearch exception [type=i_o_exception, reason=Unable to upload object [redacted_path] using a single upload]]; nested: OpenSearchException[OpenSearch exception [type=sdk_client_exception, reason=sdk_client_exception: Unable to execute HTTP request: Connection reset by peer]]; nested: OpenSearchException[OpenSearch exception [type=i_o_exception, reason=Connection reset by peer]];

OpenSearchException[OpenSearch exception [type=i_o_exception, reason=Failed to upload 1 files during transfer]]

Expected Behavior

messages in a bulk request with temporary errors should be retried

Current Behavior

unknown exception types are logged as indexer failures and messages are dropped

Possible Solution

If I correctly understood the code, MessagesAdapterOS2::errorTypeFromResponse should return IndexingError.Type.IndexBlocked for either the specific exception types or the status code (429).
It would be also very helpful to log the status code, not only the exception message, as I'm not sure if all the exceptions shown above use status 429.

see also https://github.com/opensearch-project/OpenSearch/blob/main/libs/core/src/main/java/org/opensearch/ExceptionsHelper.java#L95

Steps to Reproduce (for bugs)

I don't know the circumstances which trigger the exceptions on Opensearch side, it's not at peak load time.
Seems to be related to S3 storage load/performance.
I would guess the same issue happens also for other rejections of individual messages in a bulk requests, for example if a specific instance in a cluster has issues and an index is not spread across all instances.

Context

Data loss when Opensearch instances have load issues. In our setup, this happens roughly once a day

Your Environment

  • Graylog Version: 6.1.5 (docker image)
  • Java Version: (docker image)
  • OpenSearch Version: AWS Opensearch Service with OS 2.17 and OR1 instances
  • MongoDB Version: 7.0.16
  • Operating System: (docker image)
  • Browser version:
@kcyd kcyd added the bug label Feb 12, 2025
@dennisoelkers
Copy link
Member

Refs #21313.
Refs #21282.

@dennisoelkers
Copy link
Member

Thanks for reporting this, @kcyd!

I do not think these errors return as 429 from OpenSearch, because in that case they will be thrown as a TooManyRequestsException which will be retried after reducing the chunk size. Is there a chance you could find out which actual response is returned from OpenSearch at this point?

@kmerz kmerz added the triaged label Feb 18, 2025
@kmerz
Copy link
Member

kmerz commented Feb 18, 2025

@dennisoelkers is that a dup, then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants