Graylog drops messages as indexer failure, which were rejected due to load issues on Opensearch side #21616

kcyd · 2025-02-12T08:30:23Z

When Opensearch rejects certain messages in a bulk index request, due to issues on the Opensearch side (load, networking, ...) they are treated by Graylog as permanent errors and the messages are dropped and logged as indexer failure.

I've seen this with AWS Opensearch service and OR1 instances which is using S3 for remote storage

These are the exceptions observed:

OpenSearchException[OpenSearch exception [type=rejected_execution_exception, reason=rejected execution on primary shard:[redacted_1234][6] due to remote segments lagging behind local segments.time_lag:12074 ms dynamic_time_lag_threshold:4048.0 ms]]

OpenSearchException[OpenSearch exception [type=i_o_exception, reason=Failed to upload [redacted_filename]]; nested: OpenSearchException[OpenSearch exception [type=i_o_exception, reason=Unable to upload object [redacted_path] using a single upload]]; nested: OpenSearchException[OpenSearch exception [type=sdk_client_exception, reason=sdk_client_exception: Unable to execute HTTP request: Connection reset by peer]]; nested: OpenSearchException[OpenSearch exception [type=i_o_exception, reason=Connection reset by peer]];

OpenSearchException[OpenSearch exception [type=i_o_exception, reason=Failed to upload 1 files during transfer]]

Expected Behavior

messages in a bulk request with temporary errors should be retried

Current Behavior

unknown exception types are logged as indexer failures and messages are dropped

Possible Solution

If I correctly understood the code, MessagesAdapterOS2::errorTypeFromResponse should return IndexingError.Type.IndexBlocked for either the specific exception types or the status code (429).
It would be also very helpful to log the status code, not only the exception message, as I'm not sure if all the exceptions shown above use status 429.

see also https://github.com/opensearch-project/OpenSearch/blob/main/libs/core/src/main/java/org/opensearch/ExceptionsHelper.java#L95

Steps to Reproduce (for bugs)

I don't know the circumstances which trigger the exceptions on Opensearch side, it's not at peak load time.
Seems to be related to S3 storage load/performance.
I would guess the same issue happens also for other rejections of individual messages in a bulk requests, for example if a specific instance in a cluster has issues and an index is not spread across all instances.

Context

Data loss when Opensearch instances have load issues. In our setup, this happens roughly once a day

Your Environment

Graylog Version: 6.1.5 (docker image)
Java Version: (docker image)
OpenSearch Version: AWS Opensearch Service with OS 2.17 and OR1 instances
MongoDB Version: 7.0.16
Operating System: (docker image)
Browser version:

The text was updated successfully, but these errors were encountered:

dennisoelkers · 2025-02-12T08:33:06Z

Refs #21313.
Refs #21282.

dennisoelkers · 2025-02-12T16:27:09Z

Thanks for reporting this, @kcyd!

I do not think these errors return as 429 from OpenSearch, because in that case they will be thrown as a TooManyRequestsException which will be retried after reducing the chunk size. Is there a chance you could find out which actual response is returned from OpenSearch at this point?

kmerz · 2025-02-18T12:53:58Z

@dennisoelkers is that a dup, then?

kcyd added the bug label Feb 12, 2025

kmerz added the triaged label Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graylog drops messages as indexer failure, which were rejected due to load issues on Opensearch side #21616

Graylog drops messages as indexer failure, which were rejected due to load issues on Opensearch side #21616

kcyd commented Feb 12, 2025 •

edited by dennisoelkers

Loading

dennisoelkers commented Feb 12, 2025

dennisoelkers commented Feb 12, 2025

kmerz commented Feb 18, 2025

Graylog drops messages as indexer failure, which were rejected due to load issues on Opensearch side #21616

Graylog drops messages as indexer failure, which were rejected due to load issues on Opensearch side #21616

Comments

kcyd commented Feb 12, 2025 • edited by dennisoelkers Loading

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

dennisoelkers commented Feb 12, 2025

dennisoelkers commented Feb 12, 2025

kmerz commented Feb 18, 2025

kcyd commented Feb 12, 2025 •

edited by dennisoelkers

Loading