Retry individual messages/requests when failing with `429`/`Data too large`. #21313

dennisoelkers · 2025-01-10T09:22:31Z

Note: This needs a backport to 6.0 & 6.1.

Description

Motivation and Context

Before this PR, when OpenSearch nodes run out of heap space/circuit breakers are tripped during indexing, two things can happen:

The request for the entire batch is responded to with a 429 Too Many Requests - Currently this leads to halving the batch and retrying. When the batch size has reached zero, retrying will be given up.
Individual bulk items succeed, while others fail and are responded to with a Data too large error - Failures will be treated as permanent errors (like mapping exceptions) and will be written to the failure processing collection if available, or just dropped.

In order to improve this and avoid potential data loss, this PR changes this to:

When the overall request fails with a circuit_breaking_exception, we are now using the retryer used for conditions where the target index does not or the indexer master is not discovered yet, to retry indefinitely with an exponential backoff, while halving chunks until it reaches a size of 1.
If individual items of a bulk fail with a Data too large exception, we will retry those, just as for blocked indices indefinitely with an exponential backoff as well.

Fixes #21282.

How Has This Been Tested?

I wrote an integration test trying to simulate this condition by setting the indices.breaker.total.limit setting to a very low limit. The same procedure was used to test it locally as well.

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactoring (non-breaking change)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.

…large`.

thll · 2025-01-28T08:47:15Z

In general, I think it makes sense to retry requests that failed because of resource constraints indefinitely. I'm wondering if we could run into a situation where the circuit-breaking exception has a durability': 'PERMANENT' for a single message in a bulk indexing request. In that case, we would grind indexing on the Graylog node to a halt, while before, we would have a couple of messages put into the processing and indexing failures collection/index.
We should make sure that we won't ever get into such a situation or that the new behavior is correct even then. Even more so, before considering to backport the change.

dennisoelkers · 2025-01-28T09:25:34Z

@thll: Thanks for the hint! I am now parsing the durability of the exception and retry it only if it is transient.

thll

I think there is a situation where this change could make things worse. Although I'm not sure if it's rather theoretical.

We are not splitting up bulk requests anymore if we hit certain circuit breaking exceptions. Assuming we get a really large bulk request because of unusually large individual messages. OpenSearch might respond with a Data too large circuit breaking exception. Depending on the circuit breaker settings of the node and current heap pressure, the only way to ever get the bulk request successfully indexed would be to split it up further.

I can replicate this with OpenSearch by slightly modifying one integration test like this:

Index: graylog2-server/src/test/java/org/graylog2/indexer/messages/MessagesBatchIT.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/graylog2-server/src/test/java/org/graylog2/indexer/messages/MessagesBatchIT.java b/graylog2-server/src/test/java/org/graylog2/indexer/messages/MessagesBatchIT.java
--- a/graylog2-server/src/test/java/org/graylog2/indexer/messages/MessagesBatchIT.java	(revision f83ded02d570ab33fe4883feee31cf76d93bd240)
+++ b/graylog2-server/src/test/java/org/graylog2/indexer/messages/MessagesBatchIT.java	(date 1738141051158)
@@ -67,6 +67,8 @@
 
     @Test
     public void testIfLargeBatchesGetSplitUpOnCircuitBreakerExceptions() throws Exception {
+        client().setRequestCircuitBreakerLimit("10%");
+
         // This test assumes that ES is running with only 256MB heap size.
         // This will trigger the circuit breaker when we are trying to index large batches
         final int MESSAGECOUNT = 50;

The setting might need a bit of tuning and for me it needs to be a lot higher for ElasticSearch. The above worked with OpenSearch for me. If chosen correctly, the test passes on current master but fails in this PR.

This might be a rather artificial situation because I think most of the time, other circuit breakers like Request Entity Too Large or Too many requests will kick in. Those are still correctly handled by splitting up the batch like before.

If we want to be on the safe side, I suggest we try splitting up the bulk request further for any circuit breaking exception, like before. But instead of giving up once we cannot split the batch further, we could keep trying (with some kind of backoff mechanism) if the circuit breaker exception is transient.

…-large-errors

…its.

… test.

thll

The code LGTM and my concerns have been addressed. I did not test this manually, though.

…ed in `CircuitBreakerException`.

…-large-errors

…large`. (#21313) * Retry individual messages/requests when failing with `429`/`Data too large`. * Adding changelog snippet. * Adding license header. * Parsing durability from exception, retry only if transient. * Minor improvement. * Providing wrapper function for lowering/restoring circuit breaker limits. * Retry circuit breaker exceptions forever while halving chunks, moving test. * Adding license headers. * Retry only transient circuit breaker exceptions forever, add test. * Extracting common logic into `ChunkedBulkIndexer`. * Adjusting test. * Reset chunk size after successful ingest when previous attempt resulted in `CircuitBreakerException`. * Fixing up test to avoid infinite loop. --------- Co-authored-by: Bernd Ahlers <[email protected]>

…large`. (#21313) (#21675) * Retry individual messages/requests when failing with `429`/`Data too large`. * Adding changelog snippet. * Adding license header. * Parsing durability from exception, retry only if transient. * Minor improvement. * Providing wrapper function for lowering/restoring circuit breaker limits. * Retry circuit breaker exceptions forever while halving chunks, moving test. * Adding license headers. * Retry only transient circuit breaker exceptions forever, add test. * Extracting common logic into `ChunkedBulkIndexer`. * Adjusting test. * Reset chunk size after successful ingest when previous attempt resulted in `CircuitBreakerException`. * Fixing up test to avoid infinite loop. --------- Co-authored-by: Bernd Ahlers <[email protected]>

…large`. (#21313) (#21677) * Retry individual messages/requests when failing with `429`/`Data too large`. * Adding changelog snippet. * Adding license header. * Parsing durability from exception, retry only if transient. * Minor improvement. * Providing wrapper function for lowering/restoring circuit breaker limits. * Retry circuit breaker exceptions forever while halving chunks, moving test. * Adding license headers. * Retry only transient circuit breaker exceptions forever, add test. * Extracting common logic into `ChunkedBulkIndexer`. * Adjusting test. * Reset chunk size after successful ingest when previous attempt resulted in `CircuitBreakerException`. * Fixing up test to avoid infinite loop. --------- Co-authored-by: Bernd Ahlers <[email protected]>

Retry individual messages/requests when failing with 429/`Data too …

8727f5a

…large`.

dennisoelkers force-pushed the feat/retry-data-too-large-errors branch from 8217c3d to 8727f5a Compare January 10, 2025 10:55

dennisoelkers added 3 commits January 10, 2025 12:34

Adding changelog snippet.

4b9b791

Adding license header.

419d4b8

Merge branch 'master' into feat/retry-data-too-large-errors

4da34d2

dennisoelkers requested review from bernd, boosty, kroepke and thll January 22, 2025 09:38

Merge branch 'master' into feat/retry-data-too-large-errors

1c1fdb6

dennisoelkers added 2 commits January 28, 2025 10:24

Parsing durability from exception, retry only if transient.

4fe3733

Merge branch 'master' into feat/retry-data-too-large-errors

be08dfc

dennisoelkers marked this pull request as ready for review January 28, 2025 13:41

thll requested changes Jan 29, 2025

View reviewed changes

dennisoelkers mentioned this pull request Feb 12, 2025

Graylog drops messages as indexer failure, which were rejected due to load issues on Opensearch side #21616

Open

dennisoelkers added 4 commits February 12, 2025 09:39

Merge remote-tracking branch 'origin/master' into feat/retry-data-too…

10014c5

…-large-errors

Minor improvement.

1d1308e

Providing wrapper function for lowering/restoring circuit breaker lim…

92b77ac

…its.

Retry circuit breaker exceptions forever while halving chunks, moving…

b3c1fa1

… test.

dennisoelkers requested a review from thll February 12, 2025 17:35

dennisoelkers added 2 commits February 12, 2025 18:36

Adding license headers.

2bd5bd3

Retry only transient circuit breaker exceptions forever, add test.

fac029a

thll approved these changes Feb 13, 2025

View reviewed changes

bernd and others added 5 commits February 13, 2025 11:58

Merge branch 'master' into feat/retry-data-too-large-errors

885ecc7

Extracting common logic into ChunkedBulkIndexer.

9aed8b9

Adjusting test.

76b452a

Merge branch 'master' into feat/retry-data-too-large-errors

a048308

Reset chunk size after successful ingest when previous attempt result…

7991309

…ed in `CircuitBreakerException`.

dennisoelkers added 2 commits February 14, 2025 09:22

Merge remote-tracking branch 'origin/master' into feat/retry-data-too…

d6223c8

…-large-errors

Fixing up test to avoid infinite loop.

c413aa9

thll self-requested a review February 17, 2025 08:38

thll approved these changes Feb 17, 2025

View reviewed changes

dennisoelkers merged commit 9b26b2e into master Feb 17, 2025
6 checks passed

dennisoelkers deleted the feat/retry-data-too-large-errors branch February 17, 2025 08:43

dennisoelkers mentioned this pull request Feb 17, 2025

Retry individual messages/requests when failing with 429/Data too large. (6.1) #21675

Merged

9 tasks

dennisoelkers mentioned this pull request Feb 17, 2025

Retry individual messages/requests when failing with 429/Data too large. (6.0) #21677

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry individual messages/requests when failing with `429`/`Data too large`. #21313

Retry individual messages/requests when failing with `429`/`Data too large`. #21313

dennisoelkers commented Jan 10, 2025 •

edited

Loading

thll commented Jan 28, 2025

dennisoelkers commented Jan 28, 2025

thll left a comment

thll left a comment

Retry individual messages/requests when failing with 429/Data too large. #21313

Retry individual messages/requests when failing with 429/Data too large. #21313

Conversation

dennisoelkers commented Jan 10, 2025 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

thll commented Jan 28, 2025

dennisoelkers commented Jan 28, 2025

thll left a comment

Choose a reason for hiding this comment

thll left a comment

Choose a reason for hiding this comment

Retry individual messages/requests when failing with `429`/`Data too large`. #21313

Retry individual messages/requests when failing with `429`/`Data too large`. #21313

dennisoelkers commented Jan 10, 2025 •

edited

Loading