Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry individual messages/requests when failing with 429/Data too large. #21313

Merged
merged 20 commits into from
Feb 17, 2025

Conversation

dennisoelkers
Copy link
Member

@dennisoelkers dennisoelkers commented Jan 10, 2025

Note: This needs a backport to 6.0 & 6.1.

Description

Motivation and Context

Before this PR, when OpenSearch nodes run out of heap space/circuit breakers are tripped during indexing, two things can happen:

  • The request for the entire batch is responded to with a 429 Too Many Requests - Currently this leads to halving the batch and retrying. When the batch size has reached zero, retrying will be given up.
  • Individual bulk items succeed, while others fail and are responded to with a Data too large error - Failures will be treated as permanent errors (like mapping exceptions) and will be written to the failure processing collection if available, or just dropped.

In order to improve this and avoid potential data loss, this PR changes this to:

  • When the overall request fails with a circuit_breaking_exception, we are now using the retryer used for conditions where the target index does not or the indexer master is not discovered yet, to retry indefinitely with an exponential backoff, while halving chunks until it reaches a size of 1.
  • If individual items of a bulk fail with a Data too large exception, we will retry those, just as for blocked indices indefinitely with an exponential backoff as well.

Fixes #21282.

How Has This Been Tested?

I wrote an integration test trying to simulate this condition by setting the indices.breaker.total.limit setting to a very low limit. The same procedure was used to test it locally as well.

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactoring (non-breaking change)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.

@dennisoelkers dennisoelkers force-pushed the feat/retry-data-too-large-errors branch from 8217c3d to 8727f5a Compare January 10, 2025 10:55
@thll
Copy link
Contributor

thll commented Jan 28, 2025

In general, I think it makes sense to retry requests that failed because of resource constraints indefinitely. I'm wondering if we could run into a situation where the circuit-breaking exception has a durability': 'PERMANENT' for a single message in a bulk indexing request. In that case, we would grind indexing on the Graylog node to a halt, while before, we would have a couple of messages put into the processing and indexing failures collection/index.
We should make sure that we won't ever get into such a situation or that the new behavior is correct even then. Even more so, before considering to backport the change.

@dennisoelkers
Copy link
Member Author

@thll: Thanks for the hint! I am now parsing the durability of the exception and retry it only if it is transient.

@dennisoelkers dennisoelkers marked this pull request as ready for review January 28, 2025 13:41
Copy link
Contributor

@thll thll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a situation where this change could make things worse. Although I'm not sure if it's rather theoretical.

We are not splitting up bulk requests anymore if we hit certain circuit breaking exceptions. Assuming we get a really large bulk request because of unusually large individual messages. OpenSearch might respond with a Data too large circuit breaking exception. Depending on the circuit breaker settings of the node and current heap pressure, the only way to ever get the bulk request successfully indexed would be to split it up further.

I can replicate this with OpenSearch by slightly modifying one integration test like this:

Index: graylog2-server/src/test/java/org/graylog2/indexer/messages/MessagesBatchIT.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/graylog2-server/src/test/java/org/graylog2/indexer/messages/MessagesBatchIT.java b/graylog2-server/src/test/java/org/graylog2/indexer/messages/MessagesBatchIT.java
--- a/graylog2-server/src/test/java/org/graylog2/indexer/messages/MessagesBatchIT.java	(revision f83ded02d570ab33fe4883feee31cf76d93bd240)
+++ b/graylog2-server/src/test/java/org/graylog2/indexer/messages/MessagesBatchIT.java	(date 1738141051158)
@@ -67,6 +67,8 @@
 
     @Test
     public void testIfLargeBatchesGetSplitUpOnCircuitBreakerExceptions() throws Exception {
+        client().setRequestCircuitBreakerLimit("10%");
+
         // This test assumes that ES is running with only 256MB heap size.
         // This will trigger the circuit breaker when we are trying to index large batches
         final int MESSAGECOUNT = 50;

The setting might need a bit of tuning and for me it needs to be a lot higher for ElasticSearch. The above worked with OpenSearch for me. If chosen correctly, the test passes on current master but fails in this PR.

This might be a rather artificial situation because I think most of the time, other circuit breakers like Request Entity Too Large or Too many requests will kick in. Those are still correctly handled by splitting up the batch like before.

If we want to be on the safe side, I suggest we try splitting up the bulk request further for any circuit breaking exception, like before. But instead of giving up once we cannot split the batch further, we could keep trying (with some kind of backoff mechanism) if the circuit breaker exception is transient.

@dennisoelkers dennisoelkers requested a review from thll February 12, 2025 17:35
Copy link
Contributor

@thll thll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code LGTM and my concerns have been addressed. I did not test this manually, though.

@thll thll self-requested a review February 17, 2025 08:38
@dennisoelkers dennisoelkers merged commit 9b26b2e into master Feb 17, 2025
6 checks passed
@dennisoelkers dennisoelkers deleted the feat/retry-data-too-large-errors branch February 17, 2025 08:43
dennisoelkers added a commit that referenced this pull request Feb 17, 2025
…large`. (#21313)

* Retry individual messages/requests when failing with `429`/`Data too large`.

* Adding changelog snippet.

* Adding license header.

* Parsing durability from exception, retry only if transient.

* Minor improvement.

* Providing wrapper function for lowering/restoring circuit breaker limits.

* Retry circuit breaker exceptions forever while halving chunks, moving test.

* Adding license headers.

* Retry only transient circuit breaker exceptions forever, add test.

* Extracting common logic into `ChunkedBulkIndexer`.

* Adjusting test.

* Reset chunk size after successful ingest when previous attempt resulted in `CircuitBreakerException`.

* Fixing up test to avoid infinite loop.

---------

Co-authored-by: Bernd Ahlers <[email protected]>
dennisoelkers added a commit that referenced this pull request Feb 17, 2025
…large`. (#21313)

* Retry individual messages/requests when failing with `429`/`Data too large`.

* Adding changelog snippet.

* Adding license header.

* Parsing durability from exception, retry only if transient.

* Minor improvement.

* Providing wrapper function for lowering/restoring circuit breaker limits.

* Retry circuit breaker exceptions forever while halving chunks, moving test.

* Adding license headers.

* Retry only transient circuit breaker exceptions forever, add test.

* Extracting common logic into `ChunkedBulkIndexer`.

* Adjusting test.

* Reset chunk size after successful ingest when previous attempt resulted in `CircuitBreakerException`.

* Fixing up test to avoid infinite loop.

---------

Co-authored-by: Bernd Ahlers <[email protected]>
dennisoelkers added a commit that referenced this pull request Feb 18, 2025
…large`. (#21313) (#21675)

* Retry individual messages/requests when failing with `429`/`Data too large`.

* Adding changelog snippet.

* Adding license header.

* Parsing durability from exception, retry only if transient.

* Minor improvement.

* Providing wrapper function for lowering/restoring circuit breaker limits.

* Retry circuit breaker exceptions forever while halving chunks, moving test.

* Adding license headers.

* Retry only transient circuit breaker exceptions forever, add test.

* Extracting common logic into `ChunkedBulkIndexer`.

* Adjusting test.

* Reset chunk size after successful ingest when previous attempt resulted in `CircuitBreakerException`.

* Fixing up test to avoid infinite loop.

---------

Co-authored-by: Bernd Ahlers <[email protected]>
dennisoelkers added a commit that referenced this pull request Feb 18, 2025
…large`. (#21313) (#21677)

* Retry individual messages/requests when failing with `429`/`Data too large`.

* Adding changelog snippet.

* Adding license header.

* Parsing durability from exception, retry only if transient.

* Minor improvement.

* Providing wrapper function for lowering/restoring circuit breaker limits.

* Retry circuit breaker exceptions forever while halving chunks, moving test.

* Adding license headers.

* Retry only transient circuit breaker exceptions forever, add test.

* Extracting common logic into `ChunkedBulkIndexer`.

* Adjusting test.

* Reset chunk size after successful ingest when previous attempt resulted in `CircuitBreakerException`.

* Fixing up test to avoid infinite loop.

---------

Co-authored-by: Bernd Ahlers <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OpenSearch indexing error causes lost messages
3 participants