-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry individual messages/requests when failing with 429
/Data too large
.
#21313
Conversation
8217c3d
to
8727f5a
Compare
In general, I think it makes sense to retry requests that failed because of resource constraints indefinitely. I'm wondering if we could run into a situation where the circuit-breaking exception has a |
@thll: Thanks for the hint! I am now parsing the durability of the exception and retry it only if it is transient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a situation where this change could make things worse. Although I'm not sure if it's rather theoretical.
We are not splitting up bulk requests anymore if we hit certain circuit breaking exceptions. Assuming we get a really large bulk request because of unusually large individual messages. OpenSearch might respond with a Data too large
circuit breaking exception. Depending on the circuit breaker settings of the node and current heap pressure, the only way to ever get the bulk request successfully indexed would be to split it up further.
I can replicate this with OpenSearch by slightly modifying one integration test like this:
Index: graylog2-server/src/test/java/org/graylog2/indexer/messages/MessagesBatchIT.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/graylog2-server/src/test/java/org/graylog2/indexer/messages/MessagesBatchIT.java b/graylog2-server/src/test/java/org/graylog2/indexer/messages/MessagesBatchIT.java
--- a/graylog2-server/src/test/java/org/graylog2/indexer/messages/MessagesBatchIT.java (revision f83ded02d570ab33fe4883feee31cf76d93bd240)
+++ b/graylog2-server/src/test/java/org/graylog2/indexer/messages/MessagesBatchIT.java (date 1738141051158)
@@ -67,6 +67,8 @@
@Test
public void testIfLargeBatchesGetSplitUpOnCircuitBreakerExceptions() throws Exception {
+ client().setRequestCircuitBreakerLimit("10%");
+
// This test assumes that ES is running with only 256MB heap size.
// This will trigger the circuit breaker when we are trying to index large batches
final int MESSAGECOUNT = 50;
The setting might need a bit of tuning and for me it needs to be a lot higher for ElasticSearch. The above worked with OpenSearch for me. If chosen correctly, the test passes on current master
but fails in this PR.
This might be a rather artificial situation because I think most of the time, other circuit breakers like Request Entity Too Large
or Too many requests
will kick in. Those are still correctly handled by splitting up the batch like before.
If we want to be on the safe side, I suggest we try splitting up the bulk request further for any circuit breaking exception, like before. But instead of giving up once we cannot split the batch further, we could keep trying (with some kind of backoff mechanism) if the circuit breaker exception is transient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code LGTM and my concerns have been addressed. I did not test this manually, though.
…ed in `CircuitBreakerException`.
…large`. (#21313) * Retry individual messages/requests when failing with `429`/`Data too large`. * Adding changelog snippet. * Adding license header. * Parsing durability from exception, retry only if transient. * Minor improvement. * Providing wrapper function for lowering/restoring circuit breaker limits. * Retry circuit breaker exceptions forever while halving chunks, moving test. * Adding license headers. * Retry only transient circuit breaker exceptions forever, add test. * Extracting common logic into `ChunkedBulkIndexer`. * Adjusting test. * Reset chunk size after successful ingest when previous attempt resulted in `CircuitBreakerException`. * Fixing up test to avoid infinite loop. --------- Co-authored-by: Bernd Ahlers <[email protected]>
…large`. (#21313) * Retry individual messages/requests when failing with `429`/`Data too large`. * Adding changelog snippet. * Adding license header. * Parsing durability from exception, retry only if transient. * Minor improvement. * Providing wrapper function for lowering/restoring circuit breaker limits. * Retry circuit breaker exceptions forever while halving chunks, moving test. * Adding license headers. * Retry only transient circuit breaker exceptions forever, add test. * Extracting common logic into `ChunkedBulkIndexer`. * Adjusting test. * Reset chunk size after successful ingest when previous attempt resulted in `CircuitBreakerException`. * Fixing up test to avoid infinite loop. --------- Co-authored-by: Bernd Ahlers <[email protected]>
…large`. (#21313) (#21675) * Retry individual messages/requests when failing with `429`/`Data too large`. * Adding changelog snippet. * Adding license header. * Parsing durability from exception, retry only if transient. * Minor improvement. * Providing wrapper function for lowering/restoring circuit breaker limits. * Retry circuit breaker exceptions forever while halving chunks, moving test. * Adding license headers. * Retry only transient circuit breaker exceptions forever, add test. * Extracting common logic into `ChunkedBulkIndexer`. * Adjusting test. * Reset chunk size after successful ingest when previous attempt resulted in `CircuitBreakerException`. * Fixing up test to avoid infinite loop. --------- Co-authored-by: Bernd Ahlers <[email protected]>
…large`. (#21313) (#21677) * Retry individual messages/requests when failing with `429`/`Data too large`. * Adding changelog snippet. * Adding license header. * Parsing durability from exception, retry only if transient. * Minor improvement. * Providing wrapper function for lowering/restoring circuit breaker limits. * Retry circuit breaker exceptions forever while halving chunks, moving test. * Adding license headers. * Retry only transient circuit breaker exceptions forever, add test. * Extracting common logic into `ChunkedBulkIndexer`. * Adjusting test. * Reset chunk size after successful ingest when previous attempt resulted in `CircuitBreakerException`. * Fixing up test to avoid infinite loop. --------- Co-authored-by: Bernd Ahlers <[email protected]>
Note: This needs a backport to
6.0
&6.1
.Description
Motivation and Context
Before this PR, when OpenSearch nodes run out of heap space/circuit breakers are tripped during indexing, two things can happen:
429 Too Many Requests
- Currently this leads to halving the batch and retrying. When the batch size has reached zero, retrying will be given up.Data too large
error - Failures will be treated as permanent errors (like mapping exceptions) and will be written to the failure processing collection if available, or just dropped.In order to improve this and avoid potential data loss, this PR changes this to:
circuit_breaking_exception
, we are now using the retryer used for conditions where the target index does not or the indexer master is not discovered yet, to retry indefinitely with an exponential backoff, while halving chunks until it reaches a size of 1.Data too large
exception, we will retry those, just as for blocked indices indefinitely with an exponential backoff as well.Fixes #21282.
How Has This Been Tested?
I wrote an integration test trying to simulate this condition by setting the
indices.breaker.total.limit
setting to a very low limit. The same procedure was used to test it locally as well.Screenshots (if appropriate):
Types of changes
Checklist: