-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Be more tolerant of intra-cluster latency when running mutations on shards #29229
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
Added a RetryPolicy
class to handle intra-cluster latency issues when running mutations on shards, making the system more resilient when checking mutation status.
- Implemented
RetryPolicy
inposthog/clickhouse/cluster.py
that wraps callables with configurable retry logic (max attempts, delay, exception types) - Applied retry policy to mutation status checks in
run_on_shards
method with 3 retry attempts and 10-second delay - Added comprehensive test cases in
test_retry_policy()
covering successful, flaky, and consistently failing functions - Improved error handling with proper exception propagation and logging of retry attempts
- Added helper function
format_exception_summary()
to provide concise error messages in logs
2 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile
logger.warning( | ||
"Failed to invoke %r (attempt #%s), retrying in %s...", self.callable, attempt, self.delay | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to figure out if it's possible to get these logs in the Dagster UI without passing their logger down the stack everywhere but that seems reasonable to do as a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will def be important to have
Problem
MutationRunner.run_on_shards
sometimes tries to check mutation status on shards before the shard is aware of the mutation. This seems especially likely when those servers are under heavy load.Changes
Wraps the mutation waiter in a retry policy that retries several times in the event that a mutation is not yet available before giving up.
Does this work well for both Cloud and self-hosted?
Yes
How did you test this code?
Added unit test for retry policy, change to the job is covered by existing integration test