Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Be more tolerant of intra-cluster latency when running mutations on shards #29229

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

tkaemming
Copy link
Contributor

Problem

MutationRunner.run_on_shards sometimes tries to check mutation status on shards before the shard is aware of the mutation. This seems especially likely when those servers are under heavy load.

Changes

Wraps the mutation waiter in a retry policy that retries several times in the event that a mutation is not yet available before giving up.

Does this work well for both Cloud and self-hosted?

Yes

How did you test this code?

Added unit test for retry policy, change to the job is covered by existing integration test

@tkaemming tkaemming marked this pull request as ready for review February 26, 2025 01:57
@tkaemming tkaemming requested a review from a team February 26, 2025 01:57
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

Added a RetryPolicy class to handle intra-cluster latency issues when running mutations on shards, making the system more resilient when checking mutation status.

  • Implemented RetryPolicy in posthog/clickhouse/cluster.py that wraps callables with configurable retry logic (max attempts, delay, exception types)
  • Applied retry policy to mutation status checks in run_on_shards method with 3 retry attempts and 10-second delay
  • Added comprehensive test cases in test_retry_policy() covering successful, flaky, and consistently failing functions
  • Improved error handling with proper exception propagation and logging of retry attempts
  • Added helper function format_exception_summary() to provide concise error messages in logs

2 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile

Comment on lines +339 to +341
logger.warning(
"Failed to invoke %r (attempt #%s), retrying in %s...", self.callable, attempt, self.delay
)
Copy link
Contributor Author

@tkaemming tkaemming Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to figure out if it's possible to get these logs in the Dagster UI without passing their logger down the stack everywhere but that seems reasonable to do as a follow up

Copy link
Member

@fuziontech fuziontech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will def be important to have

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants