fix: Be more tolerant of intra-cluster latency when running mutations on shards #29229

tkaemming · 2025-02-26T00:37:11Z

Problem

MutationRunner.run_on_shards sometimes tries to check mutation status on shards before the shard is aware of the mutation. This seems especially likely when those servers are under heavy load.

Changes

Wraps the mutation waiter in a retry policy that retries several times in the event that a mutation is not yet available before giving up.

Does this work well for both Cloud and self-hosted?

Yes

How did you test this code?

Added unit test for retry policy, change to the job is covered by existing integration test

greptile-apps

PR Summary

Added a RetryPolicy class to handle intra-cluster latency issues when running mutations on shards, making the system more resilient when checking mutation status.

Implemented RetryPolicy in posthog/clickhouse/cluster.py that wraps callables with configurable retry logic (max attempts, delay, exception types)
Applied retry policy to mutation status checks in run_on_shards method with 3 retry attempts and 10-second delay
Added comprehensive test cases in test_retry_policy() covering successful, flaky, and consistently failing functions
Improved error handling with proper exception propagation and logging of retry attempts
Added helper function format_exception_summary() to provide concise error messages in logs

_{2 file(s) reviewed, no comment(s)}
_{Edit PR Review Bot Settings | Greptile}

tkaemming · 2025-02-26T02:29:11Z

posthog/clickhouse/cluster.py

+                    logger.warning(
+                        "Failed to invoke %r (attempt #%s), retrying in %s...", self.callable, attempt, self.delay
+                    )


I'd like to figure out if it's possible to get these logs in the Dagster UI without passing their logger down the stack everywhere but that seems reasonable to do as a follow up

fuziontech

this will def be important to have

tkaemming added 4 commits February 25, 2025 16:17

add retries to run_on_shards

e0e4f65

test

d9f6ef6

make mypy happy

bf8682e

explain

32841ca

tkaemming marked this pull request as ready for review February 26, 2025 01:57

tkaemming requested a review from a team February 26, 2025 01:57

greptile-apps bot reviewed Feb 26, 2025

View reviewed changes

tkaemming commented Feb 26, 2025

View reviewed changes

fuziontech approved these changes Feb 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Be more tolerant of intra-cluster latency when running mutations on shards #29229

fix: Be more tolerant of intra-cluster latency when running mutations on shards #29229

tkaemming commented Feb 26, 2025

greptile-apps bot left a comment

tkaemming Feb 26, 2025 •

edited

Loading

fuziontech left a comment

fix: Be more tolerant of intra-cluster latency when running mutations on shards #29229

Are you sure you want to change the base?

fix: Be more tolerant of intra-cluster latency when running mutations on shards #29229

Conversation

tkaemming commented Feb 26, 2025

Problem

Changes

Does this work well for both Cloud and self-hosted?

How did you test this code?

greptile-apps bot left a comment

Choose a reason for hiding this comment

PR Summary

tkaemming Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

fuziontech left a comment

Choose a reason for hiding this comment

tkaemming Feb 26, 2025 •

edited

Loading