-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve errant GTID detection in ERS to handle more cases. #16926
base: main
Are you sure you want to change the base?
Improve errant GTID detection in ERS to handle more cases. #16926
Conversation
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #16926 +/- ##
==========================================
+ Coverage 68.98% 69.37% +0.38%
==========================================
Files 1562 1571 +9
Lines 200690 204339 +3649
==========================================
+ Hits 138449 141750 +3301
- Misses 62241 62589 +348 ☔ View full report in Codecov by Sentry. |
…parentJournalInfo implemented Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
@@ -212,7 +215,52 @@ func (s *ReplicationStatus) FindErrantGTIDs(otherReplicaStatuses []*ReplicationS | |||
// Copy set for final diffSet so we don't mutate receiver. | |||
diffSet := make(Mysql56GTIDSet, len(relayLogSet)) | |||
for sid, intervals := range relayLogSet { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we continue to name this relayLogSet
?
} | ||
|
||
if len(diffSet) == 0 { | ||
// If diffSet is empty, then we have no errant GTIDs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
reparentJournalLen, err := erp.gatherReparenJournalInfo(ctx, validCandidates, tabletMap, waitReplicasTimeout) | ||
if err != nil { | ||
// If we run into an error finding the reparent journal lengths, we use the legacy errant GTID detection logic. | ||
return erp.legacyFindErrantGTIDS(validCandidates, statusMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
message ReadReparentJournalInfoResponse { | ||
int32 length = 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
// This check is only required because we want to support the case where the vttablet doesn't have | ||
// `ReadReparentJournalInfo` RPC implemented. So, the new logic for only using the tablets with the maximum | ||
// length of reparent journal won't filter any candidates. We'll need this logic to continue supporting the cases | ||
// that we did before. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a // TODO(manan): remove in v22
or something similar.
) (map[string]replication.Position, error) { | ||
// First we need to collect the reparent journal length for all the candidates. | ||
// This will tell us, which of the tablets are severly lagged, and haven't even seen all the primary promotions. | ||
// These tablets cannot be trusted for errant GTID detection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// These tablets cannot be trusted for errant GTID detection. | |
// Such severely lagging tablets cannot be used to find errant GTIDs in other tablets, seeing that they themselves don't have enough information. |
continue | ||
} | ||
maxLenPositions = append(maxLenPositions, validCandidates[candidate]) | ||
updatedValidCandidates[candidate] = validCandidates[candidate] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
status, ok := statusMap[candidate] | ||
if !ok { | ||
// If the tablet is not in the status map, and has the maximum length of the reparent journal, | ||
// then it is probably the latest primary and we don't need to run any errant GTID detection on it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the use of the word "probably". It is certain to be the primary? What are other potential scenarios?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if it is the primary, will it hurt to nonetheless run errant GTID detection? I'm hoping to reduce clutter get rid of this conditional.
// Here we don't want to send the source UUID. The reason is that all of these tablets are lagged, | ||
// so we don't need to use the source UUID to discount any GTIDs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit unclear. Can you rephrase that please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the tests, it is often the case that zone1-0000000102
is the "winner" of candidate replicas election. However, it is also the first in the list of replica candidates. I would like to see more shuffling or tablets and some changes of roles, such that 103
or 104
would be the "winners". If only to ensure that the algorithm doesn't pick the first tablet by chance or mistake.
Description
This PR adds the code changes for reworking the errant GTID detection in ERS. As proposed in #16724 (comment), we now also use the reparent journal length as an extra data point for GTID detection. All the different cases listed in #16274 have been added as unit tests in this PR, and the expectations of the algorithm have been verified.
Since,
ReadReparentJournalInfo
is a new RPC, there can be customers that upgrade Vitess multiple versions at a time (we are adding the new RPC in v21, but it is not available in releases prior to that). In this case, the vttablets won't have the RPC implemented. Since we don't want ERS to stop working in this situation, we have to keep the legacy errant GTID code around for this scenario. So, if reading the reparent journal information fails on any tablet, then we revert to using that legacy errant GTID detection code.Related Issue(s)
Checklist
Deployment Notes