DAOS-17111 cart: Fix csm_alive_count (#15945) #16030
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In swim, csm_alive_count may underflow because some cst->cst_state.sms_status changes in csm overlook the count. Moreover, not counting SUSPECT members seems to be a mistake. Consider a membership of three, {x, y, z}. If x enters a state where it can't receive any SWIM messages, and it picks y in the next period, then it will suspect y, causing csm_alive_count to drop from 3 to 2, which prevents x from declaring an "outage". (In the subsequent period, x will suspect z, causing csm_alive_count to drop from 2 to 1 quickly.) Since x keeps pinging SUSPECT members, it seems reasonable to count them in and expect them to send messages to x until they become DEAD.
This patch fixes the underflow, and counts SUSPECT members in addition to ALIVE members in csm_alive_count (renamed to
csm_alive_or_suspect_count).
Steps for the author:
After all prior steps are complete: