Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-17111 cart: Fix csm_alive_count (#15945) #16030

Draft
wants to merge 1 commit into
base: release/2.6
Choose a base branch
from

Conversation

liw
Copy link
Contributor

@liw liw commented Mar 5, 2025

In swim, csm_alive_count may underflow because some cst->cst_state.sms_status changes in csm overlook the count. Moreover, not counting SUSPECT members seems to be a mistake. Consider a membership of three, {x, y, z}. If x enters a state where it can't receive any SWIM messages, and it picks y in the next period, then it will suspect y, causing csm_alive_count to drop from 3 to 2, which prevents x from declaring an "outage". (In the subsequent period, x will suspect z, causing csm_alive_count to drop from 2 to 1 quickly.) Since x keeps pinging SUSPECT members, it seems reasonable to count them in and expect them to send messages to x until they become DEAD.

This patch fixes the underflow, and counts SUSPECT members in addition to ALIVE members in csm_alive_count (renamed to
csm_alive_or_suspect_count).

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

In swim, csm_alive_count may underflow because some
cst->cst_state.sms_status changes in csm overlook the count. Moreover,
not counting SUSPECT members seems to be a mistake. Consider a
membership of three, {x, y, z}. If x enters a state where it can't
receive any SWIM messages, and it picks y in the next period, then it
will suspect y, causing csm_alive_count to drop from 3 to 2, which
prevents x from declaring an "outage". (In the subsequent period, x will
suspect z, causing csm_alive_count to drop from 2 to 1 quickly.) Since x
keeps pinging SUSPECT members, it seems reasonable to count them in and
expect them to send messages to x until they become DEAD.

This patch fixes the underflow, and counts SUSPECT members in addition
to ALIVE members in csm_alive_count (renamed to
csm_alive_or_suspect_count).

Signed-off-by: Li Wei <[email protected]>
Copy link

github-actions bot commented Mar 5, 2025

Ticket title is '[SWIM] Zombie Node Messes Up SWIM'
Status is 'Open'
https://daosio.atlassian.net/browse/DAOS-17111

@liw liw added the clean-cherry-pick Cherry-pick from another branch that did not require additional edits label Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clean-cherry-pick Cherry-pick from another branch that did not require additional edits
Development

Successfully merging this pull request may close these issues.

1 participant