Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent outdated/broken mysql slave being promoted as master #1005

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

vaLski
Copy link
Contributor

@vaLski vaLski commented Jul 26, 2017

This merge request greatly enhance mysql resource stability. It tries to address cases that happened in live environments:

  • broken/outdated slave promoted as mysql master due to insufficient error checking in mysql monitor for slaves

  • broken/outdated slave promoted as mysql master due to specific (not-so-smart default) configuration requirements unrelated to check_slave state (test_table set and OCF_CHECK_LEVEL > 0)

  • broken/outdated slave promoted as mysql master due to master scores not being reclaimed/removed from slaves that have serious replication errors or are in state unsuitable for master promotion.

This merge request is a result of multiple hours spent during on-call and out-of-call firefighting, manual out-of-sync master-merging, postmortem writing, debugging crm and its default mysql resource agent.

IMPORTANT changes

  • check_slave is now ALWAYS called on slave in ms mode - Previously it was called only if test_table was set, which I assume to be due to an error. To prevent stale/broken slave from being promoted as master I strongly recommend calling check_slave by default on slaves.

  • in check_slave $Slave_SQL_Running/$Slave_IO_Running == 'NO' verify slave vs cluster master before START SLAVE is now performed only if ensure_follow_correct_master=true (new var. default is false)

  • Create empty file mysql_master_lock_file=/var/lock/replication_master (new var.) on master promote, erase it on master demote. Useful for for external scripts.

Introducing new function quit_on_error that is used across monitor:

  • uses common function to terminate with success or error - quit_on_error
  • quit_on_error always clear reader vip so erroneous slave does not receive reads
  • quit_on_error always delete master score so erroneous does not get promoted as master
  • quit_on_error log the error
  • quit_on_error can terminate with the desired error code

Enhanced check_slave to detect failed or stale slaves:

  • Consider slave as failed if Last_SQL_Error/Last_SQL_Errno
  • Consider slave as failed if Last_IO_Error/Last_IO_Errno
  • Consider slave as failed if Last_Error/Last_Errno
  • Attempt to start slave $Slave_SQL_Running/$Slave_IO_Running NO
  • Code simplification for evict_outdated_slaves handling
  • Code simplification for slave master score handling

Code readability greatly simplified.

vaLski added 23 commits July 26, 2017 14:55
…reated whenever instance is promoted as MySQL master in ms configuration.

Lock file is removed whenever machine is no longer MySQL master.

Lock usefull for usage by external scripts.

Default mysql_master_lock_file path is /var/lock/replication_master
…ually it is being executed in pre-demote and should be showing pre-demote in logs.
…ch trigger specific check only if this parameter is set to true.

If slave is disconnected this script attempts start slave.

However in certain cases slave might be following incorrect master.

If ensure_follow_correct_master is set to true, start slave is attempted only in case slave master matches crm elected master.
…d whenever replication error occur:

- In master/slave mode set reader vip to 0/false so clients does not read from this machine
- Delete master score for the specific machine so it never get promoted as master. As slave ran into an error, and promoting it to master may lead to dataloss.
- Log specific error
- Quit with the requested error code. It might be success as well so we don't trigger resource fail (on start slave attempts during temporary slave disconnects)
…ode that can be read easily and also avoid nested code
…ast_SQL_Errno too as those show vital information about that if specific mysql slave instance is healthy
From now on setting test_table in config ONLY controls if we will run SELECT COUNT(*) from that table on both master and slave.

Previously test_table option was controlling (by mistake?) if we will be running extensive slave tests via check_slave

However I strongly advocate on running check_slave anytime on slaves otherwise they can end up, being broken but still promoted as masters at certain point.

This commit also greatly simplify the code structure. In ms mode:

* slave

- is always checked in details with call to check_slave
- check_slave on its end should be in charge to set correct reader vip status and master scoring on its own
- should remove vip and master score if check_slave fails with call to quit_on_error

* master

- is always set as vip readable and with greatest score
- return to parent that we are $OCF_RUNNING_MASTER

* in non ms mode we simply report resource as running

Checks are as follows:

- is mysql running? (via mysql_common_status)
- Can I SELECT COUNT(*) FROM $OCF_RESKEY_test_table (if test_table is set and $OCF_CHECK_LEVEL > 0)
- For slave run extensive slave tests in check_slave
- For master, reader vip 1, scoring highest, say ok
- For non-ms mode say ok.
…w on fail slave checks anytime we got Last_SQL_Error/Last_SQL_Errno, Last_IO_Error/Last_IO_Errno or Last_Error/Last_Errno errors.

This is critical to avoid promoting outdated/broken replication slave as master.
… if statement

From now on we should compare $master_host vs $new_master before 'START SLAVE' ONLY if ensure_follow_correct_master config option is set to true. Why?

If while creating mysql replication we used CHANGE MASTER TO='ip.ip.ip.ip':

- value of $master_host will be this ip address
- Even the IP in questions is assigneed on $new_master
- Check will fail as we will compare ip vs $new_master
- This will fail the check

After 'START SLAVE' attempt, this particular check should:

- remove reader vip as we are not connected to master thus outdated (avoid reading from outdated slave)
- remove master score as we are not connected to master thus outdated (avoid promoting outdated slave)
- quit with success so we can allow mysql to reconnect or throw error after reconnect attempt
…ith quit_on_error

simplified vip reader scoring code

simplified master scoring code

check_slave is in charge of setting correct master scoring preferences for EACH slave

if we reach the end of check_slave than slave should be considered healthy thus we should always return $OCF_SUCCESS at the end.
… in master mode always set mysql_master_lock_file
@krig
Copy link
Contributor

krig commented Aug 8, 2017

Is there any chance you could clean up this series of commits a bit? It's difficult to review when there are so many commits to look at. Thanks!

@vaLski
Copy link
Contributor Author

vaLski commented Aug 8, 2017

@krig I intentionally split each change in different commits.

  • Each commit addresses single problem at a time.
  • This should allow the one who review the code to follow the idea of the author and see why exactly is he doing particular modification
  • I assume that it should be easier for one to review changes of 5-10 lines per commit instead of getting 300 lines diff where one can easily lose track of small but important details.
  • If there are multiple commits that should be merged like in my request one can easily see the entire diff.

However if you insist I can make the modifications requested. Just explain what do you mean by clean-up. Doing more than 1 change per commit?

@krig
Copy link
Contributor

krig commented Aug 8, 2017

I guess maybe it's just a lot of changes! It looked more like an evolution over time that could perhaps be condensed. I will take another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants