-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent outdated/broken mysql slave being promoted as master #1005
base: main
Are you sure you want to change the base?
Conversation
…reated whenever instance is promoted as MySQL master in ms configuration. Lock file is removed whenever machine is no longer MySQL master. Lock usefull for usage by external scripts. Default mysql_master_lock_file path is /var/lock/replication_master
…ually it is being executed in pre-demote and should be showing pre-demote in logs.
…ch trigger specific check only if this parameter is set to true. If slave is disconnected this script attempts start slave. However in certain cases slave might be following incorrect master. If ensure_follow_correct_master is set to true, start slave is attempted only in case slave master matches crm elected master.
…d whenever replication error occur: - In master/slave mode set reader vip to 0/false so clients does not read from this machine - Delete master score for the specific machine so it never get promoted as master. As slave ran into an error, and promoting it to master may lead to dataloss. - Log specific error - Quit with the requested error code. It might be success as well so we don't trigger resource fail (on start slave attempts during temporary slave disconnects)
…ode that can be read easily and also avoid nested code
…ast_SQL_Errno too as those show vital information about that if specific mysql slave instance is healthy
From now on setting test_table in config ONLY controls if we will run SELECT COUNT(*) from that table on both master and slave. Previously test_table option was controlling (by mistake?) if we will be running extensive slave tests via check_slave However I strongly advocate on running check_slave anytime on slaves otherwise they can end up, being broken but still promoted as masters at certain point. This commit also greatly simplify the code structure. In ms mode: * slave - is always checked in details with call to check_slave - check_slave on its end should be in charge to set correct reader vip status and master scoring on its own - should remove vip and master score if check_slave fails with call to quit_on_error * master - is always set as vip readable and with greatest score - return to parent that we are $OCF_RUNNING_MASTER * in non ms mode we simply report resource as running Checks are as follows: - is mysql running? (via mysql_common_status) - Can I SELECT COUNT(*) FROM $OCF_RESKEY_test_table (if test_table is set and $OCF_CHECK_LEVEL > 0) - For slave run extensive slave tests in check_slave - For master, reader vip 1, scoring highest, say ok - For non-ms mode say ok.
… more readable and avoid nested code
…error. this should be the very first check.
…w on fail slave checks anytime we got Last_SQL_Error/Last_SQL_Errno, Last_IO_Error/Last_IO_Errno or Last_Error/Last_Errno errors. This is critical to avoid promoting outdated/broken replication slave as master.
… if statement From now on we should compare $master_host vs $new_master before 'START SLAVE' ONLY if ensure_follow_correct_master config option is set to true. Why? If while creating mysql replication we used CHANGE MASTER TO='ip.ip.ip.ip': - value of $master_host will be this ip address - Even the IP in questions is assigneed on $new_master - Check will fail as we will compare ip vs $new_master - This will fail the check After 'START SLAVE' attempt, this particular check should: - remove reader vip as we are not connected to master thus outdated (avoid reading from outdated slave) - remove master score as we are not connected to master thus outdated (avoid promoting outdated slave) - quit with success so we can allow mysql to reconnect or throw error after reconnect attempt
…ith quit_on_error simplified vip reader scoring code simplified master scoring code check_slave is in charge of setting correct master scoring preferences for EACH slave if we reach the end of check_slave than slave should be considered healthy thus we should always return $OCF_SUCCESS at the end.
… in master mode always set mysql_master_lock_file
Is there any chance you could clean up this series of commits a bit? It's difficult to review when there are so many commits to look at. Thanks! |
@krig I intentionally split each change in different commits.
However if you insist I can make the modifications requested. Just explain what do you mean by clean-up. Doing more than 1 change per commit? |
I guess maybe it's just a lot of changes! It looked more like an evolution over time that could perhaps be condensed. I will take another look. |
This merge request greatly enhance mysql resource stability. It tries to address cases that happened in live environments:
broken/outdated slave promoted as mysql master due to insufficient error checking in mysql monitor for slaves
broken/outdated slave promoted as mysql master due to specific (not-so-smart default) configuration requirements unrelated to check_slave state (test_table set and OCF_CHECK_LEVEL > 0)
broken/outdated slave promoted as mysql master due to master scores not being reclaimed/removed from slaves that have serious replication errors or are in state unsuitable for master promotion.
This merge request is a result of multiple hours spent during on-call and out-of-call firefighting, manual out-of-sync master-merging, postmortem writing, debugging crm and its default mysql resource agent.
IMPORTANT changes
check_slave is now ALWAYS called on slave in ms mode - Previously it was called only if test_table was set, which I assume to be due to an error. To prevent stale/broken slave from being promoted as master I strongly recommend calling check_slave by default on slaves.
in check_slave $Slave_SQL_Running/$Slave_IO_Running == 'NO' verify slave vs cluster master before START SLAVE is now performed only if ensure_follow_correct_master=true (new var. default is false)
Create empty file mysql_master_lock_file=/var/lock/replication_master (new var.) on master promote, erase it on master demote. Useful for for external scripts.
Introducing new function quit_on_error that is used across monitor:
Enhanced check_slave to detect failed or stale slaves:
Code readability greatly simplified.