-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
galera: better monitoring of long-running SST #954
base: main
Are you sure you want to change the base?
Conversation
…nning SST With large databases, SST can take a longer time than the configured promotion timeout, which makes the resource fail and block. In order to overcome the timeout limit, start joining nodes during a monitor operation. A new dedicated monitor operation tracks the SST progress and triggers promotion to Master once the synchronization with the galera cluster is finished.
@@ -254,6 +257,7 @@ Cluster check user password | |||
<action name="monitor" depth="0" timeout="30" interval="20" /> | |||
<action name="monitor" role="Master" depth="0" timeout="30" interval="10" /> | |||
<action name="monitor" role="Slave" depth="0" timeout="30" interval="30" /> | |||
<action name="monitor" role="Slave" depth="0" timeout="30" interval="8" on-fail="block"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not name="sync-check"?
Nothing says you need to use monitor for everything :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did so because while pacemaker is perfectly fine with a random action name, pcs expects specific names in actions during resource creation (https://github.com/ClusterLabs/pcs/blob/master/pcs/lib/resource_agent.py#L24).
An action with an unexpected name is not set in the CIB, and the whole dual monitor approach breaks :/
status) mysql_common_status err;; | ||
monitor) galera_monitor;; | ||
status) galera_status err;; | ||
monitor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make these differently named actions.
Eg. monitor -> galera_probe(), sync-check -> galera_sync_monitor(), health-check -> galera_unmanaged_monitor(), bootstrap-or-join -> galera_monitor()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to convince pacemaker to call the resource agent with a specific action name when a probe is required, or when the resource is in a certain state (e.g. "resource is currently unmanaged on that node")
Anyway, seems that pcs is a limiting layer here :/
On Tue, Mar 28, 2017 at 07:14:28AM -0700, Damien Ciabrini wrote:
I did so because while pacemaker is perfectly fine with a random action name, pcs expects specific names in actions during resource creation (https://github.com/ClusterLabs/pcs/blob/master/pcs/lib/resource_agent.py#L24).
An action with an unexpected name is not set in the CIB, and the whole dual monitor approach breaks :/
Did you report this?
You shouldn't work around pcs problems. Pacemaker is the ultimate
source and the UI (pcs) needs to be fixed.
|
@dmuhamedagic fair enough, ClusterLabs/pcs#132 created to discuss that. |
With large databases, SST can take a longer time than the configured promotion
timeout, which makes the resource fail and block.
In order to overcome the timeout limit, start joining nodes during a
monitor operation. A new dedicated monitor operation tracks the SST
progress and triggers promotion to Master once the synchronization with
the galera cluster is finished.
This is a rework of #684 and #762 which mimicks the current behaviour of the agent when
a failure happens on joining (on-fail=block). In particular, failing SST will block the resource until manual cleanup, like it does today.
This rework essentially split the monitoring of galera into 4 kinds monitor operation:
the usual monitor operation (for starting up and Master monitoring)
a new Slave monitor to track the synchronization of joining nodes (SST and IST). Any failure during in this monitor is blocking (on-fail=block)
a dedicated probe monitor, which updates the attributes based on the probed state (Stopped, Slave or Master)
a dedicated monitor for unmanaged resource