galera: better monitoring of long-running SST #954

dciabrin · 2017-03-23T17:27:57Z

With large databases, SST can take a longer time than the configured promotion
timeout, which makes the resource fail and block.

In order to overcome the timeout limit, start joining nodes during a
monitor operation. A new dedicated monitor operation tracks the SST
progress and triggers promotion to Master once the synchronization with
the galera cluster is finished.

This is a rework of #684 and #762 which mimicks the current behaviour of the agent when
a failure happens on joining (on-fail=block). In particular, failing SST will block the resource until manual cleanup, like it does today.

This rework essentially split the monitoring of galera into 4 kinds monitor operation:

the usual monitor operation (for starting up and Master monitoring)
a new Slave monitor to track the synchronization of joining nodes (SST and IST). Any failure during in this monitor is blocking (on-fail=block)
a dedicated probe monitor, which updates the attributes based on the probed state (Stopped, Slave or Master)
a dedicated monitor for unmanaged resource

…nning SST With large databases, SST can take a longer time than the configured promotion timeout, which makes the resource fail and block. In order to overcome the timeout limit, start joining nodes during a monitor operation. A new dedicated monitor operation tracks the SST progress and triggers promotion to Master once the synchronization with the galera cluster is finished.

beekhof · 2017-03-28T00:06:13Z

heartbeat/galera

@@ -254,6 +257,7 @@ Cluster check user password
 <action name="monitor" depth="0" timeout="30" interval="20" />
 <action name="monitor" role="Master" depth="0" timeout="30" interval="10" />
 <action name="monitor" role="Slave" depth="0" timeout="30" interval="30" />
+<action name="monitor" role="Slave" depth="0" timeout="30" interval="8" on-fail="block"/>


Why not name="sync-check"?
Nothing says you need to use monitor for everything :-)

I did so because while pacemaker is perfectly fine with a random action name, pcs expects specific names in actions during resource creation (https://github.com/ClusterLabs/pcs/blob/master/pcs/lib/resource_agent.py#L24).
An action with an unexpected name is not set in the CIB, and the whole dual monitor approach breaks :/

beekhof · 2017-03-28T00:22:48Z

heartbeat/galera

-  status)   mysql_common_status err;;
-  monitor)  galera_monitor;;
+  status)   galera_status err;;
+  monitor)


Make these differently named actions.
Eg. monitor -> galera_probe(), sync-check -> galera_sync_monitor(), health-check -> galera_unmanaged_monitor(), bootstrap-or-join -> galera_monitor()

I'm not sure how to convince pacemaker to call the resource agent with a specific action name when a probe is required, or when the resource is in a certain state (e.g. "resource is currently unmanaged on that node")
Anyway, seems that pcs is a limiting layer here :/

dmuhamedagic · 2017-03-28T16:37:40Z

On Tue, Mar 28, 2017 at 07:14:28AM -0700, Damien Ciabrini wrote: I did so because while pacemaker is perfectly fine with a random action name, pcs expects specific names in actions during resource creation (https://github.com/ClusterLabs/pcs/blob/master/pcs/lib/resource_agent.py#L24). An action with an unexpected name is not set in the CIB, and the whole dual monitor approach breaks :/

Did you report this? You shouldn't work around pcs problems. Pacemaker is the ultimate source and the UI (pcs) needs to be fixed.

dciabrin · 2017-03-28T20:24:00Z

@dmuhamedagic fair enough, ClusterLabs/pcs#132 created to discuss that.

dciabrin changed the title ~~galera: start joining nodes during monitor op to better track long-ru…~~ galera: better monitoring of long-running SST Mar 23, 2017

oalbrigt requested a review from beekhof March 24, 2017 07:49

dciabrin force-pushed the sst-monitor branch from bbc7dd9 to a516f2b Compare March 24, 2017 08:26

beekhof reviewed Mar 28, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

galera: better monitoring of long-running SST #954

galera: better monitoring of long-running SST #954

dciabrin commented Mar 23, 2017 •

edited

Loading

beekhof Mar 28, 2017

dciabrin Mar 28, 2017

beekhof Mar 28, 2017

dciabrin Mar 28, 2017

dmuhamedagic commented Mar 28, 2017 via email

dciabrin commented Mar 28, 2017

galera: better monitoring of long-running SST #954

Are you sure you want to change the base?

galera: better monitoring of long-running SST #954

Conversation

dciabrin commented Mar 23, 2017 • edited Loading

beekhof Mar 28, 2017

Choose a reason for hiding this comment

dciabrin Mar 28, 2017

Choose a reason for hiding this comment

beekhof Mar 28, 2017

Choose a reason for hiding this comment

dciabrin Mar 28, 2017

Choose a reason for hiding this comment

dmuhamedagic commented Mar 28, 2017 via email

dciabrin commented Mar 28, 2017

dciabrin commented Mar 23, 2017 •

edited

Loading