Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

galera: better monitoring of long-running SST #954

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dciabrin
Copy link
Contributor

@dciabrin dciabrin commented Mar 23, 2017

With large databases, SST can take a longer time than the configured promotion
timeout, which makes the resource fail and block.

In order to overcome the timeout limit, start joining nodes during a
monitor operation. A new dedicated monitor operation tracks the SST
progress and triggers promotion to Master once the synchronization with
the galera cluster is finished.

This is a rework of #684 and #762 which mimicks the current behaviour of the agent when
a failure happens on joining (on-fail=block). In particular, failing SST will block the resource until manual cleanup, like it does today.

This rework essentially split the monitoring of galera into 4 kinds monitor operation:

  • the usual monitor operation (for starting up and Master monitoring)

  • a new Slave monitor to track the synchronization of joining nodes (SST and IST). Any failure during in this monitor is blocking (on-fail=block)

  • a dedicated probe monitor, which updates the attributes based on the probed state (Stopped, Slave or Master)

  • a dedicated monitor for unmanaged resource

@dciabrin dciabrin changed the title galera: start joining nodes during monitor op to better track long-ru… galera: better monitoring of long-running SST Mar 23, 2017
@oalbrigt oalbrigt requested a review from beekhof March 24, 2017 07:49
…nning SST

With large databases, SST can take a longer time than the configured promotion
timeout, which makes the resource fail and block.

In order to overcome the timeout limit, start joining nodes during a
monitor operation. A new dedicated monitor operation tracks the SST
progress and triggers promotion to Master once the synchronization with
the galera cluster is finished.
@@ -254,6 +257,7 @@ Cluster check user password
<action name="monitor" depth="0" timeout="30" interval="20" />
<action name="monitor" role="Master" depth="0" timeout="30" interval="10" />
<action name="monitor" role="Slave" depth="0" timeout="30" interval="30" />
<action name="monitor" role="Slave" depth="0" timeout="30" interval="8" on-fail="block"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not name="sync-check"?
Nothing says you need to use monitor for everything :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did so because while pacemaker is perfectly fine with a random action name, pcs expects specific names in actions during resource creation (https://github.com/ClusterLabs/pcs/blob/master/pcs/lib/resource_agent.py#L24).
An action with an unexpected name is not set in the CIB, and the whole dual monitor approach breaks :/

status) mysql_common_status err;;
monitor) galera_monitor;;
status) galera_status err;;
monitor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make these differently named actions.
Eg. monitor -> galera_probe(), sync-check -> galera_sync_monitor(), health-check -> galera_unmanaged_monitor(), bootstrap-or-join -> galera_monitor()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to convince pacemaker to call the resource agent with a specific action name when a probe is required, or when the resource is in a certain state (e.g. "resource is currently unmanaged on that node")
Anyway, seems that pcs is a limiting layer here :/

@dmuhamedagic
Copy link
Contributor

dmuhamedagic commented Mar 28, 2017 via email

@dciabrin
Copy link
Contributor Author

@dmuhamedagic fair enough, ClusterLabs/pcs#132 created to discuss that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants