Skip to content

Commit

Permalink
galera: start joining nodes during monitor op to better track long-ru…
Browse files Browse the repository at this point in the history
…nning SST

With large databases, SST can take a longer time than the configured promotion
timeout, which makes the resource fail and block.

In order to overcome the timeout limit, start joining nodes during a
monitor operation. A new dedicated monitor operation tracks the SST
progress and triggers promotion to Master once the synchronization with
the galera cluster is finished.
  • Loading branch information
dciabrin committed Mar 23, 2017
1 parent 3599492 commit bbc7dd9
Show file tree
Hide file tree
Showing 2 changed files with 439 additions and 137 deletions.
19 changes: 11 additions & 8 deletions heartbeat/README.galera
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,13 @@ General idea for starting Galera:
* Only when SST is over on joiner nodes, the agent promotes them
to Master. At this point, the entire Galera cluster is up.

* SST failures can't always be recovered automatically, so if
a failure occurs while galera is syncing on a node (attribute
sync-needed set), pacemaker will prevent the resource from
restarting on that node.
User will has to run "pcs cleanup galera" on the node to unblock
the resource and automatically trigger an SST at next restart.


Attribute usage and liveness
====
Expand Down Expand Up @@ -133,14 +140,10 @@ Non-primary state, which would make `galera_monitor()` fail.

### no-grastate

If a galera node was unexpectedly killed in a middle of a replication,
InnoDB can retain the equivalent of a XA transaction in prepared state
in its redo log. If so, mysqld cannot recover state (nor last seqno)
automatically, and special recovery heuristic has to be used to
unblock the node.

This transient attribute is used to keep track of forced recoveries to
prevent bootstrapping a cluster from a recovered node when possible.
This transient attribute is used to keep track of node which did not
shutdown cleanly or failed to join the cluster during SST. It is also
used to prevent bootstrapping a cluster from a recovered node when
possible.

- Used : during `detect_first_master()` to elect the bootstrap node
- Created: in `detect_last_commit()` if the node has a pending XA
Expand Down
Loading

0 comments on commit bbc7dd9

Please sign in to comment.