High Availability (HA) using Pacemaker
Procedure:
Step 1
Known issues:
High Availability
=================
1). When pcs cannot stop pacemaker on a node, it does not stop cman/corosync
on the remaining nodes. If you have one or more pacemaker cluster nodes in a
powered off state, and you use the --all flag on the "pcs cluster stop"
command to globally shutdown pacemaker on all cluster nodes it will kill
successfully all the pacemaker processes, but will NOT kill the corosync
processes.
This sets up a situation where quorum may be established when it should not be.
You can end up reaching quorum when pacemaker is not running on a sufficient
number of cluster nodes.
Workaround: To stop pacemaker on all cluster nodes, issue the command
"pcs cluster stop --force" on each active cluster node individually.
Do not use the --all flag if any cluster nodes are unreachable.
2). If a resource has failed, a failure message appears when you display the
cluster status. If you resolve that resource, you can clear that failure
status with the pcs resource cleanup command. This command resets the resource
status and failcount, telling the cluster to forget the operation history of
a resource and re-detect its current state. When you see repetitive logs
regarding failed operation of resource, this command also helps to get rid of
the repetitive logs.
Use the following command cleans up the resource specified by resource_id.
pcs resource cleanup resource_id
If you do not specify a resource_id, this command resets the resource status
and failcount for all resources.
3). Temporary interruption of the interface corosync is using for communication
may cause resources to be started on multiple nodes and this might cause data
corruption if the resource is a VirtualDomian resource.
Workaround: Setup corosync with redundant ring. This redundancy will reduce
the risk of losing the corosync communication.