Applies to SUSE OpenStack Cloud 6

J Recovering Clusters to a Healthy State

If one node in your cluster refuses to rejoin the cluster, it is most likely that the node has not been shut down cleanly. This can either be due to manual intervention or because the node has been fenced (shut down) by the STONITH mechanism of the cluster, to protect the integrity of data in case of a split-brain scenario.

The following sections refer to problems with the Control Nodes cluster and show how to recover your degraded cluster to full strength. This takes the following basic steps:

  1. Re-adding the Node to the Cluster

  2. Recovering Crowbar and Chef

  3. In addition, you may need to reset resource failcounts in order to allow resources to start on the node you have re-added to the cluster. See Section J.4, “Cleaning Up Resources”.

  4. In addition, you may need to manually remove the maintenance mode flag from a node. See Section J.5, “Removing the Maintenance Mode Flag from a Node”.

For a list of possible symptoms that help you to diagnose a degraded cluster, see Section J.1, “Symptoms of a Degraded Control Node Cluster”.

J.1 Symptoms of a Degraded Control Node Cluster

The following incidents may occur if a Control Node in your cluster has been shut down in an unclean state:

  • A VM reboots although the SUSE OpenStack Cloud administrator did not trigger this action.

  • One of the Control Node in the Crowbar Web interface is in status Problematic, signified by a red dot next to the node.

  • The Hawk Web interface stops responding on one of the Control Nodes, while it is still responding on the others.

  • The SSH connection to one of the Control Nodes freezes.

  • The OpenStack services stop responding for a short while.

J.2 Re-adding the Node to the Cluster

  1. Reboot the node.

  2. Connect to the node via SSH from the Administration Server.

  3. If you have a 2-node cluster, remove the block file that is created on a node during start of the cluster service:

    root # rm /var/spool/corosync/block_automatic_start

    The block file avoids STONITH deathmatches for 2-node clusters (where each node kills the other one, resulting in both nodes rebooting all the time). When Corosync shuts down cleanly, the block file is automatically removed. Otherwise the block file is still present and prevents the cluster service from (re-)starting on that node.

  4. Start the cluster service on the cluster node:

    root # systemctl start pacemaker

J.3 Recovering Crowbar and Chef

Making the Pacemaker node rejoin the cluster is not enough. All nodes in the cloud (including the Administration Server) need to be aware that this node is back online. This requires the following steps for Crowbar and Chef:

  1. Log in to the node you have re-added to the cluster.

  2. Re-register the node with Crowbar by executing:

    root # service crowbar_join start
  3. Log in to one of the other Control Nodes.

  4. Trigger a Chef run:

    root # chef-client

J.4 Cleaning Up Resources

A resource will be automatically restarted if it fails, but each failure increases the resource's failcount. If a migration-threshold has been set for the resource, the node will no longer run the resource when the number of failures reaches the migration threshold. To allow the resource to start again on the node, reset the resources failcount by cleaning up the resource manually. You can clean up individual resources by using the Hawk Web interface or all in one go as described below:

  1. Log in to one of the cluster nodes.

  2. Clean-up all stopped resources with the following command:

    root # crm_resource -o | \
      awk '/\tStopped |Timed Out/ { print $1 }' | \
      xargs -n1 crm resource cleanup

J.5 Removing the Maintenance Mode Flag from a Node

During normal operation, chef-client sometimes needs to place a node into maintenance mode. The node is kept in maintenance mode until the chef-client run finishes. However, if the chef-client run fails, the node may be left in maintenance mode. In that case, the cluster management tools like crmsh or Hawk will show all resources on that node as unmanaged. To remove the maintenance flag:

  1. Log in the cluster node.

  2. Disable the maintenance mode with:

    root # crm node ready
Print this page