How to Let Go of a Failed Relationship (with Pictures) - wikiHow
Replacing Failed Network Cards. Replacing Failed Disk Mechanisms. system administrator starts the cluster in Serviceguard Manager or with the cmruncl command A package can have an exclusionary relationship. In order to remove all single points of failure from the cluster, you should .. The cmruncl command can only be used when the cluster is not running, that is, when A package can have an exclusionary relationship with any number of other. If the primary Quorum Server node fails, it will fail over to an adoptive node. .. The command cmruncl is how we start cluster services when the.
In such situations, manual intervention is needed to start the package on either the same site or on another site. This policy will be used to determine what action to take when a package is not running on its primary node and its primary node is capable of running the package.
The "MANUAL" policy means no attempt will be made to move the package back to its primary node when it is running on an adoptive node. This policy means Serviceguard will attempt to move the package back to its primary node as soon as the primary node is capable of running the package. This is an optional parameter. A smaller number indicates higher priority. If a number is specified, it must be unique in the cluster. To help assign unique priorities, HP recommends you use priorities in increments of This will allow you to add new packages without having to reassign priorities.
Multi-node and System multi node packages cannot be assigned a priority. This parameter is used only when a weight has been defined for a package, a package depends on other packages, or other packages depend on this package, but can be specified even when no weights or dependencies have yet been configured.
Serviceguard gives preference to running the higher priority package. This means that, if necessary, Serviceguard will halt a package or halt and restart on anther node in order to run a higher priority package. The reason may be: For example, suppose package pkg1 depends on package pkg2 to be up on the same node, both have package switching enabled and both are currently up on node node1. If pkg1 needs to fail over to node2, it will also need pkg2 to move to node2.
If pkg1 has higher priority than pkg2, it can force pkg2 to move to node2. Once the down system is rebooted, we can have that node join the cluster with the command cmrunnode. In a cluster, you probably want to know that everything is working as expected.
Here is the output from my cmviewcl command: This will detail the starting of the cluster. It takes time to get used to the output from different cluster operations. That's why we are going to test cluster functionality at the end of this section. Part of that will be to check syslog. Here's the healthy output I received in my syslog. Connection from localhost Daemon Initialization - Maximum number of packages supported for this incarnation is Heartbeat Interval is 1 seconds.
Node Timeout is 2 seconds. Network Polling Interval is 2 seconds. Auto Start Timeout is seconds. Information Specific to node hpeos Connection from hpeos The maximum of concurrent local connections to the daemon that will be supported is Starting cluster management protocols.
Attempting to form a new cluster Aug 2 There are no ATS devices on this cluster. New node hpeos is joining the cluster Aug 2 Clearing Cluster Lock Aug 2 Turning on safety time protection Aug 2 The new active cluster membership is: Basically, we start the cmcld daemon. In initializing, cmcld outputs our cluster timing parameters. It then identifies which LAN cards are active and which are Standby cards.
We then work out whether there are any shared tape devices. We then see the other node hpeos joining the cluster, giving us two members. Finally, the cluster LVM daemon is started. The entries you see for hacl-cfg come from the cluster configuration daemon cmclconfd that gathers information about LAN cards and volume groups.
It also distributes the cluster binary file. During the startup of the cluster, all nodes are communicating with each other to ensure that the cluster is formed correctly and also to elect a cluster coordinator.
This causes weird results; I once saw "Error: We checked everything from cables to linkloop commands and tried resetting LAN cards with lanadmin. The only reason I managed to fix it was that my suspicions were aroused by the lack of entries in syslog.Why Most Relationships Fail - Weekly Wisdom Episode 12
In the end, the customer involved admitted that he had recently uninstalled and reinstalled Serviceguard a few times. Don't as me why, he just did. The key was getting familiar with the expected output in syslog. We should see similar output on all nodes in the cluster. Here is a brief overview of the election protocol every time a cluster reforms: Start a reconfiguration timer.
clusternode won't join cluster anymore - Hewlett Packard Enterprise Community
Search for the existing cluster coordinator. If the Cluster Coordinator replies, send them your "vote. Retry until "reconfiguration timeout" expires. Attempt to grab the cluster lock and form the cluster. If this fails, halt the node. Wait for "quiescence" timeout, an elapsed time to allow other nodes to halt. New Cluster Coordinator informs the cluster members of the status and membership of the cluster.
Start heartbeat packets to all cluster members. Clear the cluster lock. The current Cluster Coordinator does not perform steps b and c. If you are interested in finding the cluster coordinator, you need to increase the Serviceguard logging level. This is achieved by using the contributed command cmsetlog. Use of this command by customers is normally only under the guidance of HP Support personnel.
HP does not offer official support for this command, so be very careful if you are going to use it. The command and its options are discussed on various HP Education Services courses covering Serviceguard. If you are unfamiliar with the command, it is strongly suggested you do not use it. This node is now cluster coordinator". We have the option of starting cluster services every time a node starts up. Once started, why would a node need rebooting?
If it does reboot, I would want to know why. Let's look at an example of when a node crashes due to a hardware problem. This will cause a cluster reformation. If the hardware problem is intermittent, the fault may not occur for some time. Alternately, it could happen almost immediately. A cluster reformation in itself is not too much to ask, but it is something we want to avoid if it at all possible.
Having spurious reformations can confuse everyone involved and may actually "hide" real problems when they do occur. Some of them are quite straightforward and test the basic functionality of the cluster; we use Serviceguard commands to accomplish these tests.
I will call these Standard Tests. Other tests are designed to uncover whether Serviceguard can provide the high availability features it claims it can.
For these tests, we use "unorthodox" methods to test Serviceguard. I call these Stress Tests. We need to be sure that Serviceguard will react promptly and correctly in the event of an unexpected incident, e. Let's start with the Standard Tests: Cluster can start and stop successfully. You should be able to run the following to start the cluster: This is basic functionality. Do not proceed until you are satisfied that the cluster can be started and stopped from every node.
Individual nodes can leave the cluster. When we are performing critical maintenance on an individual node, we want to stop cluster services only on that node. Some administrators feel that if the node is going to be "out of commission" for a considerable time, then we should take it out of the cluster altogether.
I can see some logic in that. My only concern is that we will have to recompile the cluster binary configuration file to remove and then add the node into the cluster. What would happen if another node were not running during this recompilation? We could be in a position where we want to re-add the original node, but we are having to wait until the second node comes back online to ensure that every node has the most up-to-date cluster binary file.
For this reason alone, I would leave the node as being a member of the cluster, but just stop cluster services. To stop cluster service, we would run the following command: We could run cmhaltnode from any node in the cluster. If we want to halt cluster services for a node other than our own, we can run this: Individual nodes can join the cluster. In this instance, we want a node to rejoin a running cluster. Maybe we have concluded our critical maintenance, or the machine crashed and we have finished our investigations and repairs.
We want to start cluster services only on this node. To accomplish this, we run the following: Like cmhaltnode, we could run cmrunnode from any node in cluster. If we want to start cluster services for a node other than our own, we can run this: These test are a little "unorthodox" only insofar as we are trying to think of situations that may happen in a production environment and which could threaten access to our applications.
We want to test these situations in a controlled way to ensure that Serviceguard is behaving as expected. Remove an active LAN card. There should be no perceived problems when we perform this test. One word of warning: This is due to timing issues with sending and receiving heartbeat packets. Because Serviceguard can relocate the IP address almost instantaneously, we see the cluster reform at the same time as the IP address is relocated. A situation where cmcld is starved for resources.
This is a particularly critical situation. As we now know, cmcld is a critical part of the suite of Serviceguard daemons. This means that when it wants to run, there's a high probability that it will be the most important process on the system. There are few processes with a higher priority.
Managing HP Serviceguard A.12.00.20 for Linux
However, I have come across many installations where Real-Time priorities have been used to improve the responsiveness of critical application processes. The administrators felt that it was highly unlikely that at any one time all the database processes would be executing requests to such an intensity that cmcld would not get execution time on any processor. As we know from Murphy's Law, such a situation did arise. The database processes spawned a significant number of child processes.
The cluster coordinator made a decision that the node had failed and instigated a cluster reformation. On reforming the cluster a two-node clusterthe original node had, by that time, "resolved" its starvation problem and won the resulting election and, hence, was the only node left in the cluster. The other node instigated a Transfer Of Control TOC to preserve data integrity split-brain syndrome because it did not obtain the cluster lock.
The application running on the node that instigated a Transfer Of Control TOC had to be restarted on the remaining node. The moral of the story is twofold: Below, we look at analyzing the resulting crashdump.
We are interested in establishing a number of facts: Check out the cluster configuration files and syslog. Was the crash a TOC instigated by Serviceguard? When was the last time cmcld ran? This is obviously something I do not suggest that you undertake on a live system. Timed out node hpeos It may have failed. Attempting to adjust cluster membership Aug 2 Reset requested from above -- lbolt: Obtaining Cluster Lock Aug 2 Resetting SCSI -- lbolt: Reset detected -- lbolt: Unable to obtain Cluster Lock.
Until it is fixed, a single failure could Aug 2 Turning off safety time protection since the cluster Aug 2 If Serviceguard Aug 2 Should you see any "SCSI: I am now the only member of the cluster. Here's the crashdump analysis I performed on the resulting TOC of hpeos