Video Screencast Help

Service group does not fail over on another node on force power down.

Created: 12 Mar 2013 • Updated: 12 Mar 2013 | 16 comments
This issue has been solved. See solution.

VCS 6.0.1

Hi i have configured a two node cluster with local storage and running two service groups. They both running fine and i am able to switch over them to any node on the cluster but when i forcely power down a node where both service groups are active, just one service group fails over to another node and the one running apache resource gets faild and do not fail over.

below pasted the contents of main.cf file.

==========================================

 

 

cat /etc/VRTSvcs/conf/config/main.cf
include "OracleASMTypes.cf"
include "types.cf"
include "Db2udbTypes.cf"
include "OracleTypes.cf"
include "SybaseTypes.cf"
 
cluster mycluster (
        UserNames = { admin = IJKcJEjGKfKKiSKeJH, root = ejkEjiIhjKjeJh }
        ClusterAddress = "192.168.25.101"
        Administrators = { admin, root }
        )
 
system server3 (
        )
 
system server4 (
        )
 
group ClusterService (
        SystemList = { server3 = 0, server4 = 1 }
        AutoStartList = { server3, server4 }
        OnlineRetryLimit = 3
        OnlineRetryInterval = 120
        )
 
        IP webip (
                Device = eth0
                Address = "192.168.25.101"
                NetMask = "255.255.255.0"
                )
 
        NIC csgnic (
                Device = eth0
                )
 
        webip requires csgnic
 
 
        // resource dependency tree
        //
        //      group ClusterService
        //      {
        //      IP webip
        //          {
        //          NIC csgnic
        //          }
        //      }
 
 
group httpsg (
        SystemList = { server3 = 0, server4 = 1 }
        AutoStartList = { server3, server4 }
        OnlineRetryLimit = 3
        OnlineRetryInterval = 15
        )
 
        Apache apachenew (
                httpdDir = "/usr/sbin"
                ConfigFile = "/etc/httpd/conf/httpd.conf"
                )
 
        IP ipresource (
                Device = eth0
                Address = "192.168.25.102"
                NetMask = "255.255.255.0"
                )
 
        apachenew requires ipresource
 
 
        // resource dependency tree
        //
        //      group httpsg
        //      {
        //      Apache apachenew
        //          {
        //          IP ipresource
        //          }
        //      }
#
=====================
 

engine logs while the powerdown occurs says -

 

 

2013/03/12 16:33:02 VCS INFO V-16-1-10077 Received new cluster membership
2013/03/12 16:33:02 VCS NOTICE V-16-1-10112 System (server3) - Membership: 0x1, DDNA: 0x0
2013/03/12 16:33:02 VCS ERROR V-16-1-10079 System server4 (Node '1') is in Down State - Membership: 0x1
2013/03/12 16:33:02 VCS ERROR V-16-1-10322 System server4 (Node '1') changed state from RUNNING to FAULTED
2013/03/12 16:33:02 VCS NOTICE V-16-1-10449 Group httpsg autodisabled on node server4 until it is probed
2013/03/12 16:33:02 VCS NOTICE V-16-1-10449 Group VCShmg autodisabled on node server4 until it is probed
2013/03/12 16:33:02 VCS NOTICE V-16-1-10446 Group ClusterService is offline on system server4
2013/03/12 16:33:02 VCS NOTICE V-16-1-10446 Group httpsg is offline on system server4
2013/03/12 16:33:02 VCS ERROR V-16-1-10205 Group ClusterService is faulted on system server4
2013/03/12 16:33:02 VCS NOTICE V-16-1-10446 Group ClusterService is offline on system server4
2013/03/12 16:33:02 VCS INFO V-16-1-10493 Evaluating server3 as potential target node for group ClusterService
2013/03/12 16:33:02 VCS INFO V-16-1-10493 Evaluating server4 as potential target node for group ClusterService
2013/03/12 16:33:02 VCS INFO V-16-1-10494 System server4 not in RUNNING state
2013/03/12 16:33:02 VCS NOTICE V-16-1-10301 Initiating Online of Resource webip (Owner: Unspecified, Group: ClusterService) on System server3
2013/03/12 16:33:02 VCS WARNING V-16-1-11141 LLT heartbeat link status changed. Previous status =eth1, UP; Current status =eth1, DOWN.
2013/03/12 16:33:02 VCS INFO V-16-6-15015 (server3) hatrigger:/opt/VRTSvcs/bin/triggers/sysoffline is not a trigger scripts directory or can not be executed
2013/03/12 16:33:14 VCS INFO V-16-1-10298 Resource webip (Owner: Unspecified, Group: ClusterService) is online on server3 (VCS initiated)
2013/03/12 16:33:14 VCS NOTICE V-16-1-10447 Group ClusterService is online on system server3
 
as per the above logs, the default SG ClusterService has been failed over to another node but SG httpsg faild.
 
please suggest on it.
 
Thanks....
 
 
Operating Systems:

Comments 16 CommentsJump to latest comment

mikebounds's picture

How do you power down your server - commands from the O/S like "halt" and "reboot" are sometimes not severe enough and VCS knows this was done by ther user, as oppose to a power outage and so sees this as an adminstrive powerdown and does not failover service groups, but ClusterService is a "special" service group so this is failed over.

The best O/S command I got to work for this test was "uadmin 2 0" and even this was sometimes not quick enough bringing down the server, so VCS knew command was run.  The best way is to power down system boards if this server is part of a logical domain or flick the power switch.

If you are doing a severe powerdown and still having issues, can you provide the output of "hastatus -sum" before you do your powerdown test.

Mike

 

 

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has helped you, please vote or mark as solution

mikebounds's picture

As an aside the settings on the httpsg are probably not what you intend:

 

OnlineRetryLimit = 3
OnlineRetryInterval = 15

This means if the Apache process dies, VCS will try to restart (the whole group) locally and if it faults again after 15 seconds then the previous fault will be ignored which probably means it will never try the other node.

OnlineRetryLimit is normally only set at the service group level for the ClusterService group and when it is set for "normal" service groups then it is normally set to 1 which means it will try to restart The whole group locally once and then try another node.

If you want Apache to restart locally then you should set this at a resource type level like "hatype -modify Apache RestartLimit 1", so this will JUST restart Apache, not all resources in the service group.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has helped you, please vote or mark as solution

Shivam_HCL's picture

For powerdown i am simply unplug the power cable. Actually currently i am working in a test environment and will go live on production once it start working as expected.

My intension is to make the httpsg available even if one node goes offline suddenly due to any hardware failure or panic. In my case the ClusterServer works well but unfortunately httpsg which was created by me downn't fail over to another node from the faulted server.

    OnlineRetryLimit = 3
    OnlineRetryInterval = 15
These two attributes were set by me while troubleshooting.
===============

[root@server3 log]# hastatus -sum

-- SYSTEM STATE
-- System               State                Frozen

A  server3              RUNNING              0
A  server4              RUNNING              0

-- GROUP STATE
-- Group           System               Probed     AutoDisabled    State

B  ClusterService  server3              Y          N               ONLINE
B  ClusterService  server4              Y          N               OFFLINE
B  httpsg          server3              Y          N               OFFLINE
B  httpsg          server4              Y          N               ONLINE
[root@server3 log]#

[root@server3 log]# hagrp -display | egrep -i  "SystemList|FailOverPolicy|AutoStart|AutoFailOver"
ClusterService AutoFailOver          global     1
ClusterService AutoStart             global     1
ClusterService AutoStartIfPartial    global     1
ClusterService AutoStartList         global     server3 server4
ClusterService AutoStartPolicy       global     Order
ClusterService ClusterFailOverPolicy global     Manual
ClusterService FailOverPolicy        global     Priority
ClusterService SystemList            global     server3 0       server4 1
httpsg         AutoFailOver          global     1
httpsg         AutoStart             global     1
httpsg         AutoStartIfPartial    global     1
httpsg         AutoStartList         global     server3 server4
httpsg         AutoStartPolicy       global     Order
httpsg         ClusterFailOverPolicy global     Manual
httpsg         FailOverPolicy        global     Priority
httpsg         SystemList            global     server3 0       server4 1
[root@server3 log]#

[root@server3 log]# hares -display | grep -i  critical
apachenew    Critical              global     1
csgnic       Critical              global     1
ipresource   Critical              global     1
webip        Critical              global     1
[root@server3 log]#

 

Shivam_HCL's picture

i am actually working on a test environment before going live on production. currently i have created just one service group which is not getting failed over when i intentionally pull out the power cable from server. So i am not using any command currently to power down the server, however once i used the command reboot as well but no luck with that too.

 

 

[root@server3 log]# hastatus -sum
 
-- SYSTEM STATE
-- System               State                Frozen
 
A  server3              RUNNING              0
A  server4              RUNNING              0
 
-- GROUP STATE
-- Group           System               Probed     AutoDisabled    State
 
B  ClusterService  server3              Y          N               ONLINE
B  ClusterService  server4              Y          N               OFFLINE
B  httpsg          server3              Y          N               OFFLINE
B  httpsg          server4              Y          N               ONLINE
[root@server3 log]#
 
 
 
[root@server3 log]# hares -display | grep -i  critical
apachenew    Critical              global     1
csgnic       Critical              global     1
ipresource   Critical              global     1
webip        Critical              global     1
[root@server3 log]#
 
 
 
 
[root@server3 log]# hagrp -display | egrep -i  "SystemList|FailOverPolicy|AutoStart|AutoFailOver"
ClusterService AutoFailOver          global     1
ClusterService AutoStart             global     1
ClusterService AutoStartIfPartial    global     1
ClusterService AutoStartList         global     server3 server4
ClusterService AutoStartPolicy       global     Order
ClusterService ClusterFailOverPolicy global     Manual
ClusterService FailOverPolicy        global     Priority
ClusterService SystemList            global     server3 0       server4 1
httpsg         AutoFailOver          global     1
httpsg         AutoStart             global     1
httpsg         AutoStartIfPartial    global     1
httpsg         AutoStartList         global     server3 server4
httpsg         AutoStartPolicy       global     Order
httpsg         ClusterFailOverPolicy global     Manual
httpsg         FailOverPolicy        global     Priority
httpsg         SystemList            global     server3 0       server4 1
[root@server3 log]#
 

 

Satish K. Pagare's picture

Hi Shivam,

Pleaes confirm that before you reboot any of the nodes, the state of the httpsg group and all the resources that are part of the httpsg group are in a steady state "OFFLINE" in VCS. That means they are fully probed in VCS and detected in a steady state - OFFLINE. Then try the tests.

Thanks,

Satish/

mikebounds's picture

The hastatus -sum you gave is not from before the test as this shows service group on different systems - if you show the engine log before the power down then I can work out what the state was before the test.  I would need to see the engine log from the last instance of message "Group httpsg is online on system" before you did the powerdown.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has helped you, please vote or mark as solution

Shivam_HCL's picture

Hi Mike

Please find below the logs and status of the service groups & resources while the test.

 

++++++++++++++++++++Before test+++++++++++++++++++++++++++++

[root@server3 log]#  hastatus -sum

-- SYSTEM STATE
-- System               State                Frozen

A  server3              RUNNING              0
A  server4              RUNNING              0

-- GROUP STATE
-- Group           System               Probed     AutoDisabled    State

B  ClusterService  server3              Y          N               OFFLINE
B  ClusterService  server4              Y          N               ONLINE
B  httpsg          server3              Y          N               OFFLINE
B  httpsg          server4              Y          N               ONLINE

[root@server3 log]# hagrp -state
#Group         Attribute             System     Value
ClusterService State                 server3    |OFFLINE|
ClusterService State                 server4    |ONLINE|
httpsg         State                 server3    |OFFLINE|
httpsg         State                 server4    |ONLINE|

[root@server3 log]# hares -state
#Resource    Attribute             System     Value
apachenew    State                 server3    OFFLINE
apachenew    State                 server4    ONLINE
csgnic       State                 server3    ONLINE
csgnic       State                 server4    ONLINE
ipresource   State                 server3    OFFLINE
ipresource   State                 server4    ONLINE
webip        State                 server3    OFFLINE
webip        State                 server4    ONLINE
[root@server3 log]#

[root@server3 log]# hares -disp | grep -i group
apachenew    Group                 global     httpsg
csgnic       Group                 global     ClusterService
ipresource   Group                 global     httpsg
webip        Group                 global     ClusterService
[root@server3 log]#

============ While Powered off server4 =====================

(both service group should have been failed over to server3 as expected but only service group "ClusterService" failed over not "httpsg")

[root@server3 log]# hastatus -sum

-- SYSTEM STATE
-- System               State                Frozen

A  server3              RUNNING              0
A  server4              FAULTED              0

-- GROUP STATE
-- Group           System               Probed     AutoDisabled    State

B  ClusterService  server3              Y          N               ONLINE
B  httpsg          server3              Y          N               OFFLINE
B  httpsg          server4              Y          Y               OFFLINE

[root@server3 log]# hares -state
#Resource    Attribute             System     Value
apachenew    State                 server3    OFFLINE
apachenew    State                 server4    OFFLINE
csgnic       State                 server3    ONLINE
csgnic       State                 server4    ONLINE
ipresource   State                 server3    OFFLINE
ipresource   State                 server4    OFFLINE
webip        State                 server3    ONLINE
webip        State                 server4    OFFLINE

[root@server3 log]# hagrp -state
#Group         Attribute             System     Value
ClusterService State                 server3    |ONLINE|
ClusterService State                 server4    |OFFLINE|
httpsg         State                 server3    |OFFLINE|
httpsg         State                 server4    |OFFLINE|
[root@server3 log]#

ENGINE LOG WHILE Server4 WAS DOWN (Collected from server3)
---------------------------------
2013/03/12 18:54:47 VCS WARNING V-16-1-11141 LLT heartbeat link status changed. Previous status =eth1, UP; Current status =eth1, DOWN.
2013/03/12 18:54:49 VCS INFO V-16-1-10077 Received new cluster membership
2013/03/12 18:54:49 VCS NOTICE V-16-1-10112 System (server3) - Membership: 0x1, DDNA: 0x0
2013/03/12 18:54:49 VCS ERROR V-16-1-10079 System server4 (Node '1') is in Down State - Membership: 0x1
2013/03/12 18:54:49 VCS ERROR V-16-1-10322 System server4 (Node '1') changed state from RUNNING to FAULTED
2013/03/12 18:54:49 VCS NOTICE V-16-1-10449 Group httpsg autodisabled on node server4 until it is probed
2013/03/12 18:54:49 VCS NOTICE V-16-1-10449 Group VCShmg autodisabled on node server4 until it is probed
2013/03/12 18:54:49 VCS NOTICE V-16-1-10446 Group ClusterService is offline on system server4
2013/03/12 18:54:49 VCS NOTICE V-16-1-10446 Group httpsg is offline on system server4
2013/03/12 18:54:49 VCS ERROR V-16-1-10205 Group ClusterService is faulted on system server4
2013/03/12 18:54:49 VCS NOTICE V-16-1-10446 Group ClusterService is offline on system server4
2013/03/12 18:54:49 VCS INFO V-16-1-10493 Evaluating server3 as potential target node for group ClusterService
2013/03/12 18:54:49 VCS INFO V-16-1-10493 Evaluating server4 as potential target node for group ClusterService
2013/03/12 18:54:49 VCS INFO V-16-1-10494 System server4 not in RUNNING state
2013/03/12 18:54:49 VCS NOTICE V-16-1-10301 Initiating Online of Resource webip (Owner: Unspecified, Group: ClusterService) on System server3
2013/03/12 18:54:49 VCS INFO V-16-6-15015 (server3) hatrigger:/opt/VRTSvcs/bin/triggers/sysoffline is not a trigger scripts directory or can not be executed
2013/03/12 18:55:02 VCS INFO V-16-1-10298 Resource webip (Owner: Unspecified, Group: ClusterService) is online on server3 (VCS initiated)
2013/03/12 18:55:02 VCS NOTICE V-16-1-10447 Group ClusterService is online on system server3

============ Status after i powered on server4 =================

[root@server3 log]# hastatus -sum

-- SYSTEM STATE
-- System               State                Frozen

A  server3              RUNNING              0
A  server4              RUNNING              0

-- GROUP STATE
-- Group           System               Probed     AutoDisabled    State

B  ClusterService  server3              Y          N               ONLINE
B  ClusterService  server4              Y          N               OFFLINE
B  httpsg          server3              Y          N               ONLINE
B  httpsg          server4              Y          N               OFFLINE

[root@server3 log]# hares -state
#Resource    Attribute             System     Value
apachenew    State                 server3    ONLINE
apachenew    State                 server4    OFFLINE
csgnic       State                 server3    ONLINE
csgnic       State                 server4    ONLINE
ipresource   State                 server3    ONLINE
ipresource   State                 server4    OFFLINE
webip        State                 server3    ONLINE
webip        State                 server4    OFFLINE

[root@server3 log]# hagrp -state
#Group         Attribute             System     Value
ClusterService State                 server3    |ONLINE|
ClusterService State                 server4    |OFFLINE|
httpsg         State                 server3    |ONLINE|
httpsg         State                 server4    |OFFLINE|
[root@server3 log]#

ENGINE LOG While Server4 was coming up and came up.
---------------------

2013/03/12 19:02:59 VCS INFO V-16-1-10077 Received new cluster membership
2013/03/12 19:02:59 VCS NOTICE V-16-1-10112 System (server3) - Membership: 0x1, DDNA: 0x2
2013/03/12 19:02:59 VCS ERROR V-16-1-10113 System server4 (Node '1') is in DDNA Membership - Membership: 0x1, Visible: 0x0
2013/03/12 19:03:02 VCS WARNING V-16-1-11141 LLT heartbeat link status changed. Previous status =eth1, DOWN; Current status =eth1, UP.
2013/03/12 19:03:08 VCS INFO V-16-1-10077 Received new cluster membership
2013/03/12 19:03:08 VCS NOTICE V-16-1-10112 System (server3) - Membership: 0x3, DDNA: 0x2
2013/03/12 19:03:08 VCS NOTICE V-16-1-10322 System  (Node '1') changed state from UNKNOWN to INITING
2013/03/12 19:03:08 VCS ERROR V-16-1-10111 System server4 (Node '1') is in Regular and Jeopardy Memberships - Membership: 0x3, Jeopardy: 0x2
2013/03/12 19:03:08 VCS NOTICE V-16-1-10453 Node: 1 changed name from: 'server4' to: 'server4'
2013/03/12 19:03:08 VCS NOTICE V-16-1-10322 System server4 (Node '1') changed state from FAULTED to INITING
2013/03/12 19:03:08 VCS NOTICE V-16-1-10322 System server4 (Node '1') changed state from INITING to CURRENT_DISCOVER_WAIT
2013/03/12 19:03:08 VCS NOTICE V-16-1-10322 System server4 (Node '1') changed state from CURRENT_DISCOVER_WAIT to REMOTE_BUILD
2013/03/12 19:03:09 VCS INFO V-16-1-10455 Sending snapshot to node membership: 0x2
2013/03/12 19:03:10 VCS NOTICE V-16-1-10322 System server4 (Node '1') changed state from REMOTE_BUILD to RUNNING
2013/03/12 19:03:12 VCS INFO V-16-1-10304 Resource ipresource (Owner: Unspecified, Group: httpsg) is offline on server4 (First probe)
2013/03/12 19:03:12 VCS INFO V-16-1-10304 Resource webip (Owner: Unspecified, Group: ClusterService) is offline on server4 (First probe)
2013/03/12 19:03:14 VCS INFO V-16-6-15015 (server4) hatrigger:/opt/VRTSvcs/bin/triggers/injeopardy is not a trigger scripts directory or can not be executed
2013/03/12 19:03:14 VCS INFO V-16-6-15015 (server4) hatrigger:/opt/VRTSvcs/bin/triggers/sysjoin is not a trigger scripts directory or can not be executed
2013/03/12 19:03:14 VCS INFO V-16-6-15023 (server4) dump_tunables:
########## VCS Environment Variables ##########
VCS_CONF=/etc/VRTSvcs
VCS_DIAG=/var/VRTSvcs
VCS_HOME=/opt/VRTSvcs
VCS_LOG_AGENT_NAME=
VCS_LOG_CATEGORY=6
VCS_LOG_SCRIPT_NAME=hatrigger
VCS_LOG=/var/VRTSvcs
########## Other Environment Variables ##########
CONSOLE=/dev/pts/0
HOME=/
INIT_VERSION=sysvinit-2.86
LANG=en_US.UTF-8
LD_LIBRARY_PATH=/opt/VRTSvcs/lib:
PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/sbin:/usr/sbin:/bin:/usr/bin:/opt/VRTSvcs/bin
previous=N
PREVLEVEL=N
PWD=/var/VRTSvcs/diag/had
runlevel=5
RUNLEVEL=5
SELINUX_INIT=YES
SHLVL=5
TERM=linux
_=/usr/bin/env

2013/03/12 19:03:14 VCS INFO V-16-6-15002 (server4) hatrigger:hatrigger executed /opt/VRTSvcs/bin/internal_triggers/dump_tunables server4 1   successfully
2013/03/12 19:03:16 VCS NOTICE V-16-1-10438 Group ClusterService has been probed on system server4
2013/03/12 19:03:16 VCS NOTICE V-16-1-10438 Group VCShmg has been probed on system server4
2013/03/12 19:03:16 VCS NOTICE V-16-1-10435 Group VCShmg will not start automatically on System server4 as the system is not a part of AutoStartList attribute of the group.
2013/03/12 19:03:18 VCS INFO V-16-1-10304 Resource apachenew (Owner: Unspecified, Group: httpsg) is offline on server4 (First probe)
2013/03/12 19:03:18 VCS NOTICE V-16-1-10438 Group httpsg has been probed on system server4
2013/03/12 19:03:18 VCS INFO V-16-1-50007 Initiating auto-start online of group httpsg
2013/03/12 19:03:18 VCS INFO V-16-1-10493 Evaluating server3 as potential target node for group httpsg
2013/03/12 19:03:18 VCS NOTICE V-16-1-10233 Clearing Restart attribute for group httpsg on all nodes
2013/03/12 19:03:18 VCS NOTICE V-16-1-10301 Initiating Online of Resource ipresource (Owner: Unspecified, Group: httpsg) on System server3
2013/03/12 19:03:30 VCS INFO V-16-1-10298 Resource ipresource (Owner: Unspecified, Group: httpsg) is online on server3 (VCS initiated)
2013/03/12 19:03:30 VCS NOTICE V-16-1-10301 Initiating Online of Resource apachenew (Owner: Unspecified, Group: httpsg) on System server3
2013/03/12 19:03:30 VCS NOTICE V-16-10061-20494 (server3) Apache:apachenew:online:<Apache::Start> Command exit code [0]. Command output [Application started successfully.]
2013/03/12 19:03:42 VCS INFO V-16-1-10298 Resource apachenew (Owner: Unspecified, Group: httpsg) is online on server3 (VCS initiated)
2013/03/12 19:03:42 VCS NOTICE V-16-1-10447 Group httpsg is online on system server3

 

Shivam

mikebounds's picture

Sorry, should have spotted this earlier - you have message:

 System (server3) - Membership: 0x1, DDNA: 0x0

DDNA means "Daemon Down Node Alive", so VCS thinks server4 is still up but "had" daemon is down, which is why service groups are not failed over.   But,  don't quite understand why this is happening if you are pulling power cable.  Another possible issue is if you only have one heartbeat so could you provide output from "lltstat -nvv".  

You may find the following post useful:

https://www-secure.symantec.com/connect/forums/failover-secondary-system-upon-ungraceful-shutdown

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has helped you, please vote or mark as solution

Wally_Heim's picture

Hi Mike and Shivam,

During the time of the power off testing there were no nodes in the DDNA membership.  The hex code after the DDNA (and other memberships for that matter) tell you what nodes are in that membership.  In this case, 0x0 indicates that no node is in that specific membership at that time.

During the reboot of the powered off node, you can see that DDNA membership changes to 0x2.  To decode this, convert the number from hex to binary.  When in binary the position of the 1's indicate which node ID are in that membership.  Each node takes up 1 bit and the node ID's start at 0 (going from right to left for node IDs.)  So in this case hex 0x2 equials 0010 in binary and the second position that the 1 is in corrasponds to the node ID 1.

I'm not sure why the http group is not being marked as "Faulted" at the 18:54:49 timeframe.  However, the group "ClusterSerivce" is a special group that the cluster takes extra precautions to keep online and failover when other groups do not.

During your statup of the node, I see that the cluster goes into Jeopardy.  Jeopardy is that you are down to a single heartbeat.  Jeopardy will prevent the cluster from failing over all service groups other than the ClusterService group. 

Here is what I would recommend:

1. Increase the ShutdownTimeOut value to 300 on all servers.  This is a per server setting but can be set for all servers from a connection to a single node.  Be sure the save and close the cluster configuration when done with making this change.

2. Ensure that the cluster is not in Jeopardy membership prior to running the power off/shutdown test.

3. Perform the power off/shutdown test again.

Let us know how your tesitng goes.

Thank you,

Wally

Shivam_HCL's picture

Thanks Mike & Wally,

Below is output of lltstat -nvv as asked by Mike.

 

[root@server3 log]# lltstat -nvv
LLT node information:
    Node                 State    Link  Status  Address
   * 0 server3           OPEN
                                  eth1   UP      00:0C:29:76:C4:C0
     1 server4           OPEN
                                  eth1   UP      00:0C:29:02:73:38
     2                   CONNWAIT
                                  eth1   DOWN
     3                   CONNWAIT
                                  eth1   DOWN
     4                   CONNWAIT
                                  eth1   DOWN
     5                   CONNWAIT
                                  eth1   DOWN
     6                   CONNWAIT
                                  eth1   DOWN
     7                   CONNWAIT
                                  eth1   DOWN
     8                   CONNWAIT
                                  eth1   DOWN
     9                   CONNWAIT
                                  eth1   DOWN
    10                   CONNWAIT
                                  eth1   DOWN
    11                   CONNWAIT
                                  eth1   DOWN
    12                   CONNWAIT
                                  eth1   DOWN
    13                   CONNWAIT
                                  eth1   DOWN
    14                   CONNWAIT
                                  eth1   DOWN
    15                   CONNWAIT
                                  eth1   DOWN
    16                   CONNWAIT
                                  eth1   DOWN
    17                   CONNWAIT
                                  eth1   DOWN
    18                   CONNWAIT
                                  eth1   DOWN
    19                   CONNWAIT
                                  eth1   DOWN
    20                   CONNWAIT
                                  eth1   DOWN
    21                   CONNWAIT
                                  eth1   DOWN
    22                   CONNWAIT
                                  eth1   DOWN
    23                   CONNWAIT
                                  eth1   DOWN
    24                   CONNWAIT
                                  eth1   DOWN
    25                   CONNWAIT
                                  eth1   DOWN
    26                   CONNWAIT
                                  eth1   DOWN
    27                   CONNWAIT
                                  eth1   DOWN
    28                   CONNWAIT
                                  eth1   DOWN
    29                   CONNWAIT
                                  eth1   DOWN
    30                   CONNWAIT
                                  eth1   DOWN
    31                   CONNWAIT
                                  eth1   DOWN
    32                   CONNWAIT
                                  eth1   DOWN
    33                   CONNWAIT
                                  eth1   DOWN
    34                   CONNWAIT
                                  eth1   DOWN
    35                   CONNWAIT
                                  eth1   DOWN
    36                   CONNWAIT
                                  eth1   DOWN
    37                   CONNWAIT
                                  eth1   DOWN
    38                   CONNWAIT
                                  eth1   DOWN
    39                   CONNWAIT
                                  eth1   DOWN
    40                   CONNWAIT
                                  eth1   DOWN
    41                   CONNWAIT
                                  eth1   DOWN
    42                   CONNWAIT
                                  eth1   DOWN
    43                   CONNWAIT
                                  eth1   DOWN
    44                   CONNWAIT
                                  eth1   DOWN
    45                   CONNWAIT
                                  eth1   DOWN
    46                   CONNWAIT
                                  eth1   DOWN
    47                   CONNWAIT
                                  eth1   DOWN
    48                   CONNWAIT
                                  eth1   DOWN
    49                   CONNWAIT
                                  eth1   DOWN
    50                   CONNWAIT
                                  eth1   DOWN
    51                   CONNWAIT
                                  eth1   DOWN
    52                   CONNWAIT
                                  eth1   DOWN
    53                   CONNWAIT
                                  eth1   DOWN
    54                   CONNWAIT
                                  eth1   DOWN
    55                   CONNWAIT
                                  eth1   DOWN
    56                   CONNWAIT
                                  eth1   DOWN
    57                   CONNWAIT
                                  eth1   DOWN
    58                   CONNWAIT
                                  eth1   DOWN
    59                   CONNWAIT
                                  eth1   DOWN
    60                   CONNWAIT
                                  eth1   DOWN
    61                   CONNWAIT
                                  eth1   DOWN
    62                   CONNWAIT
                                  eth1   DOWN
    63                   CONNWAIT
                                  eth1   DOWN
[root@server3 log]#
 
 
Meanwhile i am performing the recommended suggestion and test again.
 
Shivam.
mikebounds's picture

My guess is that you have only defined one heartbeat - eth1 in /etc/llttab and that is why group does not failover and if this is the case, if you add eth0 as a lowpri heartbeat, then this will resolve your issue - can you provide contents of /etc/llttab.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has helped you, please vote or mark as solution

Shivam_HCL's picture

Sure mike,

 

 

[root@server3 log]# cat /etc/llttab
set-node server3
set-cluster 36349
link eth1 eth-00:0c:29:76:c4:c0 - ether - -
[root@server3 log]#
 
[root@server4 ~]# cat /etc/llttab
set-node server4
set-cluster 36349
link eth1 eth-00:0c:29:02:73:38 - ether - -
[root@server4 ~]#
 
Sorry for this question which may sound strange but how does having one heartbeat affect the failover in case i directly pull out the power cable of a running server which would cause complete power down of the server from cluster. even i configure two heartbeat, pulling the cable out would cause failure on both links.
 
Shivam

 

mikebounds's picture

Just missed your email - so I can now see you have only defined one heartbeat and this means VCS cannot detect between eth1 failure and system failure and therefore it will not failover any service groups (apart from ClusterService), so you need to have at least 2 heartbeats (which need to be independent in a live cluster - i.e not a dual-port card, but this is ok for testing)

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has helped you, please vote or mark as solution

SOLUTION
Wally_Heim's picture

Hi Shivam,

VCS clustering is designed to have 2 or more heartbeats.  If you are down to a single heartbeat then the cluster goes into jeopardy membership and assumes that the cluster is not faulted but that there is a another issue going on.  As a result of an unknown failure happening in the environment, VCS decides to do nothing.

However, if 2 or more heartbeats are lost at the same time, VCS decides that the node is dead and marks service groups on that node as "Faulted".  The surviving nodes then attempt to online the faulted service groups.

Basically, if there is only 1 heartbeat then VCS will not react when it is lost.  However, if there are 2 or more heartbeats that are lost at the same time, then VCS will react because it assumes that the node is dead.

In your case, you need to configure a second heartbeat so that VCS comes out of Jeopardy membership.  Then it will react when the power is pulled from the active node.

Thank you,

Wally

mikebounds's picture

If your 2 heartbeats are independent then they should never fail at the same time and therefore if a node sees 2 heartbeat links go down at the same time, then it assumes node has gone down as if your heartbeats are truely independent then it is very unlikley they would fail at the same time.  With only one heartbeat when a node sees it is gone, it has noway of telling if the link went down (like NIC or switch failure) or if the node went down.  This is why when you only have one link, the cluster goes in Jeopardy.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has helped you, please vote or mark as solution

Shivam_HCL's picture

Thanks Mike & Wally.

 

Creating another heartbeat link with below method solved the issue.

 

added this line in /etc/llttab file. (Taken help from 

link-lowpri eth0 eth-00:50:56:91:03:30 - ether - -

with MAC address for eth0 on each system

 

Thanks a Lot to all for the quick help.....

Shivam