Video Screencast Help
Symantec Appoints Michael A. Brown CEO. Learn more.

VCS incorrectly reports the status of a Service Group when it is "autodisabled"

Created: 11 Dec 2013 • Updated: 12 Dec 2013
Matthew Leger's picture
0 Agree
0 Disagree
0 0 Votes
Login to vote

Under certain circumstances the VCS engine (_had) may become faulted and unstartable by the hashadow process. Currently, if this issue arises, VCS reports the status of the "service group" as OFFLINE which may not be entirely true and can lead to undesirable results. The true state should appear as "UNKNOWN" since that is true reflection of the cluster.  Although any attempt to change the state of the "service group" (online) will result in an error indicating the group must be "autoenabled", an administrator whos has not properly been educated on how to handle this situation may inadvertanly "autoenable" and ONLINE the service group potentially causing data corruption.

Although the "service groups" are initially reported "autodisabled", the messages in the engine_A.log are misleading as the message that follows indicates the "service group" is OFFLINE. This could easily be interpreted as the group is OFFLINE and is safe to failover to another system.

 

Here's an example and test results of the current behavior:

 

Nodes: walv215-a1e & walv215-a1f

Version:  SFHA 6.0.1

 

 

Initial state of the systems and services groups before failure:

 

[root@walv215-a1e:adm]#hastatus -sum

 

-- SYSTEM STATE

-- System               State                Frozen

 

A  walv215-a1e          RUNNING              0

A  walv215-a1f          RUNNING              0

 

-- GROUP STATE

-- Group           System               Probed     AutoDisabled    State

 

B  cvm             walv215-a1e          Y          N               ONLINE

B  cvm             walv215-a1f          Y          N               ONLINE

 

Manually kill HAD and HASHADOW to reproduce the issue experienced by the customer.

 

[root@walv215-a1e:adm]#ps -ef | grep ha

    root 10511     1   0   Dec 04 ?           0:00 /opt/VRTSvcs/bin/hashadow

    root 10509     1   0   Dec 04 ?           6:43 /opt/VRTSvcs/bin/had

[root@walv215-a1e:adm]#kill -9 10511 10509

 

 

Monitor log from other node (walv215-a1f):

 

walv215-a1f#tail -f /var/VRTSvcs/log/engine_A.log

 

2013/12/09 17:43:27 VCS INFO V-16-1-10077 Received new cluster membership

2013/12/09 17:43:27 VCS NOTICE V-16-1-10112 System (walv215-a1f) - Membership: 0x2, DDNA: 0x1

2013/12/09 17:43:27 VCS ERROR V-16-1-10113 System walv215-a1e (Node '0') is in DDNA Membership - Membership: 0x2, Visible: 0x0

2013/12/09 17:43:27 VCS ERROR V-16-1-10322 System walv215-a1e (Node '0') changed state from RUNNING to FAULTED

2013/12/09 17:43:27 VCS NOTICE V-16-1-10449 Group cvm autodisabled on node walv215-a1e until it is probed

2013/12/09 17:43:27 VCS NOTICE V-16-1-10449 Group VCShmg autodisabled on node walv215-a1e until it is probed

2013/12/09 17:43:27 VCS NOTICE V-16-1-10449 Group app_failover autodisabled on node walv215-a1e until it is probed

2013/12/09 17:43:27 VCS NOTICE V-16-1-10446 Group cvm is offline on system walv215-a1e

2013/12/09 17:43:27 VCS NOTICE V-16-1-10446 Group app_failover is offline on system walv215-a1e

 

However, communication via LLT is still available therefore VCS is able to determine the node is alive.

=================================================================================================================

NOTE: Although the "service groups" are initially reported "autodisabled", the messages in the engine logs are misleading as the message that follows indicates the "service group" is OFFLINE.

==================================================================================================================

 

walv215-a1f#lltstat -nvv | more

LLT node information:

    Node                 State    Link  Status  Address

     0 walv215-a1e       OPEN

                                  bge2   UP      00:14:4F:70:0C:66

                                  bge3   UP      00:14:4F:70:0C:67

   * 1 walv215-a1f       OPEN

                                  bge2   UP      00:14:4F:72:22:E4

                                  bge3   UP      00:14:4F:72:22:E5

 

 

Nonetheless, we can see from the “hastatus” output, the Service Group is “autodisabled” therefore would require manual steps by the admin.

 

[root@walv215-a1f:/]#hastatus -sum

 

-- SYSTEM STATE

-- System               State                Frozen

 

A  walv215-a1e          FAULTED              0

A  walv215-a1f          RUNNING              0

 

-- GROUP STATE

-- Group           System               Probed     AutoDisabled    State

 

B  app_failover    walv215-a1e          Y          Y               OFFLINE                      < --- AUTODISABLED  FLAG

B  app_failover    walv215-a1f          Y          N               OFFLINE

B  cvm             walv215-a1e          Y          Y               OFFLINE

B  cvm             walv215-a1f          Y          N               ONLINE

 

Any attempt to online group will result in a message stating the group is “autodisabled” – per design.

 

[root@walv215-a1f:/]#hagrp -online app_failover -sys walv215-a1f

VCS WARNING V-16-1-10159 Group app_failover is auto-disabled in cluster. This can happen if group is not probed on all alive nodes in group's SystemList or VCS engine is not running on all alive nodes in group's SystemList