Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.

Cluster fails after solaris server is brought online after hardware replacement

Created: 21 Sep 2013 • Updated: 30 Oct 2013 | 17 comments
This issue has been solved. See solution.

We replaced one of our solaris servers (swapped the hard drives into the new server) after a hardware failure. When the server came back up, all the applications we have on the servers in the cluster stopped functioning. All the servers' logs show that the resource could not be contacted, then it attemps to run clean and repeats this process until the server we brought up is take offline. I am not sure why this is occuring and could not find any documentation concerning steps needed to re-introduce a server to the cluster.

 

Sep 20 20:17:41 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(16) Resource(app1) - monitor procedure did not complete within the expected time.
Sep 20 20:17:52 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(17) Resource(app2) - monitor procedure did not complete within the expected time.
Sep 20 20:17:58 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(18) Resource(app3) - monitor procedure did not complete within the expected time.
Sep 20 20:18:02 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(19) Resource(app4) - monitor procedure did not complete within the expected time.
Sep 20 20:18:13 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(20) Resource(app5) - monitor procedure did not complete within the expected time.
Sep 20 20:18:28 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(21) Resource(app6) - monitor procedure did not complete within the expected time.
Sep 20 20:22:17 app_server1 AgentFramework[1105]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(4) Resource(app7) - monitor procedure did not complete within the expected time.
Sep 20 20:23:41 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13210 Thread(34) Agent is calling clean for resource(app1) because 4 successive invocations of the monitor pr                              ocedure did not complete within the expected time.
Sep 20 20:23:42 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13068 Thread(34) Resource(app1) - clean completed successfully.
Sep 20 20:23:42 app_server1 AgentFramework[1107]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13074 Thread(34) The monitoring program for resource(app1) has consistently failed to determine the resource      
Operating Systems:

Comments 17 CommentsJump to latest comment

mikebounds's picture

To re-introduce a server to the cluster steps are:

  1. Install VCS and agents
     
  2. Copy following files from existing node:
    /etc/llthosts /etc/gabtab /etc/llttab /etc/vx/.uuids/clusuuid (and /etc/vxfen* if you use I/O fencing)
     
  3. Create /etc/VRTSvcs/conf/sysname containing hostname of node
     
  4. Edit /etc/llttab so that set-node is set to either /etc/VRTSvcs/conf/sysname or the node name
     
  5. Start llt and gab on new node and check "lltstat -nvv" shows all heartbeats are connected and "gabconfig -a" shows port a membership
     
  6. Run "hastart" - this should do a remote build and create main.cf and types.cf files in /etc/VRTSvcs/conf/config

Mike

 

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

mike_ohio's picture

Thank you for your reply. The hard drives were transfered to the new hardware so the files exist still. What confuses me is that when the server was brought up (booted) the monitor could not get status of all resources across all the nodes in the cluster. 

mikebounds's picture

Sorry, I read post wrong, I thought you had put new disks into existing server, but after reading again, I see you put old disks in new server.

This may mean the devices for the network cards have changed so you need to check references to network cards in llttab and main.cf

However your logs say montitor timed out for resource app1 which does't sound like a NIC - can you provide details of this resource (hares -display app1)

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

mike_ohio's picture

Sorry, I cannot get the hares output right now. My thoughts on this is if the device names changed, why would the monitor service on other members of the cluster have trouble getting the resource status. I would think that this server would just not be able to bring resources online or respond to the cluster.

mikebounds's picture

Are there errors when VCS starts on the new server and joins the cluster.

Is the new server the same hardware as the old one?

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

mike_ohio's picture

The hardware is the same. Since this is a production environment we had to shut the server down since it was causing issues. I am planning on booting the server into single user mode to view the logs and check for any other issues.

mike_ohio's picture

If I remove the server from the cluster. What is required to add it back in?

mike_ohio's picture

From the messages log on the problem server. the port a messages repeat till the server was shutdown

 

Sep 20 20:14:17 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-
1-20026 Port a registration waiting for seed port membership
Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-
1-10024 link 0 (nxge2) node 2 active
Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-
1-10024 link 0 (nxge2) node 5 active
Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-
1-10024 link 1 (e1000g2) node 5 active
Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-
1-10024 link 1 (e1000g2) node 2 active
Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-
1-10024 link 1 (e1000g2) node 0 active
Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-
1-10024 link 1 (e1000g2) node 3 active
Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-
1-10024 link 0 (nxge2) node 0 active
Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-
1-10024 link 1 (e1000g2) node 6 active
Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-
1-10024 link 0 (nxge2) node 3 active
Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-
1-10024 link 1 (e1000g2) node 1 active
Sep 20 20:14:17 lsappp10.itlogon.com llt: [ID 860062 kern.notice] LLT INFO V-14-
1-10024 link 0 (nxge2) node 1 active
Sep 20 20:14:18 lsappp10.itlogon.com root: [ID 702911 daemon.notice] S99sneep:ro
ot: /etc/default/SUNWsneep is from a system with ID "84ac2f88"
Sep 20 20:14:18 lsappp10.itlogon.com root: [ID 702911 daemon.notice] S99sneep:ro
ot: saved /etc/default/SUNWsneep as /etc/default/SUNWsneep.84ac2f88
Sep 20 20:14:18 lsappp10.itlogon.com root: [ID 702911 daemon.notice] S99sneep:ro
ot: /etc/default/SUNWsneep successfully (re)initialized
Sep 20 20:14:18 lsappp10.itlogon.com root: [ID 702911 daemon.notice] S99sneep:ro
ot: cannot use backup file to restore missing values to eeprom
Sep 20 20:14:18 lsappp10.itlogon.com root: [ID 702911 daemon.notice] S99sneep:ro
ot: Chassis Serial not available from system eeprom
Sep 20 20:14:18 lsappp10.itlogon.com last message repeated 1 time
Sep 20 20:14:19 lsappp10.itlogon.com root: [ID 702911 daemon.notice] S99sneep:ro
ot: Chassis Serial is not in backup file
Sep 20 20:14:20 lsappp10.itlogon.com root: [ID 702911 daemon.notice] S99sneep:ro
ot: Warning: cannot use backup file for this recovery
Sep 20 20:14:21 lsappp10.itlogon.com picld[194]: [ID 276222 daemon.error] PICL s
nmpplugin: sunPlatSensorClass 0 unsupported (row=287)
Sep 20 20:14:21 lsappp10.itlogon.com picld[194]: [ID 276222 daemon.error] PICL s
nmpplugin: sunPlatSensorClass 0 unsupported (row=288)
Sep 20 20:14:21 lsappp10.itlogon.com picld[194]: [ID 276222 daemon.error] PICL s
nmpplugin: sunPlatSensorClass 0 unsupported (row=289)
Sep 20 20:14:23 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-
1-20032 Port a closed
Sep 20 20:14:23 lsappp10.itlogon.com nrpe[1397]: [ID 601491 daemon.notice] Start
ing up daemon
Sep 20 20:14:23 lsappp10.itlogon.com nrpe[1397]: [ID 627629 daemon.notice] Warni
ng: Daemon is configured to accept command arguments from clients!
Sep 20 20:14:37 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-
1-20026 Port a registration waiting for seed port membership
Sep 20 20:14:43 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-
1-20032 Port a closed
Sep 20 20:14:57 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-
1-20026 Port a registration waiting for seed port membership
Sep 20 20:15:02 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-
1-20032 Port a closed
Sep 20 20:15:16 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-
1-20026 Port a registration waiting for seed port membership
Sep 20 20:15:21 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-
1-20032 Port a closed
Sep 20 20:15:35 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-
1-20026 Port a registration waiting for seed port membership
Sep 20 20:15:40 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-
1-20032 Port a closed
Sep 20 20:15:54 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-
1-20026 Port a registration waiting for seed port membership
Sep 20 20:15:59 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-
1-20032 Port a closed
Sep 20 20:16:13 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-
1-20026 Port a registration waiting for seed port membership
Sep 20 20:16:18 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-
1-20032 Port a closed
Sep 20 20:16:32 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-
1-20026 Port a registration waiting for seed port membership
Sep 20 20:16:33 lsappp10.itlogon.com syslog[1784]: [ID 702911 daemon.notice] VCS
 INFO V-16-1-11240 Command Server: running with security OFF
Sep 20 20:16:33 lsappp10.itlogon.com Had[1781]: [ID 702911 daemon.notice] VCS NO
TICE V-16-1-10619 'HAD' starting on: lsappp10
Sep 20 20:16:33 lsappp10.itlogon.com Had[1781]: [ID 702911 daemon.notice] VCS NO
TICE V-16-1-10620 Waiting for local cluster configuration status
Sep 20 20:16:35 lsappp10.itlogon.com genunix: [ID 408114 kern.info] /pseudo/zcon
snex@1/zcons@0 (zcons0) online
Sep 20 20:16:35 lsappp10.itlogon.com genunix: [ID 408114 kern.info] /pseudo/zcon
snex@1/zcons@1 (zcons1) online
Sep 20 20:16:35 lsappp10.itlogon.com Had[1781]: [ID 702911 daemon.notice] VCS NO
TICE V-16-1-10625 Local cluster configuration valid
Sep 20 20:16:35 lsappp10.itlogon.com Had[1781]: [ID 702911 daemon.notice] VCS NO
TICE V-16-1-11034 Registering for cluster membership
Sep 20 20:16:35 lsappp10.itlogon.com gab: [ID 843912 kern.notice] GAB INFO V-15-
1-20005 Port h registration waiting for seed port membership
Sep 20 20:16:35 lsappp10.itlogon.com Had[1781]: [ID 702911 daemon.notice] VCS NO
TICE V-16-1-11035 Waiting for cluster membership
Sep 20 20:16:37 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-
1-20032 Port a closed
Sep 20 20:16:50 lsappp10.itlogon.com Had[1781]: [ID 702911 daemon.notice] VCS CR
ITICAL V-16-1-11306 Did not receive cluster membership, manual intervention may
be needed for seeding
Sep 20 20:16:51 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-
1-20026 Port a registration waiting for seed port membership
Sep 20 20:16:56 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-
1-20032 Port a closed
Sep 20 20:17:10 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-
1-20026 Port a registration waiting for seed port membership
Sep 20 20:17:15 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-
1-20032 Port a closed
Sep 20 20:17:29 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-
1-20026 Port a registration waiting for seed port membership
Sep 20 20:17:34 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-
1-20032 Port a closed
Sep 20 20:17:48 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-
Sep 20 20:17:53 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-
1-20032 Port a closed
Sep 20 20:18:07 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-
1-20026 Port a registration waiting for seed port membership
Sep 20 20:18:11 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-
1-20032 Port a closed
Sep 20 20:18:26 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-
1-20026 Port a registration waiting for seed port membership
Sep 20 20:18:30 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-
1-20032 Port a closed
Sep 20 20:18:45 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-
1-20026 Port a registration waiting for seed port membership
Sep 20 20:18:49 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-
1-20032 Port a closed
Sep 20 20:19:04 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-
1-20026 Port a registration waiting for seed port membership
Sep 20 20:19:09 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-
1-20032 Port a closed
Sep 20 20:19:24 lsappp10.itlogon.com gab: [ID 222459 kern.notice] GAB INFO V-15-
1-20026 Port a registration waiting for seed port membership
Sep 20 20:19:29 lsappp10.itlogon.com gab: [ID 397130 kern.notice] GAB INFO V-15-
 
mikebounds's picture

If you have one node getting gab membership this usually means it can't see the other nodes over LLT - please provide from problem node:

output from "lltstat -nvv" and "gabconfig -a"

file /etc/gabtab

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

mike_ohio's picture

The node is disconnected from the network becuase of the issue it caused with the cluster.

mike_ohio's picture
lltstat -nvv from the problem host
   * 7 lsappp10          OPEN
                                  nxge2   UP      00:14:4F:DD:60:A8
                                  e1000g2   UP      00:14:4F:D4:D3:40
root@lsappp10.itlogon.com # cat /etc/gabtab
/sbin/gabconfig -c -n8
root@lsappp10.itlogon.com # /sbin/gabconfig -c -n8
root@lsappp10.itlogon.com # cat /etc/llttab
set-node lsappp10
set-cluster 2
link nxge2 /dev/nxge:2 - ether - -
link e1000g2 /dev/e1000g:2 - ether - -

 

mikebounds's picture

If you have disconnected heartbeats so that lltnode ids 0-6 are showing down and only lltnode id 7 (itself) is showing as UP, then this is why GAB is not seeding and hence the "port a" messages.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

SOLUTION
mike_ohio's picture

it is only showing 7 on this server because it cannot see the other servers as the network is not up on this node. The other cluster node shows all members
 

root@lsappp09 # lltstat -nvv
LLT node information:
    Node                 State    Link  Status  Address
     0 lsappp01          OPEN
                                  e1000g3   UP      00:14:4F:24:EF:C3
                                  nxge0   DOWN
     1 lsappp02          OPEN
                                  e1000g3   UP      00:03:BA:B4:5E:73
                                  nxge0   DOWN
     2 lsappp03          OPEN
                                  e1000g3   UP      00:03:BA:B4:60:27
                                  nxge0   DOWN
     3 lsappp04          OPEN
                                  e1000g3   UP      00:03:BA:B1:B7:07
                                  nxge0   DOWN
     4 lsappp07          CONNWAIT
                                  e1000g3   DOWN
                                  nxge0   DOWN
     5 lsappp08          OPEN
                                  e1000g3   UP      00:03:BA:B2:1C:FB
                                  nxge0   DOWN
   * 6 lsappp09          OPEN
                                  e1000g3   UP      00:14:4F:D4:09:CF
                                  nxge0   UP      00:14:4F:DD:68:26
     7 lsappp10          CONNWAIT
                                  e1000g3   DOWN
                                  nxge0   DOWN
 
mike_ohio's picture

From another node in the cluster

 

root@lsappp04 # lltstat -nvv
LLT node information:
    Node                 State    Link  Status  Address
     0 lsappp01          OPEN
                                  ce3   UP      00:14:4F:24:EF:C3
                                  ce7   UP      00:03:BA:B1:B2:1F
     1 lsappp02          OPEN
                                  ce3   UP      00:03:BA:B4:5E:73
                                  ce7   UP      00:03:BA:B1:6E:2F
     2 lsappp03          OPEN
                                  ce3   UP      00:03:BA:B4:60:27
                                  ce7   UP      00:03:BA:B1:6A:2B
   * 3 lsappp04          OPEN
                                  ce3   UP      00:03:BA:B1:B7:07
                                  ce7   UP      00:03:BA:B4:60:13
     4 lsappp07          CONNWAIT
                                  ce3   DOWN
                                  ce7   DOWN
     5 lsappp08          OPEN
                                  ce3   UP      00:03:BA:B2:1C:FB
                                  ce7   UP      00:03:BA:B1:E3:17
     6 lsappp09          OPEN
                                  ce3   UP      00:14:4F:D4:09:CF
                                  ce7   DOWN
     7 lsappp10          CONNWAIT
                                  ce3   DOWN
                                  ce7   DOWN
 
 
from main.cf
root@lsappp10.itlogon.com # cat main.cf|grep lsappp09
system lsappp09 (
        SystemList = { lsappp09 = 0 }
        SystemList = { lsappp07 = 2, lsappp09 = 0, lsappp10 = 1 }
        AutoStartList = { lsappp09 }
        SystemList = { lsappp09 = 1 }
                 lsappp09 = 6 }
                 lsappp09,
                Device @lsappp09 = { e1000g0 = 0, e1000g4 = 0 }
        SystemList = { lsappp09 = 0, lsappp10 = 1 }
 
system lsappp10 (
        SystemList = { lsappp10 = 0 }
        SystemList = { lsappp07 = 2, lsappp09 = 0, lsappp10 = 1 }
        SystemList = { lsappp10 = 0 }
                 lsappp10 = 7,
                 lsappp10 }
                Device @lsappp10 = { e1000g0 = 0, nxge1 = 0 }
        SystemList = { lsappp09 = 0, lsappp10 = 1 }
        SystemList = { lsappp10 = 0 }
 

 

mike_ohio's picture

this config does not look right to me nor does the fact that another node reports a heartbeat issue on lsappp09

g_lee's picture

In addition to the problem on lsappp09 mentioned by Mike, note that more than one node (04 & 09) sees lsappp07 down.

as gabtab is set to seed with 8 nodes - if 07 is also down, lsappp10 won't seed unless you run gabconfig -c -n<number-of-nodes-up> or -cx to seed regardless of how many nodes are up/down.

If this post has helped you, please vote or mark as solution

mike_ohio's picture

This has been resolved.  I worked with Max from support (who did a great job, thanks) and mikebounds, you were right about the GAB not seeding issue.

 

Output from gabconfig -a 

GAB Port Memberships 
=============================================================== 
Port h gen bf3437 membership 0123 56 
Port h gen bf3437 jeopardy ; 6 
Port h gen bf3437 visible ; 7

 

Note no port a status. So only HAD was showing but this was probably not updating since node 7, the server that was down was still showing as visible. So GAB was effectively stuck. The hardware replacement had nothing to do with this situation but was just a lucky coincidence. 

 

To resolve we needed to restart the cluster services across the whole cluster.

 

First, turn off HAD

hastop –all –force

 

Then on each of the nodes run

gabconfig –U (unloads gab)

lltconfig –U  (unloads llt)

 

Then reload llt, gab and had. On the first server, GAB needs to be started specially because no other nodes are running. So on the first node do

lltconfig -c

gabconfig –cx

Check the status reported by gabconfig -a to see port a status

Then on the other nodes

On all the other servers in the cluster do the following

lltconfig –c

sh /etc/gabtab

gabconfig –a (make sure port a shows the node)

Once each server is reporting correctly in gabconfig, start HAD on each server

hastart

Then check gabconfig -a for port h status