Video Screencast Help

WARNING V-16-2-13139, ERROR V-16-2-13011, ERROR V-16-2-13063, ERROR V-16-2-13006

Created: 18 Feb 2013 • Updated: 07 Mar 2013 | 10 comments
Zahid.Haseeb's picture
This issue has been solved. See solution.

Environment

OS = SOlaris 10

sfha = 5.1 SP1

 

2013/02/19 02:20:55 VCS WARNING V-16-2-13139 Thread(2) Canceling thread (3)
2013/02/19 02:20:56 VCS ERROR V-16-2-13011 Thread(4) Resource(Mount): offline procedure did not complete within the expected time.
2013/02/19 02:20:56 VCS ERROR V-16-2-13063 Thread(4) Agent is calling clean for resource(Mount) because offline did not complete within the expected time.
2013/02/19 02:21:55 VCS WARNING V-16-2-13139 Thread(2) Canceling thread (4)
2013/02/19 02:21:57 VCS ERROR V-16-2-13006 Thread(5) Resource(Mount): clean procedure did not complete within the expected time.
2013/02/19 02:23:57 VCS WARNING V-16-2-13139 Thread(2) Canceling thread (5)
2013/02/19 02:23:57 VCS ERROR V-16-10001-5630 Mount:Mount:monitor:The file system on /dev/vx/dsk/XXX-DG/volphx is mounted on /XXXX but not accessible. You may need to manually unmount the file system.
2013/02/19 02:23:57 VCS ERROR V-16-2-13027 Thread(6) Resource(Mount) - monitor procedure did not complete within the expected time.
2013/02/19 02:23:57 VCS ERROR V-16-2-13077 Thread(6) Agent is unable to offline resource(Mount). Administrative intervention may be required.
2013/02/19 02:24:57 VCS WARNING V-16-2-13139 Thread(2) Canceling thread (6)
2013/02/19 02:25:58 VCS WARNING V-16-2-13139 Thread(2) Canceling thread (7)
2013/02/19 02:25:58 VCS ERROR V-16-10001-5630 Mount:Mount:monitor:The file system on /dev/vx/dsk/XXX-DG/volphx is mounted on /XXXX but not accessible. You may need to manually unmount the file system.

 

 

Engine LOG

 

Feb 19 00:03:21 SEC-NODE Had[3789]: [ID 702911 daemon.notice] VCS CRITICAL V-16-1-50086 CPU usage on SEC-NODE is 100%

Feb 19 01:57:31 SEC-NODE AgentFramework[4113]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13011 Thread(4) Resource(APP): offline procedure did not complete within the expected time.

Feb 19 01:57:31 SEC-NODE AgentFramework[4113]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13063 Thread(4) Agent is calling clean for resource(APP) because offline did not complete within the expected time.

Feb 19 01:57:31 SEC-NODE Had[3789]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13063 (SEC-NODE) Agent is calling clean for resource(APP) because offline did not complete within the expected time.

Feb 19 01:57:32 SEC-NODE AgentFramework[4113]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13068 Thread(4) Resource(APP) - clean completed successfully.

Feb 19 02:02:51 SEC-NODE AgentFramework[4113]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13012 Thread(5) Resource(APP): online procedure did not complete within the expected time.

Feb 19 02:02:51 SEC-NODE AgentFramework[4113]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13065 Thread(5) Agent is calling clean for resource(APP) because online did not complete within the expected time.

Feb 19 02:02:51 SEC-NODE Had[3789]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13065 (SEC-NODE) Agent is calling clean for resource(APP) because online did not complete within the expected time.

Feb 19 02:02:52 SEC-NODE AgentFramework[4113]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13068 Thread(5) Resource(APP) - clean completed successfully.

Feb 19 02:02:52 SEC-NODE AgentFramework[4113]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13071 Thread(5) Resource(APP): reached OnlineRetryLimit(0).

Feb 19 02:02:53 SEC-NODE Had[3789]: [ID 702911 daemon.notice] VCS ERROR V-16-1-10303 Resource APP (Owner: Unspecified, Group: SG) is FAULTED (timed out) on sys SEC-NODE

Feb 19 02:03:20 SEC-NODE Had[3789]: [ID 702911 daemon.notice] VCS CRITICAL V-16-1-50086 CPU usage on SEC-NODE is 100%

Feb 19 02:08:23 SEC-NODE AgentFramework[4113]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13012 Thread(6) Resource(APP): online procedure did not complete within the expected time.

Feb 19 02:08:23 SEC-NODE AgentFramework[4113]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13065 Thread(6) Agent is calling clean for resource(APP) because online did not complete within the expected time.

Feb 19 02:08:23 SEC-NODE Had[3789]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13065 (SEC-NODE) Agent is calling clean for resource(APP) because online did not complete within the expected time.

Feb 19 02:08:24 SEC-NODE AgentFramework[4113]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13068 Thread(6) Resource(APP) - clean completed successfully.

Feb 19 02:08:24 SEC-NODE AgentFramework[4113]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13071 Thread(6) Resource(APP): reached OnlineRetryLimit(0).

Feb 19 02:08:25 SEC-NODE Had[3789]: [ID 702911 daemon.notice] VCS ERROR V-16-1-10303 Resource APP (Owner: Unspecified, Group: SG) is FAULTED (timed out) on sys SEC-NODE

Feb 19 02:20:56 SEC-NODE AgentFramework[4120]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13011 Thread(4) Resource(Mount): offline procedure did not complete within the expected time.

Feb 19 02:20:56 SEC-NODE AgentFramework[4120]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13063 Thread(4) Agent is calling clean for resource(Mount) because offline did not complete within the expected time.

Feb 19 02:20:56 SEC-NODE Had[3789]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13063 (SEC-NODE) Agent is calling clean for resource(Mount) because offline did not complete within the expected time.

Feb 19 02:21:57 SEC-NODE Had[3789]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13006 (SEC-NODE) Resource(Mount): clean procedure did not complete within the expected time.

Feb 19 02:21:57 SEC-NODE AgentFramework[4120]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13006 Thread(5) Resource(Mount): clean procedure did not complete within the expected time.

Feb 19 02:23:57 SEC-NODE AgentFramework[4120]: [ID 702911 daemon.notice] VCS ERROR V-16-10001-5630 Mount:Mount:monitor:The file system on /dev/vx/dsk/DISKGROUP/VOLUME is mounted on /MountPoint but not accessible. You may need to manually unmount the file system.

Discussion Filed Under:

Comments 10 CommentsJump to latest comment

Zahid.Haseeb's picture

This happen twice this day that I was not able to go under a particular directory ( which is a SAN storage connected to my Cluster Node(s)  ) ... even I was not able to execute a script exist on the mouted SAN Storage. I tried to offline the Service Group but the Mount resource Got stuck both times. I finally restarted the Cluster Nodes both times when the incident occured and problem resolved.

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

mikebounds's picture

This is a probably a problem with the SAN, so you couldn't write to the LUNs, so VCS is unable to umount as you need to write to the filesystem meta data to mark the filesystem as umounted and so VCS says "not accessible.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

Zahid.Haseeb's picture

Thanks Mark for your kind information.

Is that also possible that we can detect the problem why the VCS tried to failover and tried to unmount the Storage ? any clue through which I can come to know the reason to failing over like the storage had IO problem thats why the Node is failing over and got stuck at unmount stage.

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

mikebounds's picture

You need to look earlier in the log to see what triggered the failover - possiblly it was the application if this was trying to write to storage.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

Zahid.Haseeb's picture

Application_A log

2013/02/19 01:57:30 VCS WARNING V-16-2-13139 Thread(2) Canceling thread (3)
2013/02/19 01:57:31 VCS ERROR V-16-2-13011 Thread(4) Resource(APP): offline procedure did not complete within the expected time.

 

Mount_A log

2013/02/19 02:20:55 VCS WARNING V-16-2-13139 Thread(2) Canceling thread (3)
2013/02/19 02:20:56 VCS ERROR V-16-2-13011 Thread(4) Resource(Mount): offline procedure did not complete within the expected time.
2013/02/19 02:20:56 VCS ERROR V-16-2-13063 Thread(4) Agent is calling clean for resource(Mount) because offline did not complete within the expected time.

 

engine_A log

2013/02/19 01:52:18 VCS NOTICE V-16-1-10300 Initiating Offline of Resource VirtualIP (Owner: Unspecified, Group: sg) on System SEC-XXX
2013/02/19 01:52:21 VCS INFO V-16-1-10305 Resource VirtualIP (Owner: Unspecified, Group: sg) is offline on SEC-XXX (VCS initiated)
2013/02/19 01:52:30 VCS NOTICE V-16-1-10300 Initiating Offline of Resource APP (Owner: Unspecified, Group: phx-sg) on System SEC-XXX
2013/02/19 01:57:31 VCS INFO V-16-2-13003 (SEC-XXX) Resource(APP): Output of the timed out operation (offline)
Stopping APPLICATION ...
2013/02/19 01:57:31 VCS WARNING V-16-2-13011 (SEC-XXX) Resource(APP): offline procedure did not complete within the expected time.
2013/02/19 01:57:31 VCS ERROR V-16-2-13063 (SEC-XXX) Agent is calling clean for resource(APP) because offline did not complete within the expected time.
2013/02/19 01:57:32 VCS INFO V-16-2-13068 (SEC-XXX) Resource(APP) - clean completed successfully.
2013/02/19 01:57:33 VCS INFO V-16-1-10305 Resource APP (Owner: Unspecified, Group: phx-sg) is offline on SEC-XXX (VCS initiated)
2013/02/19 01:57:50 VCS NOTICE V-16-1-10233 Clearing Restart attribute for group sg on all nodes
2013/02/19 01:57:50 VCS NOTICE V-16-1-10301 Initiating Online of Resource APP (Owner: Unspecified, Group: sg) on System SEC-XXX

 

Messages OS Logs

Feb 19 01:57:31 SEC-XXX AgentFramework[4113]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13011 Thread(4) Resource(APP): offline procedure did not complete within the expected time.
Feb 19 01:57:31 SEC-XXX AgentFramework[4113]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13063 Thread(4) Agent is calling clean for resource(PHX-APP) because offline did not complete within the expected time.
Feb 19 01:57:31 SEC-XXX Had[3789]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13063 (SEC-XXX) Agent is calling clean for resource(APP) because offline did not complete within the expected time.
Feb 19 01:57:32 SEC-XXX AgentFramework[4113]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13068 Thread(4) Resource(APP) - clean completed successfully.
Feb 19 02:02:51 SEC-XXX AgentFramework[4113]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13012 Thread(5) Resource(APP): online procedure did not complete within the expected time.
Feb 19 02:02:51 SEC-XXX AgentFramework[4113]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13065 Thread(5) Agent is calling clean for resource(APP) because online did not complete within the expected time.
Feb 19 02:02:51 SEC-XXX Had[3789]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13065 (SEC-XXX) Agent is calling clean for resource(APP) because online did not complete within the expected time.

=======================================================

The only thing which I can find is below:

 

VCS ERROR V-16-2-13011

https://sort.symantec.com/ecls/umi/V-16-2-13011

EXTRACT:

This message displays when an offline  procedure does not complete on time.  An offline procedure timeout can occur when the system is overloaded, is busy processing a system call, or is handling a large number of resources.

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

mikebounds's picture

You need to look ealier than this.  The issue was probably the SAN or the CPU as you see "CPU usage on SEC-NODE is 100%" and the sequence of events is probably something like:

Agent is calling clean for resource(X) because 4 successive invocations of the monitor procedure did not complete within the expected time
OR
Agent is calling clean for resource X because the resource became OFFLINE unexpectedly, on its own.

Then you would see "Initiating Offline of Resource ..." for several resources and some resources they can't offline and so you see "offline procedure did not complete within the expected time"

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

g_lee's picture

Zahid,

2013/02/19 01:52:18 VCS NOTICE V-16-1-10300 Initiating Offline of Resource VirtualIP (Owner: Unspecified, Group: sg) on System SEC-XXX
2013/02/19 01:52:21 VCS INFO V-16-1-10305 Resource VirtualIP (Owner: Unspecified, Group: sg) is offline on SEC-XXX (VCS initiated)
2013/02/19 01:52:30 VCS NOTICE V-16-1-10300 Initiating Offline of Resource APP (Owner: Unspecified, Group: phx-sg) on System SEC-XXX
2013/02/19 01:57:31 VCS INFO V-16-2-13003 (SEC-XXX) Resource(APP): Output of the timed out operation (offline)
Stopping APPLICATION ...
2013/02/19 01:57:31 VCS WARNING V-16-2-13011 (SEC-XXX) Resource(APP): offline procedure did not complete within the expected time.

01:52:18 - VCS initiates offline of VirtualIP resource?

Was this offline initiated manually (ie: by adminstrator/you), or was VCS offlining the resource/group due to a resource fault?

Are the APP and VirtualIP in different groups (phx-sg? sg?), or are they in the same group (hard to tell because server/group details appear to have been edited inconsistently to obscure server-specifics).

It looks like VCS was offlining the group(s), and it timed out when offlining the Application (as you posted from https://sort.symantec.com/ecls/umi/V-16-2-13011, this may have been because the system was overloaded, or it could be that the application took longer than 5 minutes to shutdown. Again, hard to tell based on the (lack of) detail you have provided so far).

If you want to see why VCS was offlining the groups, you need to look at the logs prior to 01:52, as by that time VCS was already offlining resources.

If this post has helped you, please vote or mark as solution

Zahid.Haseeb's picture

Thanks guys for your input.

 

@g_less

- 01:52:18 - VCS initiates offline of VirtualIP resource?

(No one did manual offline so should be done by VCS)

- Are the APP and VirtualIP in different groups

(There is only one Group..sorry for improper edition)

 

you need to look at the logs prior to 01:52, (not able to see any incident prior to that time which I feel did trouble except HostMonitor_A log)

 

- Application_A log

2013/02/01 14:28:24 VCS INFO V-16-2-13805 Thread(1) (imf_init) entry point completed with return status (1)
2013/02/19 01:57:30 VCS WARNING V-16-2-13139 Thread(2) Canceling thread (3)

 

- engine_A log

2013/02/19 00:03:21 VCS CRITICAL V-16-1-50086 CPU usage on SEC-XXX is 100%
2013/02/19 01:52:18 VCS NOTICE V-16-1-10300 Initiating Offline of Resource VirtualIP (Owner: Unspecified, Group: sg) on System SEC-XXX

 

- messages OS logs

Feb 19 00:03:21 SEC-XXX Had[3789]: [ID 702911 daemon.notice] VCS CRITICAL V-16-1-50086 CPU usage on SEC-XXX is 100%
Feb 19 01:57:31 SEC-XXX AgentFramework[4113]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13011 Thread(4) Resource(APP): offline procedure did not complete within the expected time.

- HostMonitor_A log

2013/02/19 01:50:49 VCS INFO V-16-10061-14001 HostMonitor:VCShm:monitor:Updating System attribute with CPU usage = 100% and Swap usage = 18%.
2013/02/19 01:51:19 VCS INFO V-16-10061-14001 HostMonitor:VCShm:monitor:Updating System attribute with CPU usage = 100% and Swap usage = 18%.
2013/02/19 01:51:49 VCS INFO V-16-10061-14001 HostMonitor:VCShm:monitor:Updating System attribute with CPU usage = 100% and Swap usage = 18%.
2013/02/19 01:52:19 VCS INFO V-16-10061-14001 HostMonitor:VCShm:monitor:Updating System attribute with CPU usage = 100% and Swap usage = 18%.
2013/02/19 01:52:49 VCS INFO V-16-10061-14001 HostMonitor:VCShm:monitor:Updating System attribute with CPU usage = 100% and Swap usage = 17%.

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

mikebounds's picture

As Grace and I have said, to see why a resource offlines you need to provide logs before hand - you only need to provide the Engine log for this - possible reasons I can think of are:

  1. Resource fault due to monitor returning offline
  2. Resource fault due to monitor timeout
  3. Someone manually offlined the group or resource
  4. Someone manually ran hastop or issued a  (controlled reboot (where rc scripts were called)
  5. 3 or 4 occured in a trigger script (example in the cpuussage trigger)

As a starting point, I would suggest you search for the word "clean" in the engine log prior to you seeing " Initiating Offline of Resource" message 

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

SOLUTION
g_lee's picture

Agree with Mike's post above.

One additional point - check the engine_A.log on the other node(s) as this may show additional events / may shed some more light (eg: Someone manually offlined the group or resource - this may be recorded from the node where it was fired, etc)

If this post has helped you, please vote or mark as solution