All cluster file systems go offline unexpectedly.

Article:TECH169533  |  Created: 2011-09-14  |  Updated: 2012-07-28  |  Article URL http://www.symantec.com/docs/TECH169533
Article Type
Technical Solution


Environment

Issue



During this event, you had 6 resource type timeouts. They were:

cvm.vxconfigd, cfgnic, datadg, ocrdg, cvm_clus, and vxfsckd.

Whenever you have a system where a system resource is being used to it's limits, you will see multiple VCS resource timeouts of this type in a short period of time. Each VCS resource type has it's own agent, and each agent has a monitor, online, offline, and clean component. These are just unix processes that are doing interprocess communications, and are vying for system resources, just like any other unix process. If the system is working to it's limits, this will cause these timeouts that you saw.

The timeouts are not the issue, but when the timeouts cause the clean component to be called, this is essentially a forced offline by VCS. There are tunables for all VCS resource types, that can be tweaked to increase the amount of time that VCS will wait before calling clean

In your case, the clean was called on the CVMVolDg resource datadg, which controls all of your shared volumes and filesystems. Because clean was called, this caused a forced offline, and because all of the CFSMount resources depend on this resource, this is the root cause of your issue.

The best fix for this is to lighten the load on these cluster nodes so that you are not using so much CPU on these systems, as this will prevent the timeouts in the first place. If this can't be done, then I would recommend the following tuning for the CVMVolDg resource type.

FaultOnMonitorTimeouts - This tunable defines then number of consecutive monitor timeouts that can occur before clean is called.

Current value - 4
Recommended value - 8

MonitorTimeout - This tunable defines how long the monitor will wait before it declares the resource to be timed out.

Current value - 60
Recommended value - 120

ToleranceLimit - This tunable defines how many monitor cycles the agent will do before declaring the resource as faulted.

Current value - 0
Recommended value - 4


Error



Aug 25 17:28:55 rwmq903d Had[19225]: [ID 702911 daemon.notice] VCS CRITICAL V-16-1-50086 CPU usage on rwmq903d is 99%
Aug 25 17:30:25 rwmq903d Had[19225]: [ID 702911 daemon.notice] VCS CRITICAL V-16-1-50086 CPU usage on rwmq903d is 97%
Aug 25 17:45:55 rwmq903d Had[19225]: [ID 702911 daemon.notice] VCS CRITICAL V-16-1-50086 CPU usage on rwmq903d is 94%

Aug 25 18:06:21 rwmq903d AgentFramework[12301]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(27) Resource(ocrdg) - monitor procedure did not compl
ete within the expected time.
Aug 25 18:06:21 rwmq903d AgentFramework[12301]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 Thread(26) Resource(datadg) - monitor procedure did not comp
lete within the expected time.
Aug 25 18:06:21 rwmq903d Had[19225]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 (rwmq903d) Resource(ocrdg) - monitor procedure did not complete within
the expected time.
Aug 25 18:06:21 rwmq903d Had[19225]: [ID 702911 daemon.notice] VCS ERROR V-16-1-13027 (rwmq903d) Resource(datadg) - monitor procedure did not complete within
the expected time.


Cause



Systems are very busy and this load caused multiple timeouts of CVMVolDg resources at the same time.


Solution



To change these values, please follow this procedure:

# haconf - makerw
# hatype -modify CVMVolDg FaultOnMonitorTimeouts 8
# hatype -modify CVMVolDg MonitorTimeout 120
# hatype -modify CVMVolDg ToleranceLimit 4
# haconf -dump -makero




Article URL http://www.symantec.com/docs/TECH169533


Terms of use for this information are found in Legal Notices