CFS umount abnormally
Created: 10 Dec 2012
Hi everyone!
I have ten SFCFS nodes on the same version of AIX OS platform with SFRAC 5.1SP1RP3. This SFCFS cluster has two CFS resources which running on all nodes.Recently, I encounter a very strange problem for service group sg-cfs-mount01 on a specific node.(DB2) The resource cfsmount1 related to service group sg-cfs-mount01 was offlined abnormally. I have checked in engine_A.log and there are shown I/O test failure by CVMVolDg agent.Then I also checked SYS logs, dmpevent.log .. etc. but there are no related errors.This problem only occurs on node DB2 with SG sg-cfs-mount01,but the SG sg-cfs-mount01 on other nodes and SG sg-cfs-mount02 on all nodes (include DB2) is working normally.(This is what I think strange for)
I attached engine_A.log CFSMount_A.log CVMVolDg_A.log etc.. in logs.tar
Thanks in advance.
Comments
I have seen CVMVolDg
I have seen CVMVolDg resources time-out before and this is due to busy systems, where the "dd" read on the volumes in the "CVMVolume" attribute does not have enough time to complete and this is made worse when several volumes are specified for the "CVMVolume" attribute. However your resource is not experiencing timeouts, rather the "dd" read seems to be failing - see extract from your engine log:
VCS ERROR V-16-10011-1097 (db2) CVMVolDg:cvmvoldg1:monitor:Device /dev/vx/rdsk/yyzc-cfsdg01/cfsdg01-vol001 could not be read at offset 0
I can think of 3 causes:
The dd command the agent is running is:
dd if=/dev/vx/rdsk/yyzc-cfsdg01/cfsdg01-vol001 of=/dev/null count=1 skip=$_cdi_offset bs=1024
where $_cdi_offset is randomly generated, but this maybe failing as all errors say they are trying to read from offset 0.
So you could try running independent "dd" in a cron every minute to see if these fail to try and determine where the issue is.
As a work-a-round to your current issue, you could set a non-zero ToleranceLimit on the CVMVolDg type so a certain number of failures are ignored, but the downside of this, is if there is a real failure, then failover could be delayed. You can set ToleranceLimit, to 2, for example using:
hatype -modfiy CVMVolDg ToleranceLimit 2
Mike
UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows
If this post has helped you, please vote or mark as solution
Thanks Mike. Today, I did
Thanks Mike.
Today, I did some tests by using dd command and also I had been opening debug mode for CVMVolDg agent and HAD daemon which are adviced from Symantec Support in China. Actually, I think the storage hardware layer access from db2 has no problems, because that the /dev/vx/rdsk/yyzc-cfsdg01/cfsdg01-vol002 volume on db2 is working nicely. As I mentioned above, the problematical volume cfsdg01-vol001 is working fine on other nine nodes. So, I think it's not a physical problem. The tests methods and logs attached following:
Test for cfsdg01-vol001
=====================================================================
#more /tmp/ddvol
while [ 1 ]; do
/bin/sleep 1
date >> /tmp/ddvol.out
dd if=/dev/vx/dsk/yyzc-cfsdg01/cfsdg01-vol001 of=/dev/null count=1024 bs=512 >> /tmp/ddvol.out 2>&1
done
#/tmp/ ddvol &
Test for a Hard Disk
=====================================================================
#more /tmp/dddisk &
while [ 1 ]; do
/bin/sleep 1
date >> /tmp/dd.disk
dd if=/dev/hdisk98 of=/dev/null count=1024 bs=512 skip=65791 >> /tmp/dd.disk 2>&1
done
#/tmp/ dddisk &
opening Debug Mode for CVMVolDG agent and HAD daemon
====================================================================
# haconf -makerw
# /opt/VRTSvcs/bin/hatype -modify CVMVolDg LogDbg -add DBG_1 DBG_2 DBG_3 DBG_4 DBG_5 DBG_AGDEBUG DBG_AGINFO DBG_AGTRACE
# /opt/VRTSvcs/bin/haconf -dump –makero
# haconf -makerw
# halog -addtags DBG_POLICY
# halog -addtags DBG_TRACE
# halog -addtags DBG_AGTRACE
# halog -addtags DBG_AGINFO
# halog -addtags DBG_AGDEBUG
# haconf -dump -makero
Finally, the symptom was appears again at 2012/12/11 11:46:16
I attached all logs output from these tests and also I upload a VRTSexplorer logs hope useful for analyze my problem.
THX.
As the dd's in your test were
As the dd's in your test were successul there must be an issue with the agent. The agent code is located in /opt/VRTSvcs/bin/CVMVolDg and so the file "monitor" runs:
Funtion cvmvoldg_do_iotest is from file cvmvoldg.lib and which says:
UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows
If this post has helped you, please vote or mark as solution
Would you like to reply?
Login or Register to post your comment.