Drives going into AVR mode after zoning changes on SAN
Updated: 21 May 2010 | 8 comments
I am wondering if anyone else has experienced problems with drives going down and showing as AVR mode after changes to zoning on the SAN. We have EMC CX500 SAN and the Netbackup environment consists of 1 master server, 2 media servers and 8 san media servers. These are sited evenly across two data centre sites served by two robotic tape libraries on either site with 8 LTO3 drives in total. We are using SSO and all kit is connected by fibre with Qlogic HBA's.
Recently after actioning changes to zoning on new servers to have visibility of the tape libraries and cross site zones, we have noticed problem with Netbackup, where drives have gone down/missing and showing as AVR when NBU services have been stooped and started. The problem does not happen immediately after zoning changes have been committed and jobs do complete overnight, but the problem becomes evident when running new backups. Jobs show as running and then mounting but just hang indefinitely at this point. We have undone the changes made to zoning and this has not necessarily resolved issues, the kit in the SSO environment has been rebooted and still no better.
Drives continously show as AVR and the device manager service on the master struggles and fails to start and shows device active for disk only. The disk based jobs are unaffected and after a lengthy wait the device manager service does start on the master. When running the scan command we get drives missing or changers even though the operating system can see everything correctly in device manager. All systems run under Windows Server 2003 SP1 with NBU 6.0 MP4. Symantec have assisted but nothing conclusive has been identified as root cause, after 24-48 hrs system just seems to come backup as normal after running the device configuration wizard. Even drives that is reports missing in the device confiuration wizard show up eventually. I am just wondering if certain servers which we haven't been able to reboot with them being high availabilty production systems may have some odd condition on them causing the rest of NBU to hang up. Its as if there is something that times out after which everything starts working. No other devices that run off the SAN or fibre are affected, just NBU and I am wondering if a HBA SCSI reservation can cause this or is it to do with a reset that may ocurr on the fibre switches/routers when changes are commited.
In the recent incident we also came across four tapes stuck in drives even when no job failures had been reported and nothing was showing as active. Its as if they had been caught on the on the end of a backup and at the time the changes had been made to the zoning. Resetting the drives and rebooting the libaries did not release the tapes and in the end physically pressing the eject button on the drive sleds resolved the problem.
Everything is working fine at present but I would like to know if anyone has come across similar issues and how they have managed to resolve.
Discussion Filed Under:
Comments
Have you got SSO_SCAN_ABILITY settings set up on your master, media and all San media servers ?
Sounds like something similar that happened to us, we had no SSO_SCAN_ABILITY settings and one of our SAN media servers was constantly trying to take control of all the libraries and drives.
We were recommended to use SSO_SCAN_ABILITY = 9 in the Master server's C:\program files\veritas\volmgr\vm.conf.
and SSO_SCAN_ABILITY = 5 in our non-SAN Media servers
and SSO_SCAN_ABILITY = 0 in our SAN Media servers to prevent this from happening.
Phillip,
I think that's the setting I have been looking for. Will it get the Event Viewer to quit reporting "Remote scan failed on host SERVERA, drive HPUltrium3-SCSI2, Host is not the scan host for this shared drive." My Master is only a master and does not serve as a media server. It can't even see the library. Do I still want the Master to be the scan host or should the media server controlling the robot be the scan host?
Bharat,
If you are using QLOGIC HBA, have you run San Surfer and gone to advance settings and Disabled the "Target Reset"? That caused me major havoc before I made that change.
<Quote>
Randy:
Phillip,
I think that's the setting I have been looking for. Will it get the Event Viewer to quit reporting "Remote scan failed on host SERVERA, drive HPUltrium3-SCSI2, Host is not the scan host for this shared drive." My Master is only a master and does not serve as a media server. It can't even see the library. Do I still want the Master to be the scan host or should the media server controlling the robot be the scan host?
</quote>
I would then set the SSO_SCAN_ABILITY = 0 for the master.
SSO_SCAN_ABILITY = 9 to the server you want to control the libraries (if it does both)
and the SAN Media to 5.
Message Edited by Philip Drew on 09-21-2007 06:49 AM
Message Edited by Stumpr on 09-24-2007 06:25 PM
Would you like to reply?
Login or Register to post your comment.