Video Screencast Help
Search Video Help Close Back
to help
Not able to make it to Vision this year? Get a sampling in the Best of Vision on Demand group.

Drives going into AVR mode after zoning changes on SAN

Updated: 21 May 2010 | 8 comments
Uncle_Bob's picture
0 0 Votes
Login to vote
 I am wondering if anyone else has experienced problems with drives going down and showing as AVR mode after changes to zoning on the SAN.  We have EMC CX500 SAN and the Netbackup environment consists of 1 master server, 2 media servers and 8 san media servers.  These are sited evenly across two data centre sites served by two robotic tape libraries on either site with 8 LTO3 drives in total.  We are using SSO and all kit is connected by fibre with Qlogic HBA's. 
 
Recently after actioning changes to zoning on new servers to have visibility of the tape libraries and cross site zones,  we have noticed problem with Netbackup,  where drives have gone down/missing and showing as AVR when NBU services have been stooped and started.  The problem does not happen immediately after zoning changes have been committed and jobs do complete overnight,  but the problem becomes evident when running new backups.  Jobs show as running and then mounting but just hang indefinitely at this point.  We have undone the changes made to zoning and this has not necessarily resolved issues, the kit in the SSO environment has been rebooted and still no better.
 
Drives continously show as AVR and the device manager service on the master struggles and fails to start and shows device active for disk only.  The disk based jobs are unaffected and after a lengthy wait the device manager service does start on the master. When running the scan command we get drives missing or changers even though the operating system can see everything correctly in device manager.  All systems run under Windows Server 2003 SP1 with NBU 6.0 MP4.  Symantec have assisted but nothing conclusive has been identified as root cause, after 24-48 hrs system just seems to come backup as normal after running the device configuration wizard.  Even drives that is reports missing in the device confiuration wizard show up eventually.  I am just wondering if certain servers which we haven't been able to reboot with them being high availabilty production systems may have some odd condition on them causing the rest of NBU to hang up.  Its as if there is something that times out after which everything starts working.  No other devices that run off the SAN or fibre are affected,  just NBU and I am wondering if a HBA SCSI reservation can cause this or is it to do with a reset that may ocurr on the fibre switches/routers when changes are commited.
 
In the recent incident we also came across four tapes stuck in drives even when no job failures had been reported and nothing was showing as active.  Its as if they had been caught on the on the end of a backup and at the time the changes had been made to the zoning.  Resetting the drives and rebooting the libaries did not release the tapes and in the end physically pressing the eject button on the drive sleds resolved the problem.
 
Everything is working fine at present but I would like to know if anyone has come across similar issues and how they have managed to resolve.

Comments

Philip Drew's picture
21
Sep
2007
0 Votes 0
Login to vote

Have you got SSO_SCAN_ABILITY settings set up on your master, media and all San media servers ?

Sounds like something similar that happened to us, we had no SSO_SCAN_ABILITY settings and one of our SAN media servers was constantly trying to take control of all the libraries and drives.

We were recommended to use SSO_SCAN_ABILITY = 9 in the Master server's C:\program files\veritas\volmgr\vm.conf.

and SSO_SCAN_ABILITY = 5 in our non-SAN Media servers

and SSO_SCAN_ABILITY = 0 in our SAN Media servers to prevent this from happening.

Randy Samora's picture
21
Sep
2007
0 Votes 0
Login to vote

Phillip,

I think that's the setting I have been looking for.  Will it get the Event Viewer to quit reporting "Remote scan failed on host SERVERA, drive HPUltrium3-SCSI2, Host is not the scan host for this shared drive."  My Master is only a master and does not serve as a media server.  It can't even see the library.  Do I still want the Master to be the scan host or should the media server controlling the robot be the scan host?

 

Bharat,

If you are using QLOGIC HBA, have you run San Surfer and gone to advance settings and Disabled the "Target Reset"?  That caused me major havoc before I made that change.

Philip Drew's picture
21
Sep
2007
0 Votes 0
Login to vote

<Quote>
Randy:

Phillip,

I think that's the setting I have been looking for.  Will it get the Event Viewer to quit reporting "Remote scan failed on host SERVERA, drive HPUltrium3-SCSI2, Host is not the scan host for this shared drive."  My Master is only a master and does not serve as a media server.  It can't even see the library.  Do I still want the Master to be the scan host or should the media server controlling the robot be the scan host?


</quote>

I would then set the SSO_SCAN_ABILITY = 0 for the master.

SSO_SCAN_ABILITY = 9  to the server you want to control the libraries (if it does both)

and the SAN Media to 5.



Message Edited by Philip Drew on 09-21-2007 06:49 AM

Uncle_Bob's picture
21
Sep
2007
0 Votes 0
Login to vote

Philip
 
Thanks for the information, I have already got this set on the master and media servers and doubled checked,  if the value isn't specified then a default is set by Netbackup which seems to be 5.
 
You can check this by going to C:\Program Files\Veritas\Netbackup\Bin\Admincmd
 
Then issue the following command:
 
nbemmcmd -listhosts -verbose
Uncle_Bob's picture
21
Sep
2007
0 Votes 0
Login to vote

Randy,  I have checked the 'Target Reset' option on the Qlogic HBA and currently it is enabled and I think this will the case across the board on all the servers within our Netbackup SSO Environment.  When you say you had a lot of problems because of this,  can you be more specific as to what the symtoms were ? 
Randy Samora's picture
24
Sep
2007
0 Votes 0
Login to vote

Bharat,
 
Before I made the change, I was literally spending hours every night restarting jobs that failed with some kind of media write error; typically Status 84.  I was also seeing a lot of Status 134's because NetBackup would try to send a job to what looked like an available drive but by the time the job was queued, the drive was no longer available.  I still see that periodically but it's expected with SSO.  What was different before I made the change was that I was seeing hundreds of 134's and/or 84's every night.  I rebuilt my SSO configuration and that would seem to make everything run fine for a day or two and then the nightmares would begin again.  If I looked at Device Monitor, usually 2 or 3 or more drives were AVR or PEND or sometimes DOWN.  The drives would never remain 100% UP for more than one night.
 
That setting sends resets to the tape drives.  With SSO, you're basically trying to fool each Media Server into thinking he owns the robot and he's the only one using it.  If SERVERA tries to write to a tape drive and SERVERB is already using it, SERVERA will never get a response.  Since SERVERA thinks that's his tape drive, SERVERA assumes something is wrong with the drive because the drive isn't responding to the request.  SERVERA sends a "restart" to the drive to try to get it back on line.  In the meantime, SERVERB keeps writing data to the drive until suddenly SERVERB no longer sees the drive because SERVERA sent a "reset" command and the job on SERVERB fails.
 
This is no exaggeration, I got over 3000 emails from NetBackup one night before i made the setting change.  Each email was from a failed attempt on a backup job.  Most nights weren't that bad but that was probably the worst night.  We are primarily an HP shop so HP came in and made sure of 2 things before they left:
1. The HBA setting was changed, and
2. Only servers that were supposed to talk to the tape drive were zoned into the same zone as the tape drive.  No other server could even see the library when we were done unless it was a Master or Media server.
 
I hope this helps.  If you have more questions, please ask.  I spent weeks trying to get this resolved and I spent a lot of time with HP and with Veritas.
Stumpr2's picture
24
Sep
2007
0 Votes 0
Login to vote

Randy,
You may have just helped me :-)
Is there an equivalent “Target Reset” option on Emulex?
THANKS!
 
 



Message Edited by Stumpr on 09-24-2007 06:25 PM

sdw303's picture
26
Sep
2007
0 Votes 0
Login to vote

Also keen to know the equivalent for Emulex LP8000.  I've just checked parameters of the HBAs using "lputilnt" (Windows 2000 master) and I can't see an equivalent.