Video Screencast Help
Search Video Help Close Back
to help
Not able to make it to Vision this year? Get a sampling in the Best of Vision on Demand group.

Tape Drives Down

Updated: 07 Sep 2010 | 10 comments
Kirk_Davis's picture
0 0 Votes
Login to vote

We are runinng Netbackup 6.5.4 with a number of media servers and libraries + tape drives. Nearly every night we have drives unexpectedly being downed. The drives are still available to the underlying OS so the issue seems to be NBU related.

We've checked the firmware revisions of the tape drives in each library and these match.

Any ideas?

Comments

Marianne van den Berg's picture
05
Mar
2010
0 Votes 0
Login to vote

Have you been able to

Have you been able to determine if the problem is related to specific media server(s), specific tape drive(s), or specific media-id(s)?
Check Media Logs report. Filter report to exclude Severity of type Info. The report should then display Warning and Error info.
I always add VERBOSE entry to vm.conf on all media servers. This will ensure all hardware related errors are logged to syslog on Unix servers and to Event Viewer Application log on Windows servers. Device Management service/daemon must be restarted after adding the VERBOSE entry.

Also have a look at this TechNote: http://seer.entsupport.symantec.com/docs/336503.htm

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows.
Handy NBU links

rjrumfelt's picture
05
Mar
2010
0 Votes 0
Login to vote

You can attempt to run a cleaning job

on the drives that are giving you issues.  For me this typically does not fix a drive issue, but its one of those things you can do first to get it out of the way.

Tom Grimes's picture
05
Mar
2010
0 Votes 0
Login to vote

What do you do to get them

What do you do to get them back up? Do the drives come back up again when you select 'up' in NBU or do they produce an error?

babyd's picture
05
Mar
2010
0 Votes 0
Login to vote

We had similar issues couple

We had similar issues couple of times and most of the errors rosolved after the firmware upgrade of Tape drives.

In other cases i have cleaned the tape drives and it worked after that. there maybe some bad media also causes the tape drives to go down.

In other scenario with windows media servers, if the devices drivers are in the Unknown state the drives will go down during the backup.

rjrumfelt's picture
05
Mar
2010
0 Votes 0
Login to vote

Also

what is your OS? 

If unix, check the messages file (I know for Solaris its located in /var/adm, not sure about other unix variants) to see if any errors are being reported there.

If Windows, check the event logs for anything related to your tape drives.

David McMullin's picture
05
Mar
2010
0 Votes 0
Login to vote

probably more than you wanted to know...

drives go down due to errors netbackup sees, and these can be caused by a number of things...

dirty heads/tapes - you only get X number of read/write errors every 24 hours before it downs the drive. (see note from BOB at bottom)
stuck tapes - if there is a tape in the drive, amazingly enough, you cannot load another in it - this happens more than you think...
bad path to drive - if you do not have persistant bindings, the path to your drive can change - this is amusing as loading tape A in drive 1 can actually put it in drive 2, then trying to load a tape in drive 2 will fail...

You really have to check your logs...

There are firmware revisions of drives that have known issues - I found one that writes bad headers, so it will do one good backup, and when you try to reuse the tape, it cannot read the header and freezes the tape...

check this out:

Details:

Issue: Media is being frozen on the second backup attempt due to medium identifiers do not match.

Troubleshooting: This has been known to occur when using NetBackup with a IBM 3581 LTO Ultrium 2 tape drive at firmware revision 67U1.

Upon first load of new media, the drive is expected to generate and write the media information to the header area of the tape before it releases the drive to NetBackup for write operations. NetBackup then scans this information and updates the EMM (Enterprise Media Manager) database with the medium manufacturer and medium serial number.  There are certain situations where the drive is released to NetBackup before this information is written.

When NetBackup does not find this information on the tape, it writes generic information ([MedMfgr], [MediumSerialNumber]) to the EMM database before writing the first backup. After the first backup is written, the drive is then generating and writing the manufacturer and serial numbers to the media. When NetBackup uses this media for subsequent operations, the header is scanned and the new information on the media does not match what is in the EMM, therefore, NetBackup will generate a medium identifier mismatch and freeze the media.

Log Files:
<install_path>\NetBackup\logs\bptm
13:14:26.395 [3200.3450] <2> set_job_details: LOG 1178564713 16 bptm 3200 FREEZING media id 0182L1, Medium identifiers do not match

Resolution:
In configurations where this issue is seen with NetBackup, the issue has been known to be resolved after downgrading to firmware revision 5AT0.  The latest revision release also may resolve this issue. Visit IBM's site to determine the latest firmware release available for the IBM 3581 LTO Ultrium 2 tape drive.

Check out this note from Bob - from this thread:https://www-secure.symantec.com/connect/forums/media-damage-can-nbu-select-new-tape

I have this note from a VERITAS Software NetBackup Engineer
MEDIA_ERROR_THRESHOLD:
Touch this file and add a value in the file. The default is 2 - meaning 2 media errors within TIME_WINDOW will freeze the media. Also see TIME_WINDOW. See technote 234412

and for TIME_WINDOW:
Specifies the amount of time that BPTM will look backwards in the Errors DB for problems with drives\tapes to determine what action to take. Used with MEDIA_ERROR_THRESHOLD and DRIVE_ERROR_THRESHOLD. See technote 234412

and for DRIVE_ERROR_THRESHOLD:
Touch this file and add a value in the file. The default is 2 - meaning 2 drive errors within TIME_WINDOW downs the drive. See technote 234412

Unfortunately I can no longer get to the technote.Message was edited by: Bob Stump

NBU 7.0.1 on Solaris 10
writing to EMC 4206 VTL
duplicating to LTO2 in SL8500
(Soon to be LTO5)
using ACSLS 7.3.1

Marianne van den Berg's picture
05
Mar
2010
0 Votes 0
Login to vote

*_ERROR_THRESHOLDs and

*_ERROR_THRESHOLDs and TIME_WINDOW settings are now stored in EMM database:

# nbemmcmd -listsettings -machinename <media_server>
NBEMMCMD, Version:6.5.4
The following configuration settings were found:
ALLOW_MULTIPLE_RETENTIONS_PER_MEDIA="no"
DISABLE_DISK_STU_JOB_THROTTLING="no"
DISABLE_STANDALONE_DRIVE_EXTENSIONS="no"
MEDIA_REQUEST_DELAY="0"
MUST_USE_LOCAL_DRIVE="no"
NON_ROBOTIC_MEDIA_ID_PREFIX="A"
MAX_REALLOC_TRIES="1000"
DISABLE_BACKUPS_SPANNING_DISK="no"
DISALLOW_NONNDMP_ON_NDMP_DRIVE="no"
DO_NOT_EJECT_STANDALONE="no"
PREFER_NDMP_PATH_FOR_RESTORE="yes"
DONT_USE_SLAVE="no"
DRIVE_ERROR_THRESHOLD="2"
MEDIA_ERROR_THRESHOLD="2"
TIME_WINDOW="12"

SCSI_PROTECTION="SR"
NBUFS_DUP_TSU_TO_DSU="no"
NBUFS_DESTINATION_DSU="NONE"
NBUFS_RETENTION_LEVEL="0"
MPMS_DISABLE_RANK="0"
MPMS_DISABLE_EVENTS="no"
UNRESTRICTED_SHARING="no"
FATPIPE_USAGE_PREFERENCE="Preferred"
FATPIPE_WAIT_PERIOD="15"
FATPIPE_RESTORE_WAIT_PERIOD="5"
FT_MAX_CLIENT_PORTS_PER_SERVER="2"
FT_MAX_CLIENTS_PER_PORT="2"
SHAREDDISK_MOUNT_POINT="/nbushareddisk"
RETURN_UNASSIGNED_MEDIA_TO_SCRATCH_POOL="yes"
VAULT_CLEAR_MEDIA_DESC="no"
SCSI_PERSISTENT_RESERVE="0"
Command completed successfully.

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows.
Handy NBU links

Dion's picture
05
Mar
2010
0 Votes 0
Login to vote

Another thing to check

Saw something like this a while ago and it was related to configuration.  Another thing to check is that you don't have conflicting storage units.  Make sure you (if it is a single library with the same drive types) that you only have a single storage unit per media server.  If you have the drives set up in multiple storage units they will conflict with each other and NetBackup will randomly down the drives.

Bill_Burditzman's picture
05
Mar
2010
0 Votes 0
Login to vote

Check your system logs

NBU will throw a 'DOWN' instance to the OS syslog- if configured to log critical errors.
Check the OS syslog for root cause prior to the application marking the device as down/unavailable.

While one can modify threshold before NBU gives up via nbemmcmd commands
DRIVE_ERROR_THRESHOLD="2"  (3 soft errors in 12 hours downs the drive)
MEDIA_ERROR_THRESHOLD="2" (3 soft errors in 12 hours freezes the media)
TIME_WINDOW="12"  
This only serves to mask the trouble.

Check the file
/usr/openv/netbackup/db/media/errors
(windows- same NBU path)
On any media server where drives are attached for a history of hardware issues since the first installation.
You'll find a date/time, media involved, drive index used and error type.
If you see the same volume giving grief passed around between drives, it's most like a bad volume.
Mark it frozen and/or eject it - if wasn't frozen due to errors already.

Likewise, a drive logging errors using while using different volumes indicates a bad drive.