DOCUMENTATION: How Symantec NetBackup(tm) determines if a tape should be frozen or the status of a tape drive should be changed to down, and how to change this behavior
|Article:TECH9748|||||Created: 2001-01-09|||||Updated: 2011-05-06|||||Article URL http://www.symantec.com/docs/TECH9748|
DOCUMENTATION: How Symantec NetBackup determines if a tape should be frozen or the status of a tape drive should be changed to down, and how to change this behavior
When a read, write, or position error occurs on tape, it is difficult to know whether the error is caused by media or by the drive itself. This is because the only error produced comes from the operating system, and only reports, "I/O ERROR". In an attempt to prevent bad media or drives from causing all backups in a given timeframe to fail, NetBackup developed a method to attempt to determine, based on past history, if a media or drive is bad.
Each time an I/O error occurs on a read, write, or position, bptm logs the error into an errors file. Each entry consists of the time of the error, the media ID, the drive index, and the type of error.
Sample entries in this file are:
05/21/06 04:15:17 A00167 4 WRITE_ERROR
05/26/06 12:37:47 A00168 4 READ_ERROR
Each time an entry is made, past entries in the file are scanned to determine if the same media id or drive has had the same type of error in the past "n" hours, where "n" is the TIME_WINDOW. The default time window is 12 hours. The command to freeze a media or down a drive does not normally occur the first time the error is encountered. There are two other parameters, MEDIA_ERROR_THRESHOLD and DRIVE_ERROR_THRESHOLD, the default value for each being 3.
- If the same media id gets write errors three times within the time window, on more than 1 drive, it is assumed that the media is bad and NetBackup freezes the media.
- If different media id's get the same error three times within the time window on the same drive, it is assumed the drive is bad and NetBackup places that drive into a "DOWN" state.
- If the same drive gets errors three times within the time window with the same media id, then NetBackup assumes the media is bad and freezes it.
The TIME_WINDOW, MEDIA_ERROR_THRESHOLD and DRIVE_ERROR_THRESHOLD values are all configurable. In NetBackup 5.x and older, there are three files which, if they exist, contain a number which is used to determine the time window, when to down drives, and when to freeze media. The files are TIME_WINDOW, MEDIA_ERROR_THRESHOLD, and DRIVE_ERROR_THRESHOLD. They belong in the /usr/openv/netbackup directory on a UNIX server, and in the <install_path>\veritas\netbackup directory on a Windows server.
If the MEDIA_ERROR_THRESHOLD or DRIVE_ERROR_THRESHOLD value is set to 0, freeze or down occurs on the first error. MEDIA_ERROR_THRESHOLD is looked at first, so if both are set to 0, the freeze of the media overrides the downing of the drive. This configuration is not recommended.
If any one of a combination of the above files exist, the bptm shows a message indicating which value is used each time it goes through the algorithm. The log message shows:
"using time window of %d hours"
"using media error threshold of %d"
"using drive error threshold of %d"
where the %d comes from the number obtained from the file.
In general, the freeze and down behavior is designed to aid in getting backups completed successfully. If read errors occur during a restore attempt, freezing of the media has little effect, as it is still necessary to have that same tape to perform the restore (or another copy if it exists). In the case of a restore, downing a bad drive may help, assuming the problem is with the drive.
In NetBackup 6.0 and later, these values are set using the nbemmcmd command. See the last two items in the related documents section below on the use of the nbemmcmd command on how to set these values.
Article URL http://www.symantec.com/docs/TECH9748