DOCUMENTATION: How Symantec NetBackup(tm) determines if a tape should be frozen or the status of a tape drive should be changed to down, and how to change this behavior

Article:TECH9748  |  Created: 2001-01-09  |  Updated: 2014-01-15  |  Article URL http://www.symantec.com/docs/TECH9748
Article Type
Technical Solution


Issue



DOCUMENTATION: How Symantec NetBackup determines if a tape should be frozen or the status of a tape drive should be changed to down, and how to change this behavior


Solution




Modification:
When a read, write, or position error occurs on tape, it is difficult to know whether the error is caused by media or by the drive itself. This is because the only error produced comes from the operating system, and only reports, "I/O ERROR".  In an attempt to prevent bad media or drives from causing all backups in a given timeframe to fail, NetBackup developed a method to attempt to determine, based on past history, if a media or drive is bad.

Each time an I/O error occurs on a read, write, or position, bptm logs the error into an errors file.  Each entry consists of the time of the error, the media ID, the drive index, and the type of error.  
 
The errors file is located on each media server in
/usr/openv/netbackup/db/media/errors  (Unix)
<install_path>\Veritas\NetBackup\db\media\errors  (Windows)

Sample entries in this file are:
05/21/06 04:15:17 A00167 4 WRITE_ERROR
05/26/06 12:37:47 A00168 4 READ_ERROR

Each time an entry is made, past entries in the file are scanned to determine if the same media id or drive has had the same type of error in the past "n" hours, where "n" is the TIME_WINDOW. The default time window is 12 hours.  The command to freeze a media or down a drive does not normally occur the first time the error is encountered.  There are two other parameters, MEDIA_ERROR_THRESHOLD and DRIVE_ERROR_THRESHOLD, the default value for each being 3.  

For example:
- If the same media id gets write errors three times within the time window, on more than 1 drive, it is assumed that the media is bad and NetBackup freezes the media.
- If different media id's get the same error three times within the time window on the same drive, it is assumed the drive is bad and NetBackup places that drive into a "DOWN" state.  
- If the same drive gets errors three times within the time window with the same media id, then NetBackup assumes the media is bad and freezes it.

The TIME_WINDOW, MEDIA_ERROR_THRESHOLD and DRIVE_ERROR_THRESHOLD values are all configurable. If the MEDIA_ERROR_THRESHOLD or DRIVE_ERROR_THRESHOLD value is set to 0, freeze or down occurs on the first error.  MEDIA_ERROR_THRESHOLD is looked at first, so if both are set to 0, the freeze of the media overrides the downing of the drive. This configuration is not recommended.

If any one of a combination of the above files exist, the bptm shows a message indicating which value is used each time it goes through the algorithm. The log message shows:

"using time window of %d hours"
"using media error threshold of %d"
"using drive error threshold of %d"

where the %d comes from the number obtained from the file.

In general, the freeze and down behavior is designed to aid in getting backups completed successfully. If read errors occur during a restore attempt, freezing of the media has little effect, as it is still necessary to have that same tape to perform the restore (or another copy if it exists). In the case of a restore, downing a bad drive may help, assuming the problem is with the drive.

To view the error threshold and window settings, run the following nbemmcmd command:
Windows
<Install_Path>\Veritas\NetBackup\bin\admincmd>nbemmcmd -listsettings -machinename <machine name>
Unix
/usr/openv/netbackup/bin/admincmd/nbemmcmd -listsettings -machinename <machine name>

Several parameters will display, including the following:

DRIVE_ERROR_THRESHOLD="2"
MEDIA_ERROR_THRESHOLD="2"
TIME_WINDOW="12"

To change the error threshold and window settings, run the following nbemmcmd command:
Windows
<Install_Path>\Veritas\NetBackup\bin\admincmd>nbemmcmd -changesetting -machinename <machine name>
Unix
/usr/openv/netbackup/bin/admincmd/nbemmcmd -changesetting -machinename <machine name>
 
The parameters are specified as follows:

DRIVE_ERROR_THRESHOLD <unsigned integer>
MEDIA_ERROR_THRESHOLD <unsigned integer>
TIME_WINDOW <unsigned integer> 



 



Legacy ID



234412


Article URL http://www.symantec.com/docs/TECH9748


Terms of use for this information are found in Legal Notices