Troubleshooting Robot or Drive Issues in NetBackup

Article:TECH169477  |  Created: 2011-09-13  |  Updated: 2014-11-04  |  Article URL http://www.symantec.com/docs/TECH169477
Article Type
Technical Solution


Subject

Issue



Troubleshooting Drive/Library Issues in NetBackup 

This document provides information on, and how to resolve, various tape drive issues that may be encountered whilst using NetBackup.


Solution



It is important to understand that NetBackup does not write data directly to a tape drive. For example: when using Solaris, NetBackup relies on the operating system to write the data to the tape using the st tape driver.  The only slight involvement with NetBackup is that it specifies the block size to use - but this is still passed to the operating system.  Other operating systems work in a similar manner.

The SCSI pass-through driver (sg driver on Solaris) - allows SCSI commands to be passed directly to the drive. For example, the 'test-unit-ready' SCSI command is used, for example,  when mounting a tape.  On occasion, it is necessary to recreate/rebuild the pass-through driver. The most common symptom that involves the pass-through driver is if the scan command does not show all expected devices. Other issues involving the pass-through driver are very rare.

The majority of drive/tape issues have a cause outside of NetBackup.  When troubleshooting these issues it is advisable to start the troubleshooting process at the hardware/firmware level.
 
It should always be considered that although NetBackup reports an error, it does not mean that NetBackup is the cause.
  
Common drive issues include:
 
Scan command
TAPE_ALERT
ASC/ASCQ
Missing Path
Positioning errors
Read/ Write errors
I/O Errors
External event has caused rewind
Tapes not reaching capacity (for example) 300GB of Data is written to a 400GB (native capacity) capacity tape 
Tapes being incorrectly marked as 'read only'
Library Inventory Issues
Robot load issue - "Error bptm error requesting media TpErrno = Robot operation failed"
Missing drives, or drives disappearing and reappearing
Tapes failing to mount in NetBackup, but visable and usable by operating system commands
Issues moving tapes to/ from slots or drives
Issues with Cartridge memory
Cleaning tape
 
 
In the first instance, it is always worth power cycling the library or drives reporting an issue, as well as rebooting the associated servers,  Many of the errors referenced in this tech note can be sometimes be cleared this way.  In the event this does not clear the issue, it has at least been eliminated from being the cause.
 
 
Scan Command
 

Scenario: The scan command shows no devices at all, or, that some of the devices, or all of the devices appear and reappear when the command is run repeatedly.

Firstly, it must be confirmed that the operating system can see and communicate correctly with the tape drives.

The devices appearing in (for example)  'Device Manager'  (Windows) or cfgadm (Solaris) is NOT necessarily sufficient confirmation that the devices are correctly configured to the operating system.

It has been seen that although devices appear to be visible to the operating system, SAN issues prevented full/correct communication, and as a result, the scan command failed.

Two things need to be checked before further troubleshooting is carried out:

 1.  Ensure no backups are running on the drives (only applicable if the drives are shared).  A SCSI reservation of a drive due to a backup may prevent the drive from responding to, and thus appearing in the output of the scan command.

 2.  Rebuild the passthrough driver (Unix only). If the drive/operating system configuration has not changed, then this is very unlikely to be the issue. However, it can be eliminated from being the cause by recreating the passthrough links and files. See the device configuration guide for information on how to do this.

 Aside from the exceptions, above issues with the scan command are not caused by NetBackup. When it is understood how the scan command works, it is clear how the root of the issues are external to NetBackup. 

Although the scan command is supplied by Symantec, it does not issue any NetBackup commands, or interact with NetBackup in any way. When run, it issues operating system level SCSI commands to the devices configured in the operating system, and the output of the command is sent from the devices themselves. There are no settings, tuning or troubleshooting that can be performed on the scan command.

Windows servers do not require a passthrough driver.  Providing that there are no backups running on other servers that may share the drives, then the problem will be caused by either an issue regarding the SAN, firmware, hardware or drivers.  Consideration should be given to SAN infrastructure (e.g. switches), HBAs or the physical drive/library. 

Unix servers require a passthrough driver. For example, on Solaris this is called the sg driver.  This is required as the SCSI commands issued to query the device cannot be passed to the devices via the regular operating system driver.

If the scan command shows devices appearing and re-appearing, then the passthrough driver is not the cause.  If the device(s) permanently disappear, it may be worth reconfiguring the passthrough driver.  If the issue is not resolved, then the issue will be as per Windows servers, that is, SAN infrastructure (e.g. switches), HBAs or the physical drive/library. Consideration should also be given to HBA configuration files, as incorrect settings in these have been seen to prevent output from the scan command being returned.

Providing the passthrough driver is configured, Symantec recommends to consult your hardware vendors and/or Operating System/SAN Administrators to further investigate scan command issues.

Known Issues:

Some 6GB SAS HBAs are not compatable mpt_sas driver as details in Oracles Technote:  http://docs.oracle.com/cd/E19253-01/821-0382/821-0382.pdf

TapeAlert/Tape Alert
  
 
A tape alert message is a critical, warning, or informational alert that occurs due to a tape drive or robotic library hardware event. These "tape alert" messages are stored on the tape drive or robotic library. Applications like NetBackup query the tape device or robotic library for these "tape alert" messages and display the "tape alerts" to the user. "Tape alert" messages are reported in the NetBackup bptm log. The tape alert technology detects and logs hardware and media errors.
 
It is important to remember that while NetBackup displays these "tape alerts", the alerts occur due to a tape drive or robotic library hardware event. Check the Event Viewer/system log for any hardware related errors. Contact the Original Equipment Manufacturer (OEM) for support.
 
As a TapeAlert is sent from the drive itself, it is impossible that this can be caused by NetBackup.
 
For example:
 
Oct 11 08:59:31 media bptm[3771]: [ID 228150 daemon.warning] TapeAlert Code: 0x03, Type: Warning, Flag: HARD ERROR, from drive TLD0_LTO4_DRIVE1 (index 4), Media Id R0TP01
 
To further investigate TapeAlert issues, Symantec recommends contacting your hardware vendor.
 
A link to the TechNote "Description of Tape Alerts and code definitions" is provided at the bottom of this TechNote
 
 
ASC/ ASCQ
 
 
SCSI Sense keys describe a 'state',  which are returned when a command requests a 'check condition' status.  
In this example, robtest was failing to load a tape into a drive.
 
Initiating MOVE_MEDIUM from address 1000 to 500 
move_medium failed, CHECK CONDITION 
sense key = 0x5, asc = 0x30, ascq = 0x0, INCOMPATIBLE MEDIUM INSTALLED 
 
The analysis can be broken down as follows :
 
Sense Key 0x5 - Illeagal Request
ASC/ASCQ 0x30/00 - Incompatible Medium Inserted
 
In a similar manner to Tape Alerts, SCSI Sense Keys are produced by the device, not by NetBackup.
As ASC/ASCQ alerts are sent from the hardware, it is impossible for them to be caused by NetBackup. 
It has been seen that a power cycle of the drive (not soft reset) can sometimes clear ASC/ASCQ errors.
 
Further information on these values can be found at http://www.t10.org
 
To further investigate ASC/ASCQ issues, Symantec recommends contacting your hardware vendor.
 
 
Note
 
If hardware encryption is in use via NetBackup KMS, an issue with the service may cause the drives to send out ASC/ASCQ errors relating to encryption.  In this instance, although the drive is sending he message, the cause may be the KMS service, and so this should be given consideration.
 

Missing Path

 

Missing path means that the Operating System has lost connectivity to the drives.  At this point, you will find that the devices are also missing from the scan output, simply because the scan command only communicates with devices found at the Operating System level.

For this issue, NetBackup is not the cause, however, when the issue is resolved it may be the case that the paths to the devices change, thus making the NetBackup config incorrect.  If this is the case, the devices will need to be deleted and reconfigured within NetBackup.  If the devices come back with the same operating system paths, then no further action should be required. 

 

 
Positioning Errors
  
 
Positioning errors occur when the operating system is unable to position, fast-forward or rewind the tape.
The error message seen may differ slightly, depending on when the error occurs.
 
Example 1
<2> write_data: block position check: actual 62504, expected 31254 
 
Example 2
1/11/2010 7:50:13 AM - Error bptm(pid=3364) ioctl (MTREW) failed on media id W00229, drive index 0, The I/O bus was reset. (1111) (bptm.c.8039)    
 
NetBackup requests the operating system to position the tape, at various points of the backup.  Failure to correctly position, although detected by NetBackup, is most commonly caused by:
 
1.  Hardware error
2.  Tape error
3.  Driver issue
4.  Firmware issue
 
As NetBackup does not directly position tapes, to further investigate positioning errors issues, Symantec recommends contacting your hardware vendor.
 
Note
 
a) One known issue can be seen in the bptm log, affecting NBU 6.5.6 to 7.0.1.
 
Error bptm (pid=2164) ioctl (MTWEOF) failed on media id V01497, drive index 0, The physical end of the tape has been reached.
 
EEB 2182228 resolves this issue.
If the issue is not resolved by this EEB, or, you see this issue at earlier or later version of NetBackup (before 6.5.6 or after 7.0.1) , then the issue is related to firmware of hardware.
 
b) Between NetBackup 6.5.6 - 7.1.0.3  duplications of MPX backups may result in a positioning error / status 94.  To investigate this Symantec suggests to log a call and quote eTrack 2229875  
 
Read/ Write errors
 
 
The reading or writing operation is performed at the operating system/driver level.  Therefore, although this issue is detected and reported in the NetBackup logs, it is not caused by NetBackup.
The cause of read/ write errors are usually an issue with the tape drive or media cartridge.
 
For example:
 
write_data: cannot write image to media id XXXXXX, drive index #, Data error (cyclic redundancy check).     Example 2 io_write_block: write error on media id MIR107, drive index 0, writing header block, 1117   Example 3 Error bptm(pid=5268) cannot read image from media id 500507, drive index 1, err = 234        
 
Note
 
a) McAffee Anti_virus software is known to be a possible cause of Status 84 errors on Windows Media Servers.
b) Cyclic redundancy check errors indicate faulty hardware.
c)  MSEO is not compatible with Asynchronous Tapemarks which were introduced in NetBackup 7.1   Symptoms include write and/or read errors on tapes encrypted with MESO.  Creating the empty file ' /usr/openv/netbackup/db/config/DISABLE_IMMEDIATE_WEOF ' will resolve the issue
 
 
I/O Error
 
 
I/O errors are caused at a hardware level, and are only detected by NetBackup.
 
For example:
 
11:20:18.246 [8504.5292] <4> write_data: WriteFile failed with: The request could not be performed because of an I/O device error. (1117); bytes written = 65536; size = 0
 
To further investigate I/O Errors, Symantec recommends contacting your hardware vendor.
 
 
Known issues
 
 
open failed in io_open I/O error 
 
This exact error can be caused by mis-configeration of the drives so this should be checked in the first instance.  If the issue remains after confirmation that the configuration is correct, then the issue should be further investigated as a hardware/firmware issue.
 
  
External event has caused rewind
 
 
This issue is (potentially) serious and requires immediate investigation, as data can be lost.  NetBackup will display this error if the block position calculation check by NetBackup does not match the position reported by the drive.  It will not be certain that a full rewind has occurred (impossible to tell from a simple block check), but it will mean that the position check has failed, and most likely that the calculated position is less than the expected position.
 
The error will look similar to the following:
 
<2> io_terminate_tape: block position check: actual 4, expected 5 
<16> write_data: FREEZING media id XXXXXX, External event caused rewind during write, all data on media is lost
  
NetBackup keeps track of how much data it is sending to the operating system to write to the device. NetBackup will ask the tape device for its position as an integrity check after the end of each write. If this position does not match what NetBackup has calculated the position should be, then the job will fail with a media write error.
 
If a full rewind has occurred, this will overwrite the NetBackup header on the tape, making it unreadable. If this has happened, the data on the media is lost.  The most common cause of this is a SCSI reset on the SAN, which causes a rewind of the drive(s) whilst they are being written to. This event is undetectable by NetBackup, and is only discovered after the event, when the block position check is made. NetBackup cannot cause SCSI resets on the SAN because the tape positioning and read/write operations are all controlled by the Operating System itself.
 
If the issue is a position error (as opposed to a 'Full' rewind) a message similar to the following will be seen upon inspection of the bptm log.
 
<2> write_data: block position check: actual 62504, expected 31254 
<16> write_data: FREEZING media id XXXXXX, too many data blocks written, check tape/driver block size configuration
 
The possible causes are numerous, and most commonly include:
 
Tape driver issue
Tape drive firmware issue
SAN fault
HBA driver or firmware issue, or other fault
Switch Fault
 
If the drives are attached to a NDMP device, it must be ensured that the SCSI reservation on the NDMP device is set to match the SCSI reservation type of NetBackup.  
 
To further investigate "External event has caused rewind" issues, Symantec recommends contacting your hardware /operating system support vendor.
 
Note
 
The SCSI reservation is set/held by the Host Bus Adaptor. However, NetBackup sends the reserve command through the SCSI pass-thru path for the device, so this needs to be configured correctly. 
 
Known Issues:
 
 
NDMP
 
 
If the issue is occurring on drives that are shared (SSO) between an NDMP filer and NetBackup, and the drives are zoned directly to the filer, then the issue can be caused if the SCSI reservation type set in NetBackup is not the same as the SCSI reservation type set on the filer.
 
If this is the case the issue can be resolved by following these steps :
 
In the 'Host Properties' > 'Media Type' tab in NetBackup, check the SCSI reservation set, SPC2 or SCSI persistent
Change the type of SCSI reservation on the filer, to match the type you have set in NetBackup.
Reboot the Robotic Library to break all the current reservation.
 
The following TechNote has a detailed explanation of SCSI reservation:  http://www.symantec.com/docs/HOWTO32767

 

HP-UX 11.31 IA64 / atdd driver
 
 
Scenario: BPTM block position check fails one block short using IBM atdd driver 6.0.0.96 on HP-UX 11.31 IA64
 
This issue is actually caused by the HP ATDD driver writing the EOT mark incorrectly.  However, Symantec has produced a NetBackup 7.0.1 EEB to workaround this issue (ETrack 2142743 /TECH155113)
Using the ATDD driver with NetBackup 7.0.1 and later on HP-UX 11.31 IA64 requires atdd driver 6.0.2.8 or later. Upgrade to the new ATDD driver resolves the issue.
  
 
Tapes not reaching capacity
 
 
Scenario: 300 GB of Data is written to a 400 GB capacity tape
 
NetBackup passes data to the OS, one block at a time, to be written to the tape drive. NetBackup has no understanding of tape capacity. In theory, it would keep writing to the same tape "forever".
 
When the tape physically passes the logical end-of-tape, this is detected by the tape drive firmware. The tape drive firmware then sets a 'flag' in the tape driver (this would be the st driver in the case of Solaris). There is still enough physical space on the tape for the current block to be written, so this completes successfully. NetBackup then attempts to send the next block of data (via the operating system) but now the tape driver refuses, as the 'tape full' flag is set. The st driver then passes this 'tape full' message to the operating system, which passes it to NetBackup.  Only when this has happened will Netbackup request the tape to be changed.
 
Common causes of this issue are tape drive firmware, or faulty hardware.
 
There are no settings in NetBackup that influence tape capacity.  To further investigate Tape Capacity issues, Symantec recommends contacting your hardware vendor.
 
 
Tapes being incorrectly marked as 'read only'
 
 
NetBackup has no understanding of 'read only'.  This state is set by the tape drive, usually by means of a small, physical switch on the tape cartridge.
Therefore, if a tape is being reported as 'read only' this issue cannot be the fault of NetBackup.
 
'Read only' is reported by the firmware of the tapedrive, and logged by NetBackup, we see this as a Tapealert :
 
0x09: 'Cartridge write protected
 
It has been seen on occasion that firmware issues of the tape drive have caused tape media to be incorrectly reported as read only. 
 
 
Library Inventory Issues
 
 
NetBackup does not directly 'Inventory' a library. Instead it queries the library and waits to be told what tapes (via their barcodes) are located in which element address (slots/drives). If, for example, NetBackup cannot 'see' a particular cartridge(s) it is because the library is 'hiding' the location, not because of any setting within NetBackup.
 
For example, common symptoms of library issues include tapes appearing in the incorrect/wrong slot, and tapes/slots not appearing at all. It is impossible for this to be caused by NetBackup.
 
To further investigate Library issues, Symantec recommends contacting your hardware vendor.  
    
Note
 
Issues involving NetBackup and the Virtual I/O slots on the IBM 3500 series libraries where ALMS/Virtual I/O are enabled are occasionally seen.  
 
Problems involving Virtual I/O slots cannot be caused by NetBackup because there are no settings in NetBackup that can influence the behavior of the Virtual I/O slots.
 
It has been found that the library setting "Queued Exports" should be set to 'HIDE' from within the IBM web console to allow tapes to be moved from the virtual I/O slots to the slots within the logical library. 
 
 
Robot load issue - "Error bptm error requesting media TpErrno = Robot operation failed"
 
 
This error is seen in the bptm log, and depending on the logging set, may be referenced in the .../volmgr/debug log, and possibly also the operating system event log.
 
An excellent way to check this is to use the robtest command. A link to a TechNote for documentation on Robtest is available at the end of the TechNote.
The robtest command does not issue any NetBackup commands.  It only sends operating system level SCSI commands to the library, and the output seen from the command is sent from the library firmware. Given this description, it is clear to see that Robtest failures cannot be caused by NetBackup.   
 
For example, this robtest command issues a move media request from slot 86 to drive 2:
 
m s86 d2
 
move_medium failed 
sense key = 0x4, asc = 0x15, ascq = 0x1, MECHANICAL POSITIONING ERROR 
 
As robtest has only sent a SCSI move request, straight away this failure can be seen to not be caused by NetBackup.
Further, the error is referencing an 'ASC/ASCQ' error, which, as explained in the "ASC/ASCQ" section of this tech note, is never caused by NetBackup.
 
To further investigate robotic operation issues, Symantec recommends involving the Library's vendor.
 
 
Missing drives, or drives disappearing and reappearing
 
 
In cases where, for example, tpautconf -report_disc shows inconsistent numbers of missing devices when the command is run at different times.
 
tpautoconf -report_disc will report "Missing Device" if a device that is configured and available within NetBackup has become undetectable from the Operating System.
 
For example:
 
 ======================= Missing Device (Drive) ======================
 Inquiry = "IBM  Ultrium 3-SCSI 
 Serial Number =  HM74536FFS
 Drive Path = /dev/rmt/0cbn
 Drive Name = DRV_F2D3_LTO5
 
In this case, NetBackup is only reporting that the Operating System cannot find a device that was previously available.
 
If a different number of devices are missing at different times (that is, the devices 'disappear' and 'reappear') this is very likely a SAN issue.
 
NetBackup has no control over the communication of between the devices and the operating system.
 
If a device is showing as missing, then there must be an issue outside of NetBackup. Problems on the SAN are a very common cause of this issue.

 

Tapes failing to mount in NetBackup, but visable and usable by operating system commands

Cases have been seen in which tapes are physically loaded into the tape drive, and are accessible and respond correctly to operating system commands such as mt and dd, but NetBackup is unable to mount the tape.  The job hangs on the tape mount, failing with status 98 after some time.

Understandably, this could be seen to suggest Netbackup is at fault, however, upon investigation it was found that the fault was caused by the tape drive firmware.

 

Issues moving tapes to/from slots or drives

Failure to move tapes to/ from slots or drives will have a cause outside of NetBackup.  Moving tapes is achieved via industry standard scsi-commands - not NetBackup commands.

Various messages could be seen, depending on the exact fault, for example :

Auto empty media export request rejected by TLDCD; Cannot move from media access port

Here it is seen that an operation to empty the CAP/MAP during an inventory is failing.

Attempting to move the tape using robtest produced the following error:

m p1 s28

Attempting to move the tape in port 1 of the CAP to slot 28
 

Initiating MOVE_MEDIUM from address 10 to 1027

move_medium failed

sense key = 0x4, asc = 0x40, ascq = 0x1, UNKNOWN ERROR, KEY: 0x04, ASC: 0x40, ASCQ: 0x01

As seen in the ASC/ASCQ section earlier in this tech note, errors such as this cannot be caused by NetBackup.

In this case, the cause of the issue was due to the fact that the robot was unable to access its own slots.

To investigate issues moving media within the robot, Symantec recommends to contact the hardware vendor.

 

Issues with Cartridge memory

LTO tapes contain a small EEPROM chip, known as LTO-CM.

This has multiple uses. For example, it is used by the drive to determine the LTO tape generation, it keeps a 'error log', and manufacturer details of the tape.

It also contains information on the position of data contained on the tape, which allows for fast block positioning.

Errors will be reported if the cartridge memory fails.

For example, the following messages were reported by the Library, when a LTO-CM chip failed in a cartridge:

Description: The memory in the tape cartridge has failed.

Description: The tape drive encountered a problem while loading a tape cartridge.

Description: The tape drive detected an internal hardware problem.

Description: The tape drive has an error which requires the tape cartridge to be ejected for error recovery

These issues are related to hardware, and Symantec recommends to contact the hardware vendor for further investigation.

 

Cleaning Tape 

 
An unusual issue has been seen at NetBackup 7.5.  On occassion, a cleaning cycle run by NetBackup will fail.
 
The symptoms may differ slightly :
 
A. The tape cannot be unloaded, the /var/adm/message log will show:
 
Mar 14 12:49:38 server02 tldcd[19756]: [ID 559682 daemon.notice] TLD(2) closing/unlocking robotic path
Mar 14 12:49:38  server02 tldcd[9536]: [ID 919746 daemon.notice] inquiry() function processing library ADIC     Scalar i2000     607A:
Mar 14 12:49:38  server02 tldd[9524]: [ID 583323 daemon.notice] DecodeClean: TLD(2) drive 5, Actual status: Unable to SCSI unload drive
Mar 14 12:49:39  server02 ltid[9497]: [ID 512328 daemon.notice] LTID - received ROBOT MESSAGE, Type=55, LongParam=0, Param1=1, Param2=10
Mar 14 12:49:39  server02 ltid[9497]: [ID 581313 daemon.error] Cleaning for drive 1 failed, status = Unable to SCSI unload drive
Mar 14 12:49:48  server02 bptm[19765]: [ID 946237 daemon.warning] TapeAlert Code: 0x0b, Type: Informational, Flag: CLEANING MEDIA, from drive PER-i2000-Drive5 (index 1), Media Id CLN001
Mar 14 12:49:49 server02 ltid[9497]: [ID 560358 daemon.notice] LTID - Sent ROBOTIC request, Type=3, Param2=1
 
B. Once the tape drive is cleaned a new tape is loaded and reloaded repeatedly, the /usr/openv/volmgr/debug/robots log will show:
 
12:43:53.753 [3016] <4> AddTldLtiReqEntry: Processing ROBOT_CLEAN request...
12:43:53.753 [3016] <5> CleanDrive: TLD(0) Cleaning Tape 4TP012 on drive 5, from slot 41
12:43:53.758 [3424] <4> io_open:        Drive Path = /dev/rmt/6cbn
...
12:43:54.018 [3026] <5> tldcd:mount_unmount_drive: Processing MOUNT, TLD(0) drive 5, slot 41, barcode CLN4TP012L4     , vsn 4TP012
...
12:46:01.764 [3026] <5> tldcd:mount_unmount_drive: Processing UNMOUNT, TLD(0) drive 5, slot 41, barcode CLN4TP012L4     , vsn 4TP012
...
 
12:46:12.789 [3016] <5> GetResponseStatus: DecodeClean: TLD(0) drive 5, Actual status: Unable to SCSI unload drive 

The cause of this issue is due to the 'access bit to be set to 1' on the tape drive
 
The issue is resolved with EEB 2714761

 

 

 

 





Article URL http://www.symantec.com/docs/TECH169477


Terms of use for this information are found in Legal Notices