Troubleshooting Robot or Drive Issues in NetBackup

Article:TECH169477  |  Created: 2011-09-13  |  Updated: 2012-06-26  |  Article URL http://www.symantec.com/docs/TECH169477
Article Type
Technical Solution

Problem



Troubleshooting Drive/ Library  Issues in NetBackup 

This Document provides you with information on various tape drive issues that maybe encountered whilst using NetBackup and how to deal with them.


Solution



It is important to understand that NetBackup does not write data directly to a tape drive; for example when using Solaris, NetBackup relies on the operating system to write the data to the tape using the st tape driver.  The only 'slight' involvement with NetBackup, is that it specifies the blocksize to use, but this is still passed to the operating system.  Other operating systems work in a similar manner.

The SCSI pass-through driver (sg driver on solaris) - allows scsi commands to be passed directly to the drive. These are scsi  'commands' such as 'test-unit-ready', which is used, for example,  when mounting a tape.  On occasion it is necessary to recreate/ rebuild the pass-through driver. The common symptom that involves the pass-through driver is that the scan command does not show the devices. Other issues involving the pass-through driver are very rare.

The majority of drive /tape issues have a cause outside of NetBackup.  When troubleshooting these issues it is advisable to start the troubleshooting process at the hardware/ firmware level.
 
It should always be considered that although NetBackup reports an error, it does not mean it is the cause.
  
Common drive issues include:
 
Scan command
TAPE_ALERT
ASC/ ASCQ
Positioning errors
Read/ Write errors
I/O Errors
External event has caused rewind
Tapes not reaching capacity (for example) 300GB of Data is written to a 400GB (native capacity) capacity tape 
Tapes being incorrectly marked as 'read only'
Library Inventory Issues
Robot load issue - "Error bptm error requesting media TpErrno = Robot operation failed"
Missing drives, or drives disappearing and reappearing
Tapes failing to mount in NetBackup, but visable and usable by operating system commands
Issues moving tapes to/ from slots or drives
Issues with Cartridge memory
Cleaning tape
 
In the first instance, it is always worth power cycling the library or drives reporting an issue, as well as rebooting the associated servers,  Many of the errors referenced in the TechNote can be sometimes be cleared this way.  In the event this does not clear the issue, it has at least been eliminated from being the cause.
 
 
Scan Command
 

The Scan command shows no devices at all, or, that some of the devices, or all of the devices appear and reappear when the command is run repeatedly.

Firstly, it must be confirmed that the operating system can see and communicate correctly with the tape drives.

The devices appearing in (for example)  'Device Manager'  (Windows) or cfgadm (Solaris) is NOT necessarily sufficient confirmation that the devices are correctly configured to the operating system.

It has been seen that although devices 'appear' to be visible to the operating system, san issues prevented full/ correct communication, and as a result, the scan command failed.

 Two things need to be checked before further troubleshooting is carried out:

 1/  Check no backups are running on the drives (only applicable if the drives are shared).  A scsi reservation of a drive due to a backup, may prevent the drive from responding to, and thus appearing in the output of the scan command.

 2/  Rebuild the 'pass through' driver (Unix only).  If the drive/ operating system configuration has not changed, this is very unlikely to be the issue, but it can be eliminated from being the cause by recreating the 'pass through' files.  See the device configuration guide for information on how to do this.

 Aside of the exceptions above issues with the scan command are not caused by NetBackup, when it is understood how the scan command works, it is clear how the issues are outside of NetBackup. 

Although the scan command is supplied by Symantec, it does not issue any NetBackup commands, or interact with NetBackup in any way. When run, it issues 'operating system' SCSI commands to the devices configured in the operating system, the output of the command is sent from the devices. There are no settings, 'tuning' or troubleshooting  that can be performed on the scan command. 

Windows servers do not require a pass through driver.  Providing that there are no backups running on other servers that may share the drives, then the issue will be caused by either a san issue, firmware, hardware or driver issue.  Consideration should be given to san infrastructure (eg switches), HBAs or the physical drive/ library. 

Unix servers require a pass through driver, for example, on Solaris this is called the sg driver.  This is required as the  scsi commands issued to query the device cannot be passed to the devices via the regular operating system driver. 

Once the sg driver is configured, providing the configuration is not changed, there should be no issue with the pass through driver.  If the scan command shows devices appearing and re-appearing, then the pass through driver is not the cause.  If the devices, or device, permanently disappear, it may be worth reconfiguring the pass through driver.  If the issue is not resolved, then the issue will be as per Windows servers, that is, san infrastructure (eg switches), HBAs or the physical drive/ library.   Consideration should also be given to HBA configuration files, as incorrect settings in these have been seen to prevent output from the scan command being returned.

Providing the 'pass through' driver is configured (Unix only) Symantec recommends that to further investigate scan command issues, the operating system /san administrators, or hardware vendors are consulted. 

 TapeAlert / Tape Alert
  
A "tape alert" message is a critical, warning, or informational alert that occurs due to a tape drive or robotic library hardware event. These "tape alert" messages are stored on the tape drive or robotic library. Applications like NetBackup query the tape device or robotic library for these "tape alert" messages and display the "tape alerts" to the user. "Tape alert" messages are reported in the NetBackup bptm log The tape alert technology detects and logs hardware and media errors.
 
It is important to remember that while NetBackup displays these "tape alerts," the alerts occur due to a tape drive or robotic library hardware event. Check the Event Viewer /system log for any hardware related errors.  Contact the Original Equipment Manufacturer (OEM) for support.
 
As a TapeAlert is sent from the drive it is impossible that this can be caused by NetBackup.
 
For example:
 
Oct 11 08:59:31 media bptm[3771]: [ID 228150 daemon.warning] TapeAlert Code: 0x03, Type: Warning, Flag: HARD ERROR, from drive TLD0_LTO4_DRIVE1 (index 4), Media Id R0TP01
 
To further investigate TapeAlert issues, Symantec recommends contacting your hardware vendor.
 
A link to the TechNote "Description of Tape Alerts and code definitions" is provided at the bottom of this TechNote
 
ASC/ ASCQ
 
SCSI Sense keys describe a 'state',  which are returned when a command requests a 'check condition' status.  
In this example, robtest was failing to load a tape into a drive.
 
Initiating MOVE_MEDIUM from address 1000 to 500 
move_medium failed, CHECK CONDITION 
sense key = 0x5, asc = 0x30, ascq = 0x0, INCOMPATIBLE MEDIUM INSTALLED 
 
The analysis can be broken down as follows :
 
Sense Key 0x5 - Illeagal Request
ACS/ ACSQ 0x30/00 - Incompatible Medium Inserted
 
In a similar manner to Tape Alerts, SCSI Sense Keys are produced by the device, not by NetBackup.
As ASC /ASQ alerts are sent from the hardware, it is impossible for them to be caused by NetBackup. 
It has been seen that a power cycle of the drive (not soft reset) can sometimes clear ASC/ ASCQ errors.
 
Further information on these values can be found at http://www.t10.org
 
To further investigate ACS/ ASCQ issues, Symantec recommends contacting your hardware vendor.
 
Note
 
If hardware encryption is in use via NetBackup KMS, an issue with the service may cause the drives to send out ASC /ASCQ errors relating to "Encryption".  In this instance, although the drive is sending he message, the cause may be the KMS service, and so this should be given consideration.
 
Positioning Errors
  
Positioning errors occur when the operating system is unable to position, fsf or rew the tape.
The error message seen may differ slightly, depending on when the error occurs.
 
Example 1
<2> write_data: block position check: actual 62504, expected 31254 
 
Example 2
1/11/2010 7:50:13 AM - Error bptm(pid=3364) ioctl (MTREW) failed on media id W00229, drive index 0, The I/O bus was reset. (1111) (bptm.c.8039)    
 
NetBackup requests the operating system to position the tape, at various points of the backup.  Failure to correctly position, although detected by NetBackup, is most commonly caused by:
 
1.  Hardware error
2.  Tape error
3.  Driver issue
4.  Firmware issue
 
As NetBackup does not directly position tapes, to further investigate positioning errors issues, Symantec recommends contacting your hardware vendor.
 
Note
 
One known issue can be seen in the bptm log, affecting NBU 6.5.6 to 7.0.1.
 
Error bptm (pid=2164) ioctl (MTWEOF) failed on media id V01497, drive index 0, The physical end of the tape has been reached.
 
EEB 2182228 resolves this issue.
If the issue is not resolved by this EEB, or, you see this issue at earlier or later version of NetBackup (before 6.5.6 or after 7.0.1) , then the issue is related to firmware of hardware.
 
Read/ Write errors
 
The reading or writing operation is performed at the operating system/ tapedriver level.  Therefore, although this issue is detected and reported in the NetBackup logs, it is not caused by NetBackup.
 
The cause of read/ write errors are usually an issue with the tape drive or media cartridge.
 
For example:
 
Example 1 write_data: cannot write image to media id XXXXXX, drive index #, Data error (cyclic redundancy check).     Example 2 io_write_block: write error on media id MIR107, drive index 0, writing header block, 1117   Example 3 Error bptm(pid=5268) cannot read image from media id 500507, drive index 1, err = 234   
Note
 
a) McAffee Anti_virus software is known to be a possible cause  of Status 84 errors on Windows Media Servers
b) Cyclic redundancy check errors indicate faulty hardware
 
I/O Error
 
I/O errors are caused at a hardware level, and are only detected by NetBackup.
 
For example:
 
11:20:18.246 [8504.5292] <4> write_data: WriteFile failed with: The request could not be performed because of an I/O device error. (1117); bytes written = 65536; size = 0
 
To further investigate I/O Errors, Symantec recommends contacting your hardware vendor.
 
Known issues
 
open failed in io_open I/O error 
 
This exact error can be caused by mis-configeration of the drives so this should be checked in the first instance.  If the issue remains after confirmation that the configuration is correct, then the issue should be further investigated as a hardware /firmware issue.
 
  
External event has caused rewind
 
This issue is (potentially) serious and requires immediate investigation, as data can be lost.  NetBackup will display this error if the block position calculation check by NetBackup does not match the position reported by the drive.  It will not be certain that a full rewind has occurred (impossible to tell from a simple blockcheck), but it will mean that the position check has failed, and most likely that the calculated position is less than the expected position.
 
The error will look similar to the following:
 
<2> io_terminate_tape: block position check: actual 4, expected 5 
<16> write_data: FREEZING media id XXXXXX, External event caused rewind during write, all data on media is lost
  
NetBackup keeps track of how much data it is sending to the operating system to write to the device. As an integrity check after the end of each write, NetBackup will ask the tape device for its position. If this position does not match what NetBackup has calculated the position should be, then the job will fail with a media write error.
 
If a full rewind has occurred this will overwrite the NetBackup header on the tape making it unreadable, if this has happened the data is lost.  The most common cause is a SCSI reset on the SAN, which causes a rewind of the drive(s) whilst they are being written to.  This event is undetected by NetBackup (impossible to detect) and is only discovered after the event  when the block position check is made.  NetBackup cannot cause SCSI resets on the SAN, the cause has to be external (the tape positioning /read/ write operations are controlled by the Operating System).
 
If the issue is a position error (as opposed to a 'Full' rewind) a message similar to the following will be seen (bptm log).
 
<2> write_data: block position check: actual 62504, expected 31254 
<16> write_data: FREEZING media id XXXXXX, too many data blocks written, check tape/driver block size configuration
 
The possible causes are numerous, and most commonly include:
 
Tape driver issue
Tape drive firmware issue
SAN fault
HBA fault, driver or firmware issue
Switch Fault
 
If the drives are attached to a NDMP device, it must be ensured that the SCSI  reservation on the NDMP device is set to match the SCSI reservation type of NetBackup.  
 
To further investigate "External Event has caused rewind" issues, Symantec recommends contacting your hardware /operating system support vendor.
 
Note
 
The SCSI reservation is set /held by the Host Bus Adaptor, however NetBackup sends the reserve command through the SCSI pass-thru path for the device, so this needs to be configured correctly. 
 
Known Issues:
 
NDMP
 
If the issue is occurring on drives that are shared (SSO) between a NDMP filer and NBU, and, the drives are zoned directly to the filer the issue can be caused if the SCSI reservation type set in NBU is not the same as the SCSI reservation type set on the filer.
 
If this is the case the issue can be resolved following these steps :
 
In the 'Host Properties' > 'Media Type' tab in NetBackup, check the SCSI reservation set, SPC2 or SCSI persistent
Change the type of SCSI reservation on the filer, to match the type you have set in NBU
Reboot the Robotic Library  to break all the current reservation.
 
The following TechNote has a detailed explanation of SCSI reservation:  http://www.symantec.com/docs/HOWTO32767
  
HP-UX 11.31 IA64 / atdd driver
 
BPTM block position check fails one block short using IBM atdd driver 6.0.0.96 on HP-UX 11.31 IA64
 
This issue is actually caused by the HP ATDD driver writing the EOT mark incorrectly.  However Symantec have produced a NetBackup 7.0.1 EEB to workaround this issue (ETrack 2142743 /TECH155113)
Using the ATDD driver with NetBackup 7.0.1 and later on HP-UX 11.31 IA64 requires atdd driver 6.0.2.8 or later. Upgrade to the new ATDD driver resolves the problem.
  
Tapes not reaching capacity
 
 Issues where only (for example) 300GB of Data is written to a 400GB capacity tape ...
 
NBU passes data to the OS, one block at a time, to be written to the tape drive.  NBU has  no understanding of tape capacity, in theory it would keep writing to the same tape 'for ever '.
 
When the tape physically passes the 'logical-end-of-tape' this is detected by the tape drive firmware.  The tape drive firmware then sets a 'flag' in the tape driver (this would be the st driver in the case of Solaris).  There is physically enough tape for the current block to be written so this is completed successfully. NBU then attempts to send the next block of data (via the operating system) but now the tape driver refuses, as the 'tape full' flag is set.  The st driver then passes this 'tape full' message to the operating system, which passes it to NetBackup.  Only when this has happened will Netbackup change the tape.
 
Common causes of this issue are tape drive firmware, or faulty hardware.
 
There are no settings in NetBackup that influence tape capacity.  To further investigate Tape Capacity issues, Symantec recommends contacting your hardware vendor.
 
Tapes being incorrectly marked as 'read only'
 
NetBackup has no understanding of 'read only'.  This state is set by the tapedrive usually by means of a small switch on the tape cartridge.
Therefore, if a tape is being reported as 'read only' this issue cannot be the fault of NetBackup.
 
'Read only' is reported by the firmware of the tapedrive, and logged by NetBackup, we see this as a Tapealert :
 
0x09: 'Cartridge write protected
 
It has been seen on occasion that firmware issues of the tapedrive have caused tape media to be incorrectly reported as 'read only'. 
 
Library Inventory Issues
 
NetBackup up does not directly 'Inventory' a library. Instead it queries the library and waits to be told what tapes (barcodes) are located in which element address (slots/ drives). If, for example,  NetBackup 'cannot see' a particular cartridge(s) it is because the library is 'hiding' the location, not because of any setting within NetBackup.
 
For example, common symptoms of library issues include tapes appearing in the incorrect/ wrong slot, and tapes/ slots  not appearing at all.  It is impossible for this to be caused by NetBackup.
 
To further investigate Library issues, Symantec recommends contacting your hardware vendor.  
    
Note
 
Issues involving NetBackup and the Virtual I/O slots on the IBM 3500 series libraries where ALMS /Virtual I/O are enabled are occasionally seen.  
 
Problems involving Virtual I/O slots cannot be caused by NetBackup because there are no settings in NetBackup that can influence the behavior of the Virtual I/O slots.
 
It has been found that the library setting "Queued Exports" should be set to 'HIDE' from within the IBM web console to allow tapes to be moved from the virtual I/O slots to the slots within the logical library. 
 
Robot load issue - "Error bptm error requesting media TpErrno = Robot operation failed"
 
This error is seen in the bptm log, and depending on the logging set, may be referenced in the ...volmgr/debug log, and the operating system event log
 
An excellent way to check this, is to use the robtest command, a link to a TechNote for documentation on Robtest is available at the end of the TechNote.
The robtest command does not issue any 'NetBackup ' commands.  It only sends 'operating system' SCSI command to the library, and the output seen from the command is sent from the library firmware.  Given this description, it is clear to see that Robtest failures cannot be caused by NetBackup.   
 
For example:
 
(Using robtest command to issue a move media request from slot 86 to drive 2)
 
m s86 d2
 
move_medium failed 
sense key = 0x4, asc = 0x15, ascq = 0x1, MECHANICAL POSITIONING ERROR 
 
As robtest has only sent a SCSI move request, straight away this failure can be seen to not be caused by NetBackup.
Further, the error is referencing an 'ASC /ASCQ' error, which as explained in the "ASC /ASCQ" section of the Technote is never caused by NetBackup.
 
To further investigate robtest issues, Symantec recommends contacting the hardware vendor.
   
Missing drives, or drives disappearing and reappearing
 
In cases where, for example, tpautconf -report_disc shows inconsistent numbers of missing devices when the command is run at different times.
 
tpautoconf -report_disc will report "Missing Device", if a device that is configured and available within NetBackup, has become undetected from the operating System.
 
For example:
 ======================= Missing Device (Drive) ======================
 Inquiry = "IBM  Ultrium 3-SCSI 
 Serial Number =  HM74536FFS
 Drive Path = /dev/rmt/0cbn
 Drive Name = DRV_F2D3_LTO5
 
In this case, NetBackup is only reporting that the Operating System cannot find a device that was previously available.
 
If a different number of devices are missing at different times (that is, the devices 'disappear' and 'reappear') this is very likely a SAN issue.
 
NetBackup has no control over the communication of devices between the device and the operating system.
 
If a device is showing as missing' it is because of an issue outside of NetBackup.  Problems on the SAN are a very common cause of this issue.

 

Tapes failing to mount in NetBackup, but visable and usable by operating system commands

Cases have been seen where  tapes are physically loaded into the tape drive, and are accessible and respond correctly to operating system commands such as mt and dd but NetBackup is unable to mount the tape.  The job hangs on the tape mount, failing with status 98 after some time.

Understandably, this could be seen to suggest Netbackup is at fault, however, on investigation it was found that the fault was caused by the tape drive firmware.

 

Issues moving tapes to/ from slots or drives

 

Failure to move tapes to/ from slots or drives will have a cause outside of NetBackup.  Moving tapes is achieved via industry standard scsi-commands, not, NetBackup commands.

Various messages could be seen, depending on the exact fault, for example :

 

"Auto empty media export request rejected by TLDCD; Cannot move from media access port"

Here it is seen that an operating to empty the CAP/ MAP during an inventory is failing.

attempting to move the tape using Robtest produced the following error :

Attempting to move the tape in port 1 of the CAP to slot 28

 

m p1 s28

Initiating MOVE_MEDIUM from address 10 to 1027

move_medium failed

sense key = 0x4, asc = 0x40, ascq = 0x1, UNKNOWN ERROR, KEY: 0x04, ASC: 0x40, ASCQ: 0x01

 

As seen in the ASC /ASCQ section earlier in this TechNote, errors such as this cannot be caused by NetBackup.

The cause in this case, was due to the fact that the robot was unable to access it's own slots.

To investigate issues moving media within the robot, Symantec recommends to contact the hardware vendor.

 

 Issues with Cartridge memory

 LTO tapes contain a small EEPROM chip, known as LTO-CM.  Note that some other tape technologies contain similar technology.

This has multiple uses, for example, it is used by the drive to determine the LTO tape generation, it keeps a 'error log' , and manufacturer details of the tape.

It also contains information on the position of data contained on the tape, which allows for 'fast block positioning'.

Errors will be reported if the cartridge memory fails:

 

For example,  the following messages were reported by the Library, when a LTO-CM chip failed in a cartridge.

Description: The memory in the tape cartridge has failed.

Description: The tape drive encountered a problem while loading a tape cartridge.

Description: The tape drive detected an internal hardware problem.

Description: The tape drive has an error which requires the tape cartridge to be ejected for error recovery

 

These issues are related to hardware,  and Symantec recommends to contact the hardware vendor for further investigation.

 

 Cleaning Tape
 
An unusual issue has been seen at NetBackup 7.5.  On occassion, a cleaning cycle run by NBU will fail.
 
The symptoms may differ slightly :
 
(i) - The tape cannot be unloaded, the /var/adm/message log will show
 
 
Mar 14 12:49:38 server02 tldcd[19756]: [ID 559682 daemon.notice] TLD(2) closing/unlocking robotic path
Mar 14 12:49:38  server02 tldcd[9536]: [ID 919746 daemon.notice] inquiry() function processing library ADIC     Scalar i2000     607A:
Mar 14 12:49:38  server02 tldd[9524]: [ID 583323 daemon.notice] DecodeClean: TLD(2) drive 5, Actual status: Unable to SCSI unload drive
Mar 14 12:49:39  server02 ltid[9497]: [ID 512328 daemon.notice] LTID - received ROBOT MESSAGE, Type=55, LongParam=0, Param1=1, Param2=10
Mar 14 12:49:39  server02 ltid[9497]: [ID 581313 daemon.error] Cleaning for drive 1 failed, status = Unable to SCSI unload drive
Mar 14 12:49:48  server02 bptm[19765]: [ID 946237 daemon.warning] TapeAlert Code: 0x0b, Type: Informational, Flag: CLEANING MEDIA, from drive PER-i2000-Drive5 (index 1), Media Id CLN001
Mar 14 12:49:49 server02 ltid[9497]: [ID 560358 daemon.notice] LTID - Sent ROBOTIC request, Type=3, Param2=1
 
(ii) -  Once the tape drive is cleaned a new tape is loaded and reloaded repeatedly, the /usr/openv/volmgr/debug/robots log will show 
 
12:43:53.753 [3016] <4> AddTldLtiReqEntry: Processing ROBOT_CLEAN request...
12:43:53.753 [3016] <5> CleanDrive: TLD(0) Cleaning Tape 4TP012 on drive 5, from slot 41
12:43:53.758 [3424] <4> io_open:        Drive Path = /dev/rmt/6cbn
...
12:43:54.018 [3026] <5> tldcd:mount_unmount_drive: Processing MOUNT, TLD(0) drive 5, slot 41, barcode CLN4TP012L4     , vsn 4TP012
...
12:46:01.764 [3026] <5> tldcd:mount_unmount_drive: Processing UNMOUNT, TLD(0) drive 5, slot 41, barcode CLN4TP012L4     , vsn 4TP012
...
 
12:46:12.789 [3016] <5> GetResponseStatus: DecodeClean: TLD(0) drive 5, Actual status: Unable to SCSI unload drive 

The cause of this issue is due to the 'access bit to be set to 1' on the tape drive
 
The issue is resolved with EEB 2714761
 
 Associated Documentation

http://www.symantec.com/docs/TECH124594  -  "Description of Tape Alerts and code definitions" 

http://www.symantec.com/docs/TECH83129 - "Robtest command that can be used to test the SCSI functionality of a robot"

 




Article URL http://www.symantec.com/docs/TECH169477


Terms of use for this information are found in Legal Notices