Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.

Veritas NBU6.5-Driver I/O Error

Created: 27 Feb 2013 | 24 comments
Today,When I go to check Vertias NBU6.5 Backup status, find that one policy shows error,as follows:
begin writing
Error bptm (pid=19808) cannot write image to media id 000054, drive index 1, I/O error
end writing
status code 84
 
Please give some suggests.
TKS
Operating Systems:

Comments 24 CommentsJump to latest comment

RamNagalla's picture

hi ,

we would need more details

1) what is your operation system of Media server?

2) what is your hardware? tape library info

3) are you seeing this error for any particular media or Drive? or its random between the Drives and medias?

4) let us know the output of below command

scan

vmoprcmd -d

tpconfig -d

tpautoconf -t

5) and also the logs of bptm and /usr/openv/netbackup/db/media/errors

6) detail status of the failed job.

and also did you check if the tape is write protected.?

does it giving the error afer writing some data ,, or without writing any data?

phuong.huynh@svtech.com.vn's picture

Hi,

1) what is your operation system of Media server?

The media server is running solaris 10 sparc.

2) what is your hardware? tape library info

Tape library SL500

3) are you seeing this error for any particular media or Drive? or its random between the Drives and medias?

This is random for any media.

4) let us know the output of below command

See file attach "sl500_cmd_logs".

5) and also the logs of bptm and /usr/openv/netbackup/db/media/errors

See file attach "errors"

6) detail status of the failed job.

See picture attach image005 and image 006

and also did you check if the tape is write protected.?

Yes, I have checked, the tape is not write proteted

Thank you.

image005.png image006.png
AttachmentSize
sl500_cmd_logs.txt 7.52 KB
sl500_cmd_logs.txt 7.52 KB
Marianne's picture

NetBackup relies on the OS for I/O. This means that NBU is merely reporting error and that we are not going to get a lot of info by looking at NBU alone.

If you are seeing regular status 84's, then /usr/openv/netbackup/db/media/error will help us determine if I/O errors are experienced on a particular tape drive or particular media.

You also need to enable the following logs on the media server:

Create /usr/openv/netbackup/logs/bptm folder

Add VERBOSE entry to /usr/openv/volmgr/vm.conf and restart NBU on media server.
Device-related messages and errors will now be logged to /var/adm/messages.

Some helpful TN's:

http://www.symantec.com/docs/TECH169477

http://www.symantec.com/docs/TECH43243

Please see this extract from above doc:

As an application, NetBackup has no direct access to a device, instead relying on the operating system (OS) to handle any communication with the device. This means that during a write operation NetBackup asks the OS to write to the device and report back the success or failure of that operation. If there is a failure, NetBackup will merely report that a failure occurred, and any troubleshooting should start at the OS level. If the OS is unable to perform the write, there are three likely causes; OS configuration, a problem on the SCSI path, or a problem with the device.

**** PS **** Are you aware of the fact that support for NBU 6.5 has ended in Oct last year?
PLEASE upgrade!

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

phuong.huynh@svtech.com.vn's picture

Hi,

I sent the log your request, please see the attach file. Also, I don't find vm.conf in the directory /usr/openv/volmgr.

AttachmentSize
bptm.rar 571.59 KB
errors.txt 33.47 KB
RamNagalla's picture
01:23:35.351 [19808] <2> write_data: attempting write error recovery, err = 5
01:23:35.351 [19808] <2> tape_error_rec: error recovery to block 10158002 requested
01:23:35.351 [19808] <2> tape_error_rec: attempting error recovery, delay 3 minutes before next attempt, tries left = 5
01:26:35.353 [19808] <2> io_ioctl: command (0)MTWEOF 0 from (overwrite.c.503) on drive index 1
01:26:35.354 [19808] <2> io_ioctl: MTWEOF failed during error recovery, I/O error
 
see the related T/N
 
 
 
 
both are pointing to tape Drive driver  updates and correct the I/O issue in hardware level.
phuong.huynh@svtech.com.vn's picture

I have 6 tape drive, how to know which tape drive is corrupted.I check the service of the SL500 LED warning but no errors.

mph999's picture

You do not necessarily see a warning light - the drive has no mechanical fault as such, it just can't read/ write reliably.

Regards,  Martin
 
Setting Logs in NetBackup:
http://www.symantec.com/docs/TECH75805
 
mph999's picture

From the last 2 or 3 months, from the error.txt file I find :

(I ran this through a script, the file alone does not contain the information in this format)

HP.ULTRIUM4-SCSI.000 has had errors with 2 different tapes   (Total occurrences (errors) for this drive is 2)
HP.ULTRIUM4-SCSI.001 has had errors with 65 different tapes   (Total occurrences (errors) for this drive is 122)
HP.ULTRIUM4-SCSI.002 has had errors with 2 different tapes   (Total occurrences (errors) for this drive is 2)
HP.ULTRIUM4-SCSI.003 has had errors with 14 different tapes   (Total occurrences (errors) for this drive is 26)
HP.ULTRIUM4-SCSI.004 has had errors with 13 different tapes   (Total occurrences (errors) for this drive is 18)
HP.ULTRIUM4-SCSI.005 has had errors with 57 different tapes   (Total occurrences (errors) for this drive is 91)
 
Two drives show as having many many more errors than the other drives.  Assuming the drives are each used a similar amount, then it might suggest thy have some issues.
 
The tapes that errored in these two drives (001 and 005) did not show significantly higher numbers of errors in othetr drives that the same media had had errors in - in other words, you drives look worn, not your tapes.
 
You seem to have write errors.
 
It is a lot 'harder' for a drive to write to a tape than read it, therefore, as a drive wears out, I would expect it to start to fail with write errors before read, which is what we see.
 
If this issue has started wit no changes made to the environment, I would not expect it to be driver. firmware related.
 
Martin
Regards,  Martin
 
Setting Logs in NetBackup:
http://www.symantec.com/docs/TECH75805
 
Marianne's picture

... I don't find vm.conf in the directory /usr/openv/volmgr.

In older versions of NBU the vm.conf file does not exist by default.
Please create the file and insert 
VERBOSE
in the file. 
Save the file and restart NBU.

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

2013's picture

Please go ahead and clean the Tape Drives and then shoot a backup and see how it goes....

Also check last time when it was cleaned by executing the below command:

/usr/openv/volmgr/bin/tpclean -L

I know Tape library SL500 have its own functinality to clean the Tape Drives but you need to check what frequency was set for the Tape Drives

/usr/openv/volmgr/bin/tpclean -F drive_name cleaning_frequency

Hope it helps!!!

phuong.huynh@svtech.com.vn's picture

Hi all,

I have upgraded firmware for tape drive and Library and also update patch for OS but the error does't fix.

Please give me advice.

Marianne's picture

Have you created vm.conf with VERBOSE entry yet?

Can you see that Media Manager prosesses are running with -v?

Have you checked /var/adm/messages for hardware errors?

There is more to the data path than just library and tape drives - there is also the hba in the server, cable(s) that goes to a switch, switch port(s), cables that go to each of the drives.

As I've said before, looking at NBU only is not going to tell us much. You need to troubleshoot at OS level. 
Switch logs may also help.

The error log is telling us that you are experiencing errors on basically all the drives and lots of tapes. Chances are slim that all of them are faulty. What is the common factor that links all drives to the OS? The hba comes to mind, right?
hba is also more than just a piece of hardware - there is firmware and drivers that must be checked along with the hardware. /var/adm/messages is a good starting point to look for device-related errors.

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

phuong.huynh@svtech.com.vn's picture

I have created vm.conf with VERBOSE entry.

In the policy that I have run, the policy failed to ues tape drive id 001 and 003 (please see attach file). But when I use the tar command of OS for each drives is ok.

root@Nbmaster2 # tar cvf /dev/rmt/10 explominer.tar
a explominer.tar 24244 tape blocks
root@Nbmaster2 #
root@Nbmaster2 #
root@Nbmaster2 # tar cvf /dev/rmt/7 explominer.tar
a explominer.tar 24244 tape blocks
root@Nbmaster2 #

AttachmentSize
billing_rman_level0_FAILED_OK.txt 5.04 KB
billing_rman_level1_FAILED.txt 2.65 KB
mobicard_rman_level0_OK.txt 2.51 KB
mobicard_rman_level1_OK.txt 3.24 KB
Marianne's picture

What is status of tape drives? Check with 'vmoprcmd -d'.
Have you checked bptm log and messages file for errors?

Ability to write with tar command confirms that I/O errors are intermittent.
Old firmware on hba is known for giving errors when load is high.

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

phuong.huynh@svtech.com.vn's picture

The status of tape drives are OK.

root@Nbmaster2 # /usr/openv/volmgr/bin/vmoprcmd -d

                                PENDING REQUESTS

                                     <NONE>

                                  DRIVE STATUS

Drv Type   Control  User      Label  RecMID  ExtMID  Ready   Wr.Enbl.  ReqId
  0 hcart    TLD                -                     No       -         0  
  1 hcart    TLD                -                     No       -         0  
  2 hcart    TLD                -                     No       -         0  
  3 hcart    TLD                -                     No       -         0  
  4 hcart  DOWN-TLD             -                     No       -         0  
  5 hcart    TLD                -                     No       -         0  

                             ADDITIONAL DRIVE STATUS

Drv DriveName            Shared    Assigned        Comment                   
  0 HP.ULTRIUM4-SCSI.000  No       -                                         
  1 HP.ULTRIUM4-SCSI.001  No       -                                         
  2 HP.ULTRIUM4-SCSI.002  No       -                                         
  3 HP.ULTRIUM4-SCSI.003  No       -                                         
  4 HP.ULTRIUM4-SCSI.004  No       -                                         
  5 HP.ULTRIUM4-SCSI.005  No       -                                         
root@Nbmaster2 #

I would suggest to the team hardware about the upgrade firmware for hba card.

Marianne's picture
 4 hcart  DOWN-TLD             -                     No       -         0  

 4 HP.ULTRIUM4-SCSI.004  No       -                                         

Drive 004 is DOWN. Have you checked bptm log and messages files as suggested previously?

You need to do some 'home work' before suggesting firmware upgrade.
Check messages file (or backup of messages file) for boot messages. (who -b will tell you when last the server was rebooted). You will find the hba make and model along with firmware and driver version. 
While you have messages file open, look for hardware-related errors.
Look on hba vendor's web site for known issues with the firmware and driver versions.

About drives not getting used, check for stuck/orphaned device allocation:
nbrbutil -dump
Check the 'MDS Allocation' section at the bottom of the output for media or drive allocation that is not really in use, not the Allocation Key number and release with:
nbrbutil  -releaseMDS <mdsAlocationKey> 
 

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

phuong.huynh@svtech.com.vn's picture

/usr/openv/netbackup/db/media/errors

03/05/13 01:20:39 000144 3 WRITE_ERROR HP.ULTRIUM4-SCSI.003
03/05/13 01:20:44 000144 3 TAPE_ALERT HP.ULTRIUM4-SCSI.003 0x10000000 0x00000000
03/05/13 04:40:50 000017 1 WRITE_ERROR HP.ULTRIUM4-SCSI.001
03/05/13 04:40:55 000017 1 TAPE_ALERT HP.ULTRIUM4-SCSI.001 0x10000000 0x00000000
03/06/13 06:23:15 000141 1 WRITE_ERROR HP.ULTRIUM4-SCSI.001
03/06/13 06:23:20 000141 1 TAPE_ALERT HP.ULTRIUM4-SCSI.001 0x10000000 0x00000000
03/06/13 21:24:19 000051 1 WRITE_ERROR HP.ULTRIUM4-SCSI.001
03/06/13 21:24:24 000051 1 TAPE_ALERT HP.ULTRIUM4-SCSI.001 0x10000000 0x00000000
03/06/13 21:59:51 000130 3 WRITE_ERROR HP.ULTRIUM4-SCSI.003
03/06/13 21:59:56 000130 3 TAPE_ALERT HP.ULTRIUM4-SCSI.003 0x10000000 0x00000000
03/06/13 23:38:21 000010 2 TAPE_ALERT HP.ULTRIUM4-SCSI.002 0x10000000 0x00000000
03/07/13 08:43:50 000019 5 TAPE_ALERT HP.ULTRIUM4-SCSI.005 0x10000000 0x00000000
03/07/13 09:01:47 000143 1 WRITE_ERROR HP.ULTRIUM4-SCSI.001
03/07/13 09:01:52 000143 1 TAPE_ALERT HP.ULTRIUM4-SCSI.001 0x10000000 0x00000000
root@Nbmaster2 #

root@Nbmaster2 # /usr/openv/volmgr/bin/tpconfig -update -drive 4 -drstatus UP
Updated drive < HP.ULTRIUM4-SCSI.004 > of type hcart in configuration
root@Nbmaster2 #
root@Nbmaster2 #
root@Nbmaster2 # /usr/openv/volmgr/bin/tpconfig -d
Id  DriveName           Type   Residence
      Drive Path                                                       Status
****************************************************************************
0   HP.ULTRIUM4-SCSI.000 hcart  TLD(0)  DRIVE=3
      /dev/rmt/8cbn                                                    UP
1   HP.ULTRIUM4-SCSI.001 hcart  TLD(0)  DRIVE=5
      /dev/rmt/10cbn                                                   UP
2   HP.ULTRIUM4-SCSI.002 hcart  TLD(0)  DRIVE=6
      /dev/rmt/11cbn                                                   UP
3   HP.ULTRIUM4-SCSI.003 hcart  TLD(0)  DRIVE=2
      /dev/rmt/7cbn                                                    UP
4   HP.ULTRIUM4-SCSI.004 hcart  TLD(0)  DRIVE=1
      /dev/rmt/6cbn                                                    UP
5   HP.ULTRIUM4-SCSI.005 hcart  TLD(0)  DRIVE=4
      /dev/rmt/9cbn                                                    UP

Currently defined robotics are:
  TLD(0)     robotic path = /dev/sg/c1tw500104f000b88092l0

EMM Server = Nbmaster2

root@Nbmaster2 #

Marianne's picture

Seems you are ignoring my advice to check bptm log and /var/adm/messages.

I give up....

Good luck!

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

phuong.huynh@svtech.com.vn's picture

I sent log /var/adm/messages for the hardware team for their review and concluded that the error only appears in the log when the write data to tape using Netbackup software, also uses OS command not found error.

I just want to confirm that the configuration of Veritas is correct and whether this is a bug of veritas 6.5.

Anyway, thank you for your support very much.

Marianne's picture

I'm repeating above extract from Status 84 Troubleshooting Guide:

As an application, NetBackup has no direct access to a device, instead relying on the operating system (OS) to handle any communication with the device. This means that during a write operation NetBackup asks the OS to write to the device and report back the success or failure of that operation. If there is a failure, NetBackup will merely report that a failure occurred, and any troubleshooting should start at the OS level. If the OS is unable to perform the write, there are three likely causes; OS configuration, a problem on the SCSI path, or a problem with the device.

Your 'tar' tests are writing small amounts of data (24244 tape blocks) to one tape drive at a time.
This proofs nothing.

Repeat the test with more data (+- 5 GB) and write to all 6 drives at the same time
HBA firmware and/or driver issues normally show up when high I/O is experienced.

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

phuong.huynh@svtech.com.vn's picture

Thank you for the advice, I will follow your instructions.

Abhishek Tomar's picture

Also check the output of iostat -En and see if any error reported for drive

If yes, check with hardware vendor

Abhishek Tomar's picture

run iostat -En on media server and see if drive reoprt any error .

if yes, check with vendor

mph999's picture

NetBackup does NOT write to drives - ever.

NBU sends the data to the operating system, the operating system then writes it to the drive, using the blocksize requested by NBU.

I/O errors are not caused by NBU.

Martin

Regards,  Martin
 
Setting Logs in NetBackup:
http://www.symantec.com/docs/TECH75805