Video Screencast Help

Medias In DOWN status

Created: 12 Jun 2012 | 14 comments

Hi All,

Below is the case:

I have media server AAA (not robot contorl) with below details:

NetBackup Client Platform = Linux, RedHat2.6
Version Name = 6.5

The backups are failing with Error bptm (pid=20539) NBJM returned an extended error status: All compatible drive paths are down but media is available (2009)

BPTM logs for  (pid=20539) on media server :

16:52:06.962 [20539] <16> nbjm_media_request: NBJM returned an extended error status: All compatible drive paths are down but media is available (2009)
16:52:06.962 [20539] <2> send_MDS_msg: OP_STATUS 0 40105003 RG997 8211 5 0 0 0 0 0 0 *NULL* 0
16:52:06.966 [20539] <16> send_MDS_msg: Error from emmlib_handleMessage, Master ZZZ, type 12, returned error 2005023

16:51:58.362 [20539] <8> write_backup: media id O00504 load operation reported an error
16:52:06.370 [20539] <16> RequestSpanResources: MultiResReq.cpp:2608 resource request failed [2009]
16:52:06.370 [20539] <2> RequestSpanResources: retVal = 2009    emmStatus = 2005009
16:52:06.370 [20539] <2> RequestSpanResources: returning
16:52:06.371 [20539] <4> nbjm_media_request: Error from RequestSpanResources, Master ZZZ, error 2009, resourceAllocated 0

/var/log/messages shows :

Jun 12 16:51:48 AAA ltid[18025]: LTID - Sent ROBOTIC request, Type=3, Param2=55
Jun 12 16:51:48 AAA tldd[18389]: TLD(2) DismountTape ****** from drive 5
Jun 12 16:51:48 AAA tldd[18389]: DecodeDismount: TLD(2) drive 5, Actual status: Robotic dismount failure
Jun 12 16:51:49 AAA ltid[18025]: LTID - Sent ROBOTIC request, Type=3, Param2=48
Jun 12 16:51:49 AAA tldd[18389]: TLD(2) DismountTape ****** from drive 8
Jun 12 16:51:49 AAA tldd[18389]: DecodeDismount: TLD(2) drive 8, Actual status: Robotic dismount failure
Jun 12 16:51:51 AAA ltid[18025]: Operator/EMM server has DOWN'ed drive Drive094 (device 55)
Jun 12 16:51:52 AAA ltid[18025]: LTID - Sent ROBOTIC request, Type=1, Param2=0
Jun 12 16:51:52 AAA ltid[18025]: Operator/EMM server has DOWN'ed drive Drive099 (device 48)
Jun 12 16:51:52 AAA tldd[18389]: TLD(2) MountTape O00506 on drive 17, from slot 16
Jun 12 16:51:52 AAA tldd[18389]: DecodeMount: TLD(2) drive 17, Actual status: Unable to SCSI unload drive
Jun 12 16:51:53 AAA ltid[18025]: LTID - received ROBOT MESSAGE, Type=54, LongParam=0, Param1=45, Param2=0
Jun 12 16:51:55 AAA ltid[18025]: LTID - Sent ROBOTIC request, Type=1, Param2=0
Jun 12 16:51:55 AAA tldd[18389]: TLD(2) MountTape O00504 on drive 15, from slot 14
Jun 12 16:51:56 AAA tldd[18389]: DecodeMount: TLD(2) drive 15, Actual status: Unable to SCSI unload drive
Jun 12 16:51:56 AAA xinetd[13903]: START: vnetd pid=20881 from=10.20.200.20
Jun 12 16:51:56 AAA xinetd[13903]: START: vnetd pid=20882 from=10.20.200.20
Jun 12 16:51:56 AAA xinetd[13903]: EXIT: vnetd status=0 pid=20882 duration=0(sec)
Jun 12 16:51:56 AAA xinetd[13903]: EXIT: vnetd status=0 pid=20881 duration=0(sec)
Jun 12 16:51:56 AAA ltid[18025]: LTID - received ROBOT MESSAGE, Type=54, LongParam=0, Param1=42, Param2=0
Jun 12 16:51:59 AAA xinetd[13903]: START: vnetd pid=20887 from=10.20.200.20
Jun 12 16:51:59 AAA xinetd[13903]: START: vnetd pid=20888 from=10.20.200.20
Jun 12 16:51:59 AAA xinetd[13903]: EXIT: vnetd status=0 pid=20888 duration=0(sec)
Jun 12 16:51:59 AAA avrd[18391]: MTIOCGET failed on Drive092 (device 41, /dev/nst117) ioctl failed, Input/output error
Jun 12 16:51:59 AAA avrd[18391]: MTIOCGET failed on Drive079 (device 42, /dev/nst114) ioctl failed, Input/output error
Jun 12 16:51:59 AAA avrd[18391]: MTIOCGET failed on Drive093 (device 45, /dev/nst116) ioctl failed, Input/output error
Jun 12 16:51:59 AAA avrd[18391]: MTIOCGET failed on Drive076 (device 54, /dev/nst115) ioctl failed, Input/output error
 

I tried to reconfigure the drives using the GUI wizard, restarted NBU services on media server and robot control host but still problem exists.

The drives get DOWN after some time when i tries to UP them

This is urgent...can someone help..pls...

 

 

 

Discussion Filed Under:

Comments 14 CommentsJump to latest comment

Mark_Solutions's picture

This sounds like either a drive error or, more likely something i have seen many times when tapes have been changed whilst drives were in use.

Use robtest (volmgr\bin) on the robot control host to access the library

type in:

s d

to see the drives and the tapes in the drives - they should show the barcode of the tapes and which slot they came from (hopefully)

Make a note of the slots they came from if shown

next type in:

s s

to show the slots and see if the listed slots are actually empty - i suspect they are not?!

Whilst using this command look for some empty slots and make a note of them

To move a tape from a drive (drive 1 in this example) to a slot ( slot 11 in this example) use:

m d1 s11

Do that for each tape in the drive so that all drives are empty

Now run an inventory on the library to update the configuration of where all the tapes are in NetBackup

Now UP the drives

Hope this helps - moral of the story is usually to only use the MAP to load and eject tapes or to not change tapes when the drives are in use.

If this is not the case let me know

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Andy Welburn's picture

- it has happened to us on occasion - another excellent source for detailing possible hardware issues is the following:

 

Troubleshooting Robot or Drive Issues in NetBackup
http://www.symantec.com/business/support/index?pag...

 

[[ One of Martins favorites I believe wink ]]

mph999's picture

Indeed it is ...  (we still get the blame though ... )

 

Regards,  Martin
 
Setting Logs in NetBackup:
http://www.symantec.com/docs/TECH75805
 
Marianne's picture

Either robot hardware issue or device config mismatch:

DismountTape ****** from drive 5
DecodeDismount: TLD(2) drive 5, Actual status: Robotic dismount failure

Mount requested for drive 17, but it seems there is already a tape in the drive that cannot be unloaded:

TLD(2) MountTape O00506 on drive 17, from slot 16
 DecodeMount: TLD(2) drive 17, Actual status: Unable to SCSI unload drive

The 'rewind and unload' command is sent to OS device name (/dev/nst...) via OS 'mt' command. Something like
mt -f /dev/nst... rewoffl

When I see errors like this, I try to get NBU out of the way and test with commands like robtest to test robot mount/move of tapes in and out of drives and OS mt command to see if SCSI unload works at OS level.

 

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

mph999's picture

Adding to the outstanding posts above ...

If none of that works, I'd delete and re-add the drives (from os as well).

Sometimes fixes these issues ...

Martin

 

Regards,  Martin
 
Setting Logs in NetBackup:
http://www.symantec.com/docs/TECH75805
 
Marianne's picture

If I look at the device names - it seems that there are LOTS of them.

Assume this is VTL? 

Have you been able to isolate problem to VTL devices or physical or both?

Are all media servers accessing the VTL or physical tape library experiencing similar issues?

I am thinking of your other post where you were trying to delete and re-config after MISSING_PATH issues....

 

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

Mark_Solutions's picture

Good spot Marianne! If it is a VTL it probably needs a reboot so that it re-inventories itself during boot up

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Giri_S's picture

Maria,

Yes we are having all media servers accessing the VTL.

But As suggested by Mark_sol. we rebooted that VTL but no luck.

on robtest :

s d : is not showing any o/p (No tape is loded in any drive).

s d
drive 1 (addr 500) access = 1 Contains Cartridge = no
drive 2 (addr 501) access = 1 Contains Cartridge = no
drive 3 (addr 502) access = 1 Contains Cartridge = no
drive 4 (addr 503) access = 1 Contains Cartridge = no
drive 5 (addr 504) access = 1 Contains Cartridge = no
drive 6 (addr 505) access = 1 Contains Cartridge = no
drive 7 (addr 506) access = 1 Contains Cartridge = no
drive 8 (addr 507) access = 1 Contains Cartridge = no
drive 9 (addr 508) access = 1 Contains Cartridge = no
drive 10 (addr 509) access = 1 Contains Cartridge = no
drive 11 (addr 510) access = 1 Contains Cartridge = no
drive 12 (addr 511) access = 1 Contains Cartridge = no
drive 13 (addr 512) access = 1 Contains Cartridge = no
drive 14 (addr 513) access = 1 Contains Cartridge = no
drive 15 (addr 514) access = 1 Contains Cartridge = no
drive 16 (addr 515) access = 1 Contains Cartridge = no
drive 17 (addr 516) access = 1 Contains Cartridge = no
drive 18 (addr 517) access = 1 Contains Cartridge = no
drive 19 (addr 518) access = 1 Contains Cartridge = no
drive 20 (addr 519) access = 1 Contains Cartridge = no
drive 21 (addr 520) access = 1 Contains Cartridge = no
 

How should i go now...

 

Thanks.

Netbackup Admin (Unix)

Mark_Solutions's picture

Did you reboot the media servers after rebooting the VTL? They need to re-establish connection correctly and then ensure all drives are showing as UP

Also check that there are no allocation stuck somewhere do a nbrbutil -resetMediaServer mediaservername or better still a nbrbutil -resetAll to clear all locks

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Marianne's picture

nbbrburil -reset..... sound like a good suggestion.

I would still prefer testing at OS level as follows (You may want to stop NBU - robtest will still work):

  1. run 'tpconfig -l' to map robot drive numbers to NBU drive names and indexes as well as OS device path.
  2. Use robtest to mount a tape in a drive
  3. Check that OS device status reflects tape mount
    # mt -f /dev/rmt/nst.... status
    Compare output with one or more drives that should be empty
    (I know what to expect on Solaris server, unfortunately not Linux... Linux web sites/forums might help)
  4. See if you can SCSI unload tape drive:
    # mt -f /dev/nst... rewoffl
  5. Use robtest to move tape back to slot.

I know this sounds like a major exercise, but I come from OS background where our company sold standalone tape drives and we had only OS commands to test the hardware.
It is still the best way to eliminate third-party backup application and troubleshoot hardware and/or OS -> hardware issues.

 

 

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

Giri_S's picture

@Mark_

I hv rebooted the robot control host as well as device host after VTL reboot.

I also used resetall for nbrbutil.

I am trying for Marias option.

Thanks,

Giri

Netbackup Admin (Unix)

ashok.veritas's picture

HI All,

We are also facing some drive issues in our environment.Mainly in weekend full backups are running which are huge databse backups on the day all drives going down in master server along with media servers. When i check the var/adm/messages noticed invalid drives also found some drives are disappeaaring and reappearing from fabric.

Please find the output of the commands

bash-3.00$ cat /var/adm/messages |grep -i disappear

Sep 10 04:12:41 dm4cfi fctl: [ID 517869 kern.warning] WARNING: fp(5)::N_x Port with D_ID=180cc, PWWN=500507630f50f601 disappeared from fabric
Sep 10 16:51:18 dm4cfi fctl: [ID 517869 kern.warning] WARNING: fp(5)::N_x Port with D_ID=180cc, PWWN=500507630f50f601 disappeared from fabric

cat /var/adm/messages |grep -i reappeared

Sep 10 04:12:51 dm4cfi fctl: [ID 517869 kern.warning] WARNING: fp(5)::N_x Port with D_ID=180cc,

PWWN=500507630f50f601 reappeared in fabric
Sep 10 16:51:28 dm4cfi fctl: [ID 517869 kern.warning] WARNING: fp(5)::N_x Port with D_ID=180cc, PWWN=500507630f50f601 reappeared in fabric
------------------------------------------------------------------------------------------------------------------------------
cat /var/adm/messages |grep -i drive output has been attached below.

Can you any help me on this much appreciated.

Master server:solaris 5.10

Netbackup version 7.0

Tape library:IBMTS3580

Please let me know incase of more information required.

 

 

AttachmentSize
OUTPUT.txt 458.45 KB
mph999's picture

Ashok,

You should create a new post, this post in for somebodie elses issues.

You have a SAN issue, not a netbackup issue.  You should speak to your san team to troubleshhot.

Could well be a faulty HBA or switch.

Martin

 

Regards,  Martin
 
Setting Logs in NetBackup:
http://www.symantec.com/docs/TECH75805
 
Marianne's picture

I have move post to new discussion:
https://www-secure.symantec.com/connect/forums/drives-disappearing-fabric

 

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links