Video Screencast Help

NDMP backups are failing with media write errors

Created: 08 Nov 2012 | 9 comments

Hi

We have NDMP backups configured for both emc celerra and sun 7410. Some of the shares in sun filer seems to be working fine but some are starting well and then fails with media write errors. The celerra backups are not working at all.

Netbackup version -7.5.0.3

OS version - Redhat 6.2

Any help would be appreciated. I have opened case with symantec and did not find any solution yet.

I can connect to both the filers from nbu servers.

[root@slprdsan6019 bptm]# tpautoconf -verify sl-emc-c1-dm2
Connecting to host "sl-emc-c1-dm2" as user "ndmp"...
Waiting for connect notification message...
Opening session--attempting with NDMP protocol version 4...
Opening session--successful with NDMP protocol version 4
  host supports TEXT authentication
  host supports MD5 authentication
Getting MD5 challenge from host...
Logging in using MD5 method...
Host info is:
  host name "server_2"
  os type "DartOS"
  os version "EMC Celerra File Server.T.6.0.51.6"
  host id "abc1997"
Login was successful
Host supports LOCAL backup/restore
Host supports 3-way backup/restore
 

Sun shares errors:

1/08/2012 15:02:11 - Info bptm (pid=31869) using 262144 data buffer size
11/08/2012 15:02:11 - Info bptm (pid=31869) start backup
11/08/2012 15:02:11 - Info bptm (pid=31869) Waiting for mount of media id 100174 (copy 1) on server slprdsan6019.lhr.stor.s.nokia.com.
11/08/2012 15:02:11 - granted resource  slprdsan6019.lhr.stor.s.nokia.com.NBU_CLIENT.MAXJOBS.sl-sun7410-1a
11/08/2012 15:02:11 - granted resource  slprdsan6019.lhr.stor.s.nokia.com.NBU_POLICY.MAXJOBS.sun-ndmp-vshards-prod-pr3
11/08/2012 15:02:11 - granted resource  100174
11/08/2012 15:02:11 - granted resource  HP.ULTRIUM4-SCSI.000
11/08/2012 15:02:11 - granted resource  slprdsan1001-hcart-robot-tld-0-sun7410a
11/08/2012 15:02:11 - estimated 0 kbytes needed
11/08/2012 15:02:11 - Info nbjm (pid=15579) started backup (backupid=sl-sun7410-1a_1352386931) job for client sl-sun7410-1a, policy sun-ndmp-vshards-prod-pr3, schedule Daily_Clnc on storage unit slprdsan1001-hcart-robot-tld-0-sun7410a
11/08/2012 15:02:11 - started process bpbrm (pid=31866)
11/08/2012 15:02:11 - connecting
11/08/2012 15:02:11 - connected; connect time: 0:00:00
11/08/2012 15:02:11 - mounting 100174
11/08/2012 15:03:16 - Info bptm (pid=31869) media id 100174 mounted on drive index 0, drivepath /dev/rmt/3n, drivename HP.ULTRIUM4-SCSI.000, copy 1
11/08/2012 15:03:16 - mounted 100174; mount time: 0:01:05
11/08/2012 15:03:21 - positioning 100174 to file 38
11/08/2012 15:04:25 - Info ndmpagent (pid=31868) sl-sun7410-1a: Direct Access Restore information is supported
11/08/2012 15:04:25 - positioned 100174; position time: 0:01:04
11/08/2012 15:04:25 - begin writing
11/08/2012 15:04:26 - Info ndmpagent (pid=31868) sl-sun7410-1a: Backing up "/export/vshards-pr-21".
11/08/2012 15:04:26 - Info ndmpagent (pid=31868) sl-sun7410-1a: Tape record size: 262144.
11/08/2012 15:04:26 - Info ndmpagent (pid=31868) sl-sun7410-1a: File history: Y.
11/08/2012 15:04:26 - Info ndmpagent (pid=31868) sl-sun7410-1a: Date of the last level '0': Sat Sep  8 09:00:20 2012.
11/08/2012 15:04:26 - Info ndmpagent (pid=31868) sl-sun7410-1a: Date of this level '1': Thu Nov  8 15:04:25 2012.
11/08/2012 15:04:26 - Info ndmpagent (pid=31868) sl-sun7410-1a: Update: TRUE.
11/08/2012 15:09:43 - Error ndmpagent (pid=31868) MOVER_HALTED media write error - reason = 5 (NDMP_MOVER_HALT_MEDIA_ERROR)
11/08/2012 15:09:43 - Info ndmpagent (pid=31868) sl-sun7410-1a: Runtime [/export/vshards-pr-21] 872876032 bytes (872876032): 312 seconds
11/08/2012 15:09:43 - Error ndmpagent (pid=31868) NDMP backup failed, path = /export/vshards-pr-21
11/08/2012 15:09:43 - Error ndmpagent (pid=31868) sl-sun7410-1a: Filesystem traverse error.
11/08/2012 15:09:45 - Error bptm (pid=31869) cannot write image to media id 100174, drive index 0
11/08/2012 15:09:45 - Error bptm (pid=31869) io_ioctl_ndmp (MTBSF) failed on media id 100174, drive index 0, return code 7 (NDMP_IO_ERR) (bptm.c.8757)
11/08/2012 15:09:50 - Info bptm (pid=31869) EXITING with status 84 <----------
11/08/2012 15:09:50 - Info ndmpagent (pid=0) done. status: 84: media write error
11/08/2012 15:09:50 - end writing; write time: 0:05:25
11/08/2012 15:19:50 - Info bpbrm (pid=516) sl-sun7410-1a is the host to backup data from
11/08/2012 15:19:50 - Info bpbrm (pid=516) reading file list from client
11/08/2012 15:19:50 - Info bpbrm (pid=516) starting ndmpagent on client
11/08/2012 15:19:50 - Info ndmpagent (pid=518) Backup started
11/08/2012 15:19:50 - Info bpbrm (pid=516) bptm pid: 519
11/08/2012 15:19:51 - Info bptm (pid=519) start
11/08/2012 15:19:51 - Info bptm (pid=519) using 32 data buffers
11/08/2012 15:19:51 - Info bptm (pid=519) using 262144 data buffer size
11/08/2012 15:19:51 - Info bptm (pid=519) start backup
11/08/2012 15:19:51 - Info bptm (pid=519) Waiting for mount of media id 100174 (copy 1) on server slprdsan6019.lhr.stor.s.nokia.com.
media write error  (84)
 

Celerra Errors

1/08/2012 14:32:15 - Info nbjm (pid=15579) starting backup job (jobid=4319) for client sl-emc-c1-dm2, policy emc-ndmp-mapr-access-prod, schedule Weekly_Full
11/08/2012 14:32:16 - Info bpbrm (pid=29762) sl-emc-c1-dm2 is the host to backup data from
11/08/2012 14:32:16 - Info bpbrm (pid=29762) reading file list from client
11/08/2012 14:32:16 - Info bpbrm (pid=29762) starting ndmpagent on client
11/08/2012 14:32:16 - Info ndmpagent (pid=29764) Backup started
11/08/2012 14:32:16 - Info bpbrm (pid=29762) bptm pid: 29765
11/08/2012 14:32:16 - Info bptm (pid=29765) start
11/08/2012 14:32:16 - estimated 0 kbytes needed
11/08/2012 14:32:16 - Info nbjm (pid=15579) started backup (backupid=sl-emc-c1-dm2_1352385136) job for client sl-emc-c1-dm2, policy emc-ndmp-mapr-access-prod, schedule Weekly_Full on storage unit slprdsan0021-hcart-robot-tld-0-emc-c1-dm2
11/08/2012 14:32:16 - started process bpbrm (pid=29762)
11/08/2012 14:32:16 - connecting
11/08/2012 14:32:16 - connected; connect time: 0:00:00
11/08/2012 14:32:17 - Info bptm (pid=29765) using 32 data buffers
11/08/2012 14:32:17 - Info bptm (pid=29765) using 1048576 data buffer size
11/08/2012 14:32:17 - Info bptm (pid=29765) start backup
11/08/2012 14:32:17 - Info bptm (pid=29765) Waiting for mount of media id 100118 (copy 1) on server slprdsan6019.lhr.stor.s.nokia.com.
11/08/2012 14:32:17 - mounting 100118
11/08/2012 14:47:24 - Error bptm (pid=29765) error requesting media, TpErrno = Robot operation failed
11/08/2012 14:47:24 - Warning bptm (pid=29765) media id 100118 load operation reported an error
11/08/2012 14:47:24 - current media 100118 complete, requesting next media Any
11/08/2012 14:47:59 - granted resource  100118
11/08/2012 14:47:59 - granted resource  HP.ULTRIUM4-SCSI.001
11/08/2012 14:47:59 - granted resource  slprdsan0021-hcart-robot-tld-0-emc-c1-dm2
11/08/2012 14:48:00 - Info bptm (pid=29765) Waiting for mount of media id 100118 (copy 1) on server slprdsan6019.lhr.stor.s.nokia.com.
11/08/2012 14:48:00 - mounting 100118
11/08/2012 14:53:06 - Info ndmpagent (pid=0) done
11/08/2012 14:53:06 - Error bptm (pid=29765) media manager terminated during mount of media id 100118, possible media mount timeout
11/08/2012 14:53:06 - Error bptm (pid=29765) media manager terminated by parent process
11/08/2012 14:53:06 - Error ndmpagent (pid=29764) NDMP backup failed, path = UNKNOWN
11/08/2012 14:53:06 - Info ndmpagent (pid=0) done. status: 150: termination requested by administrator
11/08/2012 14:53:06 - Error ndmpagent (pid=29764) connection 0x12f6fc0 ndmp_message_process_one failed, status = 18 (NDMP_XDR_DECODE_ERR)
11/08/2012 14:53:06 - Error ndmpagent (pid=29764) connection 0x12f9790 ndmp_message_process_one failed, status = 18 (NDMP_XDR_DECODE_ERR)
11/08/2012 14:53:06 - end writing
termination requested by administrator  (150)

Comments 9 CommentsJump to latest comment

Nicolai's picture

Sun Server:

11/08/2012 15:09:43 - Error ndmpagent (pid=31868) MOVER_HALTED media write error - reason = 5 (NDMP_MOVER_HALT_MEDIA_ERROR)

Celerra Errors

1/08/2012 14:47:24 - Error bptm (pid=29765) error requesting media, TpErrno = Robot operation failed

Both errors are related to tape/robot and not Netbackup itself. Verify the health of the tape drives on the Sun server. Use a cleaning cartridge to see if that help, else you may need to replace the drive

On the Celerra the robot can't mount a tape. Check the drives,look out for jammed tapes or obstructions that prevent the robot free movement. The robot control host should have status messages in the event log /syslog that indicate why (SCSI sense keys).

 

Assumption is the mother of all mess ups.

If this post answered your'e qustion -  Please mark as a soloution.

spenukon's picture

I tried doing a robtest and that seems to be working fine. We did have a jammed tape in a drive couple of weeks back and that has been removed. I did open a case with oracle to check the library anyways.

jim dalton's picture

Try putting that exact tape in the exact drive.Look at the timings 14:32 mounting 100118...15 minutes later it has still not done it.Maybe the robot door was open, theres a fault on it ....you have an admin gui for the robot? Have a look on there at the time of the errors. It does appear you have more than one error.

We have issues with ops forgetting theres live backups scheduled and they go to empty the robot...30 mins later the CAP is still open and we have mount failures.

spenukon's picture

Sorry for the late reply. We have couple of drive issues and that has been replaced. But we still have issue with emc filer mounting a tape. It tries to mount a tape and then it says "Robot operation failed". Neither emc nor symantec has helped me so far. This is really confusing. Any help would be greatly appreciated.

revaroo's picture

Has NetBckup requested the tape? If so, it's down to the OS/Hardware to mount the tape not NetBackup.

spenukon's picture

Netbackup does says it's mounting a tape but nothing is happening after that. I also have sun filer doing the ndmp backups ad that was able to mount the tape in the same drive. Only emc filer is struggling with all the 3 drives.

 

1/14/2012 19:53:50 - mounting 100134
11/14/2012 20:08:51 - Error bptm (pid=8543) error requesting media, TpErrno = Robot operation failed
11/14/2012 20:08:52 - Warning bptm (pid=8543) media id 100134 load operation reported an error
11/14/2012 20:08:52 - current media 100134 complete, requesting next media Any
11/14/2012 20:09:26 - current media -- complete, awaiting next media Any. Waiting for resources.
          Reason: Drives are in use, Media server: slprdsan6019.lhr.stor.s.nokia.com,
          Robot Type(Number): TLD(0), Media ID: N/A, Drive Name: N/A,
          Volume Pool: NetBackup, Storage Unit: slprdsan0021-hcart-robot-tld-0-emc-c1-dm2, Drive Scan Host: N/A,
          Disk Pool: N/A, Disk Volume: N/A
11/14/2012 20:10:31 - Waiting for scan drive stop HP.ULTRIUM4-SCSI.007, Media server:
11/14/2012 20:10:32 - Info bptm (pid=8543) Waiting for mount of media id 100134 (copy 1) on server slprdsan6019.
11/14/2012 20:10:32 - granted resource  100134
11/14/2012 20:10:32 - granted resource  HP.ULTRIUM4-SCSI.007
11/14/2012 20:10:32 - granted resource  slprdsan0021-hcart-robot-tld-0-emc-c1-dm2
11/14/2012 20:10:32 - mounting 100134
11/14/2012 20:25:32 - Error bptm (pid=8543) error requesting media, TpErrno = Robot operation failed
11/14/2012 20:25:32 - Warning bptm (pid=8543) media id 100134 load operation reported an error
11/14/2012 20:25:32 - current media 100134 complete, requesting next media Any
11/14/2012 20:26:06 - Info bptm (pid=8543) Waiting for mount of media id 100500 (copy 1) on server
11/14/2012 20:26:06 - granted resource  100500
11/14/2012 20:26:06 - granted resource  HP.ULTRIUM4-SCSI.002
11/14/2012 20:26:06 - granted resource  slprdsan0021-hcart-robot-tld-0-emc-c1-dm2
11/14/2012 20:26:06 - end writing
11/14/2012 20:26:06 - mounting 100500
 

revaroo's picture

Have you tried deleting and re-adding the drives and robot - that can sometimes kick things back to life?

spenukon's picture

I tried doing that already, did not help. Rebooted the library, cleaned the drives, no luck.

Andy Welburn's picture

Have a look to see if you can utilise Martins excellent script or parts thereof:

https://www-secure.symantec.com/connect/downloads/...

- altho' written for Solaris, you *may* be able to get it to work on Linux relatively easily if not directly.

 

Basically, it collates information in the media errors file to help determine if there are issues with specific media or specific drives.

Personally I'm of the opinion that you still have issues with your drives if not your robot (arm), but if you can investigate using Martins script or similar then you have possibly the right ammunition to raise a support ticket with your library hardware vendor.

 

I used to have to pull information from the errors file manually for this exact reason - shame I didn't have access to Martins script then!