Video Screencast Help
Search Video Help Close Back
to help
Not able to make it to Vision this year? Get a sampling in the Best of Vision on Demand group.

NDMP Backups failing

Updated: 21 May 2010 | 9 comments
Rwigg's picture
0 0 Votes
Login to vote
This issue has been solved. See solution.

Greetings,

We recently moved two of our NetApp filers to a new datacenter and now when I try to backup some of the larger volumes I am receiving the error NDMP backup failure(99).  The detailed status shows the following lines.

12/15/2009 2:24:21 PM - Error ndmpagent(pid=3704) MOVER_HALTED unexpected reason = 3 (NDMP_MOVER_HALT_INTERNAL_ERROR)      
12/15/2009 2:24:23 PM - Error ndmpagent(pid=3704) NDMP backup failed, path = /vol/epvol5/

I have ran successful backups from these two filers since the move.  Also, I am not seeing any errors in the NetApp syslogs.  Has anyone else ran into this problem or have any ideas?

Comments

lu's picture
17
Dec
2009
0 Votes 0
Login to vote

Can you run a "tpautoconf

Can you run a "tpautoconf -verify yournas" ?

Rwigg's picture
17
Dec
2009
0 Votes 0
Login to vote

tpautoconf -verify

Yes I can run tpautoconf -verify yournas for both filers on the media server.  The output is below.

C:\Program Files\Veritas\Volmgr\bin>tpautoconf -verify filer301c
Connecting to host "filer301c" as user "root"...
Waiting for connect notification message...
Opening session--attempting with NDMP protocol version 4...
Opening session--successful with NDMP protocol version 4
  host supports MD5 authentication
Getting MD5 challenge from host...
Logging in using MD5 method...
Host info is:
  host name "filer301c"
  os type "NetApp"
  os version "NetApp Release 7.2.6.1P2"
  host id "0151703198"
Login was successful
Host supports LOCAL backup/restore
Host supports 3-way backup/restore
Host has SnapVault Secondary license installed

C:\Program Files\Veritas\Volmgr\bin>tpautoconf -verify filer300c
Connecting to host "filer300c" as user "root"...
Waiting for connect notification message...
Opening session--attempting with NDMP protocol version 4...
Opening session--successful with NDMP protocol version 4
  host supports MD5 authentication
Getting MD5 challenge from host...
Logging in using MD5 method...
Host info is:
  host name "filer300c"
  os type "NetApp"
  os version "NetApp Release 7.2.6.1P2"
  host id "0151703082"
Login was successful
Host supports LOCAL backup/restore
Host supports 3-way backup/restore
Host has SnapVault Secondary license installed

g_man's picture
04
Jan
2010
0 Votes 0
Login to vote

I cannot speak to NetApp, but

I cannot speak to NetApp, but this could be similar to the EMC Celerra. With my Celerra, I think I can only run 4 concurrent backups per data mover. If another backup kicks off while 4 are running, it will end in a status 99.

Are your backups by chance taking longer to run and you have more concurrent backups running now than you used to?

Manoj Siricilla's picture
04
Jan
2010
2 Votes +2
Login to vote

ndmp failure 99 - further troubleshooting

- Start a backup and then at the ndmp prompt run the command

>ndmpd status

See the seesion and verify if ndmpd is on

Also, review the log file on the nas filer by running the command

>rdfile /etc/log/backup

99 - is a generic error code

You can have a 99 even if the filer cannot perform a snapshot to dump the backup.

Cheers!
Manoj
------------------
Time isn't running out, but life is...

MattS's picture
26
Jan
2010
0 Votes 0
Login to vote

Rwigg, Did you get this issue

Rwigg,

Did you get this issue resolved?  I was having this same issue with the same errors in the logs.

After much troubleshooting i found that i was using the e0m interface on the NetApp for netbackup media/master communication.  I went over this with the SAN guys and they weren't sure why that interface had a DNS entry with the NetApp's hostname.  It should have been something like hostname-e0m.

So i edited the host files on the master/media to use the IP address of the vif-master interface instead and it seems to have resolved the issue.

Let me know if this works for you.

Matt

lu's picture
26
Jan
2010
0 Votes 0
Login to vote

Can you try to create the

Can you try to create the file /usr/openv/netbackup/db/config/ndmp.cfg and put the following keyword in it : NDMP_MOVER_CLIENT_DISABLE

Rwigg's picture
26
Jan
2010
1 Vote +1
Login to vote

Still having issues

Thanks MattS I did check the host file on the master/media and we are already using the vif-master interface.  I had thought we fixed the issue after upgrading the tape drive firmware but I received the error again today.  After rerunning the job I received a new error which is leading me to believe this is still related to the tape drive.  Below is the Status from the failed job.  I am going to follow up with IBM again.

1/26/2010 10:47:38 AM - requesting resource ares05c-hcart-robot-tld-3-Filer300c
1/26/2010 10:47:38 AM - requesting resource ares02.NBU_CLIENT.MAXJOBS.filer300c
1/26/2010 10:47:38 AM - requesting resource ares02.NBU_POLICY.MAXJOBS.NDMP_Vol_Dotnextvirt01
1/26/2010 10:47:38 AM - granted resource ares02.NBU_CLIENT.MAXJOBS.filer300c
1/26/2010 10:47:38 AM - granted resource ares02.NBU_POLICY.MAXJOBS.NDMP_Vol_Dotnextvirt01
1/26/2010 10:47:38 AM - granted resource P00045
1/26/2010 10:47:38 AM - granted resource IBM.ULT3580-TD4.016
1/26/2010 10:47:38 AM - granted resource ares05c-hcart-robot-tld-3-Filer300c
1/26/2010 10:47:38 AM - estimated 0 kbytes needed
1/26/2010 10:47:40 AM - started process bpbrm (3520)
1/26/2010 10:47:41 AM - connecting
1/26/2010 10:47:41 AM - connected; connect time: 00:00:00
1/26/2010 10:47:46 AM - mounting P00045
1/26/2010 10:49:00 AM - mounted; mount time: 00:01:14
1/26/2010 10:49:00 AM - positioning P00045 to file 9
1/26/2010 10:50:59 AM - positioned P00045; position time: 00:01:59
1/26/2010 10:50:59 AM - begin writing
1/26/2010 11:38:14 AM - current media P00045 complete, requesting next resource Any
1/26/2010 11:38:15 AM - current media -- complete, awaiting next media Any Reason: Drives are in use, Media Server: ares05c,
  Robot Number: 3, Robot Type: TLD, Media ID: N/A, Drive Name: N/A,
  Volume Pool: NDMPFiler01, Storage Unit: ares05c-hcart-robot-tld-3-Filer300c, Drive Scan Host: N/A
 
1/26/2010 11:39:02 AM - granted resource P00079
1/26/2010 11:39:02 AM - granted resource IBM.ULT3580-TD4.016
1/26/2010 11:39:02 AM - granted resource ares05c-hcart-robot-tld-3-Filer300c
1/26/2010 11:39:02 AM - mounting P00079
1/26/2010 11:40:15 AM - mounted; mount time: 00:01:13
1/26/2010 11:40:16 AM - positioning P00079 to file 1
1/26/2010 11:40:19 AM - positioned P00079; position time: 00:00:03
1/26/2010 11:40:19 AM - begin writing
1/26/2010 11:48:31 AM - Error ndmpagent(pid=1452) NDMP backup failed, path = /vol/dotnextvirt01      
1/26/2010 11:48:32 AM - Error bptm(pid=2632) io_ioctl_ndmp (MTBSF) failed on media id P00079, drive index 2, return code 7 (NDMP_IO_ERR) (bptm.c.21479)
1/26/2010 11:48:33 AM - end writing; write time: 00:08:14
1/26/2010 11:48:37 AM - Error ndmpagent(pid=1452) terminated by parent process        
1/26/2010 11:48:38 AM - Error ndmpagent(pid=1452) Connection was closed but has not yet been destroyed.   
1/26/2010 11:48:40 AM - Error ndmpagent(pid=1452) Connection was closed but has not yet been destroyed.   
media position error(86)
1/26/2010 11:48:41 AM - Error ndmpagent(pid=1452) MoverGetState called with no session       
1/26/2010 11:48:42 AM - Error ndmpagent(pid=1452) NDMP backup failed, path = /vol/dotnextvirt01      
1/26/2010 11:48:43 AM - Error ndmpagent(pid=1452) Connection was closed but has not yet been destroyed.   
1/26/2010 11:48:45 AM - Error ndmpagent(pid=1452) Connection was closed but has not yet been destroyed.   
1/26/2010 11:48:46 AM - Error ndmpagent(pid=1452) Connection was closed but has not yet been destroyed.   
1/26/2010 11:48:48 AM - Error ndmpagent(pid=1452) Connection was closed but has not yet been destroyed.   
1/26/2010 11:48:49 AM - Error ndmpagent(pid=1452) MoverGetState called with no session       
1/26/2010 11:48:50 AM - Error ndmpagent(pid=1452) Connection was closed but has not yet been destroyed.   
1/26/2010 11:48:52 AM - Error ndmpagent(pid=1452) Connection was closed but has not yet been destroyed.   

MattS's picture
27
Jan
2010
1 Vote +1
Login to vote

That does sound tape drive

That does sound tape drive related... maybe try performing a large/long standard backup on that drive to see if any errors pop up?  Is it SAN connected? If so you might want to check for errors on your fiber switch ports.

Rwigg's picture
03
Feb
2010
0 Votes 0
Login to vote

Resolved

The issue was with the tape drive.  IBM replaced it and now all is well.  Thanks everyone for your suggestions.