Netbackup 6.5.4 and Netapp Ontap 7.2.6.1 NDMP backup problems
Hi,
We're using Netbackup 6.5.4 on a Solaris 10 host (master and media server) for backing up our Netapp FAS 3140 with ONTAP 7.2.6.1 on a Quantum PX 506 LTO tape library. Both media server and Netapp appliance are connected with fibrechannel to the tape library. Two drives are assigned to the Netapp appliance.
Since a few days we're having problems doing incremental backups on one of the volumes (/vol/home) on the NetApp filer. The incremental backup of other volumes run fine. The /vol/home which is causing the problems is our biggest volume. It contains about 30 million files and 1.6 TB (1.2 TB with NetApp block deduplication):
adnnfs03> df -h home Filesystem total used avail capacity Mounted on /vol/home/ 1440GB 1238GB 201GB 86% /vol/home/ /vol/home/.snapshot 360GB 253GB 106GB 70% /vol/home/.snapshot adnnfs03> df -i home Filesystem iused ifree %iused Mounted on /vol/home/ 30047941 4952040 86% /vol/home/ adnnfs03> df -hs home Filesystem used saved %saved /vol/home/ 1239GB 426GB 26%
Are there any known limitations or problems with this constellation and that amount of data? We didn't change our setup in the last few months and backing up the smaller volumes works fine. The backup process of /vol/home stalls for hours and then aborts with this error message:
03/15/2010 09:32:31 - requesting resource adnnfs03-ndmp 03/15/2010 09:32:31 - requesting resource adnbackup.NBU_CLIENT.MAXJOBS.adnnfs03 03/15/2010 09:32:31 - requesting resource adnbackup.NBU_POLICY.MAXJOBS.adnnfs03 03/15/2010 09:32:31 - granted resource adnbackup.NBU_CLIENT.MAXJOBS.adnnfs03 03/15/2010 09:32:31 - granted resource adnbackup.NBU_POLICY.MAXJOBS.adnnfs03 03/15/2010 09:32:31 - granted resource LM2727 03/15/2010 09:32:31 - granted resource HP.ULTRIUM3-SCSI.003 03/15/2010 09:32:31 - granted resource adnnfs03-ndmp 03/15/2010 09:32:32 - estimated 1905263392 kbytes needed 03/15/2010 09:32:32 - started process bpbrm (pid=29229) 03/15/2010 09:32:32 - connecting 03/15/2010 09:32:32 - connected; connect time: 0:00:00 03/15/2010 09:32:35 - mounting LM2727 03/15/2010 09:33:27 - mounted LM2727; mount time: 0:00:52 03/15/2010 09:33:31 - positioning LM2727 to file 39 03/15/2010 09:33:47 - positioned LM2727; position time: 0:00:16 03/15/2010 09:33:47 - begin writing 03/15/2010 17:34:18 - Error ndmpagent (pid=29234) aborting operation - no mover progress 03/15/2010 17:34:18 - Error ndmpagent (pid=29234) NDMP backup failed, path = /vol/home 03/15/2010 17:35:13 - end writing; write time: 8:01:26 NDMP backup failure (99)
During that time the NDMP debug logfile of the Netapp filer shows these messages:
Mar 15 23:59:46 GMT+01:00 [ndmpd:35]: Error code: NDMP_NO_TAPE_LOADED_ERR Mar 15 23:59:46 GMT+01:00 [ndmpd:35]: Device name: nrst0a Mar 15 23:59:46 GMT+01:00 [ndmpd:35]: Mode: 0 Mar 15 23:59:46 GMT+01:00 [ndmpd:35]: IOException: Device cannot be opened. Device may have no tape. Mar 15 23:59:46 GMT+01:00 [ndmpd:35]: NDMP message type: NDMP_CONNECT_CLOSE Mar 15 23:59:46 GMT+01:00 [ndmpd:35]: NDMP message replysequence: 7
I doubt that this message is true. There is a tape in the drive. Backing up other volumes works and a second drive is unused. In the meantime we begged our users to clean up their home directories and reduce the number of files and data size, hoping that this will help.
Thanks in advance for any hint.
Best regards,
Bernd
Comments
Nothing in NDMP logs between
Nothing in NDMP logs between 09:30 and 17:35?
The log that you've posted is for 23:59 and seems to be correct - the NBU job has failed at 17:35 due to "no mover progress" and would've unloaded the tape within 5 minutes of the failure.
Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows.
Handy NBU links
Hi Marianne, Yes, there
Hi Marianne,
Yes, there is NDP log on the Netapp, but it's just repeating the following sequence. Looks normal to me, except the "IOException: Device cannot be opened. Device may have no tape." line. They appear for both drives (nrst0a and nrst1a).
According to [1] the error code 99 most likely means in this case "An Incremental backup is run, but no files have changed since the last backup.". But files have changed. I assume that the incremental backup process takes too long to find all the files that have changed and aborts after 8 hours. The last full backup ran on 8 march and terminated fine. Our backup operator says that the incremental backups fail sometimes. I have to check if a full backup is working now.
Best regards,
Bernd
[1] http://seer.entsupport.symantec.com/docs/275330.htm
Netapp NDMP debug log:
OK. Full backups are not
OK. Full backups are not affected - at least a full backup of /vol/home runs now for 5.5 hours and has written 105 GB onto tape. Only incremental is failing.
Mount Time Out
I think that media errors is because a mount time out are all your drives healthy? most of the time 99 errors are because the qtrees are not available.
Omar A Villa
Netbackup Expert
These are my personal views and not those of the company I work for
VxUL log for NDMP
Also you can always enable unify logging to see if there is any issue with the NDMP Agent.
Omar A Villa
Netbackup Expert
These are my personal views and not those of the company I work for
Omar, there is no media mount
Omar, there is no media mount timeout here:
Also - unified logging is enabled by default
Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows.
Handy NBU links
Hi, as I wrote before: a full
Hi, as I wrote before: a full backup of that volume runs fine. An incremental backup does nothing for 8 hours and then terminates. Therefore I assume no media or drive error. To confirm that, I ran a full backup yesterday. Here is the log output of that full backup job that worked:
Best regards,
Bernd
Hi, Our users deleted some
Hi, Our users deleted some files in their home directories and the number of inodes in that volume is reduced from 30 million to 27 million:
The incremental backup ran now fine:
So there must be a bug when a Netapp volume contains more than 30 million inodes and ist backed up with NDMP. Is there a limitation known with Netbackup 6.5.4 and Data ONTAP 7.2.6.1? Restoring files to that volume also fails:
Restoring a backup from /vol/home to another volume with less used inodes on the same Netapp filer is being tested. Best regards, Bernd
Hi, Can you turn on logging
Hi,
Can you turn on logging on your netapp and post the results here?
I've never had to turn it on, but from the documentation the command is:
options backup.log.enable [on | off]
From the (sry have no link):
"Data ONTAP® 7.3
Data Protection
Tape Backup and Recovery Guide"
Also, are there any firewalls between your master and media server?
Netapp backup log files
Hi Matt,
Master and media server is the same host and there is no firewall between. The backup server and the Netapp filer are both directly attached with fibre channel to the tape library. Backup log is enabled by default. Here's the content of the backup log from the failed incremental backup on 15 march:
The log of the successful full backup from 16 march:
The time of netapp filer and backup server differ by 24 minutes because NTP daemon wasn't running. We also have sometimes problems restoring files onto that volume. I'm currently testing this issue iteratively to find out under wich conditions this happens. Restoring takes many hours until it has found the requested file.
Best regards,
Bernd
Looking at the log for the failed job
specifically the length of time taken for Phase1 before failure.
From the NetApp site:
"...
Phase I : In this phase, we generate the list of files that need to be backed up. The output of this phase is what is called an inode map. The inode map contains an entry for every inode on the volume. The inode that is to be backed up has its corresponding entry set to 1 whereas the inode that is not supposed to be backed up has its entry set to 0.
With Data Ontap 6.4 onwards, for backups that are initiated by NDMP with file history turned on, we need to do some processing to estimate the amount of space each file (inode) will take on tape. This processing is invaluable during Phase IIIb (Once again introduced in Data ONTAP 6.4) where we write the offset map to the backup image. This adds some cost to NDMP backups w/ file history in this Phase. However, this allows us to provide some extraordinary features (DAR Enhancements) during restore.
..."
It then goes on to say:
"...
Phase I of dump behaves differently based on whether the user initiated the backup of a Volume/Qtree or a sub-directory.
With Sub-directory dumps, each directory and file below the specified root of the backup must be examined to determine whether it should be included in the backup.
This is a time-consuming operation especially if the sub-directory contains several million inodes (files).
With Volume and Qtree based backups, it is not necessary to go through the entire directory tree to determine what needs to be backed up. Instead we just go through the inode file for the backup snapshot. By just looking at an inode entry, it is possible to determine whether the inode exists and whether it is a part of a given qtree/volume.
This is a much faster and an efficient process as compared to the directory traversal. However, the inode file does not contain enough information to be able to do the same for sub-directory dumps. So we cannot apply this efficient process for sub-directory dumps.
..."
Looking more closely at your log, can I ask what your backup selection looks like for this volume? Comparing it with the log for one of my own, it looks like it is /vol/home/. If it has got that trailing / then is it treating it as a "sub-directory dump" & not a "volume or q-tree based backup" hence the length of time it's taking to do incrementals? Or am I just reading too much into this & it's simply due to the number of inodes as you've already said?
Regards Andy
"It's not too late to panic ..."
Hi Andy, In the backup
Hi Andy,
In the backup policy we have /vol/home (without trailing slash) together with 19 other volumes. So it's a volume dump without directory traversal. We don't have qtrees in the user home volume for about 200 users. That policy ran fine for about a year with full and incremental backups. We upgraded then to Netbackup 6.5.4 due to a backup problem caused by unicode file names.
In the meantime the number of inodes sank from 30 mio to 27 mio. An incremental backup ran yesterday. Also a restore from a full backup with less than 30 mio inodes ran through. I'm currently testing restores from a full backup with more than 30 mio inodes some weeks in the past. That one failed previously.
30 mio is an odd number and not close to the border of computer data types. I would understand it, if were crossed 2^31 inodes. It looks more like a timing issue. Calculating the difference takes longer than a certain timeout. The backup and media server is this model:
[root@adnbackup ~]$ prtdiag | more System Configuration: Sun Microsystems sun4us Fujitsu Siemens PRIMEPOWER450 2x SPARC64 V System clock frequency: 220 MHz Memory size: 4096 Megabytes ========================= CPUs ========================= Run Ecache CPU CPU Brd CPU Module MHz MB Impl. Mask --- --- ------- ----- ------ ------ ---- 0 0 0 1869 3.0 5 5.1 0 1 1 1869 3.0 5 5.1Not the most powerful model of today, but it ran fine for years.
Best regards,
Bernd
Solution
Hi,
Got a solution from Symantec support. By default the NDMP progress timeout is at 8 hours. Startign with patch levels 6.0MP7 and 6.5.4, a higher timeout can be defined. I increased it to 24 hours (1440 minutes):
http://support.veritas.com/docs/249241
Best regards,
Bernd
Great stuff! Thanks for
Great stuff! Thanks for sharing this info!
Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows.
Handy NBU links
Thanks for update
Thanks for update AdNovum!
Putting this on my internal wiki as im sure ill run into this someday.
Thanks again!
Matt
Would you like to reply?
Login or Register to post your comment.