Video Screencast Help

Netbackup 5200 Duplication Jobs Fail with Error 191, then keep re-spawning and failing

Created: 12 Nov 2012 • Updated: 23 Jan 2013 | 3 comments
This issue has been solved. See solution.

Hi,

We are having problems with duplication jobs on our Netbackup 5200 appliance. We use storage lifecycle policies to backup to the 5200 as the primary copy, and then duplicate off to tape. This has worked fine for months, but recently the duplication jobs have started failing with error 191. They start OK and backup a few GB of data, but then fail before the end (the primary copy backup jobs to the appliance complete OK). Looking more closely at the job log reveals the following:

Critical bpdm(pid=14456) sts_read_image failed: error 2060017 system call failed     
Critical bpdm(pid=14456) image read failed: error 2060017: system call failed    
Error bpdm(pid=14456) cannot read image from disk, Invalid argument     
 

I originally thought this was some kind of corruption on the appliance, but the strange this is that we can still backup to the appliance OK, and even restore from some of the images that are failing to duplicate. I've adjusted the PoolUsageMaximum and PagedPoolSize parameters as suggested on some sites without success.

Another problem is that when the duplication jobs fail they re-spawn again, so we have numerous duplication jobs running that are hogging our tape drives which is having a knock-on effect on normal backups to tape. We have temporarily suspended duplication to help with this, but it isn't a long term fix.

Our environment comprises: NBU Master Server 7.1.0.4 (Win2003 R2), NBU Media Sever 7.1.0.4 (Win2003 R2), NBU 5200 Appliance (2.0.2)

I've got a ticket open with Symantec about this and they are currently investigating, but just wondered if anyone else in the community had ever seen this before. Any help would be much appreciated.

Thanks

 

 

Comments 3 CommentsJump to latest comment

Mark_Solutions's picture

The main thing that i have seen cause this is simple communications where the master is checking that the media server and its disks are OK as well as checking itself (its a bit deeper than that but that is the basic principle)

I have found that adding the following on the appliance (and any storage server for that matter) has help stop this happening:

/usr/openv/netbackup/db/config/DPS_PROXYDEFAULTRECVTMO

with a value of 800 inisde it really helps keeps things running smooth

If you have a case open then i guess there is nothing more obvious wrong (queue processing, disk space etc.) so give this a go and see if it helps

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Stevo's picture

This was resolved with help from a Symantec Engineer. It turns out it was due to corruption after all. The Engineer provided a tool called "recoverCR" which was run on the appliance. This identied a large number of corrupt backups.

The corruption was fixed by disabling the cached mode on the appliance and re-running a full backup of all clients. The premise for this was that the corruption could be traced back to some shared data that duplication depended on for a number of subsequent backups. By disabling the cached mode the full backup over-wrote the corrupted source data. Re-running the recoverCR tool after the full backup showed that the level of corruption had been reduced to a miniscule level - just one backup image which we ended up deleting.

Cache mode was enabled again and everything worked fine after that.

For info, cache mode is disabled by changing the "CACHE_DISABLED = 1 " entry in  the pd.conf file on the appliance. 1 indicates caching disabled, 0 indicates caching enabled.

 

 

SOLUTION