Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.

Image Cleanup Causing performance issues

Created: 10 Jul 2013 | 6 comments

It seems here recently that when Image Cleanup jobs run all of my backup jobs freeze until it is done.  This is causing my backups to either sit active for a very long time or fail all together.

the appliance is a 5200 running 2.5.1.  i just recently upgraded some clients and the master to 7.5.0.6 but this seemed to start happening prior to the upgrade. 

I did adjust some retentions to a lower time frame about a month ago.  Could this be causing the issue?

Cleanup jobs runn all throughout the day and night and some run fine while others will run for an hour or so.

Operating Systems:

Comments 6 CommentsJump to latest comment

Mark_Solutions's picture

The long running cleanup jobs are probably the regular default image clenup jobs (bpimage or nbdelete processes) which may take longer.

What O/S is your Master Server? (Guessing Win 2008 from yoru signature?)

If so then it may be struggling for resources during that period.

Check its System and Application event logs for any errors - especially win2k (I think) or application popups.

When the image cleanup runs it fires off nbdelete processes which can use a lot of desktop heap and it can get exhausted on a very busy system

Check for lots of cmd, nbdelete and bpdbm processes running - if there are a LOT while nothing much is running then this could be your issue

Let me know what you find

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Stanleyj's picture

I checked the event logs and surprisingly they are very clean.  I only see one error and it was not related to anything with netbackup.

My master is a win2008 R2 system with 16gb of memory and a 16 cored processor.  Plenty of rear end to handle the load i would hope.

After putting up this post i decided to reboot the master and media server.  Once everything came back up all of the frozen jobs retried and ran flawless.  Some of the jobs had been running for almost 10 hours retried and completed in 3 minutes or less.  Replication that was moving at around 10k jumped to around 10mb and replicated in less than 5 minutes.  Last night all jobs finished in less than 7 hours.  For the last week or so they have been extending into the 12 - 14 hours range.

So something apparently was just killing one of these servers.  I dont know buddy but thank you for offering up some direction for me.

I did open a ticket yesterday and sent off some logs so hopefully i can find an answer of why this seem to  coincedentally start happening around my 7.5.0.6 upgrade

Mark_Solutions's picture

If a reboot has cleared things up then it does suggest some process issues, maybe on the Master or maybe on the Media Servers - but I suspect on the Master (unless your appliances do VMWare backups and have multiple paths to the VMWare LUNS - in which case they slowly run out of memory and eventually stop)

If it was desktop heap it would normally show the odd application popup or you would get parent jobs not finishing even though their backup jobs had

The trouble is that when desktop heap gets really bad it cannot even write the events to the event viewer!! Makes it hard work to troubleshoot then! Win2008 is better with the desktop heap than 2003 though

Do remember that despite having 16GB RAM and all of those cores the paged pool memory, desktop heap and number of tcpip ports you have are all still restricted

On a standard Master implementation I always increase paged pool setting and maximise th enumber of tcpip ports available - if you need some pointers for this let me know

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Stanleyj's picture

Thanks Mark. 

I didnt even think to check the processes on the master server when all this was going on.  Because of the issues i have had in the past  with my media server I always seem to blame that guy first.

If the issue creeps back up on me i will pay closer attention to the master.  Thanks again.

Stanleyj's picture

Mark,

I've had a ticket open now for about 4 weeks on this issue and we still have no clue what is going on.  I cant figure out if its the appliance or the master server at this point.

What i do know is this problem consistently happens every 7 days.  i went back and looked at my logs about two weeks ago and i have reboot every 7th day since upgrading to 7.5.0.6.  Just this morning i had to reboot the entire system again. 

for 6 days all backups run just fine and then i see normally around the time my sql jobs start at 10pm on the 6th night things start getting VERY slow and to the point of failing.

I see error: read from input socket failed(636), media open error(83)

Today I did the reboots in phases hopeing to narrow down the culprit but unless i just wasnt being patient i still cant figure it out.

1st reboot appliance
  - This canceled all pending/running jobs
  - It took 30 minutes for the appliance to come back up and connect to the master
 -  Jobs retried but just sat idle

2nd: Restarted all processes on master (15 mintues after the media servers was connected)
 - Waited another 15 mintues and still no action

3rd: Fully rebooted master
-  5 mintues after it was back up a few jobs started running successfully. 
 -  After the catalog and two sql jobs ran successfully a replication job started and another sql job started.  Both are running terribly slow.  Im talking like 20k a second when normally both run near 30mb.

I just dont get it.  At this point i cant figure out if this is a bug with 7.5.0.6 or if something has just freaked on the appliance and the timeing is coincedental. 

Only two jobs have changed since this started happening:  Created an exchange 2013 policy, and added a new evault job which is being run manually at this point because soemthing isnt working with it at the moment.

Mark_Solutions's picture

If the appliances were back up in 30 minutes then that is not too back, though you need to run crcontrol commands on them to see if they really are up as they don't come live until they have loaded the queues up (they give a cannot conect error when running the crcontrol command if they are not "really" up.

It is sounding like a memory leak with a process on the Master so possibly a bug, but hard to say without digging deeper.

Do you do VMWare backups using the appliances - that really can make then run out of memory after a week which could be the culprit - a reboot of them gets everything back to normal

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.