Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.

Scheduled jobs are not starting

Created: 29 Apr 2013 • Updated: 10 May 2013 | 33 comments
This issue has been solved. See solution.

I've been working on a problem for a little over a month and have managed to even stump the fine folks at Symantec so far, so I though I would reach out to all of you and see if I can get this fixed.

Back in March, my scheduled jobs suddenly just stopped working.  They just don't start. If I restart my servers, or services, the jobs will queue up and run fine for approximately 24 hours, then not run again. I have approximately 150 policies that will show up if I type in:

nbpemreq -predict -date (next 24 hours)

If I do the same command the next day, there will only be 4 policies that show up. These policies have worked seemlessly throughout the problem, which doesn't make troubleshooting any easier. The policies that work are wide ranging, from a flat file backup, Exchange, and RMAN's. Some write to a Data Domain, some write direct to tape. New backup policies behave in the same manner. Manually starting a job works fine.

Master - W2K8 running NBU 7.5.0.3

Clients - vary between Windows and Redhat (both fail)

Any suggestions would be greatly appreciated. I will post any log files that you need.

Operating Systems:

Comments 33 CommentsJump to latest comment

LucSkywalker1957's picture

Do you see anything in your Windows system and application event logs that might indicate something going on with the server? Did you make or introduce any changes or patches last March?

Dan Giberson's picture

Nope, I looked at that too.....the only work I was doing was troubleshooting SLP replication.

NBU 7.01 on Windows 2003 with LTO2 Library

Marianne's picture

What type of scheduling are you using? Calendar or Frequency?

If Calendar - do the schedules span midnight?

Were any backups kicked of manually after backup window closed?

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

Dan Giberson's picture

I only use Calendar based scheduling in my environment. Some of these jobs will span midnight, but not all. I can manually kick a backup off at any time and it will run perfectly fine.

NBU 7.01 on Windows 2003 with LTO2 Library

Marianne's picture

The problem with manual backups is that it will affect scheduled backups. If a daily has run during the day, another backup will not be started that night.

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

LucSkywalker1957's picture

When's the last time you rebooted your master?

Jaykullar's picture

I have see this previously, however in 7.1.0.4.

Can you pick a policy that does not auto run, delete your schedules from it. Then run nbpemreq -updatepolicies. The re-create your schedules in the same policy and again run nbpemreq -updatepolicies.

Then run nbpemreq -predict_all -date **/**/**

Let us know how that figures.

Dan Giberson's picture

Luc, I have rebooted my master several times in the last couple of weeks. If I reboot, everything appears to work fine for the first 24 hours, then it halts again.

Jaykullar, tried that with the same result. The only jobs in the predict list are the one's that have never stopped working, but good suggestion. If I create a new policy, it's the same as well. Even if I copy from one of the policies that still work, it fails....

NBU 7.01 on Windows 2003 with LTO2 Library

mph999's picture

" ... and have managed to even stump the fine folks at Symantec "

Yikes, this will be fun then ...

I doubt the type of job is relevant, can't see why pem would care.

Anyhows ...

in the pem log, for a given client/ policy you will see lines like this :

[PolicyClientTask::cancel] scheduling for policy <policy name>, client <client name>, has been abandoned because of image expiration, will recalculate
 
then, a bit later you should see :
 
[PolicyClientTask::run] policy <policy name>, client <client name>, schedule <schedule name> will be submitted for execution at <date /time> 
 
For a given policy/ client do you find that at some point, the "submitted for execution" lines suddenly stop ?
 
Cheers,
 
M
Regards,  Martin
 
Setting Logs in NetBackup:
http://www.symantec.com/docs/TECH75805
 
Jaykullar's picture

Sounds like a bit of problem there bud.

Have Symantec recommened MP5 at all?

I know this sounds silly, but are you using Java or Admin Console? Have you tried creating policies in both?

Dan Giberson's picture

MPH, are you talking about PEM logs on the client itself? If not, then none of my logs for the last 6 days have a line that says "submitted for execution".

The one odd thing I have noticed in my troubleshooting is that I can't get a full out put from "nbpem subsystems screen all", it will fail on screen 1 which I found out is the Task / Job Factory. If I do a "nbpemreq subsystems screen 1" it just fills up about 2-3 log files and says it can't connect to nbpem, which I have attached a sample of for your ligh treading.

So far that's all I have to go on.....

Jaykullar, no one has suggested going up to MP5 yet, and I would prefer to avoid it if possible as i'm waiting for 7.6 to be released. However, if it will fix it, I might have to do it. And, I just use the Admin console.

AttachmentSize
NBPEM.txt 429.89 KB

NBU 7.01 on Windows 2003 with LTO2 Library

mph999's picture

There are no pem logs on the client, master only.

You'll need to process the logs for me, I haven't got access to a machine at the moment;

vxlogview -p 51216 -i 116 -d all -t 07:00:00

... for example, would give the last 7 hours of logs.

Thanks,

Martin

Regards,  Martin
 
Setting Logs in NetBackup:
http://www.symantec.com/docs/TECH75805
 
Dan Giberson's picture

Hey,

Just noticed this line, which exists for every failed policy...any thoughts?

Prediction data not available for WestNet_MKSC/srvmksq01 because schedule calculation is pending(PolicyClientTask.cpp:1325),34:PolicyClientTask::formatPrediction,1

NBU 7.01 on Windows 2003 with LTO2 Library

Dyneshia's picture

You could give the following a try.  ( you will interrupt backups )

1. shutdown NBU services

2. use bpps or Services to check all the NBU service including PBX service are not running.

3. delete or move to a safe directory  the below files:
C:\program files\veritas\netbackup\bin\bpsched.d\pempersist
C:\program files\veritas\netbackup\bin\bpsched.d\retirepersist
C:\program files\veritas\netbackup\bin\dbdbm.lock
C:\program files\veritas\netbackup\db\jobs\restart\*
C:\program files\veritas\netbackup\db\jobs\pempersist
C:\program files\veritas\netbackup\db\jobs\pempersist2
C:\program files\veritas\netbackup\var\TaoNotifSvc*.*
C:\program files\veritas\netbackup\db\failure_history\*

Rename the following files by adding ".old" to the end:

C:\program files\veritas\netbackup\var\nbproxy_jm.ior
C:\program files\veritas\netbackup\var\nbproxy_pem.ior
C:\program files\veritas\netbackup\var\nbproxy_pem_email.ior

4. startup NBU

5. Run the following command:
C:\program files\veritas\netbackup\bin\admincmd/nbrbutil -resetAll
C:\program files\veritas\netbackup\bin\admincmd\nbpemreq -updatepolicies
C:\program files\veritas\netbackup\bin\admincmd\nbpemreq -tables screen

For more info, please refer to the technote below:

http://www.symantec.com/docs/TECH62714

Dyneshia's picture

There was an issue in 7.5.0.3 and nbpem, please see : The 7.5 New bullitan http://www.symantec.com/docs/TECH178334

The fix is in 7.5.0.4, and since you are going to patch up, you might as well go to 7.5.0.5

http://www.symantec.com/docs/TECH199269

(ET2838857) <<Fixed in 7.5.0.4>> NB_7.5.0.3_ET2838857_4.zip is an Emergency Engineering Binary (EEB) replacement for nbpem for NetBackup 7.5.0.3.
 http://www.symantec.com/docs/TECH192530

This EEB includes resolutions for the following issues:

(ET2836511) <<Fixed in 7.5.0.4>> After upgrading to NetBackup 7.5.0.3 virtual machine (VMware) backups run multiple times, eventually failing with status code 196 reported.
 http://www.symantec.com/docs/TECH192104

(ET2746518) <<Fixed in 7.5.0.4>> Calendar schedules will be run multiple times in the backup window in 7.5 if the Backup window spans midnight and the backup starts prior to midnight and finished on the next day.
 http://www.symantec.com/docs/TECH189216

(ET2836015) <<Fixed in 7.5.0.4>> Query base VMware Backup using Calendar schedule may fail with Status: 196 (client backup was not attempted because backup window closed)
 http://www.symantec.com/docs/TECH190338

Dan Giberson's picture

I will be upgrading to 7.5.0.5 in hopes that it will fix it, however the issues that are listed are not quite the problem i'm having. It's not just policies that span midnight that are failing. I will keep everyone posted on the Symantec root cause if they find one.

NBU 7.01 on Windows 2003 with LTO2 Library

Dyneshia's picture

7.5.0.5 includes a later version of nbpemm which I hope to resolves your issue.  I know the issues are not exact , but we did have numerous issues.  In addition, if support needs to escalate your case , backline will push back until you are at 7.5.0.5.  Please let us know how it goes cool

LucSkywalker1957's picture

I know you likely know, but it's worth saying. Make you get a 100% successful catalog backup before you upgrade. :)

Jaykullar's picture

If support have not suggested an upgrade to MP5, then they are unware of any fixes for this in MP5. Is your case with backline?

I've had some very wired problems with NBU, taken time to resolve, but backline have always come good.

Dan Giberson's picture

There is someone from backline lookingat my logs. I'm not upgrading until tomorrow so I hope they come up with a good root cause / solution.

Dyneshia, I will try your suggestion if my manual backups clear up later today.

NBU 7.01 on Windows 2003 with LTO2 Library

Dan Giberson's picture

Just an update for everyone, updated to 7.5.0.5 and can still repeat the problem....i did see a couple of odd issues during the upgrade. One of my media servers kept trying to run bpcoverage and wouldn't clear the process until I re-ran the bpcoverage against an actual client.

So...back to square one and would appreciate any other suggestions.

NBU 7.01 on Windows 2003 with LTO2 Library

Mark_Solutions's picture

Hadn't noticed this thread before - been a bit busy! - but love a challenge!

So how many jobs per night do you run and how much memory does your Master Server have?

Here are my thoughts ...

1. You have TCPIP issue causing communication to not happen - tcpip timedwaitdelay and more ports can help here:

HKLM\System\CurrentControlSet\Services\Tcpip\Parameters\ add DWORD TcpTimedWaitDelay with a decimal value of 30.

For 32 bit servers also add MaxUserPort with a decimal value of 65534

For 64 bit servers run netsh int ipv4 set dynamicport tcp start=10000 num=50000

Then reboot

2. Page Pool memory is not enough:

In the registry go to HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management

Add two new DWORDs - PoolUsageMaximum with a decimal value of 40 and PagedPoolSize with a hex value of FFFFFFFF. Then reboot (so do at the same time as the TCPIP ones)

Both of the above you would expect to see something logged somewhere but will always help anyway

3. And this is a good one! If you do run a LOT of jobs per night Windows can run out of desktop heap - the great thing is that when it does this it gets so unhappy it cannot even log it anywhere, not even to the event viewer - so it is a bit difficult to spot!

On the Master in the registry go to :

HKLM\System\CurrentControlSet\Control\Session Manager\SubSystems

There is a WindowsShared section in here with three numbers in:

Windows 32-bit servers: SharedSection=1024,3072,512
Windows 64-bit servers: SharedSection=1024,20480,768

You need to change the 512 or 768 value (depending on your server O/S type) to 1024 initially - you can later try 2048 or even 4096 if you need to - then re-boot

This can make the server unstable so try one step up at a time and this and see how it goes.

With all of the three above in place hopefully it will help - if no one can spot an error i am thinking the desktop heap setting may help

Let me know how you get on!

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

watsons's picture

First glance at your first post, I would think of a "nbpem crashing", so no further schedule can be run at all. But if you have reached support, this should have been identified promptly from the event logs

Another thing would be the nbpem message you saw:

Prediction data not available for WestNet_MKSC/srvmksq01 because schedule calculation is pending(PolicyClientTask.cpp:1325),34:PolicyClientTask::formatPrediction,1

If nbpem does not crash, it might have stuck in doing the calculation and not able to process and run the schedule, then it most likely a bug. 

Would very much like to hear the finding from support (backline).

Dan Giberson's picture

Hey guys,

Thanks for all the suggestions so far, I am still working with support. Once we have a solution I will post it on here.

Mark, I would have agreed with some of your suggestions, but my nbpemreq -predict doesn't even show a tenth of my scheduled jobs. I will go through and look at the tuning you suggested though as they are good ideas no matter what.

Dan

NBU 7.01 on Windows 2003 with LTO2 Library

Mark_Solutions's picture

My guess is that they will say that something has corrupted your policies and get you to re-create them!

As a test make a new policy and see if that appears in the predict list - I am guessing it will

You could try making a small change to all of your policies - may be change the date that they go active - save this minor change for each one and then run nbpemreq -updatepolicies

Then try a predict again to see if they appear (maybe just try one first that is currently missing)

Another way of kicking things into life is to use nbpemreq -suspend_scheduling followed by -resume_scheduling to re-set all timers and see if that helps

I assume you have checked all media servers for orphaned processes (especially bpbrm and bptm) that could make the system think the last set of jobs are still running?

Let us know how things go anyway

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Dan Giberson's picture

Hey Mark,

I've already tried recreating a policy. New policies act the same as existing policies. If I do a suspend / resume, things will work for about 24 hours then die again....

NBU 7.01 on Windows 2003 with LTO2 Library

Mark_Solutions's picture

Ok - in that case i go back to where i was before with the tuning - and would add another couple of things and that is EMM Database being OK and stuck ior files

It may need tuning of your server.conf file for the cache sizes and / or your databases defraging - have a look through your server.log file to see what it is saying about the state of your databases (search for warning and cache)

The ior files go into the netbackup\var directory and relate to nbproxy linking with nbpem and nbjm

They can get locked out causing your issues and need cleaning up - but it  is why this happens that needs investigating

You get some idea about these files here http://www.symantec.com/docs/TECH45841 but need looking at carfully to see why the errors ocurr

Have we asked about Anti Virus exclusions etc yet?

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

Dan Giberson's picture

Found this a couple of times...not sure what it's telling me

05/06 01:05:01. Disconnecting shared memory client, process id not found
I. 05/06 01:05:01. Disconnected SharedMemory client's AppInfo: IP=10.246.123.110;HOST=SRVYCNBMAS01;OSUSER=SYSTEM;OS='Windows 2008R2 Build 7601 Service Pack 1';EXE=E:\NBU\NetBackup\bin\bpdbm.exe;PID=0x1504;THREAD=0xbe8;VERSION=11.0.1.2867;API=ODBC;TIMEZONEADJUSTMENT=-360
 

And a few fragmentation warnings, but nothing I would consider critical.

NBU 7.01 on Windows 2003 with LTO2 Library

Dyneshia's picture

Are you still seeing :

Prediction data not available for WestNet_MKSC/srvmksq01 because schedule calculation is pending(PolicyClientTask.cpp:1325),34:PolicyClientTask::formatPrediction,1

Dan Giberson's picture

Yes...it's been escalated to engineering now. The calculate process never completes.

NBU 7.01 on Windows 2003 with LTO2 Library

Dyneshia's picture

If you are still seeing the "calculation is pending"  I found one case where they were able to resolve it by the post I posted back on April 30th.  Did you give it a try ? Just in case herre is is again :

You could give the following a try.  ( you will interrupt backups )

1. shutdown NBU services

2. use bpps or Services to check all the NBU service including PBX service are not running.

3. delete or move to a safe directory  the below files:
C:\program files\veritas\netbackup\bin\bpsched.d\pempersist
C:\program files\veritas\netbackup\bin\bpsched.d\retirepersist
C:\program files\veritas\netbackup\bin\dbdbm.lock
C:\program files\veritas\netbackup\db\jobs\restart\*
C:\program files\veritas\netbackup\db\jobs\pempersist
C:\program files\veritas\netbackup\db\jobs\pempersist2
C:\program files\veritas\netbackup\var\TaoNotifSvc*.*
C:\program files\veritas\netbackup\db\failure_history\*

Rename the following files by adding ".old" to the end:

C:\program files\veritas\netbackup\var\nbproxy_jm.ior
C:\program files\veritas\netbackup\var\nbproxy_pem.ior
C:\program files\veritas\netbackup\var\nbproxy_pem_email.ior

4. startup NBU

5. Run the following command:
C:\program files\veritas\netbackup\bin\admincmd/nbrbutil -resetAll
C:\program files\veritas\netbackup\bin\admincmd\nbpemreq -updatepolicies
C:\program files\veritas\netbackup\bin\admincmd\nbpemreq -tables screen

For more info, please refer to the technote below:

http://www.symantec.com/docs/TECH62714

If you have done this , could you give me the etrack number you currnlety have open.  I would like to track this.

Thank you !

Dan Giberson's picture

Hey Dyneshia,

I tried your suggestion as well...the E-track number is ET 3190406.

Dan

NBU 7.01 on Windows 2003 with LTO2 Library

Dan Giberson's picture

Ok....I think we have a fix...I hope. It turns out there was a wildcard (*) in a policy that caused NBPEM to try and calculate over 700,000 files into new streams which caused it to constantly hang. As such, I have now gone through and removed all wildcards from my policies. 

I will continue to monitor this through the weekend. Thanks everyone for all the suggestions.

NBU 7.01 on Windows 2003 with LTO2 Library

SOLUTION