Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.

NetBackup's inability to prioritise jobs properly - seeking workaround

Created: 23 Jul 2012 | 12 comments

I do NDMP backups of two NetApp filers using a shared tape library. All policies use the same tape pool and retention period. I want to be able to prioritise jobs so that more important ones run first. Unfortunately NetBackup is incapable of honouring job priorities for jobs that run via different NDMP devices (specifically this seems to be a limitation on storage units, i.e. job priority is ignored for jobs running across different storage units).

What happens is that when a job completes on a particular filer, if there are any other jobs queued against the same filer then these jobs will always take priority over any other job queued against a different filer regardless of job priority. So a job with a priority of 0 on the same filer will be picked over a job with a priority of 999999 on the other filer.

There is no good reason for this since all jobs have the same retention period and use the same media pool, so can share the same tapes. When switching from one filer to another the tape is not unloaded, so please don't offer this as a reason for the undesirable behaviour.

The reason I know Symantec doesn't support job prioritisation across storage units is because I had a call open with them for over a year trying to get this fixed. Eventually they decided it was easier to say the product was working as designed rather than to fix what I consider to be a major design flaw in what is apparently and enterprise backup solution. The only solution that Symantec could come up with was to buy a new tape drive and dedicate this to the filer that keeps loosing out. This is on NetBackup 7.1. Happy to upgrade to 7.5, but since I have no reason to believe the issue will ever be fixed by Symantec I don't see the point of upgrade now.

So what I'm looking for is a workaround to force jobs for filer A to run ahead of jobs queued against filer B when a job is already running against filer B. Currently the only way is to manually cancel all queued filer B jobs and then re-queue them after filer A gets hold of the tape drive. I was hoping to be able to programmatically remove a tape drive from filer B by downing the path, but this can't be done while a backup is running that uses that path.

Regards,
Dale

Comments 12 CommentsJump to latest comment

Marianne's picture

Work through your local Symantec SE to submit a formal 'request for enhancement' to NetBackup Product Management. 

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

revaroo's picture

>> So what I'm looking for is a workaround to force jobs for filer A to run ahead of jobs queued against filer B when a job is already running against filer B

So you want any jobs already running on filer B to be cancelled so filer A can have it's backups started.

Are these jobs on filer B Active or Queued?

If Active, there is nothing you can do but wait. If that is the case, working as designed.

What happens if filer A's backup hits the queue, then filer B's backup hits the queue, does filer B still run first?

Marianne's picture

If I read 'between the lines' it seems that you have media sharing enabled? Therefore no need for media unload?

It seems as if NBU handles STU change in the same manner as media change. Therefore my suggestion to submit request for enhancement.

For everyone else's benefit - here is how NBU use Job Priorities and why media unload is applicable:

http://www.symantec.com/docs/HOWTO33114

 

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

jim dalton's picture

Hi Dale

There has to be a workaround somewhere somehow even if it means reverting to a manual schedule and some hacking.

Try this for size: I have a number of policies that have no schedule but they are generally active and are all the same 'flavour'.

I have a script which via cron on the master checks these policies, discards the ones that arent active and manually executes the remainder. 

Not knowing quite how you are set up, maybe something like this could be used, say have policies for A be autoscheduled and then script the b/up for B...check if theres any A backups still running or queued , if not , execute the B policy, then take care not to re-execute (stick a local flag someplace, which gets removed by another cron or a pre/post-phase around policy A.

It might help if you described the storage arrangement too.

Jim

Will Restore's picture

"So what I'm looking for is a workaround to force jobs for filer A to run ahead of jobs queued against filer B when a job is already running against filer B"

Once a job has started it will run to completion.  As revaroo suggests, there is no mechanism in NetBackup to supercede an active job. 

Will Restore -- where there is a Will there is a way

DaleW's picture

"So what I'm looking for is a workaround to force jobs for filer A to run ahead of jobs queued against filer B when a job is already running against filer B"

I'm not looking for any active jobs to be cancelled, just the ones in the queue that run on filer A to have priority over the ones in the queue that run on filer B. So when a filer B job finishes it would be nice to have a way to make a filer A job run ahead of the next filer B job in the queue. In other words it would be nice if Symanatec would implement a priority mechanism that actually works in a sensible way.

Will Restore's picture

Understanding the Job Priority setting

NetBackup uses the JobPriority setting as a guide. Requests with a higher priority

do not always receive resources before a request with a lower priority.

The NetBackup Resource Broker (NBRB) maintains resource requests for jobs in

a queue.

NBRB evaluates the requests sequentially and sorts them based on the following

criteria:

■ The request's first priority.

■ The request’s second priority.

■ The birth time (when the Resource Broker receives the request).

The first priority is weighted more heavily than the second priority, and the second

priority is weighted more heavily than the birth time.

Because a request with a higher priority is listed in the queue before a request

with a lower priority, the request with a higher priority is evaluated first. Even

though the chances are greater that the higher priority request receives resources

first, it is not always definite.

The following scenarios present situations in which a request with a lower priority

may receive resources before a request with a higher priority:

■ A higher priority job needs to unload the media in a drive because the retention

level (or the media pool) of the loaded media is not what the job requires. A

lower priority job can use the media that is already loaded in the drive. To

maximize drive utilization, the Resource Broker gives the loaded media and

drive pair to the job with the lower priority.

■ A higher priority job is not eligible to join an existing multiplexing group but

a lower priority job is eligible to join the multiplexing group. To continue

spinning the drive at the maximum rate, the lower priority job joins the

multiplexing group and runs.

■ The Resource Broker receives resource requests for jobs and places the requests

in a queue before processing them. New resource requests are sorted and

evaluated every 5 minutes. Some external events (a new resource request or

a resource release, for example) can also trigger an evaluation. If the Resource

Broker receives a request of any priority while it processes requests in an

evaluation cycle, the request is not evaluated until the next evaluation cycle

starts.

Will Restore -- where there is a Will there is a way

DaleW's picture

If only it worked that way, but it doesn't. They left out the bit about jobs from the same storage unit ALWAYS taking precedence over ANY other job if a job from that storage unit currently has a lock on a tape drive. I suggested they update the documentation to reflect this limitation, but I doubt they'll bother.

DaleW's picture

I have a plan for a workaround:

  • When the filer A jobs are queued, cancel all filer B jobs in the queue (not the active ones)
  • Run nbpemreq -suspend_scheduling to prevent any filer B jobs entering the queue
  • Wait for a filer A job to grab a tape drive
  • Re-queue the cancelled filer B jobs
  • Run nbpemreq -resume_scheduling

It's messy but will hopefully do the job.

Mark_Solutions's picture

Just out of interest what job priority numbers do you actually use?

You say a job with a value of 0 takes priority over one with a value of 999999 - are those the value you actually  use?

I only ask because (a long time ago) i was told that there is some sort of binary thing going on with the priority value where if you go too high with a number it actually ends up meaning that it is less (wow! - does that make any sense!!)

So I always try to keep things between 1 and 9 of possible - i haven't noticed that this doesn't work but maybe i haven't watched it all going on in real time?

Hope this helps in some way

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

DaleW's picture

I've tried a range of priorities, from one higher than filer B jobs to the maximum and a few in between. I've even tried reversing the priorities, so that filer A jobs have a lower priority to filer B jobs (yeah, I wasn't really expecting that one to work).

The response from Symantec support was that what I'm wanting is not possible due to the way resource allocation works, so I don't think there's a magic priority value that'll make it suddenly work. I think the issue is that the next filer B job starts before the storage unit releases the tape drive, and a filer A job can't start until it gets access to the tape drive.

If there was a way to suspend a queued job to prevent it running but leave it in the queue then a workaround would be easy, but alas, the backup gods where not smiling on me when we choose NetBackup as our backup solution.

At least the NetBackup scheduler is better than the [lack of] one in Data Protector, which we had before NetBackup.

Andy Welburn's picture

I have a plan for a workaround:

  • When the filer A jobs are queued, cancel all filer B jobs in the queue (not the active ones)
  • Run nbpemreq -suspend_scheduling to prevent any filer B jobs entering the queue
  • Wait for a filer A job to grab a tape drive
  • Re-queue the cancelled filer B jobs
  • Run nbpemreq -resume_scheduling

It's messy but will hopefully do the job.

This is the only way I've ever managed to overcome the shortcoming in job prioritisation due to resource allocation.