Video Screencast Help
Search Video Help Close Back
to help
New in the Rewards Catalog: Vouchers for "Symantec Technical Specialist" and "Symantec Certified Specialist" exams.

Many Jobs delayed "granting resources" and "connecting"

Updated: 28 Oct 2010 | 29 comments
Saul Grysman's picture
+1 1 Vote
Login to vote

We've recently upgraded from 6.5.3 to 6.5.4 and we've added an additional Media Server into the environment (all Solaris based).  We are experiencing significant delays in starting jobs - Unix, Windows, SQL and Oracle/RMAN - and writing to disk or tape.  The example below is typical: 12 minute delay in granting resources and 15 minute delay is connecting.  With the prior version, these times were in seconds.  We've restarted our entire environment to no avail.  Any help would be appreciated.

03/03/2010 11:14:51 - requesting resource nburobot
03/03/2010 11:14:51 - requesting resource nbucore1.NBU_CLIENT.MAXJOBS.sqlmisp

03/03/2010 11:14:51 - requesting resource nbucore1.NBU_POLICY.MAXJOBS.sql-sqlmisp

03/03/2010 11:27:27 - granted resource  nbucore1.NBU_CLIENT.MAXJOBS.sqlmisp
03/03/2010 11:27:27 - granted resource  nbucore1.NBU_POLICY.MAXJOBS.sql-sqlmisp
03/03/2010 11:27:27 - granted resource  T01171
03/03/2010 11:27:27 - granted resource  sl8500_drv20
03/03/2010 11:27:27 - granted resource  nbucore1_sl8500
03/03/2010 11:27:29 - estimated 0 kbytes needed
03/03/2010 11:27:30 - started process bpbrm (pid=4546)
03/03/2010 11:27:31 - connecting
03/03/2010 11:27:42 - connected; connect time: 0:00:00
03/03/2010 11:42:43 - mounting T01171
03/03/2010 11:43:30 - mounted T01171; mount time: 0:00:47
03/03/2010 11:43:31 - positioning T01171 to file 596
03/03/2010 11:44:53 - positioned T01171; position time: 0:01:22
03/03/2010 11:44:53 - begin writing
03/03/2010 11:45:14 - end writing; write time: 0:00:21
the requested operation was successfully completed (0)

discussion Filed Under:

Comments

babyd's picture
03
Mar
2010
1 Vote +1
Login to vote

Check the nbrb.conf file and

Check the nbrb.conf file and nbrb log file for any errors;

Example format for parameters in nbrb.conf:
<install_path>\Veritas\NetBackup\var\global\nbrb.conf

By default all values are 0.

SECONDS_FOR_EVAL_LOOP_RELEASE = 180
RESPECT_REQUEST_PRIORITY = 0
DO_INTERMITTENT_UNLOADS = 1

More info:
http://seer.entsupport.symantec.com/docs/300442.htm

Saul Grysman's picture
03
Mar
2010
1 Vote +1
Login to vote

The path for Unix is

The path for Unix is /usr/openv/var/global/nbrb.conf which is what we're using.

We did not have the conf file or the log directory.  I added both & will monitor, and will update this thread.

Thanks for the feedback.  If anyone else might have any suggestions, please don't hesitate.

Saul Grysman's picture
03
Mar
2010
1 Vote +1
Login to vote

Jobs are still queuing.  They

Jobs are still queuing.  They get hung getting resources granted (see output below) and they get hung after connecting:

From original post above:
03/03/2010 11:27:42 - connected; connect time: 0:00:00
03/03/2010 11:42:43 - mounting T01171

The job below was queued from 16:37 until 16:43 when it listed an "info" message.

03/03/2010 16:37:14 - requesting resource nburobot
03/03/2010 16:37:14 - requesting resource nbucore1.NBU_CLIENT.MAXJOBS.webdev15mtc
03/03/2010 16:37:14 - requesting resource nbucore1.NBU_POLICY.MAXJOBS.sql-webdev15mtc
03/03/2010 16:43:45 - Info nbrb (pid=21875) Limit has been reached for the logical resource nbucore1.NBU_POLICY.MAXJOBS.sql-webdev15mtc
03/03/2010 16:43:47 - granted resource  nbucore1.NBU_CLIENT.MAXJOBS.webdev15mtc
03/03/2010 16:43:47 - granted resource  nbucore1.NBU_POLICY.MAXJOBS.sql-webdev15mtc

The nbrb log had no reference to the job until 16:43
03/03/10 16:43:44.555 [ResBroker_i::requestResources] RequestSeq:
Request=0 provider=MPXProvider resourcename=MpxRequest-577611 MPXGroupRequest maxmpx=1
    SubRequest=0 provider=DriveOperationProvider resourcename=nburobot userSequence=0 RB::StorageUnitRequest { StorageUnitRequest { storageUnit=nburobot mediaPool=DB retentionLevel=2 mustUseLocalMediaServer=no failOnError=no mpxRequired=no mustBeNdmp=no getMaxFreeSpace=no minFreeSpaceKBytes=0 usageType=1 client=webdev15mtc shareGroup=*ANY* } }
Request=1 provider=NamedResourceProvider resourcename=nbucore1.NBU_CLIENT.MAXJOBS.webdev15mtc CountedResourceRequest { max=99 }
Request=2 provider=NamedResourceProvider resourcename=nbucore1.NBU_POLICY.MAXJOBS.sql-webdev15mtc CountedResourceRequest { max=1 }

03/03/10 16:43:45.577 [CountedProvider::allocate] cannot allocate, resource is busy nbucore1.NBU_POLICY.MAXJOBS.sql-webdev15mtc (current_ref_count=1, max_ref_count=1)
03/03/10 16:43:45.577 [Error] V-118-171 request ID {EF1BA70C-1DD1-11B2-BFFC-0003BA21D3A5} failed with status 1 (Resource is currently in use); releasing 1 allocated resources
03/03/10 16:43:45.578 [ResBroker_i::holdOne] tracking error status (RBErrorInfo: errorStatus=2005034 retryClass="RB Defined Retry" retryReason=24 retryText="LOG 1267652625 4 nbrb 21875 Limit has been reached for the logical resource nbucore1.NBU_POLICY.MAXJOBS.sql-webdev15mtc" retryResource=nbucore1.NBU_POLICY.MAXJOBS.sql-webdev15mtc)
03/03/10 16:43:45.578 [ResBroker_i::doEvaluation] <EVALPERF> Evaluated 2 requests, evalTime=106 msec, numSucceeded=0, numQueued=1, numPended=0, numFailed=1, numMultipleCopyJobs=0, numMdsCacheMisses=0, restartedEvalCycle=0, numDriveReleasesBeforeEval=0, numDrivesAvailStart=7, numDrivesAvailEnd=7, requestQueueSize=1
03/03/10 16:43:45.749 [ResBroker_i::doEvaluation] CONF: releaseTimeSeconds = 180, respectRequestPriority = false, doIntermittentUnloads = true, maxHighPriorityQueueSize = 100

Anything else I can provide, please advise.

babyd's picture
04
Mar
2010
0 Votes 0
Login to vote

Can you enable bpbrm,bpdbm

Can you enable bpbrm,bpdbm logs ?
Change the verbose level to 5 and rerun the backup.

venkat6.k@gmail.com's picture
24
Mar
2010
0 Votes 0
Login to vote

For windows <inastall

For windows <inastall path>\vertias\netbackup\logs directory, there you can find  xxx.bat file. excute that file will create the all log file for netbackup.
and in registry, set value debuglogs =5 for detailed logs

Saul Grysman's picture
04
Mar
2010
1 Vote +1
Login to vote

I was just told by Symantec

I was just told by Symantec that this is "a well known problem" and that they have engineering binaries to fix it.  I'll get them, install and try to report back.

(I suppose that it isn't as well known as they think.  Wish I knew about this before.)

David McMullin's picture
05
Mar
2010
1 Vote +1
Login to vote

can you let us know the

can you let us know the ETRACK numbers so we can check them out?

NBU 7.0.1 on Solaris 10
writing to EMC 4206 VTL
duplicating to LTO2 in SL8500
(Soon to be LTO5)
using ACSLS 7.3.1

bcblake's picture
05
Mar
2010
1 Vote +1
Login to vote

having same problem

tried everything in:

http://seer.entsupport.symantec.com/docs/297322.htm

Just moved EMM (/usr/openv/db/data) to it's own LUN (Clariion CX-3) this week, hoping the problem would be fixed but it's gotten worse.

Cleared all resource requests with nbrbutil -resetAll, started a few jobs which all went active fine, but as we add more jobs, the time to get a resource gets longer and longer... it's absolutely ridiculous. Jobs now waiting over 25 minutes just to get a MAXJOBS resource... this is getting very frustrating.

Solaris 10 master (which was just updated to Update 7), T5220 64 procs and 64G of RAM and nbrb makes this whole thing come to a crawl.

Current nbrb.conf:

RESPECT_REQUEST_PRIORITY = 1
SECONDS_FOR_EVAL_LOOP_RELEASE = 270
DO_INTERMITTENT_UNLOADS = 1

I'm going to let it bake for a while and see how it behaves but it's not looking promising.

Please let us know if you have an Etracks and if they have EEBs for this.

Thanks.

cgoliver's picture
12
Mar
2010
1 Vote +1
Login to vote

We have been experiencing this problem, also

It's been four of five months of this behavior and no solution from Symantec.

Originally, the problem started when we added two new tape drives. Upgrading from 6.5.3.1 to 6.5.4 didn't resolve this specific problem.

Partial analysis of the problem suggests that a communication delay is occuring between the client, media server, master. One month ago we made changes for some clients, media servers to ensure all connection went through vnetd, after we found that callback were being done on high random ports, which is a no-no for our firewall.

Sincerely,
Chris O.

bcblake's picture
16
Mar
2010
0 Votes 0
Login to vote

Found eTrack info and EEB for nbrb

All-

Thanks to one of the advanced TSEs (Jim Harris, highly recommended!), my case 411-519-814, yielded an EEB for nbrb for 6.5.4.

Mention any of these eTracks and you should be able to get it:

 

iPeng_Ops.book

ET1596861, ET1702148, ET1634241, ET1600913, ET1785238, ET1860650, ET1873458, ET1870125, ET1867225, ET1886274, ET1915680, ET1977194

One of these (ET1977194) was actually happening to me where bpduplicates would exit with a 96 error and nbrb would report 2005000 (EMM_ERROR_MediaUnavailable). So that saves me an extra call in and case to Symantec.

I am planning to apply this tomorrow and will report back status.

Brian

 

 

iPeng_Ops.book

bcblake's picture
16
Mar
2010
0 Votes 0
Login to vote

Also check out these kernel parms

A couple of other kernel parms that I ended up using that seem to help as well on the master (Solaris 10):

/usr/sbin/ndd -set /dev/tcp tcp_conn_req_max_q 8192

/usr/sbin/ndd -set /dev/tcp tcp_conn_req_max_q0 8192

/usr/sbin/ndd -set /dev/tcp tcp_time_wait_interval 15000

/usr/sbin/ndd -set /dev/tcp tcp_keepalive_interval 3600000

(I put these in /lib/svc/method/net-init to automatically set on boot time, and manually set them while I had NetBackup down)

Definitely disable tcp_fusion (see http://seer.entsupport.symantec.com/docs/306694.htm).

And setup project for kernel parameters (see http://seer.entsupport.symantec.com/docs/307610.htm on how to set them up):

project.max-msg-ids 256

project.max-sem-ids 1024

project.max-shm-ids 256

I didn't bother with shm-max since 8GB (1/4 of system memory) is fine for my master, as it doesn't handle client backups.

Modified /usr/openv/var/global/emm.conf to contain:

NUM_DB_CONNECTIONS=51

NUM_DB_BROWSE_CONNECTIONS=50

NUM_ORB_THREADS=61

 

And /usr/openv/var/global/server.conf (adding the -gn to be 6 more than NUM_DB_CONNECTIONS), and also up'ed the cache to 75M to start (500M high water mark):

 -n NB_<master>

   -x tcpip(LocalOnly=YES;ServerPort=13785)  -gp 4096 -ct+ -gd DBA -gk DBA -gl DBA -ti 0 -c 75M -ch 500M

 -cl 75M -gn 57 -zl -os 1M  -o /usr/openv/db//log/server.log

 -ud


Hope those can help some of you!


 

cgoliver's picture
17
Mar
2010
0 Votes 0
Login to vote

Can you post the name of the

Can you post the name of the EEB? Back on 6.5.3.1, they had me install an eeb for nbrb. Didn't help this problem, but it did addresss a different issue.

I've changed some of our tcp params, but we are running Solaris 9 so no application profiles for kernel params :(....

MILB65's picture
17
Mar
2010
0 Votes 0
Login to vote

We initially had the same problem. Here is how we fixed it:

By the way, there were no ebb's that we put into place.

Turn up the logging for nbemm
vxlogcfg -a -p NB -o 111 -s DebugLevel-6 -s DiagnosticLevel=6

Then monitor, wait till the long delay then run:
vxlogview -i 111 --display D,T,p,t,c,w,x -t 02:00:00 > <somewhere>

then run
grep 'gethostbyname failed for` <somewhere> |grep dns_hosts_equal |awk '{print $11}' |sort | uniq > <someplace new>

so now for the explanation.

nbrb is probably getting hung up in a "learning mode" for a server that it can not resolve.  however, while it learns, it is hanging up the backups of the servers that it does not have to learn about.

Omar Villa's picture
17
Mar
2010
0 Votes 0
Login to vote

NBRB EEB Versions

Symantec is on NBRB EEB version 5 after 6.5.4, be sure you get version 6, because they still have a lot of issues with the resource handeling.

Omar A Villa

Netbackup Expert

These are my personal views and not those of the company I work for

bcblake's picture
17
Mar
2010
0 Votes 0
Login to vote

nbrb fun

Not sure if I got version 5 of nbrb EEB (would be nice to know how to tell which version!). but it was for NBU 6.5.4.

cgoliver- I had same problems in 6.5.3.1, they told me there were fixes in 6.5.4... I continued to run into problems (one of which was the one from the ET I mentioned above).

I applied the EEB I just got (no idea which version) and it seems to be running ok for now.

It still takes a heckuva lot more time than I would think when there are only a couple hundred jobs in queue, and looking at the evaluation performance

Looking at evaluation cycle performance, its looking a little better (evaluated 400-500 requests in the 3 minutes we have set for SECONDS_FOR_EVAL_LOOP_RELEASE), whereas it used to be in the 30-40 number of requests.

Also wondered if it has anything to do with how often the catalog backup runs, which cleans out the Sybase transaction log. So make sure you get good hot catalog backups often!

And as for Solaris 9, use the appropriate shared memory, semaphore and message queues kernel parameters (see the NBU 6.5 Perf Tuning Guide, they should be set in /etc/system I believe for Solaris 9), or check with your Solaris admin or google it.

CRZ's picture
18
Mar
2010
0 Votes 0
Login to vote

EEBs and nbrb - the latest news I found

You'd probably be better off applying 6.5.5, because you can get the nbrb EEB for 6.5.5 via a TechFile:

DOCUMENTATION: Issues resolved in the nbrb package for NetBackup Resource Broker, included in Etrack 1924099.
 http://support.veritas.com/docs/346920

The 6.5.4 bundle is still only available through a support case.  (I think we're only at version 5, by the way - I haven't seen a version 6)

 

cgoliver's picture
18
Mar
2010
0 Votes 0
Login to vote

We usually stay back at least

We usually stay back at least one revision.

CRZ's picture
18
Mar
2010
0 Votes 0
Login to vote

Fair enough

The Etrack for the 6.5.4 EEB is 1747025.  (Omar will correct me if I'm wrong, I hope!)  Mention it to the engineer handling your case.

Really, though, by not installing 6.5.5 you're missing out on a lot of other fixes.  (As a representative of support, I am obligated to add this to every post.  Also, it's totally true!)

 

Omar Villa's picture
18
Mar
2010
0 Votes 0
Login to vote

NBPEM Issues...

      Today I saw the master server just not running jobs after NBRB EEV v5 was install, the difference is the jobs just didn't start, they dont go to the queue, no NBRB assigments nothing, as if the policies where disabled and more weird it didnt' happen with all the policies, so I bounce the master but previously raise the Originator 116 (NBPEM) to level 3 just to get some logs. Everything is back and scheduling, but we still need to see how runs tonight.

Have anyone of you see this behavior after the EEB was deploy?

Please let me know.
Regards

Omar A Villa

Netbackup Expert

These are my personal views and not those of the company I work for

Omar Villa's picture
18
Mar
2010
0 Votes 0
Login to vote

NBPEM Issues...

      Today I saw the master server just not running jobs after NBRB EEV v5 was install, the difference is the jobs just didn't start, they dont go to the queue, no NBRB assigments nothing, as if the policies where disabled and more weird it didnt' happen with all the policies, so I bounce the master but previously raise the Originator 116 (NBPEM) to level 3 just to get some logs. Everything is back and scheduling, but we still need to see how runs tonight.

Have anyone of you see this behavior after the EEB was deploy?

Please let me know.
Regards

Omar A Villa

Netbackup Expert

These are my personal views and not those of the company I work for

cgoliver's picture
23
Mar
2010
0 Votes 0
Login to vote

NBRB Issues...

Upgraded to 6.5.5 today, we shall see what progress has been made.

CRZ's picture
23
Mar
2010
0 Votes 0
Login to vote

Did you also apply the 6.5.5 EEB?

I posted it earlier, but here it is again:

DOCUMENTATION: Issues resolved in the nbrb package for NetBackup Resource Broker, included in Etrack 1924099.
 http://support.veritas.com/docs/346920

 

cgoliver's picture
26
Mar
2010
0 Votes 0
Login to vote

NBRB 6.5.5 -- EEB

Just grabbed it and will patch today. Came in this morning for find jobs still queued up from last night and looking at the logs, it appears that we had many jobs that didn't finish.

Back to the drawing board.

CRZ --- Can you answer this? We have been told that 30 tape drives should not be shared amongst 25 media servers. True or False?

cgoliver's picture
26
Mar
2010
0 Votes 0
Login to vote

Status

The EEB in addition to the 6.5.5 update on the master has not done much of anything to alleviate our resources problems. The conversions are slow and utilization of the tape drives is sub optimal.

We spent three months working with third level support and got no where on resolving this issue. Still limping along with a million dollar infrastructure that will not backup more than about 20TB per day.

The media servers are still running 6.5.4, maybe we need to get those up to 6.5.5 and add the EEB

cgoliver's picture
08
Apr
2010
0 Votes 0
Login to vote

NBPEM, NBJM, VMD and who knows what else

We are still experiencing resource allocation problems with mount times averaging 15 minutes to one hour.

I have updated most of our media servers to 6.5.5, which has not helped much with the daily incrementals.

Last weekend we sent ~ 36TB to tape, but during the week our incrementals struggle to finish. The process of mounting tapes takes more than 15 minutes and in many cases it takes an hour or more.

Thanks,
Chris

bcblake's picture
08
Apr
2010
0 Votes 0
Login to vote

Other optimization possibilities?

cgoliver:

Have you tried splitting off your EMM db, db/images, log files directories (both legacy and VxUL), and maybe even your EMM transaction logs onto separate storage LUNs and file systems? This helped us tremendously when we were originally having nbrb and EMM performance problems.

Also, what kind of disk is EMM sitting on (local attached, SAN, etc.)?

When you say you're seeing a 15 minute delay in mounting tapes, is this a delay in waiting for the resources to be granted, or actually physically (or virtually in the case of a VTL) mounting a tape? If the latter, have you looked at syslog/messages file/EventLog with ltid in verbose mode to see if its actually getting delayed in the tape mount request being sent/acknowledged by the tape library? If it's a TLD type library, usually you'll see those messages in the syslog (opening/closing robotic path blah-blah-blah)...

Just trying to think of any other possibilities why there would be delays...

cgoliver's picture
09
Apr
2010
0 Votes 0
Login to vote

Current situation

The EMM database is local, db/images are split onto into two different mount points (one local, one SAN) some logging is placed on the same LUN as db/images.

Amit Karia's picture
16
Apr
2010
0 Votes 0
Login to vote

Same issue

We are facing same issue while duplicating from DSSU to tape
even after upgrading 6.5.5 to all our media servers.. Please update if applying nbrb binary has helped or not..

following is the error we are facing

Copy Num: 2, NBU status: 2005000, EMM status: No media is available

crowe's picture
28
Apr
2010
0 Votes 0
Login to vote

Same/Related Issue

Master Server is Linux based NB 6.5.4

Symptoms:  Client jobs sitting at "started process bpbrm (pid = xxxxxx)" where xxxxxx is the pid number.  I view this from Activity Monitor, double click job, select Detailed Status tab.   Here is the ouput:

04/27/2010 20:03:16 - requesting resource Any
04/27/2010 20:03:16 - requesting resource TPUSA-BACKUP3.NBU_CLIENT.MAXJOBS.TWI-BB
04/27/2010 20:03:16 - requesting resource TPUSA-BACKUP3.NBU_POLICY.MAXJOBS.LEX-WIN2003
04/27/2010 20:03:16 - granted resource  TPUSA-BACKUP3.NBU_CLIENT.MAXJOBS.TWI-BB
04/27/2010 20:03:16 - granted resource  TPUSA-BACKUP3.NBU_POLICY.MAXJOBS.LEX-WIN2003
04/27/2010 20:03:16 - granted resource  WZE401
04/27/2010 20:03:16 - granted resource  HP.ULTRIUM2-SCSI.002
04/27/2010 20:03:16 - granted resource  tpusa-backup2-hcart2-robot-tld-1
04/27/2010 20:03:16 - estimated 1137440 kbytes needed
04/27/2010 20:03:16 - begin Parent Job
04/27/2010 20:03:16 - begin Snapshot: Start Notify Script
04/27/2010 20:03:16 - started process RUNCMD (pid=14974)
04/27/2010 20:03:16 - ended process 0 (pid=14974)
Operation Status: 0
04/27/2010 20:03:16 - end Snapshot: Start Notify Script; elapsed time 0:00:00
04/27/2010 20:03:16 - begin Snapshot: Step By Condition
Operation Status: 0
04/27/2010 20:03:16 - end Snapshot: Step By Condition; elapsed time 0:00:00
04/27/2010 20:03:16 - begin Snapshot: Stream Discovery
Operation Status: 0
04/27/2010 20:03:17 - end Snapshot: Stream Discovery; elapsed time 0:00:01
04/27/2010 20:03:17 - begin Snapshot: Read File List
Operation Status: 0
04/27/2010 20:03:17 - end Snapshot: Read File List; elapsed time 0:00:00
04/27/2010 20:03:17 - begin Snapshot: Create Snapshot
04/27/2010 20:03:20 - begin Create Snapshot
04/27/2010 20:03:18 - started process bpbrm (pid=11314)

Jobs like this and a few others will sit in this status (with the State type of "active" on the Activity Monitor main page) until the backup window expires which is 10 hours.  This occurs with different clients in different policies.  All the while other jobs will be completing around these.  All of the clients that this occurs on are windows boxes doing flat file backups.  Some Windows with flat file backups still run normally.  Linux backups run normally.  Windows SQL backups all run normally.

This behavior has not been observed until all Windows flat file policies were set to to Multiplexing 2 and "allow multiple data streams".  Tonight I am going to change all of these windows policies back to multiplexing 1 and uncheck "allow multiple data streams" and see if this re-occurs.

In writing this post I also found a pattern.  It appears that all of the clients stuck in this status have NB client 6.0.  I am going to put all of these 6.0 clients experiencing the issue in a separate policy without streaming and multiplexing and see if the issue re-occurs.  I have not yet found a 6.5 client that behaves this way.  Needs more investigation.  Will post more as I discover it.