Many Jobs delayed "granting resources" and "connecting"
We've recently upgraded from 6.5.3 to 6.5.4 and we've added an additional Media Server into the environment (all Solaris based). We are experiencing significant delays in starting jobs - Unix, Windows, SQL and Oracle/RMAN - and writing to disk or tape. The example below is typical: 12 minute delay in granting resources and 15 minute delay is connecting. With the prior version, these times were in seconds. We've restarted our entire environment to no avail. Any help would be appreciated.
03/03/2010 11:14:51 - requesting resource nburobot
03/03/2010 11:14:51 - requesting resource nbucore1.NBU_CLIENT.MAXJOBS.sqlmisp
03/03/2010 11:14:51 - requesting resource nbucore1.NBU_POLICY.MAXJOBS.sql-sqlmisp
03/03/2010 11:27:27 - granted resource nbucore1.NBU_CLIENT.MAXJOBS.sqlmisp
03/03/2010 11:27:27 - granted resource nbucore1.NBU_POLICY.MAXJOBS.sql-sqlmisp
03/03/2010 11:27:27 - granted resource T01171
03/03/2010 11:27:27 - granted resource sl8500_drv20
03/03/2010 11:27:27 - granted resource nbucore1_sl8500
03/03/2010 11:27:29 - estimated 0 kbytes needed
03/03/2010 11:27:30 - started process bpbrm (pid=4546)
03/03/2010 11:27:31 - connecting
03/03/2010 11:27:42 - connected; connect time: 0:00:00
03/03/2010 11:42:43 - mounting T01171
03/03/2010 11:43:30 - mounted T01171; mount time: 0:00:47
03/03/2010 11:43:31 - positioning T01171 to file 596
03/03/2010 11:44:53 - positioned T01171; position time: 0:01:22
03/03/2010 11:44:53 - begin writing
03/03/2010 11:45:14 - end writing; write time: 0:00:21
the requested operation was successfully completed (0)
Comments
Check the nbrb.conf file and
Check the nbrb.conf file and nbrb log file for any errors;
Example format for parameters in nbrb.conf:
<install_path>\Veritas\NetBackup\var\global\nbrb.conf
By default all values are 0.
SECONDS_FOR_EVAL_LOOP_RELEASE = 180
RESPECT_REQUEST_PRIORITY = 0
DO_INTERMITTENT_UNLOADS = 1
More info:
http://seer.entsupport.symantec.com/docs/300442.htm
The path for Unix is
The path for Unix is /usr/openv/var/global/nbrb.conf which is what we're using.
We did not have the conf file or the log directory. I added both & will monitor, and will update this thread.
Thanks for the feedback. If anyone else might have any suggestions, please don't hesitate.
Jobs are still queuing. They
Jobs are still queuing. They get hung getting resources granted (see output below) and they get hung after connecting:
From original post above:
03/03/2010 11:27:42 - connected; connect time: 0:00:00
03/03/2010 11:42:43 - mounting T01171
The job below was queued from 16:37 until 16:43 when it listed an "info" message.
03/03/2010 16:37:14 - requesting resource nburobot
03/03/2010 16:37:14 - requesting resource nbucore1.NBU_CLIENT.MAXJOBS.webdev15mtc
03/03/2010 16:37:14 - requesting resource nbucore1.NBU_POLICY.MAXJOBS.sql-webdev15mtc
03/03/2010 16:43:45 - Info nbrb (pid=21875) Limit has been reached for the logical resource nbucore1.NBU_POLICY.MAXJOBS.sql-webdev15mtc
03/03/2010 16:43:47 - granted resource nbucore1.NBU_CLIENT.MAXJOBS.webdev15mtc
03/03/2010 16:43:47 - granted resource nbucore1.NBU_POLICY.MAXJOBS.sql-webdev15mtc
The nbrb log had no reference to the job until 16:43
03/03/10 16:43:44.555 [ResBroker_i::requestResources] RequestSeq:
Request=0 provider=MPXProvider resourcename=MpxRequest-577611 MPXGroupRequest maxmpx=1
SubRequest=0 provider=DriveOperationProvider resourcename=nburobot userSequence=0 RB::StorageUnitRequest { StorageUnitRequest { storageUnit=nburobot mediaPool=DB retentionLevel=2 mustUseLocalMediaServer=no failOnError=no mpxRequired=no mustBeNdmp=no getMaxFreeSpace=no minFreeSpaceKBytes=0 usageType=1 client=webdev15mtc shareGroup=*ANY* } }
Request=1 provider=NamedResourceProvider resourcename=nbucore1.NBU_CLIENT.MAXJOBS.webdev15mtc CountedResourceRequest { max=99 }
Request=2 provider=NamedResourceProvider resourcename=nbucore1.NBU_POLICY.MAXJOBS.sql-webdev15mtc CountedResourceRequest { max=1 }
03/03/10 16:43:45.577 [CountedProvider::allocate] cannot allocate, resource is busy nbucore1.NBU_POLICY.MAXJOBS.sql-webdev15mtc (current_ref_count=1, max_ref_count=1)
03/03/10 16:43:45.577 [Error] V-118-171 request ID {EF1BA70C-1DD1-11B2-BFFC-0003BA21D3A5} failed with status 1 (Resource is currently in use); releasing 1 allocated resources
03/03/10 16:43:45.578 [ResBroker_i::holdOne] tracking error status (RBErrorInfo: errorStatus=2005034 retryClass="RB Defined Retry" retryReason=24 retryText="LOG 1267652625 4 nbrb 21875 Limit has been reached for the logical resource nbucore1.NBU_POLICY.MAXJOBS.sql-webdev15mtc" retryResource=nbucore1.NBU_POLICY.MAXJOBS.sql-webdev15mtc)
03/03/10 16:43:45.578 [ResBroker_i::doEvaluation] <EVALPERF> Evaluated 2 requests, evalTime=106 msec, numSucceeded=0, numQueued=1, numPended=0, numFailed=1, numMultipleCopyJobs=0, numMdsCacheMisses=0, restartedEvalCycle=0, numDriveReleasesBeforeEval=0, numDrivesAvailStart=7, numDrivesAvailEnd=7, requestQueueSize=1
03/03/10 16:43:45.749 [ResBroker_i::doEvaluation] CONF: releaseTimeSeconds = 180, respectRequestPriority = false, doIntermittentUnloads = true, maxHighPriorityQueueSize = 100
Anything else I can provide, please advise.
Can you enable bpbrm,bpdbm
Can you enable bpbrm,bpdbm logs ?
Change the verbose level to 5 and rerun the backup.
For windows <inastall
For windows <inastall path>\vertias\netbackup\logs directory, there you can find xxx.bat file. excute that file will create the all log file for netbackup.
and in registry, set value debuglogs =5 for detailed logs
I was just told by Symantec
I was just told by Symantec that this is "a well known problem" and that they have engineering binaries to fix it. I'll get them, install and try to report back.
(I suppose that it isn't as well known as they think. Wish I knew about this before.)
can you let us know the
can you let us know the ETRACK numbers so we can check them out?
NBU 7.0.1 on Solaris 10
writing to EMC 4206 VTL
duplicating to LTO2 in SL8500
(Soon to be LTO5)
using ACSLS 7.3.1
having same problem
tried everything in:
http://seer.entsupport.symantec.com/docs/297322.htm
Just moved EMM (/usr/openv/db/data) to it's own LUN (Clariion CX-3) this week, hoping the problem would be fixed but it's gotten worse.
Cleared all resource requests with nbrbutil -resetAll, started a few jobs which all went active fine, but as we add more jobs, the time to get a resource gets longer and longer... it's absolutely ridiculous. Jobs now waiting over 25 minutes just to get a MAXJOBS resource... this is getting very frustrating.
Solaris 10 master (which was just updated to Update 7), T5220 64 procs and 64G of RAM and nbrb makes this whole thing come to a crawl.
Current nbrb.conf:
RESPECT_REQUEST_PRIORITY = 1
SECONDS_FOR_EVAL_LOOP_RELEASE = 270
DO_INTERMITTENT_UNLOADS = 1
I'm going to let it bake for a while and see how it behaves but it's not looking promising.
Please let us know if you have an Etracks and if they have EEBs for this.
Thanks.
We have been experiencing this problem, also
It's been four of five months of this behavior and no solution from Symantec.
Originally, the problem started when we added two new tape drives. Upgrading from 6.5.3.1 to 6.5.4 didn't resolve this specific problem.
Partial analysis of the problem suggests that a communication delay is occuring between the client, media server, master. One month ago we made changes for some clients, media servers to ensure all connection went through vnetd, after we found that callback were being done on high random ports, which is a no-no for our firewall.
Sincerely,
Chris O.
Found eTrack info and EEB for nbrb
All-
Thanks to one of the advanced TSEs (Jim Harris, highly recommended!), my case 411-519-814, yielded an EEB for nbrb for 6.5.4.
Mention any of these eTracks and you should be able to get it:
iPeng_Ops.book
ET1596861, ET1702148, ET1634241, ET1600913, ET1785238, ET1860650, ET1873458, ET1870125, ET1867225, ET1886274, ET1915680, ET1977194
One of these (ET1977194) was actually happening to me where bpduplicates would exit with a 96 error and nbrb would report 2005000 (EMM_ERROR_MediaUnavailable). So that saves me an extra call in and case to Symantec.
I am planning to apply this tomorrow and will report back status.
Brian
iPeng_Ops.book
Also check out these kernel parms
A couple of other kernel parms that I ended up using that seem to help as well on the master (Solaris 10):
/usr/sbin/ndd -set /dev/tcp tcp_conn_req_max_q 8192
/usr/sbin/ndd -set /dev/tcp tcp_conn_req_max_q0 8192
/usr/sbin/ndd -set /dev/tcp tcp_time_wait_interval 15000
/usr/sbin/ndd -set /dev/tcp tcp_keepalive_interval 3600000
(I put these in /lib/svc/method/net-init to automatically set on boot time, and manually set them while I had NetBackup down)
Definitely disable tcp_fusion (see http://seer.entsupport.symantec.com/docs/306694.htm).
And setup project for kernel parameters (see http://seer.entsupport.symantec.com/docs/307610.htm on how to set them up):
project.max-msg-ids 256
project.max-sem-ids 1024
project.max-shm-ids 256
I didn't bother with shm-max since 8GB (1/4 of system memory) is fine for my master, as it doesn't handle client backups.
Modified /usr/openv/var/global/emm.conf to contain:
NUM_DB_CONNECTIONS=51
NUM_DB_BROWSE_CONNECTIONS=50
NUM_ORB_THREADS=61
And /usr/openv/var/global/server.conf (adding the -gn to be 6 more than NUM_DB_CONNECTIONS), and also up'ed the cache to 75M to start (500M high water mark):
-n NB_<master>
-x tcpip(LocalOnly=YES;ServerPort=13785) -gp 4096 -ct+ -gd DBA -gk DBA -gl DBA -ti 0 -c 75M -ch 500M
-cl 75M -gn 57 -zl -os 1M -o /usr/openv/db//log/server.log
-ud
Hope those can help some of you!
Can you post the name of the
Can you post the name of the EEB? Back on 6.5.3.1, they had me install an eeb for nbrb. Didn't help this problem, but it did addresss a different issue.
I've changed some of our tcp params, but we are running Solaris 9 so no application profiles for kernel params :(....
We initially had the same problem. Here is how we fixed it:
By the way, there were no ebb's that we put into place.
Turn up the logging for nbemm
vxlogcfg -a -p NB -o 111 -s DebugLevel-6 -s DiagnosticLevel=6
Then monitor, wait till the long delay then run:
vxlogview -i 111 --display D,T,p,t,c,w,x -t 02:00:00 > <somewhere>
then run
grep 'gethostbyname failed for` <somewhere> |grep dns_hosts_equal |awk '{print $11}' |sort | uniq > <someplace new>
so now for the explanation.
nbrb is probably getting hung up in a "learning mode" for a server that it can not resolve. however, while it learns, it is hanging up the backups of the servers that it does not have to learn about.
NBRB EEB Versions
Symantec is on NBRB EEB version 5 after 6.5.4, be sure you get version 6, because they still have a lot of issues with the resource handeling.
Omar A Villa
Netbackup Expert
These are my personal views and not those of the company I work for
nbrb fun
Not sure if I got version 5 of nbrb EEB (would be nice to know how to tell which version!). but it was for NBU 6.5.4.
cgoliver- I had same problems in 6.5.3.1, they told me there were fixes in 6.5.4... I continued to run into problems (one of which was the one from the ET I mentioned above).
I applied the EEB I just got (no idea which version) and it seems to be running ok for now.
It still takes a heckuva lot more time than I would think when there are only a couple hundred jobs in queue, and looking at the evaluation performance
Looking at evaluation cycle performance, its looking a little better (evaluated 400-500 requests in the 3 minutes we have set for SECONDS_FOR_EVAL_LOOP_RELEASE), whereas it used to be in the 30-40 number of requests.
Also wondered if it has anything to do with how often the catalog backup runs, which cleans out the Sybase transaction log. So make sure you get good hot catalog backups often!
And as for Solaris 9, use the appropriate shared memory, semaphore and message queues kernel parameters (see the NBU 6.5 Perf Tuning Guide, they should be set in /etc/system I believe for Solaris 9), or check with your Solaris admin or google it.
EEBs and nbrb - the latest news I found
You'd probably be better off applying 6.5.5, because you can get the nbrb EEB for 6.5.5 via a TechFile:
DOCUMENTATION: Issues resolved in the nbrb package for NetBackup Resource Broker, included in Etrack 1924099.
http://support.veritas.com/docs/346920
The 6.5.4 bundle is still only available through a support case. (I think we're only at version 5, by the way - I haven't seen a version 6)
We usually stay back at least
We usually stay back at least one revision.
Fair enough
The Etrack for the 6.5.4 EEB is 1747025. (Omar will correct me if I'm wrong, I hope!) Mention it to the engineer handling your case.
Really, though, by not installing 6.5.5 you're missing out on a lot of other fixes. (As a representative of support, I am obligated to add this to every post. Also, it's totally true!)
NBPEM Issues...
Today I saw the master server just not running jobs after NBRB EEV v5 was install, the difference is the jobs just didn't start, they dont go to the queue, no NBRB assigments nothing, as if the policies where disabled and more weird it didnt' happen with all the policies, so I bounce the master but previously raise the Originator 116 (NBPEM) to level 3 just to get some logs. Everything is back and scheduling, but we still need to see how runs tonight.
Have anyone of you see this behavior after the EEB was deploy?
Please let me know.
Regards
Omar A Villa
Netbackup Expert
These are my personal views and not those of the company I work for
NBPEM Issues...
Today I saw the master server just not running jobs after NBRB EEV v5 was install, the difference is the jobs just didn't start, they dont go to the queue, no NBRB assigments nothing, as if the policies where disabled and more weird it didnt' happen with all the policies, so I bounce the master but previously raise the Originator 116 (NBPEM) to level 3 just to get some logs. Everything is back and scheduling, but we still need to see how runs tonight.
Have anyone of you see this behavior after the EEB was deploy?
Please let me know.
Regards
Omar A Villa
Netbackup Expert
These are my personal views and not those of the company I work for
NBRB Issues...
Upgraded to 6.5.5 today, we shall see what progress has been made.
Did you also apply the 6.5.5 EEB?
I posted it earlier, but here it is again:
DOCUMENTATION: Issues resolved in the nbrb package for NetBackup Resource Broker, included in Etrack 1924099.
http://support.veritas.com/docs/346920
NBRB 6.5.5 -- EEB
Just grabbed it and will patch today. Came in this morning for find jobs still queued up from last night and looking at the logs, it appears that we had many jobs that didn't finish.
Back to the drawing board.
CRZ --- Can you answer this? We have been told that 30 tape drives should not be shared amongst 25 media servers. True or False?
Status
The EEB in addition to the 6.5.5 update on the master has not done much of anything to alleviate our resources problems. The conversions are slow and utilization of the tape drives is sub optimal.
We spent three months working with third level support and got no where on resolving this issue. Still limping along with a million dollar infrastructure that will not backup more than about 20TB per day.
The media servers are still running 6.5.4, maybe we need to get those up to 6.5.5 and add the EEB
NBPEM, NBJM, VMD and who knows what else
We are still experiencing resource allocation problems with mount times averaging 15 minutes to one hour.
I have updated most of our media servers to 6.5.5, which has not helped much with the daily incrementals.
Last weekend we sent ~ 36TB to tape, but during the week our incrementals struggle to finish. The process of mounting tapes takes more than 15 minutes and in many cases it takes an hour or more.
Thanks,
Chris
Other optimization possibilities?
cgoliver:
Have you tried splitting off your EMM db, db/images, log files directories (both legacy and VxUL), and maybe even your EMM transaction logs onto separate storage LUNs and file systems? This helped us tremendously when we were originally having nbrb and EMM performance problems.
Also, what kind of disk is EMM sitting on (local attached, SAN, etc.)?
When you say you're seeing a 15 minute delay in mounting tapes, is this a delay in waiting for the resources to be granted, or actually physically (or virtually in the case of a VTL) mounting a tape? If the latter, have you looked at syslog/messages file/EventLog with ltid in verbose mode to see if its actually getting delayed in the tape mount request being sent/acknowledged by the tape library? If it's a TLD type library, usually you'll see those messages in the syslog (opening/closing robotic path blah-blah-blah)...
Just trying to think of any other possibilities why there would be delays...
Current situation
The EMM database is local, db/images are split onto into two different mount points (one local, one SAN) some logging is placed on the same LUN as db/images.
Same issue
We are facing same issue while duplicating from DSSU to tape
even after upgrading 6.5.5 to all our media servers.. Please update if applying nbrb binary has helped or not..
following is the error we are facing
Copy Num: 2, NBU status: 2005000, EMM status: No media is available
Same/Related Issue
Master Server is Linux based NB 6.5.4
Symptoms: Client jobs sitting at "started process bpbrm (pid = xxxxxx)" where xxxxxx is the pid number. I view this from Activity Monitor, double click job, select Detailed Status tab. Here is the ouput:
04/27/2010 20:03:16 - requesting resource Any
04/27/2010 20:03:16 - requesting resource TPUSA-BACKUP3.NBU_CLIENT.MAXJOBS.TWI-BB
04/27/2010 20:03:16 - requesting resource TPUSA-BACKUP3.NBU_POLICY.MAXJOBS.LEX-WIN2003
04/27/2010 20:03:16 - granted resource TPUSA-BACKUP3.NBU_CLIENT.MAXJOBS.TWI-BB
04/27/2010 20:03:16 - granted resource TPUSA-BACKUP3.NBU_POLICY.MAXJOBS.LEX-WIN2003
04/27/2010 20:03:16 - granted resource WZE401
04/27/2010 20:03:16 - granted resource HP.ULTRIUM2-SCSI.002
04/27/2010 20:03:16 - granted resource tpusa-backup2-hcart2-robot-tld-1
04/27/2010 20:03:16 - estimated 1137440 kbytes needed
04/27/2010 20:03:16 - begin Parent Job
04/27/2010 20:03:16 - begin Snapshot: Start Notify Script
04/27/2010 20:03:16 - started process RUNCMD (pid=14974)
04/27/2010 20:03:16 - ended process 0 (pid=14974)
Operation Status: 0
04/27/2010 20:03:16 - end Snapshot: Start Notify Script; elapsed time 0:00:00
04/27/2010 20:03:16 - begin Snapshot: Step By Condition
Operation Status: 0
04/27/2010 20:03:16 - end Snapshot: Step By Condition; elapsed time 0:00:00
04/27/2010 20:03:16 - begin Snapshot: Stream Discovery
Operation Status: 0
04/27/2010 20:03:17 - end Snapshot: Stream Discovery; elapsed time 0:00:01
04/27/2010 20:03:17 - begin Snapshot: Read File List
Operation Status: 0
04/27/2010 20:03:17 - end Snapshot: Read File List; elapsed time 0:00:00
04/27/2010 20:03:17 - begin Snapshot: Create Snapshot
04/27/2010 20:03:20 - begin Create Snapshot
04/27/2010 20:03:18 - started process bpbrm (pid=11314)
Jobs like this and a few others will sit in this status (with the State type of "active" on the Activity Monitor main page) until the backup window expires which is 10 hours. This occurs with different clients in different policies. All the while other jobs will be completing around these. All of the clients that this occurs on are windows boxes doing flat file backups. Some Windows with flat file backups still run normally. Linux backups run normally. Windows SQL backups all run normally.
This behavior has not been observed until all Windows flat file policies were set to to Multiplexing 2 and "allow multiple data streams". Tonight I am going to change all of these windows policies back to multiplexing 1 and uncheck "allow multiple data streams" and see if this re-occurs.
In writing this post I also found a pattern. It appears that all of the clients stuck in this status have NB client 6.0. I am going to put all of these 6.0 clients experiencing the issue in a separate policy without streaming and multiplexing and see if the issue re-occurs. I have not yet found a 6.5 client that behaves this way. Needs more investigation. Will post more as I discover it.
Would you like to reply?
Login or Register to post your comment.