Duplication Jobs running wild
After rebooting my NBU servers, I have over a thousand Duplication jobs trying to run.
A little info: We have 2 DD670's. Half backups go to one, half to the other, then each backup duplicates to the other DD670 after the backup completes.
All that shows up for the job is this:
On the Job Overview Tab:
Job Type: Duplication
Master Server: <server name>
Job Policy: SLP_LCP_DD02_Weekly
Job Schedule: Dup
Priority: 0
On the Detailed Status Tab:
Nothing at top - all fields blank
in Status:
2/2/2013 1:46:29 AM - requesting resource LCM_dd01-su
2/2/2013 1:36:35 AM - Info nbrb(pid=3248) Limit has been reached for the logical resource LCM_dd01-su
I have over 1500 sitting in queue like this? How do I keep these from launching? If I cancel them and clean them all up, 5-10 min later they all kick in again... It seems even though they run and complete, they just queuue up again and run again.
Baffled......
Thanks,
John
Comments 13 Comments • Jump to latest comment
2/2/2013 1:36:35 AM - Info nbrb(pid=3248) Limit has been reached for the logical resource LCM_dd01-su
1)do you see any other active jobs for LCM_dd01-su?
2) what are the max concurrent jobs for the stu LCM_dd01-su?
3) what is the max I/O streams for the disk pool assiciated with STU LCM_dd01-su?
4) what is the output of /usr/openv/netbackup/bin/admincmd/nbstlutil report
"It seems even though they run and complete, they just queuue up again and run again." definatlly it should not be, unless they defined in SLP for one more copy.
Thanks Nagalla,
1) No, I dont even have a storage unit called LCU_dd01-su, they are only called dd02_su and dd02_su and valid duplication jobs do not request this su.
2) The valid dd_01 nad dd02_su are both set to 90 concurrent jobs.
3) Max I/O is also set to 90per volume.
4) what part of that report are you specifically looking for? Lots of options...
hi,
called dd02_su and dd02_su and valid duplication jobs do not request this su., ---> I assume those are dd01_su and dd 02_su, correct me if I am wrong.
please let us know the 2 SLP names that are in Issue.
please porovide the below output
nbstl -L -all_versions
/usr/openv/netbackup/bin/admincmd/nbstlutil report ---> is the command just run it.'
bpstulist -label <storage unit name> ---> for both stoarage unit name
please provide these attachments as attachment.
File attached.
hi,
i am still looking for
the 2 SLP names that are in Issue.
/usr/openv/netbackup/bin/admincmd/nbstlutil report ---> is the command just run it.'
bpstulist -label <storage unit name> ---> for both stoarage unit name
and also the details status of latest Dupliction job.
Missed that part, here they are.
Here is a current Duplication waiting to run
it looks like, Max I/O streams is the problme.
you did set the Max concurent jobs in Storage unit is 90 for each.
and also Max I/O streams for each disk pools is 90.
so at any point of time only 90 streams can be active for each disk Pool, but as you specified 90 in Storage units, from eaah SLP source is allocating 90 streams, destination is 0 results all are in Queue.
its like:-
for DD01_SU
MaX I/O stream in disk pool =90
Max jobs in disk stu = 90
for DD02_SU
MaX I/O stream in disk pool =90
Max jobs in disk stu = 90
so when duplication starts for DD01 LSP. all these are gettting allocated at source end, DD01 SLP taking 90 at source and noting left for the SLP DD02 jobs results queue jobs.
its same way for DD02 SLP.
its a dead lock situation.
3 ways to come out, and first cancle all Duplicate jobs, and impliment one of the below.
1) Reduce the Max jobs count in each STU.
or
2) Increase I/O at Disk pool (not recommented as its DD670, might not be albe to handle more )
or
3)Deactive one SLP untill other SLP gets compleated.
Ok, but I guess what i need to know is, what is LCM_dd01-su?
There is a way to clear the Duplications, as I have had Symantec help me do this one other time, but why is there no information within the job as to what image it is duplicating to the other DD?
Is LCM_dd01-su just a logical name the internal operations of NBU call the SU?
I also noticed I am getting a Status 84 (failed media write) on some of these.
LCM_dd01-su is acatully mean of dd01-su Storage unit.
84 might becuase of the large number of I/O streams.
you will see the image ID , once its start replicating..
first think that you need to do is, recude the Max concenret jobs in STU., may be to 40 ro 50. for both stoarge untis.
cancle all duplication jobs in activity moniter, and let them start again.
and see how they are moving.
----------
PS:- if you do not want them to duplicate , you can cancle them using nbstlutil command.
Waiting to see the results of the changes. Ended up rebooting (after shutting down NetBackup) all of the servers.
Just another quick question for you, I do not see a bpjobs.act.db file in the \NetBackup\db\jobs folder. Did this go away in 7.5?
Duplication jobs have finally finished after 3 solid days of running. Everything is caught up and back to normal.
Would you like to reply?
Login or Register to post your comment.