Job Engine Exception
Updated: 21 May 2010 | 43 comments
We have BE V 10 installed on several servers. All of the servers are runni g Win2000 Server, current service packs, current updates. On one of those machines the backupExec Job service sometimes crashes during backups with a c0000005 exception at address 10432DA7 (copy). This has happened several times.
We have done a reinstall and installed all BE updates, so we already have SR1.
All of our agents are the current BE 10 versions.
I've seen several entries in this forum that seem similar to our problem, but none of them seem to have been resolved.
The only kbase article I've found was 276242, which says to contact Veritas technical services.
How can we get this problem fixed?
Discussion Filed Under:
Comments
Has anyone looked at this thread? Does anyone have a suggestion/answer?
This is a serious problem for us.
Check on this forum there are other threads on this topic
Thanks for the reply, but the threads I've seen all seem to have some other stuff going on, like synthetic backup. My point with this thread is that we're just doing boring stuff here and the things not working.
I guess I'm having trouble understanding what the big deal with fixing this is. Services are not supposed to crash, period.
I've written services and currently write device drivers. Services are supposed to be able to get all kinds of wrong data and fail gracefully.
Hello,
We apologise for the inconvenience caused because of
the delayed response. Please let us know if your issue
still persists. If yes, please refer the steps mentioned
below. If not, we will mark this case as assumed
answered and move it to answered questions pool.
We would suggest you to perform a repair installation of Backup Exec referring the following document.
http://support.veritas.com/docs/253199
And the Service Pack 1 of Veritas Backup Exec should be installed again after the repair installation is done.
Do let us know if the issue persists.
NOTE : If we do not receive your reply within two
business days, this post would be marked âassumed
answeredâ and would be moved to âanswered questionsâ pool.
The repair installation was done weeks ago. The Service Pack install was done at initial installation time as were all of the hot fixes.
This problem continues to persist. The only way that we have been able to stop it from happening is to have the BE GUI runing all of the time.
It seems clear that there is an error in the job service code. When will this be corrected?
Hello,
Please elaborate on the following "This problem continues to persist. The only way that we have been able to stop it from happening is to have the BE GUI runing all of the time"
I also request you to tell that, if job engine service crashes during a specific backup job or any job ? Does it give any specific error ?
Waiting for your reply.
Regards
NOTE : If we do not receive your reply within two business days, this post would be marked "assumed answered" and would be moved to "answered questions" pool.
"this problem continues to persist" - that means that the problem keeps happening, the problem happens over and over again, the job engine stops with Dr watson errors.
"The only way that we have been able to stop it from happening is to have the BE GUI running all of the time" - that means that if we keep the BE application running all of the time, the error does not happen. Having a service-side component dependent on the operation of a user-side component seems like an error.
The specific error from the system event log is: c0000005 exception at address 10432DA7 (copy).
Thanks.
I should have said I have seen this (though not repeatedly the way you are), and we're just doing boring stuff also.
The whole business is getting a bit silly. This is a hard, system level failure that shouldn't be all that hard to get a handle on.
Instead of asking me to repeat the same information again and again, it seems to me that someone at Veritas should get me a version of the Job Engine sevice that has some instrumentation (assuming that they did not use WPP) so that I could get them some trace information.
Since that is not happening, my job is to sit here and keep saying, "See, it's still BROKEN!".
I should say that the workaround of keeping the GUI running all of the time seems to work. However, we don't like the idea of leaving a user logged in on the server console.
We are running three Backup Exec 10 SP1 servers and all three started having Backup Exec Job Engine service failures with in two weeks of each other.
I'll start leaving the GUI running because I need to get data backed up but I would LOVE to hear from Veritas on this issue. These same servers ran flawlessly for months...
Kelly,
I'll be interested to hear if this works on your site the way that it does here.
I found this workaround in another thread where it even described some problem where the Job Engine trys to post an alert and gets confused if the GUI is not running. Sounds like a bug to me.
I'll let you know how it goes tonight for sure. The other frustrating side affect of this issue is that it marks my tapes with "end marker unreadable" effectively cutting the capicity way down. The only way to clear that status is to erase the tape...the joy of it all. Now where's my fresh cup of coffee.
Leaving the GUI open all night seems to have worked. Thank you for posting that work around while we wait for Veritas to provide a solution.
Hello,
We regret the inconvenience caused.
Please do the following to resolve the issue:
I
1. Apply the latest service packs for OS as well as BE (in this case it is already done)
2.Update to the latest MDAC Version and if you have already installed latest MDAC version then reapply it.
Please see the link below:
http://www.microsoft.com/downloads/details.aspx?Fa...
3.If that does not help then perform repair installation of BE. For performing Repair installation of Backup Exec please refer to the foll technote:
http://support.veritas.com/docs/253199
II
Also please try doing the following steps:
1) Split the Backup jobs into smaller ones and see if the error is occurring on any particular drive or file. If the error is occurring on any particular folder then exclude that folder from the backup or check its integrity.
2) Run a backup job on the Backup to Disk folder and see if you are getting the same error.
III
In Tools->Options->Preferences->Display progress Indicators for Backup jobs uncheck it if you have checked that option and then verify.
Please keep us updated.
Thanks,
NOTE : If we do not receive your reply within two business days, this post would be marked "assumed answered" and would be moved to "answered questions" pool.null
What do we do when we've done all of that and this issue still occurs? Three servers with the same service crashing, all with in two weeks of each other after working fine for months.
Kelly
All of these items have been covered in other threads and I have done all of them several times. I have not been able to narrow the problem down to a specific job or a specific client machine. I have this problem on one server frequently and two others less frequently.
Regardless, nothing that happens on a client machine should be able to make the Job Engine service crash. I should also be getting some sort of system error log entries that should help me narrow this down and I am not getting anything useful.
Hello,
Please check in drwatson for the exact error message.
the log file is drwtsn32.log
Thanks,
NOTE : If we do not receive your reply within two business days, this post would be marked "assumed answered" and would be moved to "answered questions" pool.
Here's the Dr watson info from my latest crash:
Application exception occurred:
App: (pid=3232)
When: 6/2/2005 @ 04:04:21.468
Exception number: c0000005 (access violation)
Here's the state dump for the faulting thread:
State Dump for Thread Id 0x7ac
eax=012e7a30 ebx=00010000 ecx=0000068c edx=00010000 esi=028be5d0 edi=012f6000
eip=01432da7 esp=01c2bc9c ebp=01c2bd0c iopl=0 nv up ei pl nz na po nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000206
function: Copy
01432d87 e854efffff call MemoryMappedFile::CloseMap+0x140 (01431ce0)
01432d8c 8d4db8 lea ecx, ss:02445bf2=????????
01432d8f e87cfcffff call MemoryMappedFile::Allocate+0x460 (01432a10)
01432d94 8bf0 mov esi,eax
01432d96 8d4d9c lea ecx, ss:02445bf2=????????
01432d99 e872fcffff call MemoryMappedFile::Allocate+0x460 (01432a10)
01432d9e 8bf8 mov edi,eax
01432da0 8bcb mov ecx,ebx
01432da2 8bd1 mov edx,ecx
01432da4 c1e902 shr ecx,0x2
FAULT ->01432da7 f3a5 rep movsd ds:028be5d0=001c0008 es:012f6000=????????
01432da9 8bca mov ecx,edx
01432dab 83e103 and ecx,0x3
01432dae f3a4 rep movsb ds:028be5d0=08 es:012f6000=??
01432db0 8b75ec mov esi, ss:02445bf2=????????
01432db3 03f3 add esi,ebx
01432db5 8975ec mov ,esi ss:02445bf2=????????
01432db8 8b7d18 mov edi, ss:02445bf2=????????
01432dbb 8bc7 mov eax,edi
01432dbd 2bc6 sub eax,esi
01432dbf 8945dc mov ,eax ss:02445bf2=????????
01432dc2 8d4ddc lea ecx, ss:02445bf2=????????
All of the other threads are sitting on waits except for these:
State Dump for Thread Id 0x12d4
eax=0012fd18 ebx=00000000 ecx=01010101 edx=00000000 esi=00000000 edi=00000160
eip=77f82926 esp=0012fbd0 ebp=0012fc40 iopl=0 nv up ei pl zr na po nc
cs=001b ss=0023 ds=0023 es=0023 fs=0038 gs=0000 efl=00000246
function: NtReadFile
77f8291b b8a1000000 mov eax,0xa1
77f82920 8d542404 lea edx, ss:00949ab7=????????
77f82924 cd2e int 2e
77f82926 c22400 ret 0x24
State Dump for Thread Id 0x1408
eax=00000000 ebx=000493e0 ecx=01cbe938 edx=00000000 esi=00140820 edi=000493e0
eip=77f8289c esp=0265febc ebp=0265fee4 iopl=0 nv up ei ng nz ac po cy
cs=001b ss=0023 ds=0023 es=0023 fs=0038 gs=0000 efl=00000297
function: ZwRemoveIoCompletion
77f82891 b8a8000000 mov eax,0xa8
77f82896 8d542404 lea edx, ss:02e79da3=????????
77f8289a cd2e int 2e
77f8289c c21400 ret 0x14
I can sent the whole file with several of these, but not on an open forum.
Hello,
1. Please place the job engine in debug mode and see if you can place the faulting module and exact error message with which it is faulting.
2. Also check if you are receiving any alerts during the job engine crash.
3. Ensure that you have provided sufficient media for the job to complete. Does the backup span? does this occur only during backups or also other jobs?
4. Please try a repair install if you already haven't done so.
http://support.veritas.com/docs/253199
NOTE : If we do not receive your reply within two business days, this post would be marked assumed answeredand would be moved toanswered questions pool.
I don't know how to put the job engine in debug mode. Please attach instructions or a kbase article reference.
We only get crashes during backup operations, not during verify.
This is an autoloader, so the job is spanning across tapes.
We almost always get alerts during any single job. Normally, at least one file is locked during a backup of the 5 or so systems in a typical job.
Hello,
please see the technote below to put the services in the debug mode:
http://support.veritas.com/docs/254212
Thanks,
NOTE : If we do not receive your reply within two business days, this post would be marked "assumed answered" and would be moved to "answered questions" pool.
We followed the instructions in KB 254212 which says to stop the job engine, then add -debug to the start parameters. The other night, the job engine crashed, but we have no log file. I noticed that when I checked the properties for the job engine service there was no -debug property. Are the instructions in the KBase article correct?
The impression I got from reading the KB article is that you could use EITHER the startup param OR the registry value change. Is that correct?
Out of three machines that had job engine crashes, we only got one log. The logfile ends at the same time as the crash time, but has no crash-specific information. I'm attaching the last chunk of the log file below:
---------------------------------------------------
ataStartBackup: ndmpSendRequest returned: 0x0, 0
07/07/05 06:46:12 TF_NDMPGetResult(): MediaServer thread done, returning TFLE 0
07/07/05 06:46:12 NDMPEngine::MessagePumpAndWaitForResults(): TF_NDMPGetResult() returned 0
07/07/05 06:46:13 data halted: SUCCESSFUL
07/07/05 06:46:13 NDMPEngine: Shutting down.
07/07/05 06:46:14 WriteEndSet( 1 ) returning 0
07/07/05 06:46:16 WriteEndSet( 1 ) returning 0
07/07/05 06:46:16 WriteEndSet( 0 ) returning 0
07/07/05 06:46:16 HARDWARE COMPRESSION ===> Setting compression off.
07/07/05 06:46:16 TF_CloseSet
07/07/05 06:46:16 ndmpConnect : Control Connection information : connection established between IP 172.20.24.25, port 2935 and IP 172.20.24.76, port 10000
07/07/05 06:46:16 NDMP version 3 connection CONNECTED
07/07/05 06:46:16 BESC: Parsing OS version info -
07/07/05 06:46:16 ndmpcSnapshotPrepare2: Warning. No devices to snap. Returning with NDMP_SNAPSHOT_NO_DEVICES2SNAP
07/07/05 06:46:16 Media Server to initiate connection for data transfer
07/07/05 06:46:16 TF_OpenSet( )
07/07/05 06:46:16 Requested Set: ID = ffffffff Seq = -1 Set = -1
07/07/05 06:46:16 Current VCB: ID = 32e5396a Seq = 2 Set = 23
07/07/05 06:46:16 PositionAtSet( ): TF Msg = 6
07/07/05 06:46:16 UI Msg = 8002
07/07/05 06:46:16 HARDWARE COMPRESSION ===> Compression is configurable.
07/07/05 06:46:16 GET_DRV_INF: bsize = 8192
07/07/05 06:46:16 SetupFormatEnv( fmt=0 )
07/07/05 06:46:16 End of TF_OpenSet: Ret_val = 0 Buffs = 2 HiWater = 0
07/07/05 06:46:16 HARDWARE COMPRESSION ===> Setting compression on.
07/07/05 06:46:16 Current Block is = 7d59f
07/07/05 06:46:16 TF_InitMediaServerReverseConnection : Data Connection information : connection established between IP 172.20.24.25, port 2938 and IP 172.20.24.76, port 4754
07/07/05 06:46:16
dataStartBackup: ndmpSendRequest returned: 0x0, 0
07/07/05 06:46:43 TF_NDMPGetResult(): MediaServer thread done, returning TFLE 0
07/07/05 06:46:43 NDMPEngine::MessagePumpAndWaitForResults(): TF_NDMPGetResult() returned 0
07/07/05 06:46:43 data halted: SUCCESSFUL
07/07/05 06:46:43 NDMPEngine: Shutting down.
Hello Greg,
1.Is this the entire debug log created?
2.Also verify whether the services exhibit the same behaviour when the jobs are run to backup to disk folders instead of a tape.
3. Hve you also split the job if it is too large as told before?
4. Try a repair installation:
http://support.veritas.com/docs/253199
Additional Information :
For information on the recent VERITAS Backup Exec security vulnerabilities, including links to the downloads for the necessary hotfixes, please refer to the following document:
Patch summary for Security Advisories VX05-001, VX05-002, VX05-003, VX05-005, VX05-006, VX05-007
http://seer.support.veritas.com/docs/277429.htm
NOTE : If we do not receive your reply within two business days, this post would be marked assumed answeredand would be moved toanswered questions pool.
The entire log is about 8000 lines long. I don't think that this web-forum thing can handle it. I would be happy to post it somewhere if you like.
We don't backup to file folders.
I don't know what you mean about a job being "too long". What's the difference? What's the limit? Where is this documented? In what version was this restriction added? I have about 50 jobs spread over 3 servers. I don't believe I have ever seen the same job fail twice.
If you had even read this thread you would know that we have already done a repair installation several times on each of the servers. It has made no difference.
I am getting job engine crashes on all three of my servers and I do not feel like I am receiving any reasonable help.
Greg, I feel your pain. For us, the issue ended up being the "password protected tape" feature. We have been running the option to password protect all of our tapes in every job we run. Once we disabled that option, the job engine crashes went away.
Kelly,
Here's the source of my irritation: I write windows device drivers and services for a living. The rule with all of this stuff is that the driver/service _cannot_ fault.... ever. It is up to the writer to handle every Bad Thing that can happen gracefully. Previous versions of this application were bulletproof and now it is just junk and none of these folks seem to care.
It really bothers me when someone tries to tell me that the reason that the service is crashing is because "my job is too long", or to run the repair install yet again. What nonsense!
Greg
I can't say that I blame you there and agree completely with your comments about services; I've never had another service crash like this.
I've found that these Veritas forums are great for general issues but once it gets complicated, like this one, I finally went with the phone support after seeing where this thread kept going. Different technicians getting on with different solutions yet always referring to repair installs or clean installs. I went through it all before an engineer up the Veritas chain suggested I disable the password protection.
I'll keep my eye on this thread and offer any info I run across...maybe you should try a repair install. Sorry, that was low :)
Hello,
Antivirus service is running during the backup?
If so , stop all the third party applications during the backup.
In the application event log are you getting a n event id as "4097"?
================================================
Additional Information :
For information on the recent VERITAS Backup Exec security vulnerabilities, including links to the downloads for the necessary hotfixes, please refer to the following document:
Patch summary for Security Advisories VX05-001, VX05-002, VX05-003, VX05-005, VX05-006, VX05-007
http://seer.support.veritas.com/docs/277429.htm
NOTE : If we do not receive your reply within two business days, this post would be marked assumed answered and would be moved to answered questions pool.
Yes, there certainly are event number 4097 entries in the log. Those events are Dr Watson events which occur every time the the job engine crashes.
I can try to reschedule the virus scan runs to see what happenes.
As to "stopping all third party applications", this IS a server after all. The whole point of a server is to run those applications. If I did not have those applications, I would not have any data that I needed to back up and I would not need BackupExec.
Hello Greg,
Sorry for the repetitive redundancy that you have been facing on this thread. It is true that your problem can be resolved faster via voice support.
If you are indeed recieving an event ID 4097 in the application log, refer to the following technote which explains why you requie the personal attention of a Symantec Engineer.
http://support.veritas.com/docs/276242
If you would like to resolve the issue and due to the complexity of your issue, resolution will require the personal attention of a SYMANTEC representative.
Please contact us through your local
support number.You can see a list of support numbers at our support web site:
http://support.veritas.com/prodlist_path_phone.htm
Please note that you may be charged for this service.
For the latest details on pricing, please visit http://support.veritas.com/srv_portfolio/srvc_pric...
Thanks,
Additional Information :
For information on the recent VERITAS Backup Exec security vulnerabilities, including links to the downloads for the necessary hotfixes, please refer to the following document:
Patch summary for Security Advisories VX05-001, VX05-002, VX05-003, VX05-005, VX05-006, VX05-007
http://seer.support.veritas.com/docs/277429.htm
NOTE : If we do not receive your reply within two business days, this post would be marked assumed answeredand would be moved toanswered questions pool.
Hello,
Please applu the above mentioned solution and update us on the status of this issue.
Additional Information :
For information on the recent VERITAS Backup Exec security vulnerabilities, including links to the downloads for the necessary hotfixes, please refer to the following document:
Patch summary for Security Advisories VX05-001, VX05-002, VX05-003, VX05-005, VX05-006, VX05-007
http://seer.support.veritas.com/docs/277429.htm
NOTE : If we do not receive your reply within two business days, this post would be marked assumed answered and would be moved to answered questions pool.
I'm having the exact same problem as you with BEWS 10.0 SP1.
I've been experiencing the exact same thing, but for me I was trying to back up 2TB of data to tape. I have two DLT220, 16 tape libraries.
I leave the console open, everything is up-to-date... I've repaired my database and I still get these frustrating errors. What's worse is the error will pop on on the screen about the Job Engine... everything will work fine until you click the OK button... then everything comes to a grinding halt with a "Communication Stalled" message.
I'm feelin' your pain as well Greg... and I believe the only way to get answers out of these forums is for people like you and I to answer the problems.
Having a service fault is a definate no-no, and the product's quality has dropped... I miss Cheyenne!
I hope you get an answer... as opposed to mine (http://forums.veritas.com/discussions/thread.jspa?forumID=101&threadID=45708&messageID=4356219?)
The sad thing is that even if you don't get an appropriate answer or solution... you still have to say that this topic has been answered.
Rich
Thanks for the input Richard. I'm just back from a few days off. As soon as I catch up, I'll be making a service call to Veritas/Symantec about the job engine thing. I can already imagine how that is going to go....
Service: "Can't you stop that SQL server service while backups are running?"
Me: "Ugh no, our business sort of depends on it."
Service: "Okay, then lets just reboot the machine and try a test".
Me: "No, see we have these people that are using these systems to do actual work. The machines have to keep working. That's why we go to all of this trouble to back them up."
I'm really looking forward to it.
By the way, I changed the schedule for the virus scan to not conflict with the backup run and it made no difference. I understand the job engine stopped twice while I was gone.
I'm butting in here, but if you're running the Symantec AntiVirus Corp. Edition client on your VeritasBE server you might consider disabling the Filesystem Realtime Protection.
Other AntiVirus products may have a similar feature.Message was edited by:
Patrick Prather
Greg,
Lately I've been having problems with the DLO option with BEWS 10.0 SP1 ... The technician on the phone suggested that I upgrade to the latest build (5520) ... I am running now 5484.
This is supposed to fix a whole slew of bugs, security holes, etc.
I'll paste the link for ya... but unfortunately it's an in-place upgrade. Veritas has a history of painful upgrades but I'm going to try it. I'm sure this will mean you'll have to upgrade all your AOFO clients... but I'll let you know how my upgrade goes...
Link: http://seer.support.veritas.com/docs/277181.htm
Rich
Oh... one other thing. I was able to stop the job engine service from crashing... the stupidest thing too ... a bad tape. Would crash my libraries, stall my jobs... and kill the job engine. I doubt it's your problem... but that's what a bad DLT tape did to me.Message was edited by:
Richard Fleming
Thanks for your input Richard and Patrick.
I'll consider disabling realtime file protection, however at least one of my servers is a file server.
Interesting about the bad tape problem. Back to my initial rant, bad tapes happen but the service just shouldn't crash because of it. Tapes are a removable media just like floppy disks. What would you say if a service on your desktop crashed when you inserted a bad floppy or CDROM?
Hehe... if that were to happen, I'd check to see if I was running Windows 95 :)
Rich
The job engine debug shows normal operation to where it stops. A variety of things can occur next so I can't determine what it's trying to do. Does the job engine stop when backing up a particular resource or can you do test jobs of local resources and get it to stop? Does the function for drwtsn32 stay consistent for the fault? Always showing function:Copy? It's hard to pin down what is occurring unless we can analyze a user dump file to get more information about the occurrance.
Thanks for your reply, Russ.
Yes, I've had job engine crashes on all three of my machines and they are all c0000005 errors in copy. We are not seeing any particular link between crashes and specific resources during backup. We have a fixed schedule of jobs, different jobs on different nights and we get job engine crashes at random. However, it is not at all unusual for some of the machines that are to be backed up not to be available sometimes, so that's another variable.
I spent time on the phone with Veritas today. They want me to load the latest driver kit and assure that the autoloader has the current version firmware. Also install the latest security hotfix. We're going to pick one of the machines and see if it makes any difference.
For those of you who have been standing by wondering what has finally happened to my servers, here's the exciting conclusion to the story.
I finally broke down and spent the $ to call Veritas support. I spent the better part of the day on the phone. I already had a couple of debug mode crash logs, otherwise he would have pushed me off to go get those. The conclusion was that I didn't have the latest build (5520) and the latest driver load (it turns out that this is part of 5520). Now, notice that no where in the 5520 bugfix list is anything that talks about job engine crashes. Everything implys that this is just an accumulation of hot fixes.
It turns out that this actually fixes our problem. I would recommend that anyone who is having this problem install 5520 to see if it clears things up for you. If you are trying to use technical support, install the new drivers too, just to be able to say that you have already done it.
I hope this helps someone!
Greg
Hello,
Thank you for the update Greg.
Regards,
Would you like to reply?
Login or Register to post your comment.