KNOWN ISSUE: TaskManagement Service - Event Queue thread-race condition

Article:TECH196314  |  Created: 2012-09-07  |  Updated: 2012-09-07  |  Article URL http://www.symantec.com/docs/TECH196314
NOTE: If you are experiencing this particular known issue, we recommend that you Subscribe to receive email notification each time this article is updated. Subscribers will be the first to learn about any releases, status changes, workarounds or decisions made.
Article Type
Technical Solution


Issue



When the Site Server is moderately to heavily loaded "Task Complete" events get lost. All events can fall prey to the problem, but the "Task Complete" event is the most problematic.


Environment



Symantec Management Platform 7.1 SP2
Task Management 7.1 SP2


Cause



Known Issue.

We discovered a “thread-race” condition.  The effect of the thread-race condition was that the “task complete” event could get lost.  This would only happen on a Site Server that was experiencing a certain level of load.  The series of conditions required for the problem to happen go like this;  the Site Server sends a task to a client machine, and the client machine performs the task and responds with the “task complete” message quickly (usually within 2 or 3 seconds),  the Site Server has an “Event Processing Loop” with a 5 second wait in it, so if the task starts and completes during that 5 second delay, then the “task has started” and the “task has completed” events begin processing on separate threads one right after the other.  When this happens it becomes kind of like a role of the dice.  If thread one gets ahead of thread two then the events both get processes in the correct order and all is fine. 
If thread two gets ahead of thread one then both events get processed but out of order, we handle this and the job continues.  If the “task has completed” thread overwrites the “task has started” event then the job continues. 
However, if the “task has started” thread falls behind but only just barely then it can overwrite the “task has completed” event.  When this happens the client knows it is done with the task, but the Site Server believes it has only started the task.  This causes the Site Server to wait 60 minutes for the task to “retry”, then resend the task to the client.

This creates several different symptoms.  First is the 1 to 4 hour delay we saw where sometimes it would still complete.  The second is sometimes it would delay until the task was removed from the NS as too old.  The third and more common is for tasks that have the default 30 minute timeout set.  These tasks reach the 30 minute “timeout” and are killed before the 60 minute “retry” comes into play.  These tasks are reported as “failures” even though they may have completed successfully.

The code fix for this issue involved moving the functionality to pass the event to the thread the processes events for a task to a location in code where it could be contained within a mutex semaphore with the thread wakeup process that causes the event to be processed.  (It should be noted this thread is different from the two threads above that are in a thread-race condition.)  This prevents one thread from overwriting the event from another thread.  This will slow large “Jobs” with multiple tasks but speed up small or single task items that are being processed on the Site Server.

Also of note:  The “Event Processing Loop” would wait after the 5 second delay for a “one or more events present” event.  In a lightly loaded or test environment this is where the loop would spend most of its time.  This has the effect of having the Site Server ready to process the “task has started” event as soon as it is posted.  When this would happen there would be a 5 second delay before the “task has completed” event could be processed.  This is why the issue would never happen in a lightly loaded environment.


Solution



This issue has been reported to the Symantec Development Team. This issue will be addressed in the next major release (currently targeted for SMP 7.1 SP2 MP1 and ITMS 7.5).

There is a pointfix available. Please see attached "Pointfix_eTrack2903733_7.1_SP2v4.zip"

MINIMUM REQUIREMENT:
Installed ITMS 7.1 SP2 v4

HOW TO INSTALL THIS POINTFIX:
1.    Download "Pointfix_eTrack2903733_7.1_SP2v4.zip".
2.    Put script and executables in one folder without any other files.
(on the screenshot below it is New Folder on Desktop)
3.    Install Software Management Solution plug-in on Remote Task Server
4.    Import new software recourse (right click on “Installed Software” pane, files that we provide to a customer in 1 step. We just point to *.cmd file)
5.    Create Quick Delivery task
6.    Select Task Server(s)
7.    Click “OK” on Quick Delivery Task.
This will run Quick Delivery Task on Task Servers, execute batch file and update DLL.
 
CHANGES MADE:
The code fix for this issue involved moving the functionality to pass the event to the thread the processes events for a task to a location in code where it could be contained within a mutex semaphore with the thread wakeup process that causes the event to be processed. This prevents one thread from overwriting the event from another thread. This will slow large “Jobs” with multiple tasks but speed up small or single task items that are being processed on the Site Server.


Additional information:
This fix is for Remote Task Servers only. Fix should be reapplied if new Remote Task Server is created.


Attachments

Pointfix_eTrack2903733_7_1_SP2v4.zip (2.3 MBytes)

Supplemental Materials

SourceETrack
Value2903733


Article URL http://www.symantec.com/docs/TECH196314


Terms of use for this information are found in Legal Notices