Q/A regarding changes in Event Queues processing in SMP 7.1 SP2
|Article:HOWTO65758|||||Created: 2012-01-18|||||Updated: 2012-07-18|||||Article URL http://www.symantec.com/docs/HOWTO65758|
With the release of ITMS 7.1 SP2 some improvements were made in order to speed up and handle better NSE processing in the Event queues. The following information was collected from multiple resources and place here as reference.
Note: Please also refer to DOC5480 "Event Queues in ITMS 7.1 SP2"
1. Agent side changes were made around passing to PostEvent.aspx the get parameters in the format priority=level&source=sourceGuid. New parameter is passed to PostEvent.aspx - source=<agent guid>. Modified agent code to make sure events are sent in proper order. Now the
first event passed to event queue API will be sent first.
2. NSEs with the same sourceGuid and priority level will be processed in the same order they are received - with no concurrency. Now NS server has to process events in order they arrive.
3. The server side changes are extensive as the existing queuing structure did not support attaching meta data to each event. Therefore the queuing system had to be rewritten to register each event in the database and dispatch them with the constraints that no 2 NSEs of the same source guid and priority level can be processed simultaneously. (Exception given to empty source guid for backwards compatibility - these can process simultaneously.) Additionally same source guid and priority level are processed in the exact order they are registered into the database.
4. The names of temporary NSE files stored in "Queue" folder have been changed. Now the name follows the template: NSE-xxx-xxxxxxxx-xxxxxxxx.tmp, where first 3 digits equal to 0xFFF - "priority of the event", second group is unique event queue ID, the third group is unique event ID for the queue. For now unique queue ID is digit representing the number of times NS agent service runs. The last two groups of digits make possible to create the unique file name for the lifetime of agent (assuming 2^64 events is enough for the lifetime of agent). So now NSE file names are unique and can be used to track the order they have been posted by plug-ins.
5. Event files copied to "Capture Events Folder" (configured in registry - HKEY_LOCAL_MACHINE\SOFTWARE\Altiris\Altiris Agent\Transport) can use two different name formats:
- NSE-xxx-xxxxxxxx-xxxxxxxx.tmp - used for events that has been queued
- xxxNSExxxx.tmp - old format, used for events that has been posted directly ignoring events queue. The first digits represent priority.
If you sort tmp files by creating date then you'll get the order the events has been sent to the server. Note that one event can be sent to server sooner than another only if it got higher priority or was posted earlier or was posted ignoring the event queue.
6. Since events are all registered in the database now, there is no reason to have multiple directories on disk - so all events are now stored in EvtQueue - with the exception of those events registered by stream which are less than 3000 UTF8 characters in length. These events are now stored in the database, to avoid wasting I/O and improve performance for these tiny messages.
7. Additionally the 'bad' queue which keeps all failed NSEs is optional and configured by the core setting 'EventFailureBackupFolder' which is a sibling of the existing EventCopyFolder registery key under Notification Server. This was made optional because the new queue supports automatic retry of NSE on failure, so there is reduced need to fill the hard drive with failed NSEs. The number of retries by default is 3. This was introduced because the strict queue ordering means that previous customer strategies of copying failed NSEs back into the inbox would break queue ordering - so it needs to be avoided wherever possible.
8. Also all the core settings to do with event queue being full are obsolete as the EventQueue table in the database now provides accurate queue size measurements so events are denied in real time based on whether they would cause the queue size to be exceeded, rather than relying on a periodically updated full flag.
9. The existing processing queues are the same as before and are represented by the queueId column in EventQueueEntry and the Id column for EventQueue. 0 is priority queue, 1 is fast, 2 is normal, 3 is slow, 4 is large.
10. The large queue is now actually processed once more - which is important for the ability for exact queue ordering to be implemented.
11. There was some backwards compatibility code (Believed to be for compatibility with some pre 6.0 agent scenarios) which broke all NSEs which contained %20 character sequence by replacing it with spaces. This no longer occurs.
12. The old queue directories are now monitored by the inbox watcher to help ensure backwards compatibility.
13. The NS Event Queue Status data class was deemed to be insufficient to provide useful reporting information so a new Event Queue Status data class was introduced. A new report was added to report on this data class (under SMP Console>Reports menu>Notification Server Management>Server>Event Queue Statistics), and the existing reports were marked as being used to display legacy data class data.
14. The performance improvement is a small change to messaging resource to ensure that it doesn't load and save a resource if the resource key information has not changed. On an average software inventory NSE, previously it was causing hundreds of item load/saves per NSE - which was adding seconds of processing time per NSE for these large software inventory NSEs.
Do you know why truncating ResourceMerge helped with the Queue processing? Should we have some type of check to avoid this type of issue in our code?
The “ResourceMerge” table is a “weak spot” we found recently – it keeps the records of merged resources pointing out what resource guids were before and after merge.
The procedure, which hangs is doing loops trying to find out the current resource guid while it can find it’s “parents”, and this logic could lock the whole DB if records become cycled, like this:
Resource A -> Resource B
Resource B -> Resource C
Resource C -> Resource A
The only reason why this could happen is some “race conditions” in resource merge logic in c# code, or some service crashed.
As I recall, we don’t have any 100% proof way to avoid situations like this, but I think we should redone this piece of resource merging code, or at least setup some kind of “fail-proof” retry logic in the stored procedure, so it will not block forever.
(From eTrack 2163162):
Resource merges are mapped into a table now which keeps track of old resource guid to current guid mapping, allowing fast look up of old resource guids.
Resource association and data class importing have been changed to use this table to map the incoming resource guids to their merge targets, if any such targeting exists. The current implementation has one known limitation, which is that if a resource merge occurs on a resource which is referenced as a foreign key, it will not be remapped.
Is the message processing now single threaded rather than multi-threaded as it traditionally was? Is there at least a separate thread for each “queue” since technically the messages are still sorted into “queues” within their table entries in dbo.EventQueueEntry? The behavior over the last two days seems to indicate a single threaded processing model which does not seem very efficient given how easily our two servers became backlogged on NSEs due to processing issues on one of the servers.
The NSE processing is multithreaded, indeed. And it was single-threaded in SP1, not vice-versa. We see it slow recently (and looking like single-threaded) only because of DB locks while processing “resource merge” logic, which efficiently disallows resource-specific tables to be accessed anyhow: while “merge” was locking, none of other NSE threads just could do nothing, because DB is locked…
In the event of a database communication outage which could last for quite some time and potentially require a restart of the SMP services, how quickly do the failure retries occur? Do the messages get placed back into the queue and then wait for a later retry interval or do they immediately re-submit? If they are immediately re-submitted, as they appeared to do today, then we will most likely lose all of the NSEs that were submitted during the outage. This is not a good idea. The CMDB could be missing data until the next full inventory or in the case of critical software updates and/or distributions the customer will have no idea whether they were successfully installed. I realize we have the option of enabling the Event Failure Backup folder but the customer needs to be aware of this as does support and the consulting partners.
The default retry limit for NSE is “3” times, and the delay between tries is pretty small (RetryNumber * 100 ms). After that, if the option is set – the NSE will be backed up.
If the DB is out of order – the NSE’s will not be put into queue at all: actually, the “server busy” will be returned to client on NSE “post”, and client should handle this situation gracefully.
Please give a summary on how the queues are supposed to work now. I see by manually copying NSE’s into the EventQueue\EvtQueue folder, nothing seems to happen. I have to copy them into the eventqueue\EvtInbox folder. That process that moves them into the EventQueue\EvtQueue folder directory. I see client posting to the EventQueue\EvtQueue directory.
As I said before: you should NOT put files into EvtQueue. The safest way (if you need manual NSE) – is to use EvtInbox. Then, message will be routed into EvtQueue automatically (no matter what size if it’s over 3k), but it’s priority will be set in DB accordingly.
Confirm that bad and process folders are no longer used?
I don’t see any evidence of the “process” or “bad” folder for NSE, rather we use mapping inside the BadNSEFolders.config (located in main NS Core configuration folder), which will create specific folders for multiple types of exceptions occurred if we were unable to route NSE in 3 (default) retries. This works only when NSE backup is set (“EventFailureBackupFolder” in core settings) and the subfolders will be created under the folder, specified in this setting.
Confirm that EvtQFast, Large, Priority, Slow are no longer used?
Yes, we don’t use them internally, but if someone put some NSE’s to it manually – they will be put into main queue in DB and moved to EvtQueue. This is done for backwards compatibility with some old-time solutions, which are yet to move for new NS API.
Is EventQueu\temp still used for decompression of larger NSE like in NS 6?
I don’t see any evidence in code to use the temp folder there. I see, that the data from post.aspx will go SMP temporary folder first (if I’ts over 3k), then it is being decompressed into the EvtQueue folder directly (with RANDOM-GUID.NSE file name), and then registered in event queue DB table with this file name.
Please describe how to use the feature Event Failure Backup folder. I see the setting in coresettings.config which seems to reference a registry setting. Do I create a registry string value with the value of where I want the backup folder to be? Exactly how is this implemented.
Yes, the code query for core setting, which tells that this setting is in registry. If there is something (non-empty string), than it goes like this:
1.1 if the value is “rooted path”, i.e. with driver letter – it is being used “as is” (+ the subfolder for exception, which map we have in “BadNSEFolders.config“)
1.2 if the path is relative, then we create such folder under the “EventQueue” folder
1.3 The code write the NSE itself into this place, along with extra file with same name and “.log” extension, if it was an exception which lead to event backup – the log file body will be an exception message itself.
A default retry limit for NSEs is 3 (see excerpt from article below), “default” implies that this value can be changed. How do I modify this value to reduce the number of NSE retries? Under HKLM/Software/Altiris/eXpress/Notification Server I see the following keys, but none of them seem to apply to the NSE retry limit.
Retry limit can be set by Core Settings: “EventRetryLimit” entry. By default, it is absent from the settings, so the hardcoded default of “3” is used.
So if we won’t post files to the EvtQueue if the DB is not connected then does the agent have a trigger that says the DB is not ready and so the NSE will not be sent or delayed for X amount of time or is the agent just going to continue to send to send NSE’s the NS regardless? If the files are sent regardless do they just get rejected?
All I know, that server response should make a sign for client, that server is busy when it can’t eat the NSE posted. And it’s up to agent’s logic to handle the case.
So do SQL deadlocks and timeouts also prevent nse files from being posted? If files are posted and the DB connection then gets dropped for some reason is this going to cause the current set of NSE files in the EvtQueue to delete or become invalid?
If connection is dropped when NSE is not yet registered in DB – same “can’t handle NSE” will be returned to client. But if the event is registered, it will not somehow die/disappear if connection drops after it. The file will hang in EvtQueue (if it’s large enough) and the entry about this NSE will stay in DB until it will be processed. If the DB will be unstable or just too busy while NSE is registering, it will be retried to re-register it 3 times.
Where is the queue limit held and what is the queue limit ( queue size )?
The only limit for NSE queue I know is the in Core Settings: “MaxFileQSize(KB)”, which only limits the total size of all NSE stored as files in EvtInbox. There are no limits on DB entries for NSE’s, as far as I can see. The default value of this setting is 512000 (500 MB)
How advisable is to put 0 (unlimited) into …/express/Notification Server/ MaxFileQSize(KB) regkey in order to process more NSEs?
Setting this value (MaxFileQSize) to zero will not make NSE’s to proceed faster: it is rather viewed as an options to make it safer for very large environments, when NS could not proceed THAT much NSE’s in time. I doubt it should really be the case – if NS can’t stand against 500MB, it will eventually fail to process 600, 700 etc megabytes in time…
How to use “EventFailureBackupFolder”?
If you don’t see this setting in registry, it means it’s not “effective”, i.e. not used. But you can define it manually by adding “EventFailureBackupFolder” string value under the key “HKLM>Software>Altiris>eXpress>Notification Server” with the path to the folder where you want to store these NSEs.
As soon, as you add it – the setting will be active: if any of NSE fail to process (with retries), the file will be put into that folder.
Note: That appropriate subfolder would be created depending on type of the exception.
How to use “EventCopyFolder”?
You setup this value same way in registry, as the previous one: by adding “EventCopyFolder” string value under the key “HKLM>Software>Altiris>eXpress>Notification Server” with the path to the folder where you want to store these NSEs.
As soon, as this setting is effective – the NSE processing will copy each NSE’s to this folder (even before processing).
Note: When we say “as soon as it is set” – actually could mean some time span, in which Core Settings checks values for changes in registry.
Article URL http://www.symantec.com/docs/HOWTO65758