Video Screencast Help

Event Queues in ITMS 7.1 SP2

Created: 29 Mar 2012 | 5 comments
Language Translations
meggie_woodfield's picture
+5 5 Votes
Login to vote

Table of Contents

About Event Queues in ITMS 7.1 SP2

New Event Queue features in 7.1 SP2

New Event Queue processing features in 7.1 SP2

How to use the EventFailureBackupFolder and EventCopyFolder settings

Q&A

NSE Flow Diagrams

About Event Queues in ITMS 7.1 SP2

This article covers how Event Queues work in ITMS 7.1 SP2. Event Queues, or Message Queues, store and queue some Notification Server events while those events are waiting to be executed.

New Event Queue features in 7.1 SP2

With the release of ITMS 7.1 SP2, the following general improvements were made:

  • The SMP logs and Event Queues have been moved to C:\ProgramData\Symantec\SMP.

        For information about where Event Queues were located prior to the 7.1 SP2 release, see the following

        article:

        www.symantec.com/docs/HOWTO45754

  • Event Queue paths are now changeable via the following registry: EventQueuePath key in HKEY_LOCAL_MACHINE\SOFTWARE\Altiris\eXpress\Notification Server

        Changes will be applied after all services are restarted. The recommended way to change path values is as

        follows:

        1.      Stop all Altiris services and the World Wide Web Publishing Service.

        2.      Change registry values.

        3.      Start all Altiris services and the World Wide Web Publishing Service.

        NOTE: Do not redirect the EventQueues to the NSCap directory structure as it may stop NSEs from being

        processed.

        This is because old NSCap folders are monitored by an event dispatcher for legacy NSEs, which are put as

        files into subfolders.

        If you point the dispatcher to one of the subfolders, it will create files there and they will immediately

        be caught as new NSEs from legacy solutions and routed again to the dispatcher.

        For more information about this issue, please see the following article:

        www.symantec.com/docs/TECH183959

  • Default values for Event Queue ([CommonAppDataFolder]Symantec\SMP\EventQueue) now are hard-coded in the MSI file.
  • All hard-coded values with the path to the Event Queue folder are replaced with properties from NS.Installation.ProductConfiguration.
  • Event Queue folders will be created on first call to properties in NS.Installation.ProductConfiguration.
  • If the folder for Event Queue cannot be created, then an exception will be thrown and an entry will be added to the Altiris and/or Windows Application logs.
  • During an Upgrade (In-Box)/Repair, old values should be migrated from the previous version of SMP.

New Event Queue processing features in 7.1 SP2

With the release of ITMS 7.1 SP2, some improvements were made to accelerate and improve NSE processing in the Event Queues.

Agent side

  • Changes were made to passing the get parameters in the format priority=level&source=sourceGuid to PostEvent.aspx. A new parameter is now passed to PostEvent.aspx - source=<agent guid>.

        The agent code was modified to make sure that events are sent in the proper order.

        In SP2, the first event passed to Event Queue API will be sent first.

  • NSEs with the same source guid and priority level will be processed in the same order they are received with no concurrency.

Server side

Because the existing queuing structure did not support the ability to attach meta data to each event, the server side changes are extensive. The queuing system had to be rewritten to register each event in the database and dispatch events with the constraints that no two NSEs of the same source guid and priority level can be processed simultaneously (one exception being that the empty source guid retains backward compatibility and these NSEs can process simultaneously). Additionally, same source guids and priority levels are processed in the exact order that they are registered into the database.

  • Event files that are copied to the Capture Events Folder (configured in the registry through HKEY_LOCAL_MACHINE\SOFTWARE\Altiris\Altiris Agent\Transport) can use two different name formats:
    • NSE-xxx-xxxxxxxx-xxxxxxxx.tmp – This is used for events that have been queued.
    • xxxNSExxxx.tmp – This is the old format. It can be used for events that have been posted directly, ignoring the events queue. The first digits represent priority.

        If you sort tmp files by date created, the events will be sorted in the order that they were sent to the server.

        By using priority and ignoring the Event Queue, you can cause certain events to be sent to the server sooner.

  • Since events are all registered in the database in SP2, there is no reason to have multiple directories on disk. Now all events are stored in EvtQueue (with the exception of those events registered by stream and are less than 3000 UTF8 characters in length). These events are now stored in the database to avoid wasting I/O and to improve performance for these tiny messages.
  • The “bad” queue which keeps all failed NSEs is optional and can be configured by the core setting EventFailureBackupFolder (this is a sibling of the existing EventCopyFolder registry key under Notification Server). This was made optional because the new queue supports automatic retry of an NSE on failure, so there is reduced need to fill the hard drive with failed NSEs.

        The default number of retries is three. This was put in place because the strict queue ordering causes

        the previous customer strategy of copying failed NSEs back into the inbox to break queue ordering.

        For more information, see the following section of this document:

        “How to use the EventFailureBackupFolder and EventCopyFolder settings”.

  • All of the core settings associated with the Event Queue being full are obsolete, as the EventQueue table in the database now provides accurate queue size measurements. This causes events to be denied in real-time based on whether they would cause the queue size to be exceeded, rather than relying on a periodically updated full flag.
  • The existing processing queues are the same as before and are represented by the queueId column in EventQueueEntry and the Id column in EventQueue. 0 - priority queue, 1 - fast, 2 - normal, 3 - slow, 4 - large.
  • In SP2, the large queue is processed one more time than it was in previous versions. This is important for the exact queue ordering to be implemented.
  • The backwards compatibility code (believed to be for compatibility with some pre-6.0 agent scenarios, and that previously broke all NSEs which contained %20 character sequence by replacing it with spaces) now works.
  • In SP2, the old queue directories are monitored by the inbox watcher to help ensure backward compatibility.
  • A new Event Queue Status data class was introduced to provide more useful reporting information. A new report was added to the report on this data class (under SMP Console>Reports>Notification Server Management>Server>Event Queue Statistics), and the existing reports were marked as being used to display legacy class data.
  • There was a small change made to the messaging resource to ensure that it doesn't load and save a resource if the resource key information has not changed. Before SP2, on an average software inventory NSE, it was causing hundreds of items to load or be saved per NSE which was adding seconds of processing time per NSE for the large software inventory NSEs.

How to use the EventFailureBackupFolder and EventCopyFolder settings

EventFailureBackupFolder

If you don’t see this setting in the registry, it means it’s not effective or not used. However, you can define it manually by adding an EventFailureBackupFolder string value under the key HKEY_LOCAL_MACHINE\SOFTWARE\Altiris\eXpress\Notification Server.

As soon as you add it, the setting will be active. If any of the NSEs fail to process (with retries), the file will be put into that folder. The appropriate subfolder will also be created, depending on the type of exception.

EventCopyFolder

You can define this setting manually by adding an EventCopyFolder string value under the key HKEY_LOCAL_MACHINE\SOFTWARE\Altiris\eXpress\Notification Server.

As soon as this setting is effective the NSE processing will copy each NSE to this folder. Please note that it may take some time for this setting to be effective as it must wait until the Core Settings checks values for changes in the registry.

Q&A

Q: Do you know why truncating ResourceMerge helped with the Queue processing? Should we have some type of check to avoid this type of issue in our code?

A: The ResourceMerge table is a weak spot we found recently. It keeps the records of merged resources pointing out what resource guids were before and after a merge.

The procedure, which hangs, is doing loops trying to find out the current resource guid while it can find its parents, and this logic could lock the whole DB if records become cycled, like this:

Resource A > Resource B
Resource B > Resource C
Resource C > Resource A

The only reason why this could happen is if some race conditions in resource merge logic in c# code or some service crashed. We don’t have any 100% proof way to avoid situations like this.

Resource merges are mapped into a table now, which keeps track of old resource guids to current guid mapping, allowing a fast lookup of old resource guids. Resource association and data class importing have been changed to use this table to map the incoming resource guids to their merge targets, if any such targeting exists. The current implementation has one known limitation: If a resource merge occurs on a resource which is referenced as a foreign key, it will not be remapped.

 

Q: Is the message processing now single-threaded rather than multi-threaded as it traditionally was? Is there at least a separate thread for each queue since technically the messages are still sorted into queues within their table entries in dbo.EventQueueEntry? The behavior over the last two days seems to indicate a single-threaded processing model which does not seem very efficient given how easily our two servers became backlogged on NSEs due to processing issues on one of the servers.

A: The NSE processing is multithreaded. It was single-threaded in SP1, not vice-versa. We see it slow recently (and looking like single-threaded) only because DB locks occur while processing resource merge logic, which efficiently disallows resource-specific tables to be accessed. While merge was locking, none of the other NSE threads could do anything because the DB was locked.

 

 Q: In the event of a database communication outage, which could last for quite some time and potentially require a restart of the SMP services, how quickly do the failure retries occur? Do the messages get placed back into the queue and then wait for a later retry interval or do they immediately re-submit? If they are immediately re-submitted, as they appeared to do today, then we will most likely lose all of the NSEs that were submitted during the outage. This is not a good idea. The CMDB could be missing data until the next full inventory; in the case of critical software updates and/or distributions I will have no idea if they were successfully installed.

A: The default retry limit for NSE is three times and the delay between tries is pretty small (RetryNumber * 100 ms). After that, if the option is set, the NSE will be backed up.

If the DB is out of order the NSEs will not be put into the queue at all. Actually, a ‘server busy’ message will be returned to the client on the NSE post, and the client should handle this situation gracefully.

 

Q: How are the queues supposed to work now? I found that when I manually copy NSEs into the EventQueue\EvtQueue folder, nothing happens. I have to copy them into the eventqueue\EvtInbox folder. That process moves them into the EventQueue\EvtQueue folder directory. I see the client posting to the EventQueue\EvtQueue directory.

A: Do NOT put files into EvtQueue. The safest way (if you need manual NSE) is to use EvtInbox. Then, the message will be routed into EvtQueue automatically (no matter what size if it’s over 3KB), but its priority will be set in DB accordingly.

 

Q: Are bad and process folders no longer used?

A: There is no evidence of the process or bad folder for NSE; instead, mapping is used inside the BadNSEFolders.config (located in main NS Core configuration folder). This will create specific folders for multiple types of exceptions that might have occurred if it were unable to route the NSE in three retries. This works only when the NSE backup is set (EventFailureBackupFolder in core settings) and the subfolders will be created under the folder as specified in this setting.

 

Q: Are EvtQFast, EvtQLarge, EvtQPriority, and EvtQSlow no longer used?

A: Correct, they are not used internally. However, if someone put some NSEs into the folder manually they will be put into main queue in DB and moved to EvtQueue. This is done for backwardcompatibility with some older solutions. Some older solutions don’t use the newest NS API.

 

Q: Is EventQueue\temp still used for decompression of larger NSEs like in NS 6?

A: There is no evidence in the code to use the temp folder there. The data from post.aspx will go to the SMP temporary folder first (if it’s over 3KB), then it is decompressed into the EvtQueue folder directly (with RANDOM-GUID.NSE file name), and then registered in Event Queue DB table with this file name.

 

Q: Describe how to use the feature Event Failure Backup folder. I see the setting in coresettings.config which seems to reference a registry setting. Do I create a registry string value with the value of where I want the backup folder to be? Exactly how is this implemented?

A: Yes, the code query for core setting indicates that this setting is in a registry. If there is something like a non-empty string, then it goes like this:
1.1 - If the value is rooted path, i.e. with a driver letter, then it is being used as is (+ the subfolder for exception, which we map have in BadNSEFolders.config).
1.2 - If the path is relative, then we create a folder under the EventQueue folder.
1.3 - The code writes the NSE itself into this place, along with an extra file with the same name and .log extension. If it was an exception which led to an event backup, the log file body will be an exception message itself.

 

Q: The default retry limit for NSEs is three. To me, default implies that this value can be changed. How do I modify this value to reduce the number of NSE retries? Under HKLM/Software/Altiris/eXpress/Notification Server I see some keys, but none of them seem to apply to the NSE retry limit.

A: Retry limit can be set by Core Settings in the EventRetryLimit entry. By default, it is absent from the settings so the hardcoded default of three is used.

 

Q: If we won’t post files to the EvtQueue if the DB is not connected, then does the agent have a trigger that says the DB is not ready and so the NSE will not be sent or delayed for X amount of time or is the agent just going to continue to send NSEs the NS regardless? If the files are sent regardless do they just get rejected? 

A: Server response should indicate to the client that the server is busy it can’t register and save the NSE that was posted. It’s up to the agent’s logic to handle the case and the client should retry after some time.

 

Q: Do SQL deadlocks and timeouts also prevent NSE files from being posted? If files are posted and the DB connection gets dropped for some reason is this going to cause the current set of NSE files in the EvtQueue to delete or become invalid?

A: If the connection is dropped when the NSE is not yet registered in DB, the same “can’t handle NSE” will be returned to the client. However, if the event is registered, it will not somehow die/disappear if the connection drops. The file will hang in EvtQueue (if it’s large enough) and the entry about this NSE will stay in DB until it is processed. If the DB is unstable or just too busy while the NSE is registering, it will be retry to register 3 more times. However, since the HDD usage for queues is checked prior to saving the file, the server may return with busy if there is no more space and it will not try to register again.

 

Q: Where is the queue limit held and what is the queue limit (queue size)?

A: The only limit for the NSE queue is in the Core Settings: MaxFileQSize(KB), which only limits the total size of all NSEs stored as files in EvtInbox. There are no limits on DB entries for NSEs. The default value of this setting is 512000 (500 MB).

 

Q: Is it recommended to put 0 (unlimited) into …/express/Notification Server/ MaxFileQSize(KB) regkey to process more NSEs?

A: Increasing HDD space limit for event queue is not a 100% solution for all customers that experience the slow NSE processing. For most customers, the default 500MB is more than enough to handle thousands of clients. However, you may still be able to leverage the MaxFileQSize setting for huge environments where there are tens of thousands clients could send messages simultaneously. And, though the server can process these messages, it could be limited by this setting to receive and store them. Here are two examples.

Example 1: If the NS is an 8 GB / 2 CPU computer handling 5,000 clients without many agents on them, then the 10k clients that are posting can cause trouble if they start to proceed some tasks and software delivery at the same time. The server would handle their response fine, but it would take time. In this case it could help to increase the queue limit.

Example 2: If the NS is an 8 GB / 2 CPU computer handling 20,000 clients and you setup ITMS and all possible agents, the server might not able to proceed due to the load. In this case, increasing the setting will not help because no matter how much you increase it the clients will send more data than the server can handle.

 

NSE Flow Diagrams

NSE Dispatch Flow

NSE Incoming Flow

Comments 5 CommentsJump to latest comment

DanC@BYU's picture

After dropping nse's into the EvtInbox folder and hoping they would process from there, they did go into the "Bad" subfolder. How do we get them to process from there? I had 9,000+ stale (up to 10 days old) nse's in the EvtQueue folder that weren't processing. Some must be processing because after moving them all I only have about 7500 in the Bad folder.

0
Login to vote
Ludovic Ferre's picture

From reading the article above it appears that the NSE's are only processed after DB registration (See NS Dispatch Flow above).

So dropping events into the Q folders will no longer have the expected results, of getting them processed...

I am currently off-net, on a retreat of some kind. I'll be back real soon, and you sure will hear from me then ;-).

Ludovic FERRÉ
Principal Remote Product Specialist
Symantec

+1
Login to vote
DanC@BYU's picture

Quite so, I got off the phone about a week ago with support and sure enough the other folders are no longer used with SP2. Thanks for the tip! Awesome article btw.

0
Login to vote
Pascal KOTTE's picture

Thanks, exactly kind of article we need to understand: "how it works".

What was less clear for me (not native english), what's happen to "failed" NSE (no "bad" folder registry) ? I guess now just "deleted" and log in altiris log viewer... But, I was surprise because I got the following issue:

Where bad NSE keep hold, and grow on the EvtQueue, default limited 512MB ! I was deleting them manually.. Symantec support 1st provide the registry to extend size, but final, not a solution.

logs:

  • Failed to move file from 'C:\ProgramData\Symantec\SMP\EventQueue
  • Failed to process NSE invalid date

Only the patch this TECH195347, was solving the queue growing problem :)

Be aware your queues (not NScap one, but ProgramData/Symantec/SMP :)

~Pascal @ Kotte.net~ Do you speak French? Et utilisez Altiris: venez nous rejoindre sur le GUASF

0
Login to vote
asnotb's picture

Meggie,

thank you for this post, I have 1 question. In your Q&A section you mentioned the resourcemerge table and the truncating of this table.

We have performance issues in our SQL server that points to the stored procedure "spMapMergedResourceWithIntentToMerge".

This stored procedure loops the resourcemerge table (as you probably know ;-) ) Our resourcemerge table has more than 7800 rows.

Am I correct to state that there is no problem with regurarly truncating this table?

0
Login to vote