The Altiris Agent uses a set of interfaces available under http://<server_name>/Altiris/NS/Agent (this is true for the Core agent, sub-agent such as the Inventory Rule Agent can have additional web-interfaces available under additional web-application folders) to communicate with the Notification Server and check policies, package codebases, package snapshot etc.
In Altiris environments with a lot of Software Delivery Packages (500+ from Software Delivery Solution or Patch Management Solution), a certain number of clients (500+) and package servers (a few unconstrained and 50 + constrained) the GetPackageInfoRequest interface can be stormed by requests from the Altiris Agent and Package Server agents. In effect this looks like a self-inflicted Denial-Of-Service attack.
This is a very standard design based on the information provided by Altiris .
However as you can see on the diagram above (Image 1) this design has a weak point: the codebase request interface ("/Altiris/NS/Agent/GetPackageInfo.aspx) is shared between all tiers in the environment. Now given the title of this article you can understand that problems on this interface have been encountered in a few cases at least.
For more information on GetPackageInfo.aspx please check references  and  or directly on the AKB searching for GetPackageInfo.
The problem can be triggered by the deployment of new packages to the entire environment (uPS, cPS and Agent receiving policies with the package reference all at once without staging the policy to apply to uPS then cPS and the Agents sequentially).
Here's a quick table indicating why this can cause some serious problems:
Table 1: GetPackageInfo request distribution for 10 new packages in mid-size environment
|uPS||3||3x10 = 30||0.3%|
|cPS||100||100x10 = 1,000||9.1%|
|Agents||1,000||1,000x10 = 10,000||90.7%|
Table 2: GetPackageInfo request distribution for 10 new packages in large environment
|uPS||5||5x10 = 50||0.05%|
|cPS||800||800x10 = 8,000||7.4%|
|Agents||10,000||10,000x10 = 100,000||92.6%|
As illustrated on table 1 the uPS request load on the server is almost insignificant, however given the package distribution hierarchy implies the packages must be available on the uPS before they can be downloaded by cPS and thereafter distributed to Altiris Agents the uPS are fighting a lost battle with the cPS and Agents that are keeping the server busy with requests that return no codebases.
This situation is bad at the starting point (after the administrator applied a policy with the new package to all computers) however things are only going to get worse from here.
The cPS and Agents are receiving empty codebases because the packages are not ready on the uPS and cPS (namely their respective higher tiers on the hierarchy). Because the Altiris Agent is not receiving any codebase on the response from the Notification Server the agent will not use back-off mechanism and will retry the codebase download (hit the GetPackageInfo.aspx interfaces) after 3 minutes.
So we can extend the tables above to provide GetPackageInfo.aspx potential hit count in a 2 hours period:
Table 3: GetPackageInfo hit count for 10 new packages during 120 minutes in mid-size environment
|uPS||3||3x10x(120/3)= 30||1,200 x 0.3% > 30|
Table 4: GetPackageInfo hit count for 10 new packages during 120 minutes in large environment
|uPS||5||5x10x(120/3) = 2,000||2,000 x 0.05% = 1|
|cPS||800||800x10x(120/3) = 32,000||0|
|Agents||10,000||10,000x10x(120/3) = 400,000||0|
I am sure you noticed a couple of entries that seemingly don't belong in table 3: "tbd". These entries are listed 'to be determined' however there is another factor that has to be accounted for in the process: the backing-off mechanism implemented in the Altiris Agent communication executive (AeXNetComm.dll).
The backing-off mechanism kicks-in when the Altiris Agent receives an error message which can be a server busy response inside the xml response (indicating that IIS is not over-loaded; as would be with an error 50x). Once the agent is backing off no request made from the SWD Agent (GetPackageInfo, GetPackageSnapshot, etc) or other sub-agents.
By default the backing off mechanism starts with a 3 minutes retry interval that doubles up every failure (6, 12, 24 up to the max retry interval which is 120 minutes). Any request made by the agent or other sub-agents will received the <<SERVER_BUSY_AGENT_BACKING_OFF>> error message without any communication taking place on the wire (effectively implementing a blockout for specific interfaces).
So we understand that the GetPackageInfo.aspx interface can be overwhelmed in specific conditions, however what really interest us here is how do we handle this type of issues (should you ever be in that unpleasant position).
Here's a list of configuration items that can help avoid or reduce this issue from the server or client themselves which we will review in detail below:
Server configuration items
- CoreSettings.Config MaxConcurrentPackageInfoRequests
- Duplication of web applications
- IIS site security to deny server access
Client configuration items
- Registry: "Package Expiry (mins)"
- Registry: "Retry delay (mins)" & "Maximum Retry delay (mins)"
- Network blockout
- Manual replication
This coresetting entry helps control the maximum number of Package Info request the NS will handle at any given time. Any requests above this count will be returned an error message indicating the server has reached the MaxConcurrentPackageInfoRequests and the agent will set the package download to "Retry" and schedule a download attempt using the standard retry interval.
Duplication of web application
As illustrated above (special in Image 1) the biggest challenge with this issue is that the bottleneck in GetPackageInfo.aspx is global, so we have very little options to throttle down the incoming requests from the Agents or cPS whilst the uPSs are synchronizing. Hence this option that requires a little bit of extra work: to create a couple of IIS websites using different ports and pointing to the "%%InstallDir%/Altiris/NS/Agent" subfolder. In this manner we can point the managed client (Altiris Agents) to a new site for communications with the NS, and the same is true with the constrained package servers that could be served out of a distinct IIS site on another port. With this in place in case of a GetPackageInfo storm you can throttle down incoming request from the Altiris Agent and cPS using IIS throttling to limit bandwidth available to a web-site or tcp request counts. Additionally we could use custom web-applications to control the kernel queue depth and further limit the amount of data that would be allowed in from the agent during critical periods.
Important note! I still have to warn the reader that this is very experimental and albeit I have run a quick proof of concept to ensure this is is feasible there are a number of details that would need to be verified prior to implementing this option.
IIS site security to deny server access
This is a brute force method but it proved quite effective when push comes to shove. Using the IIS security options on the default web-site you can allow or deny access to the server based on ip addresses or ip range. So if a couple of uPS are struggling to synchronize you can quickly turn down all other noise (traffic) and ensure the GetPackageInfo requests are full-filled asap for the white listed computers.
It has proved to work however the timing from the agent being mostly out of control for the administrator watching package requests coming in every few minutes can be very stressing and under pressure can still be seen as a problem. For faster resolution check the "Manual replication" under the agent side options below.
Client configuration items
Registry: "Package Source Expiry (mins)"
This registry control the duration for which a Package codebase will be kept valid by the Altiris Agent [refs]. Once the codebase is expired the Altiris Agent will check with the Notification Server by hitting GetPackageInfo.aspx. Given the default value for this registry key is 0x0a (10) the client will request new codebases anytime it needs to verify a package (even if it doesn't have anything to download as it happens most times for package servers).
Additionally the maximum value allowed for this registry entry is equal to 1 week (10,080 mins) so there is some margin here to reduce the workload on the GetPackageInfo.aspx interface without taking too many risks (of having the agent using stale or invalid codebases).
Registry: "Retry delay (mins)" & "Maximum Retry delay (mins)"
Both these registry entry are used with the package retry and back-off mechanism [refs].
- "Retry delay (mins)" specifies the initial amount of time the agent will wait before resending a request to the NS or to a PS (GetPackageInfo or GetPackageSnapshot). Default value is 0x3 (3 minutes).
- "Maximum Retry delay (mins)" specifies the maximum amount of time the agent will wait before resending a request to the NS or to a PS (GetPackageInfo or GetPackageSnapshot) when the agent is backing off. Default value is 0x78 (120 minutes).
There is also some room for tuning here, specially for the uPS and cPS as allowing uPS's to back-off could reduce the chances of a uPS to synchronize back during stress period from little to none. The best option for the uPS is to set the retry delay to 1 minute, and the retry interval to 1 minute as well. In this manner the uPS will never be more than a minute without trying to get package codebases from the server.
This is part of the Altiris Agent policy [refs]. It can be quickly and easily implemented even during stress period. If you are encountering this type of issues or want to ensure you won't be in this situation (or reduce the likelihood of such situation arising) you can implement the following blockout period to ensure uPS and cPS have time to update out of hours (timing wasf):
- uPS's can communicate at all time to the NS
- cPS's can communicate with the NS between 0400 and 2200 (block is in effect for 6 hours between 2200 and 0400)
- Altiris Agents can communicate with the NS between 0600 and 2000 (block out is in effect for 10 hours between 2000 and 0600)
Please note that the above values are an illustration of what could be implemented for managed machines (clients and workstations) in large environment.
This is the ultimate option and the most efficient one, if implemented properly.
Given Package Servers are not ready because they are failing to download valid codebase for the packages (and subsequently to download package snapshot information in order to verify that the cached packages are compliant with the current versions on the server) a quick work around the slow (or failing) package synchronization is to replicate the package server package delivery directory structure with only the package.xml files inside.
Replicating the Package Delivery structure can be done by copying the package delivery folder to a new location and removing everything but the package.xml files. Once the directory structure only contains package.xml files you can compress the package delivery tree for deployments to the uPS's. This method will save a lot of time copying package cache folders that are most likely already in synch on the package servers. In all cases once the PS has some valid codebases to connect to the NS it will be able to download any missing files based on the current package snapshot.
 AKB #19106: Altiris Notification Server 6.0 SP3 Release Notes
 AKB #18712: Altiris Notification Server 6.0 SP3 Reference
 AKB #33385: How to configure constrained and unconstrained Package Server in a site hierarchy
 AKB #02888: How the Altiris Agent obtains Package Codebases (download sources)
 AKB #20837: Inside Notification Server agent interfaces: GetPackageInfo.apsx
 AKB #04295: Package Servers are unable to download new or updated packages
 AKB #01716: How often will the Altiris Agent retry communications?