Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.

How to Handle GetPackageInfo Request Storms?

Created: 06 May 2010 | 5 comments
Language Translations
Ludovic Ferre's picture
+5 5 Votes
Login to vote

The Altiris Agent uses a set of interfaces available under http://<server_name>/Altiris/NS/Agent  (this is true for the Core agent, sub-agent such as the Inventory Rule Agent can have additional web-interfaces available under additional web-application folders) to communicate with the Notification Server and check policies, package codebases, package snapshot etc.

In Altiris environments with a lot of Software Delivery Packages (500+ from Software Delivery Solution or Patch Management Solution), a certain number of clients (500+) and package servers (a few unconstrained and 50 + constrained) the GetPackageInfoRequest interface can be stormed by requests from the Altiris Agent and Package Server agents. In effect this looks like a self-inflicted Denial-Of-Service attack.

Image 1: Tiered architecture with Constrained Package Servers (cPS) and Unconstrained Package Servers (uPS)

This is a very standard design based on the information provided by Altiris [1][2].

However as you can see on the diagram above (Image 1) this design has a weak point: the codebase request interface ("/Altiris/NS/Agent/GetPackageInfo.aspx) is shared between all tiers in the environment. Now given the title of this article you can understand that problems on this interface have been encountered in a few cases at least.

For more information on GetPackageInfo.aspx please check references [3] and [4] or directly on the AKB searching for GetPackageInfo.

The problem can be triggered by the deployment of new packages to the entire environment (uPS, cPS and Agent receiving policies with the package reference all at once without staging the policy to apply to uPS then  cPS and the Agents sequentially).

Here's a quick table indicating why this can cause some serious problems:

Table 1: GetPackageInfo request distribution for 10 new packages in mid-size environment

   Machine #  Requests  Distribution
 uPS  3  3x10 = 30  0.3%
 cPS  100  100x10 = 1,000  9.1%
 Agents  1,000  1,000x10 = 10,000  90.7%

Table 2: GetPackageInfo request distribution for 10 new packages in large environment

   Machine #  Requests  Distribution
 uPS  5  5x10 = 50  0.05%
 cPS  800  800x10 = 8,000  7.4%
 Agents  10,000  10,000x10 = 100,000  92.6%

As illustrated on table 1 the uPS request load on the server is almost insignificant, however given the package distribution hierarchy implies the packages must be available on the uPS before they can be downloaded by cPS and thereafter distributed to Altiris Agents the uPS are fighting a lost battle with the cPS and Agents that are keeping the server busy with requests that return no codebases.

This situation is bad at the starting point (after the administrator applied a policy with the new package to all computers) however things are only going to get worse from here.

The cPS and Agents are receiving empty codebases because the packages are not ready on the uPS and cPS (namely their respective higher tiers on the hierarchy). Because the Altiris Agent is not receiving any codebase on the response from the Notification Server the agent will not use back-off mechanism and will retry the codebase download (hit the GetPackageInfo.aspx interfaces) after 3 minutes.

So we can extend the tables above to provide GetPackageInfo.aspx potential hit count in a 2 hours period:

Table 3: GetPackageInfo hit count for 10 new packages during 120 minutes in mid-size environment


 
Machine #  Hits  Success
 uPS  3  3x10x(120/3)= 30  1,200 x 0.3% > 30
 cPS  100  100x10x(120/3)= 40,000  tbd
 Agents  1,000 1,000x10x(120/3)= 10,000  tbd

Table 4: GetPackageInfo hit count for 10 new packages during 120 minutes in large environment


   Machine #  Hits  Success
 uPS  5  5x10x(120/3) = 2,000  2,000 x 0.05% = 1
 cPS  800  800x10x(120/3) = 32,000  0
 Agents  10,000  10,000x10x(120/3) = 400,000  0


I am sure you noticed a couple of entries that seemingly don't belong in table 3: "tbd". These entries are listed 'to be determined' however there is another factor that has to be accounted for in the process: the backing-off mechanism implemented in the Altiris Agent communication executive (AeXNetComm.dll).

The backing-off mechanism kicks-in when the Altiris Agent receives an error message which can be a server busy response inside the xml response (indicating that IIS is not over-loaded; as would be with an error 50x). Once the agent is backing off no request made from the SWD Agent (GetPackageInfo, GetPackageSnapshot, etc) or other sub-agents.

By default the backing off mechanism starts with a 3 minutes retry interval that doubles up every failure (6, 12, 24 up to the max retry interval which is 120 minutes). Any request made by the agent or other sub-agents will received the <<SERVER_BUSY_AGENT_BACKING_OFF>> error message without any communication taking place on the wire (effectively implementing a blockout for specific interfaces).

So we understand that the GetPackageInfo.aspx interface can be overwhelmed in specific conditions, however what really interest us here is how do we handle this type of issues (should you ever be in that unpleasant position).

Here's a list of configuration items that can help avoid or reduce this issue from the server or client themselves which we will review in detail below:

  • Server configuration items
    • CoreSettings.Config MaxConcurrentPackageInfoRequests
    • Duplication of web applications
    • IIS site security to deny server access
  • Client configuration items
    • Registry: "Package Expiry (mins)"
    • Registry: "Retry delay (mins)" & "Maximum Retry delay (mins)"
    • Network blockout
    • Manual replication

Server Settings

CoreSettings.config MaxConcurrentPackageInfoRequests

This coresetting entry helps control the maximum number of Package Info request the NS will handle at any given time. Any requests above this count will be returned an error message indicating the server has reached the MaxConcurrentPackageInfoRequests and the agent will set the package download to "Retry" and schedule a download attempt using the standard retry interval.

Duplication of web application

As illustrated above (special in Image 1) the biggest challenge with this issue is that the bottleneck in GetPackageInfo.aspx is global, so we have very little options to throttle down the incoming requests from the Agents or cPS whilst the uPSs are synchronizing. Hence this option that requires a little bit of extra work: to create a couple of IIS websites using different ports and pointing to the "%%InstallDir%/Altiris/NS/Agent" subfolder. In this manner we can point the managed client (Altiris Agents) to a new site for communications with the NS, and the same is true with the constrained package servers that could be served out of a distinct IIS site on another port. With this in place in case of a GetPackageInfo storm you can throttle down incoming request from the Altiris Agent and cPS using IIS throttling to limit bandwidth available to a web-site or tcp request counts. Additionally we could use custom web-applications to control the kernel queue depth and further limit the amount of data that would be allowed in from the agent during critical periods.

Important note! I still have to warn the reader that this is very experimental and albeit I have run a quick proof of concept to ensure this is is feasible there are a number of details that would need to be verified prior to implementing this option.

Image 2: Adding new sites to cater for specific agents traffic:

IIS site security to deny server access

This is a brute force method but it proved quite effective when push comes to shove. Using the IIS security options on the default web-site you can allow or deny access to the server based on ip addresses or ip range. So if a couple of uPS are struggling to synchronize you can quickly turn down all other noise (traffic) and ensure the GetPackageInfo requests are full-filled asap for the white listed computers.

It has proved to work however the timing from the agent being mostly out of control for the administrator watching package requests coming in every few minutes can be very stressing and under pressure can still be seen as a problem. For faster resolution check the "Manual replication" under the agent side options below.

Client configuration items

Registry: "Package Source Expiry (mins)"

This registry control the duration for which a Package codebase will be kept valid by the Altiris Agent [refs]. Once the codebase is expired the Altiris Agent will check with the Notification Server by hitting GetPackageInfo.aspx. Given the default value for this registry key is 0x0a (10) the client will request new codebases anytime it needs to verify a package (even if it doesn't have anything to download as it happens most times for package servers).

Additionally the maximum value allowed for this registry entry is equal to 1 week (10,080 mins) so there is some margin here to reduce the workload on the GetPackageInfo.aspx interface without taking too many risks (of having the agent using stale or invalid codebases).

Registry: "Retry delay (mins)" & "Maximum Retry delay (mins)"

Both these registry entry are used with the package retry and back-off mechanism [refs].

  • "Retry delay (mins)" specifies the initial amount of time the agent will wait before resending a request to the NS or to a PS (GetPackageInfo or GetPackageSnapshot). Default value is 0x3 (3 minutes).
  • "Maximum Retry delay (mins)" specifies the maximum amount of time the agent will wait before resending a request to the NS or to a PS (GetPackageInfo or GetPackageSnapshot) when the agent is backing off. Default value is 0x78 (120 minutes).

There is also some room for tuning here, specially for the uPS and cPS as allowing uPS's to back-off could reduce the chances of a uPS to synchronize back during stress period from little to none. The best option for the uPS is to set the retry delay to 1 minute, and the retry interval to 1 minute as well. In this manner the uPS will never be more than a minute without trying to get package codebases from the server.

Network blockout

This is part of the Altiris Agent policy [refs]. It can be quickly and easily implemented even during stress period. If you are encountering this type of issues or want to ensure you won't be in this situation (or reduce the likelihood of such situation arising) you can implement the following blockout period to ensure uPS and cPS have time to update out of hours (timing wasf):

  • uPS's can communicate at all time to the NS
  • cPS's can communicate with the NS between 0400 and 2200 (block is in effect for 6 hours between 2200 and 0400)
  • Altiris Agents can communicate with the NS between 0600 and 2000 (block out is in effect for 10 hours between 2000 and 0600)

Please note that the above values are an illustration of what could be implemented for managed machines (clients and workstations) in large environment.

Manual replication

This is the ultimate option and the most efficient one, if implemented properly.

Given Package Servers are not ready because they are failing to download valid codebase for the packages (and subsequently to download package snapshot information in order to verify that the cached packages are compliant with the current versions on the server) a quick work around the slow (or failing) package synchronization is to replicate the package server package delivery directory structure with only the package.xml files inside.

Replicating the Package Delivery structure can be done by copying the package delivery folder to a new location and removing everything but the package.xml files. Once the directory structure only contains package.xml files you can compress the package delivery tree for deployments to the uPS's. This method will save a lot of time copying package cache folders that are most likely already in synch on the package servers. In all cases once the PS has some valid codebases to connect to the NS it will be able to download any missing files based on the current package snapshot.

References:
[1] AKB #19106: Altiris Notification Server 6.0 SP3 Release Notes
[2] AKB #18712: Altiris Notification Server 6.0 SP3 Reference
[3] AKB #33385: How to configure constrained and unconstrained Package Server in a site hierarchy
[4] AKB #02888: How the Altiris Agent obtains Package Codebases (download sources)
[5] AKB #20837: Inside Notification Server agent interfaces: GetPackageInfo.apsx
[6] AKB #04295: Package Servers are unable to download new or updated packages
[7] AKB #01716: How often will the Altiris Agent retry communications?

Comments 5 CommentsJump to latest comment

AlexP's picture

Great article Ludovic.
Thumbs up!

+1
Login to vote
Jason Gallas's picture

We have over 400 package servers (constrained).  This has been a constant problem for us when pushing packages in the past.  I have come to realize that the easiest solution from my perspective is to pre-stage the package on "all package servers" and then release to the rest of the clients once this has been verified.

Is this article specifically targeted to an NS 6 environment?  That is what we currently have in production.

Is the NS 7 environment better at say giving preference to package servers over standard clients in order to optimize package replication to avoid these types of issues?

0
Login to vote
Ludovic Ferre's picture

BTW, we have pre-staging in place for SWD and SWU packages.

However we can't control the environment 100% of the time, so for example a couple of month ago we had ~200 Software Update Package update overnight (due to a problem in the end-of-lie supplemental PMImport).

And of course that got the NS chocking over the codebase delivery. Thank fully we had the issue before and the customer quickly switched off IIS communication for non-uPS and the problem was solved overnight.

> Is this article specifically targeted to an NS 6 environment?  That is what we currently have in production.
> Is the NS 7 environment better at say giving preference to package servers over standard clients in order to optimize package replication to avoid these types of issues?

I'm not sure how things work in 7 but I suspect the interfaces have not changed. I am pretty sure that there isn't any advanced queuing for the interface, as this would be rather complex to implement (versus pointing PS to a custom interface).

I am currently off-net, on a retreat of some kind. I'll be back real soon, and you sure will hear from me then ;-).

Ludovic FERRÉ
Principal Remote Product Specialist
Symantec

0
Login to vote
JStrecko's picture

you say we can set retry interval

The best option for the uPS is to set the retry delay to 1 minute, and the retry interval to 1 minute as well.
 

I didn't find this regristry key not sure can be created named
retry interval or retry interval (mins)

We get this storm issue with last Security Updates and  your article is very helpfull

Merci
 

0
Login to vote
Ludovic Ferre's picture

Hi J

>you say we can set retry interval
>>The best option for the uPS is to set the retry delay to 1 minute, and the retry interval to 1 minute as well.
>
>I didn't find this regristry key not sure can be created named
> retry interval or retry interval (mins)


I forgot to update the references on that section. The registry keys are detailed in [7], with the details shown here

The agent continues to try all of these basic tasks. The real questions are when and how often the retries occur.

The agent does a progressive back off which begins with a seed period of time; in all cases, this period of time is continually doubled up until a max interval is reached.

Both the seed intervals and the max intervals are defined in the registry and defaults for all intervals are three minutes.

Here are the registry details:

Request type Initial value Max value

Configuration Requests

[...] Altiris Agent\Servers\<Server Name>\Policy Retry Interval (mins) [...] Altiris Agent\Servers\<Server Name>\Policy Update Interval (mins)
Basic Inventory [...] Altiris Agent\Servers\<Server Name>\Basic Inventory Retry Interval (mins) [...] Altiris Agent\Servers\<Server Name>\Basic Inventory Update Interval (mins)
Package Delivery [...] Communications\Package Delivery\Retry delay (mins) [...] Communications\Package Delivery\Maximum retry delay (mins)
Agent Transport [...] Altiris Agent\Servers\<Server Name>\Transport Retry Interval (mins) [...] Altiris Agent\Servers\<Server Name>\Transport Maximum Retry Interval (mins)

[...] = 'HKey_Local_Machine\Software\Altiris\'

[7] AKB #01716: How often will the Altiris Agent retry communications?

I am currently off-net, on a retreat of some kind. I'll be back real soon, and you sure will hear from me then ;-).

Ludovic FERRÉ
Principal Remote Product Specialist
Symantec

0
Login to vote