Today I spent a number of hours on the phone with remote hands on (something similar to webex) with a customer of mine working on the GetPackageInfo.aspx problem, that turns their NS environment into a no package delivery world from time to time...
This time around everything fell down when the nightly package refresh generated new snapshot for all 1,500+ packages on the NS. With 110+ package server (3 unconstrained, all other constrained) and ~7,000 clients in the middle of a software upgrade the package servers quickly reported all their packages not ready, whilst the clients added to the incoming load .
Cutting a long story short we looked at ways to speed-up the recovery of the Package Server agents on one of the uPS and found the CodebaseCache to be our worse enemy in that case!
Let's explain that. The cache is there to reduce workload on the database and store often used data (codebases, i.e. location where a package can be downloaded for given sites) when the agent request the information. But in our use case we only have 1 package server allowed to talk to the NS (using the IIS security option to limit access to the interface) and the cache caused our GetPackageInfo requests to complete in 160 to 300 seconds under our condition (with the CodebaseCache set to default).
Now being of patient nature this would be fine with me (if at least we were guaranteed a rate of 1 package update for 2 or 3 minutes of wait time), however the Altiris Agent is not that patient, and after 120 seconds waiting for a response from the NS the request is considered a failure, causing the agent to go into back-off mode. Triggering the Altiris Agent to show "Server busy" messages.
Our Package Server agent was set with the most aggressive settings possible, so package downloads are on constant retry (Retry timeout = 0) still the AeXNetComms.dll (the component that implement the network communication functions, such as http requests, unc download or network blockout, bandwidth throttling etc) blocks any http requests from leaving the Package Server for a while 3 minutes.
This means that our single package server is taking more than 5 minutes before it can try to download the next package codebases, ,dropping the package server package refresh rates dramatically.
We looked at the CodebaseCache configuration switches (in CoreSettings.config) and found that we could change the MaxSize. We ran the earlier tests with a cache 50,000 entry deep. Using the Altiris Performance Counters we could see that the CodebaseCache reached the 47,000 entries in about an hour time, and that during this time 90,000 entries were scavenged (removed to make room for newer entries) from the cache.
So we trebled the cache size (to 150,000 entries), and dear reader was this a mistake... We restarted the Altiris Service and the package server received a few package codebase in the normal amount of time (anything from 60 to 180 seconds, and this worked without issues as we had set the http timeout on the PS to 240 seconds) and then it went all down hill.
A PackageInfo request ran totally awol, causing tens of thousand of SQL queries to update the codebase cache that slowly but surely grew past the 50,000 marks up to 90,000 entries, regardless of of useless the information was (as the PS didn't want to populated the entire codebase cache but rather wanted it's codebase for a single package).
Anyhow, this just highlights that making software to scale from very small to very large is very difficult.
PS: Please excuse me for the lack of structure on this blog post, I hope I'll get a chance to mend my way by creating a detailed article on this if time permits).