The Story of a Very Expensive Filter
One of my customer reported a problem that caused one of their child nothing server to run at 100% CPU and consume almost all memory (out of 32GiB available).
I first looked at the timing (it was reported last Friday) and I thought this was possibly linked to the PMImport release as last week we had Patch Tuesday (so we released the PMImport Wednesday and replicated it to the child server Thursday evening.
But this was not it. First the memory ballooning problem happened on 3 different processes: the w3wp pools for the Altiris-NS-Agent and TaskManagement as well as the AeXSvc itself.
With all three processes running we would see large chunks of memory being released in a clean drop and go right back up in after nice curve. This was because the 3 processes were fighting for the scarce memory resources and causing each other to have to be scavenged every now and then.
Stopping on of the application pool pegged the memory to ~12 GiB for each of the other two processes, restoring access to the console but not resolving the problem.
In the end we found that this was a re-occurrence of an issue seen in November (before I was on the account) caused by a "rogue" filter.
The following SQL allowed us to find and clean up the culprit:
select top 1 collectionguid, count(*) from collectionmembership group by collectionguid having count(*) > 1000000 order by count(*) desc delete from collectionmembership where collectionguid = <guid found above>
This return 2.9 Million entries!
So the deletion took 25 minutes to run, but after restarting the application pools and Altiris Service all was back to work. We looked at the audit information on the filter and the person who last modified it had not changed anything specific from their recollection.
However we saw from the edit view (whilst deletion was running) that the filter was set to "Query Mode: Query Builder" mode instead of "Query Mode: none" (as the filter is used for patch targeting and we only need to do filter inclusions or exclusions.
When the same happened today we quickly fixed the issue, but the user again confirmed that he had not done anything bad.
So I tested this on my server and had the same problem: when the query mode is set to Query Builder if you save it as is (without modifying anything) all resources are included in the filter.
This doesn't matter on my test system as it can cope with a low 25,000 objects in the cache. But in a large environment the 2.9 million items were fully replicated in memory (we use a complete cache for the collection membership cache) on 3 different process - demanding an awful lot of resources and grinding the server to a halt.