Video Screencast Help

Using the Altiris Profiler to Research "Blackholes" (Where Time Disappears Between Events)

Created: 14 Jun 2012 | 4 comments
Language Translations
Ludovic Ferre's picture
+3 3 Votes
Login to vote

I have been a big user of the profiler ever since it was added to NS with the SP3 release (let's not try to find out when this was - it makes me feel old to say the least :D).

This tool is really great to find out what is going on in the code and how long certain operation takes.

But in this article we will not look at the obvious, but rather we'll try to read between the lines: what can possibly happen when requests or code sections are taking a long time but the time is not consumed in SQL or taking actions. When time seemingly disappear in a black hole!

SMP 7.1 SP2 slow console issues

The server is setup on a virtual machine with an external database and the customer reported some rather slow access times on the console. It was not possible for them to tie the slowness to a specific part of the console, nor was it possible to reproduce the issue at will. However it wasn't a question of "feeling" like the console was slow. It was very noticeable live on the console, with 15 to 30 seconds waiting for an action to take place.

As usual the profiler was the only response to separate the "feelings" from the fact. And the facts point to Web-application requests taking between 15 and 45 seconds to return to the caller (i.e. send a http response back to the browser).

As usual images are much more interesting than text, so let's look at the trace information and comment these.

Image 1: AltirisWebApplicationRequest for a Report (Software Bulletin Details)

What we can see here is the duration that is rather long: 46,682 ms, which is 46 seconds. No need to say this is rather slow. Now what is really interesting from the top view is that the request spends roughly 1.2 seconds in SQL (execute non query + execute reader + execute scalar) and 2 seconds in InternalOverhead. So (46.6 - 1.2 - 2) = 43.4 second that are gone unaccounted for. This is what I refer to as a black hole.

Image 2: Scrolling further down the request stack

We have 44.9 seconds spend in the CommitHandler part of the SessionStart. The session start section initialize a session for the current user, so this only happens when the browser is first pointing to the SMP console or when the sessions are refreshed (this can happen for many different reasons - a timeout since the last activity, changes in a web.config or changes in some files in the webapplication path).

Image 3: Another trace with the same problem

Looking inside the SessionStart execution path we see that this one capture spent 57 seconds in commit handler. We can see that the big drop occurs when we execute the Item.GetItem method for the "Symantec Management Agent for Unix, Linux and Mac". Now what's interesting again is that the SQL execute reader timing is ridiculously low for the time spent: 3.9 ms for 4 queries.

So to get down to the bottom of this we need to look in the details of this Item\GetItem code section.

Image 4: Item\GetItem to Item\Load takes 32.5 seconds

In this final trace sample (for this "black hole") we can see that the 32 seconds are not spend executing SQL statements but rather in the code that runs between the Item.GetItem() and the Item.Load() method call.

Now without having a certain intimacy with the code and product it would be rather difficult to tell what is going on in there, so I'll try to shed some light on what is going on right there.

The SMP server loads solution classes on the runtime (.Net) using the System.Activator class provided by Microsoft (it's part of the Base Class Library). So once we know which item should be loaded we look up the class on the "Class" table and request to load the state xml of the class in a new instance in the process.

This is all done transparently and very fast, but in this particular case this is the process that was "broken" and causing the very slow console access.

This process was (partly) broken because the assembly loading and activation is taking place for dll's in the Global Assembly Cache. This means that the assemblies are signed cryptographically _and_ as the impacted the servers were not connected to the Internet, the framework attempted to check whether the signing certificates were revoked or still valid using the standard CRL method (so the framework had to wait for the crl request to timeout before it could load the assembly).

I'll probably publish something later on this, but for now you have reached the end of the article :D.

Comments 4 CommentsJump to latest comment

ukanoldai's picture

And did you found any workaround to bypass the certificate validation or shorten the timeout at least, if the smp is not connected to the internet?


Login to vote
Rich Nahra's picture

I've always suspected that the .NET 2.0 web services cause a bottlenecks because of the overhead required to serialize and de-serialize data.  Do all the UI functions use the webservices exclusively? 

Login to vote
Ludovic Ferre's picture

Rich, no.

We don't use web-services calls in the AltirisWebApplication requests (they are plain aspx pages).

We do serialize and de-serialise data from the database but that's a minor toll on the overall performance issues, specially as we don't do large costly xml envellops as you'd see in a SOAP web-service.

Ludovic FERRÉ
Principal Remote Product Specialist

Login to vote