Video Screencast Help
Search Video Help Close Back
to help
New in the Rewards Catalog: Vouchers for "Symantec Technical Specialist" and "Symantec Certified Specialist" exams.

Performance Troubleshooting Guide - Part 2

Updated: 04 Aug 2009 | 1 comment
Rob Wilcox's picture
+6 6 Votes
Login to vote

This is the second of two articles discussing performance troubleshooting techniques ...

Types of Hang
I put hangs into two categories :

  • 100% CPU usage
  • 0% CPU usage

The data to gather for these types of issue is largely the same, as you will see. The next two sections describe the process that I would typically begin to follow to analyse each type of hang. This of course, isn't something that you can necessarily do yourself for Enterprise Vault code, since you don't have the source code or symbol files to work on the data at this detail, however I feel it's important for us all to have an understanding of the science (and some say art) involved.

100% CPU
As the name implies this is when some operation is performed in the product and 100% (or nearly 100%) CPU occurs on the Enterprise Vault server for a prolonged period of time. This obviously isn't "good".

I have seen this on a number of occasions, and the net result in most of the cases is that the machine goes sluggish, client operations go sluggish, backlogs start to appear all over the system (MSMQ, indexing, storage etc etc). I'm not talking here about a high level of activity at the time that mailbox archiving kicks in or anything pinpointable like that. These 100% CPU issues are usually at random times throughout the day (or night).

The first thing to do is capture some data, and begin to try and work out what might happening to cause the CPU usage. You might be lucky and know what operation is causing it already, or you have to have the data reviewed first.

As before you should collect performance monitor data, and capture some userdumps of the process that you observe (using task manager) as taking up the high CPU. It is also appropriate to capture DTRACE of the process too, as that might tell us which are of code we need to investigate further. With some analysis by Symantec it might then help to begin to work out what is causing the high CPU usage.

This part is an iterative process, and may lead to one of the fun parts of the job ... debug logging and debug modules provided with extra tracing. You see the userdump might show that we have gone in to a particular function, but that might be hundreds of lines of code, and might make calls to other functions that may of interest to us. The userdump is a point in time, so unless we see a mile-long call stack showing a recursive function call going a bit mad, then the userdump might only tell us we've hit on a particular function, but wouldn't tell us why it's causing high CPU. With additional tracing we can see more (written out to dtrace). The reason for it being iterative is that we would typically per the function with extra tracing, but not go off in to functions that this particular function calls (just put trace before a call and then after a call).. however once we've used that module it might be that we do have to go and trace those side functions too.

0% CPU
From my observations in Enterprise Vault a 0% CPU usage hang is rare. I've not seen one at all in the 3 years I've been working in Engineering Support. I spoke to a colleague who also does debugging and he hasn't either, except for our FSA File System Drivers - there have been deadlocks there, but none in the main, core product. However, I can describe what it is based on experience I've had previously supporting Microsoft Exchange Server for Microsoft.

A 0% CPU usage as the name implies, is when the CPU for a particular process (or more than one) drops to 0 or close to it, when you are expecting something more than that. The end result to a client of the application is that it will be appear sluggish, or freeze, much like the 100% CPU usage issue.

The usual cause that I've seen in my travels is that one or more threads within a process are deadlocked. Locks are used throughout programming, I couldn't even begin to describe the many different types, but if you're interested use your favourite search tool and try to find information on "Spin locks", "Synchronization Locks", "Reader/Write Locks", and many more. A deadlock means that one thread is owning some resources but waiting to get access to another resource which is owned by another thread, and that thread is waiting on a resource that the first thread has access to. Neither thread can proceed, and a deadlock occurs.

Fortunately there are a few tools for assisting with this. WinDBG can be used natively with the !locks command, or, if you have the SIEEXTPUB extension (contact your friendly Microsoft TAM for it) then you can do a !critlist.

Here are examples of each :

!locks
CritSec +160bb140 at 160BB140
LockCount 1
RecursionCount 1
OwningThread 147
EntryCount 3
ContentionCount 3
*** Locked

CritSec +113c0590 at 113C0590
LockCount 3
RecursionCount 1
OwningThread 3e
EntryCount 22
ContentionCount 22
*** Locked

The !locks command is a less exact science at least for me. You are typically looking for high contention counts, and high lock counts. Those are the threads to investigate.

!critlist
CritSec at a6a3c4. Owned by thread 16. Deadlocked on CritSec a6a3a4.
Waiting Threads: 15
CritSec at a6a3a4. Owned by thread 15. Deadlocked on CritSec a6a3c4.
Waiting Threads: 16 

You can see that thread 16 owns a critical section that thread 15 is waiting on, but thread 16 is also waiting on a critical section held by thread 15, creating a deadlock situation. Once you have the thread numbers you can review the call stacks for those threads.

It's also possible that you won't have a pure deadlock like in the example above. You'll have more of a chain, with a long list of threads waiting. Analysing another userdump taken some minute later might show the thread list increasing, or it might show "some" progress. This is where I refer to an "almost zero" CPU usage. In the case where things are moving on, but very slowly, it's a nearly-deadlock or something resource intensive. In the distant past I have seen issues with Microsoft Exchange Server and communication through the VSAPI to 3rd party antivirus scanners that fell into this very category.

There are millions of web sites dedicated to troubleshooting performance, and it's quite easy to get lost and spend days looking at them to find that there are all sorts of contradictory advice of what to capture, how often, when, why, and what it's used for. Hopefully these references will be useful, if you're interested in reading more.

References :
http://www.dumpanalysis.org
http://blogs.msdn.com/tess/default.aspx
http://msdn.microsoft.com/en-us/magazine/cc163760....
http://blogs.msdn.com/ntdebugging/

Comments

Maverik's picture
05
Aug
2009
0 Votes 0
Login to vote

This is great thanks Rob.

This is great thanks Rob.