Deployment Solution

 View Only

How to Determine if AClient is Dead? 

Dec 07, 2011 03:34 AM

When imaging machines over and over in testing, we occasionally find that after a reboot (initiated by an agent power control task) that the AClient just stops responding. We are using here DS6.9SP4.  We see,

  1. The AClient process is still memory resident
  2. netstat shows TCP 402 still ESTABLISHED
  3. service does not respond to stop requests
  4. Log file is not written to

So, the AClient is pretty much dead. The process needs to be killed, and the service started.

This is a pain though, as when deploying large numbers of machines the unresponsive agent eats into our success statistics and causes frustration. So... how to see if the AClient is running?

Been puzzling this over last night, and came up with the following script with the ever helpful Darren...

 

 '####################################################
'# VBScript to get some AClient process and log
'# data so that we can see if aclient has died...
'#
'# For illustration only. This does not actually
'# work well. Although log file is written to, reported
'# size is not seen to increase. Processor cycles also
'# does not seem to be helpful.
'####################################################

set fso=CreateObject("Scripting.FileSystemObject")

StrLogReg="HKLM\SOFTWARE\Altiris\Client Service\LogFilename"
StrPingReg="HKLM\SOFTWARE\Altiris\Client Service\PingTimeOut"

StrLogFile=fRegValGet(StrLogReg)
StrPingTimeout=fRegValGet(StrPingReg)

'####################################################
'# Leave Now if we can't get the reg values we need
'# or the log file does not exist
'####################################################

if Len(StrLogFile)=0 or Len(StrPingTimeout)=0 or not fso.fileexists(StrLogFile) then wscript.quit

'####################################################
'# Get process data and log file size
'# Loop ten times
'####################################################

For i=1 to 10
WSCript.StdOut.WriteLine vbcrlf & "Time: " & Now()
WSCript.StdOut.WriteLine fGetProcessData("ACLIENT.EXE")
set ofile=fso.getfile(StrLogFile)
intLogSize=ofile.size
WSCript.StdOut.WriteLine "File Size: " & intLogSize

wscript.sleep((StrPingTimeout+1)*60*1000)
next


Function fGetProcessData(StrProc)

strComputer = "."
Set objWMIService = GetObject("winmgmts:" _
& "{impersonationLevel=impersonate}!\\" _
& strComputer & "\root\cimv2")

Set colProcess = objWMIService.ExecQuery _
("Select * from Win32_Process where name='" & StrProc & "'")

For Each objProcess in colProcess
strList = strList & vbCr & _
"Process: " & objProcess.Caption & vbcrlf & _
"ReadOperationCount: " & objProcess.ReadOperationCount & vbcrlf & _
"ReadTransferCount: " & objProcess.ReadTransferCount & vbcrlf & _
"UserModeTime: " & objProcess.UserModeTime & vbcrlf & _
"KernelModeTime: " & objProcess.KernelModeTime
Next

fGetProcessData=strList
End Function



Function fRegValGet(sRegVal)
Dim wshShell
Set wshShell = CreateObject("WScript.Shell")
On Error Resume Next
fRegValGet = wshShell.RegRead(sRegVal)
On Error Goto 0
End Function

 

This script locates the log file from the registry, and the PingTimeOut which I've assumed is the minimum time frame during which the log should see some action. When run with cscript, we can see the stats flow but they are pretty much static. They don't change. Even though we can see that the log is gaining content and  ProcessMonitor sees the AClient regularly querying the registry and writing to the log.

Very frustrating.

Now resorting to performance IO counters....

 

Statistics
0 Favorited
0 Views
0 Files
0 Shares
0 Downloads

Tags and Keywords

Comments

Jan 10, 2012 10:29 AM

 

Gibson99 -

We did develop an automatic way of getting the Inactive Computers list from the DS Console into our text file. 

The manual way (if it is a lot of servers) is to highlight the list in the DS Console, do an Export List and import into Excel and copy that back into the servers.txt.  Or you can just type in the hostnames into the servers.txt file.

Here is the automatic way:

1. Install SQLCMD.exe on your DS server.

Download here for SQL2005:  http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=15748
"Microsoft SQL Server 2005 Command Line Query Utility"
X86 Package (SQLServer2005_SQLCMD.msi) - 2528 KB
And
"Microsoft SQL Server Native Client"
X86 Package (sqlncli.msi) - 3511 KB

2. Put this single [modified] line in a BAT file:

            Note1:  if you use other than the Microsoft default port 1433 for SQL, you must change it in the command

            Note 2:  Please replace square brackets and examples to make the code work.

            Note 3:  You can install this anywhere; we put it under Express\Deployment Server\PowerTools

            Note 4:  You should be able to run the SQL Select statement in an NS report or thru Query Analyzer.

            Note 5: (lots of caveats huh?) you can run the BAT file via DS or Scheduled Task.

SQLCMD.EXE -S [DATABASENAME\INSTANCENAME],1433 -d[DS DATABASENAME] -h-1 -E -m 1 -W -Q "set nocount on; set ansi_warnings off; USE [DS DATABASENAME] select C.Name FROM inactive_computers IC JOIN computer C ON IC.computer_id = C.computer_id" -o "c:\Program Files\Altiris\eXpress\Deployment Server\PowerTools\InactiveComputerAUTO\servers.txt"

Another note to make here:  We pushed a PingTimeOut value of 15 (hex) to help the AClient restart itself within 30 minutes if the agent has not checked into the console.   If you don't have this implemented, I would highly suggest it.

http://www.symantec.com/docs/HOWTO10570

We found this to be good 80/20 rule but we still have some occasional clients that need a 'tickle'.  The AclientReconnect.vbs in my first post and the above line to get the Inactive Computers list into the servers.txt file quickly offers a solution to get the rouge agents healthy again.  There will always be one offs but this resolves 99.5% of our DS agent connection issues.  Yes, in a perfect world, we wouldn't/shouldn't have to do this but it is just more motivation to get to the SMA and SMP 7.2. !!

Jan 10, 2012 02:37 AM

Hi Jason. I haven't found anything for the DS agent client. It does not appear to be as nicely supported as the Altiris Agent for that kind of thing unfortunately.

Would delighted if someone pointed out that I am wrong!

Jan 09, 2012 08:39 PM

is it possible to send a command directly to the aclient/dagent and then watch for some sort of response?  perhaps send it something that is known to cause the log file to grow in a way we can see.  then if the log file doesn't grow, we can then kill the service and restart it (or send an email so we can manually verify that it's really dead).  

Jan 09, 2012 06:35 PM

Is that the NET STOP/START will wait for the action to complete before returning control, SC STOP/START will return immediately even if the action never completes.

Jan 07, 2012 04:41 AM

sc query still says it's running when it's hung (at least in the way I've been looking at).

I think the approach used by Mark is simpler -as when you are working at the console you can just target unresponsive machines for an agent kill and service start. It would have been nice if we could have altered the client context menu on DS to right click and have our own  'Kill/Restart Agent" option huh?

But..  not uber useful for us as we've tried implement Altiris here in a proactive manner. Marks solution is great as a stop gap, but it requires that the problem is observed, and then resolved. I like to resolve them before they are observed!

The idea behind putting in this script on each client is that we can actually monitor, audit and resolve automatically these issues we've been seeing with agent connectivity. I'm trying to get some work done on the DS 7.2 beta, and will return to the guinea pig offer Jason when I've put in a few hours there.

Thanks for reading the blog, and for sharing your thoughts...

Jan 04, 2012 11:13 PM

One more thing...
If the service is hung and won't stop, does sc query give you a different result which you could then act on? Perhaps by using pskill to end task on aclient/dagent.exe and then restarting the service after 30 seconds? My worry would be that sc query just tells you that it's started or running, and nothing else and is therefore useless. I dont know, since i've only ever used sc query in an old batch script that would stop a service, then check to make sure it really stopped before waiting, then restarting it, then making sure it really started (it was for the ds 6.9 sp2 pxe config helper since it was so unreliable).

Jan 04, 2012 11:06 PM

Mark - what populates your text file? Do you run the vbs on demand or on a schedule? I'm guessing the former since you mention doing it "in a few clicks".
I like Ian's approach to it. Run a scheduled task say, every 5-10 mins on each and every machine to keep itself alive automatically so we (admins) dont have to do anything to keep the agent talking to the server. This method is better for me because it truly would be self- maintaining, which is especially important for my clients on the other side of the globe. I don't like being woken up at 2am just because someone in china can't run a simple ds task.
I'm tired of the aclient/dagent just randomly throwing a fit, crossing its arms, stamping its foot and sticking out its bottom lip, refusing to listen to its parents, especially during a system build job. So if someone comes up with a reliable way to make the stubborn child slap itself when it's being rude and start to behave, I'd be thrilled.
Why yes, i did just write this after a 2 hour battle to get my 2 year old daughter to go to bed... why do you ask? ;)
Ian - as usual, i am more than willing to be a guinea pig for you with this project! You know how to reach me.

Jan 04, 2012 11:11 AM

Ian and everyone -

We too struggled for months with lots of Inactive Computers in our DS Console and no central way to 'mass tickle' those computers before we found PowerTools' PSKILL.  You can always login and restart the agent on each server, but that is a pain.

We installed PowerTools on one server and copied the folder to all the other DS servers (the program doesn't really install, just uses the EXEs to run), we saved all the files in the \eXpress\Deployment Server\PowerTools folder.

We put the below code in a file called "AclientReconnect.vbs" in the same folder as PSKILL.exe and a text file of the inactive Computers called "servers.txt".  We found that just the computer name is necessary, no domain information.

The code below cycles thru the list of computers in the server.txt file, PSKILLs the Aclient.exe and the Aclnusr.exe and closes the Command Prompt box.

OPTION EXPLICIT

Const conForReading = 1

'Declare variables
Dim objFSO, objReadFile, contents
Dim servers(10000)
Dim newServers(10000)
Dim newServers1(10000)
Dim start(10000)
Dim i
Dim j
Dim objShell
dim servername
Dim temp
dim ping
dim temp1, temp2
Set objShell = CreateObject("WScript.Shell")
j = 0
i = 0
ping = "ping -n 5.127.0.0.1 >null"

'Set Objects
Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objReadFile = objFSO.OpenTextFile("servers.txt", conForReading, False)

'Read file contents
Do Until objReadFile.AtEndOfStream
contents = objReadFile.ReadLine()
servers(i) = contents
newServers(i) = "pskill -t \\" + servers(i) + " aclient.exe"
newServers1(i) = "pskill -t \\" + servers(i) + " aclntusr.exe"
start(i) = "sc \\" + servers(i) + " start aclient"
i = i + 1
Loop

'Close file
objReadFile.close

Do
temp = "%comspec% /c c: &" + newServers(j) + " & " + newServers1(j) + " & " + "ping -n 5.127.0.0.1 >null" + " & " + start(j)
objShell.Run temp
j = j + 1
Wscript.Sleep 9000
Loop Until j = i
WScript.Quit

'Cleanup objects
Set objFSO = Nothing
Set objReadFile = Nothing

'Quit script
WScript.Quit()

To keep the Command Prompt box up and in 'debug' mode, change the "temp=" line value /c to /k

We can now mass tickle centrally from our DS server and within seconds, wake up all the inactive computers.

Hope this helps someone else with the same problems, it has made our life MUCH easier being able to resolved everyone's issues with a couple clicks.

Dec 21, 2011 06:48 AM

Hi Andy,

In our case we can't stop the service (as the client is really, really dead). I have a script now which monitors IO and should the agent die we can then kill the process and restart the aclient with a net start. Didn't actually know you could use sc for the same thing -good tip.

What I currently have is a monitor script which executes as part of a scheduled task on machine startup. If it sees no IO (through the perfmon counters) it sends me an email to say "Oi, agent dead". In a few weeks, if I'm confident that the agents are dead when reported in this way, I'll update the script to kill the process and restart the service.

A work in process.....

Dec 21, 2011 04:40 AM

Add script task > Run on DS (from memory):

 

  sc \\%COMPUTERNAME% stop aclient

REM Wait 15 seconds
ping -n 15 127.0.0.1

sc \\%COMPUTERNAME% start aclient

ping -n 15 127.0.0.1   

We use the AClient over MPLS links and I suspect a firewall somewhere cuts off the agent TCP connection after a period of inactivity.

Dec 14, 2011 09:39 AM

Progress slow. Tried resorting to Sysinternal's Process Monitor (procmon) with the following script,

 cd %~dp0
del agent.pml
del agent.csv

start procmon.exe /quiet /backingfile agent.pml /loadconfig DeploymentAgent.pmc

ping -n 60 127.0.0.1

start /wait Procmon.exe /quiet /terminate
procmon.exe /quiet /openlog agent.pml /saveas agent.csv 

This starts procmon using the DeploymentAgent.pmc configuration (which drops filtered events, and looks for dagent.exe and aclient.exe processes) and terminates after a minute. It then processes the log file into a CSV so I can parse the output.

This looked quite promising: run procmon in batch mode and collect filtered logs. At the end of the collection period, simply process into CSV and then scan for entries.

Only problem is, that I am getting corrupted logs which should technically not happen is procmon is closed nicely with the /terminate switch.

Other people have reported this issue when using the drop filtered events option too so this method isn't going to be a valid route.

It's also annoying that the /quiet switch still allows an error dialog to appear, which means I can't just ignore this datum when it happens.

Off now to try to Win32_PerfFormattedData_PerfProc_Process class,
http://msdn.microsoft.com/en-us/library/windows/desktop/aa394277(v=vs.85).aspx

If that fails, it looks like the only option left is to write my own virtual device driver. Not sure if I'm up to that....

Related Entries and Links

No Related Resource entered.