When imaging machines over and over in testing, we occasionally find that after a reboot (initiated by an agent power control task) that the AClient just stops responding. We are using here DS6.9SP4. We see,
So, the AClient is pretty much dead. The process needs to be killed, and the service started.
This is a pain though, as when deploying large numbers of machines the unresponsive agent eats into our success statistics and causes frustration. So... how to see if the AClient is running?
Been puzzling this over last night, and came up with the following script with the ever helpful Darren...
'#################################################### '# VBScript to get some AClient process and log '# data so that we can see if aclient has died... '# '# For illustration only. This does not actually '# work well. Although log file is written to, reported '# size is not seen to increase. Processor cycles also '# does not seem to be helpful. '#################################################### set fso=CreateObject("Scripting.FileSystemObject") StrLogReg="HKLM\SOFTWARE\Altiris\Client Service\LogFilename" StrPingReg="HKLM\SOFTWARE\Altiris\Client Service\PingTimeOut" StrLogFile=fRegValGet(StrLogReg) StrPingTimeout=fRegValGet(StrPingReg) '#################################################### '# Leave Now if we can't get the reg values we need '# or the log file does not exist '#################################################### if Len(StrLogFile)=0 or Len(StrPingTimeout)=0 or not fso.fileexists(StrLogFile) then wscript.quit '#################################################### '# Get process data and log file size '# Loop ten times '#################################################### For i=1 to 10 WSCript.StdOut.WriteLine vbcrlf & "Time: " & Now() WSCript.StdOut.WriteLine fGetProcessData("ACLIENT.EXE") set ofile=fso.getfile(StrLogFile) intLogSize=ofile.size WSCript.StdOut.WriteLine "File Size: " & intLogSize wscript.sleep((StrPingTimeout+1)*60*1000) next Function fGetProcessData(StrProc) strComputer = "." Set objWMIService = GetObject("winmgmts:" _ & "{impersonationLevel=impersonate}!\\" _ & strComputer & "\root\cimv2") Set colProcess = objWMIService.ExecQuery _ ("Select * from Win32_Process where name='" & StrProc & "'") For Each objProcess in colProcess strList = strList & vbCr & _ "Process: " & objProcess.Caption & vbcrlf & _ "ReadOperationCount: " & objProcess.ReadOperationCount & vbcrlf & _ "ReadTransferCount: " & objProcess.ReadTransferCount & vbcrlf & _ "UserModeTime: " & objProcess.UserModeTime & vbcrlf & _ "KernelModeTime: " & objProcess.KernelModeTime Next fGetProcessData=strList End Function Function fRegValGet(sRegVal) Dim wshShell Set wshShell = CreateObject("WScript.Shell") On Error Resume Next fRegValGet = wshShell.RegRead(sRegVal) On Error Goto 0 End Function
This script locates the log file from the registry, and the PingTimeOut which I've assumed is the minimum time frame during which the log should see some action. When run with cscript, we can see the stats flow but they are pretty much static. They don't change. Even though we can see that the log is gaining content and ProcessMonitor sees the AClient regularly querying the registry and writing to the log.
Very frustrating.
Now resorting to performance IO counters....
Gibson99 -
We did develop an automatic way of getting the Inactive Computers list from the DS Console into our text file.
The manual way (if it is a lot of servers) is to highlight the list in the DS Console, do an Export List and import into Excel and copy that back into the servers.txt. Or you can just type in the hostnames into the servers.txt file.
Here is the automatic way:
1. Install SQLCMD.exe on your DS server.
Download here for SQL2005: http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=15748 "Microsoft SQL Server 2005 Command Line Query Utility" X86 Package (SQLServer2005_SQLCMD.msi) - 2528 KB And "Microsoft SQL Server Native Client" X86 Package (sqlncli.msi) - 3511 KB
2. Put this single [modified] line in a BAT file:
Note1: if you use other than the Microsoft default port 1433 for SQL, you must change it in the command
Note 2: Please replace square brackets and examples to make the code work.
Note 3: You can install this anywhere; we put it under Express\Deployment Server\PowerTools
Note 4: You should be able to run the SQL Select statement in an NS report or thru Query Analyzer.
Note 5: (lots of caveats huh?) you can run the BAT file via DS or Scheduled Task.
SQLCMD.EXE -S [DATABASENAME\INSTANCENAME],1433 -d[DS DATABASENAME] -h-1 -E -m 1 -W -Q "set nocount on; set ansi_warnings off; USE [DS DATABASENAME] select C.Name FROM inactive_computers IC JOIN computer C ON IC.computer_id = C.computer_id" -o "c:\Program Files\Altiris\eXpress\Deployment Server\PowerTools\InactiveComputerAUTO\servers.txt"
Another note to make here: We pushed a PingTimeOut value of 15 (hex) to help the AClient restart itself within 30 minutes if the agent has not checked into the console. If you don't have this implemented, I would highly suggest it.
http://www.symantec.com/docs/HOWTO10570
We found this to be good 80/20 rule but we still have some occasional clients that need a 'tickle'. The AclientReconnect.vbs in my first post and the above line to get the Inactive Computers list into the servers.txt file quickly offers a solution to get the rouge agents healthy again. There will always be one offs but this resolves 99.5% of our DS agent connection issues. Yes, in a perfect world, we wouldn't/shouldn't have to do this but it is just more motivation to get to the SMA and SMP 7.2. !!
Hi Jason. I haven't found anything for the DS agent client. It does not appear to be as nicely supported as the Altiris Agent for that kind of thing unfortunately.
Would delighted if someone pointed out that I am wrong!
is it possible to send a command directly to the aclient/dagent and then watch for some sort of response? perhaps send it something that is known to cause the log file to grow in a way we can see. then if the log file doesn't grow, we can then kill the service and restart it (or send an email so we can manually verify that it's really dead).
Is that the NET STOP/START will wait for the action to complete before returning control, SC STOP/START will return immediately even if the action never completes.
sc query still says it's running when it's hung (at least in the way I've been looking at).
I think the approach used by Mark is simpler -as when you are working at the console you can just target unresponsive machines for an agent kill and service start. It would have been nice if we could have altered the client context menu on DS to right click and have our own 'Kill/Restart Agent" option huh?
But.. not uber useful for us as we've tried implement Altiris here in a proactive manner. Marks solution is great as a stop gap, but it requires that the problem is observed, and then resolved. I like to resolve them before they are observed!
The idea behind putting in this script on each client is that we can actually monitor, audit and resolve automatically these issues we've been seeing with agent connectivity. I'm trying to get some work done on the DS 7.2 beta, and will return to the guinea pig offer Jason when I've put in a few hours there.
Thanks for reading the blog, and for sharing your thoughts...
Ian and everyone -
We too struggled for months with lots of Inactive Computers in our DS Console and no central way to 'mass tickle' those computers before we found PowerTools' PSKILL. You can always login and restart the agent on each server, but that is a pain.
We installed PowerTools on one server and copied the folder to all the other DS servers (the program doesn't really install, just uses the EXEs to run), we saved all the files in the \eXpress\Deployment Server\PowerTools folder.
We put the below code in a file called "AclientReconnect.vbs" in the same folder as PSKILL.exe and a text file of the inactive Computers called "servers.txt". We found that just the computer name is necessary, no domain information.
The code below cycles thru the list of computers in the server.txt file, PSKILLs the Aclient.exe and the Aclnusr.exe and closes the Command Prompt box.
OPTION EXPLICIT Const conForReading = 1 'Declare variables Dim objFSO, objReadFile, contents Dim servers(10000) Dim newServers(10000) Dim newServers1(10000) Dim start(10000) Dim i Dim j Dim objShell dim servername Dim temp dim ping dim temp1, temp2 Set objShell = CreateObject("WScript.Shell") j = 0 i = 0 ping = "ping -n 5.127.0.0.1 >null" 'Set Objects Set objFSO = CreateObject("Scripting.FileSystemObject") Set objReadFile = objFSO.OpenTextFile("servers.txt", conForReading, False) 'Read file contents Do Until objReadFile.AtEndOfStream contents = objReadFile.ReadLine() servers(i) = contents newServers(i) = "pskill -t \\" + servers(i) + " aclient.exe" newServers1(i) = "pskill -t \\" + servers(i) + " aclntusr.exe" start(i) = "sc \\" + servers(i) + " start aclient" i = i + 1 Loop 'Close file objReadFile.close Do temp = "%comspec% /c c: &" + newServers(j) + " & " + newServers1(j) + " & " + "ping -n 5.127.0.0.1 >null" + " & " + start(j) objShell.Run temp j = j + 1 Wscript.Sleep 9000 Loop Until j = i WScript.Quit 'Cleanup objects Set objFSO = Nothing Set objReadFile = Nothing 'Quit script WScript.Quit()
OPTION EXPLICIT
Const conForReading = 1
'Declare variables Dim objFSO, objReadFile, contents Dim servers(10000) Dim newServers(10000) Dim newServers1(10000) Dim start(10000) Dim i Dim j Dim objShell dim servername Dim temp dim ping dim temp1, temp2 Set objShell = CreateObject("WScript.Shell") j = 0 i = 0 ping = "ping -n 5.127.0.0.1 >null"
'Set Objects Set objFSO = CreateObject("Scripting.FileSystemObject") Set objReadFile = objFSO.OpenTextFile("servers.txt", conForReading, False)
'Read file contents Do Until objReadFile.AtEndOfStream contents = objReadFile.ReadLine() servers(i) = contents newServers(i) = "pskill -t \\" + servers(i) + " aclient.exe" newServers1(i) = "pskill -t \\" + servers(i) + " aclntusr.exe" start(i) = "sc \\" + servers(i) + " start aclient" i = i + 1 Loop
'Close file objReadFile.close
Do temp = "%comspec% /c c: &" + newServers(j) + " & " + newServers1(j) + " & " + "ping -n 5.127.0.0.1 >null" + " & " + start(j) objShell.Run temp j = j + 1 Wscript.Sleep 9000 Loop Until j = i WScript.Quit
'Cleanup objects Set objFSO = Nothing Set objReadFile = Nothing
'Quit script WScript.Quit()
To keep the Command Prompt box up and in 'debug' mode, change the "temp=" line value /c to /k
We can now mass tickle centrally from our DS server and within seconds, wake up all the inactive computers.
Hope this helps someone else with the same problems, it has made our life MUCH easier being able to resolved everyone's issues with a couple clicks.
Hi Andy,
In our case we can't stop the service (as the client is really, really dead). I have a script now which monitors IO and should the agent die we can then kill the process and restart the aclient with a net start. Didn't actually know you could use sc for the same thing -good tip.
What I currently have is a monitor script which executes as part of a scheduled task on machine startup. If it sees no IO (through the perfmon counters) it sends me an email to say "Oi, agent dead". In a few weeks, if I'm confident that the agents are dead when reported in this way, I'll update the script to kill the process and restart the service.
A work in process.....
Add script task > Run on DS (from memory):
sc \\%COMPUTERNAME% stop aclient REM Wait 15 seconds ping -n 15 127.0.0.1 sc \\%COMPUTERNAME% start aclient ping -n 15 127.0.0.1
We use the AClient over MPLS links and I suspect a firewall somewhere cuts off the agent TCP connection after a period of inactivity.
Progress slow. Tried resorting to Sysinternal's Process Monitor (procmon) with the following script,
cd %~dp0 del agent.pml del agent.csv start procmon.exe /quiet /backingfile agent.pml /loadconfig DeploymentAgent.pmc ping -n 60 127.0.0.1 start /wait Procmon.exe /quiet /terminate procmon.exe /quiet /openlog agent.pml /saveas agent.csv
This starts procmon using the DeploymentAgent.pmc configuration (which drops filtered events, and looks for dagent.exe and aclient.exe processes) and terminates after a minute. It then processes the log file into a CSV so I can parse the output.
This looked quite promising: run procmon in batch mode and collect filtered logs. At the end of the collection period, simply process into CSV and then scan for entries.
Only problem is, that I am getting corrupted logs which should technically not happen is procmon is closed nicely with the /terminate switch.
Other people have reported this issue when using the drop filtered events option too so this method isn't going to be a valid route.
It's also annoying that the /quiet switch still allows an error dialog to appear, which means I can't just ignore this datum when it happens.
Off now to try to Win32_PerfFormattedData_PerfProc_Process class, http://msdn.microsoft.com/en-us/library/windows/desktop/aa394277(v=vs.85).aspx
If that fails, it looks like the only option left is to write my own virtual device driver. Not sure if I'm up to that....