{CWoC 2010} aila 0.5.0 Samples: Using the Query Shell, Part II
So, we are reaching the final part of the aila 0.5.0 documentation. Using the query shell part II.
To start this second round of detailed look at the query-shell we'll use a real life scenario here on Symantec Connect: the xmlbasehttpcallback problems [https://www-secure.symantec.com/connect/forums/seeing-lot-basexmlhttpcallback-exceptions].
In short, the problem is that some Altiris Agent are sending empty xml string to the Task Server in infinite loop.
Here's what we can see from the IIS log files, and how we can drill into the root cause to find the affected clients by IP address or ResourceGuid.
ludovic@smallisbeautiful:~/PROD_003/SW-SRC/altiris-ns-tooling/aila/payload$ aila -f iis_6_logs/ex110418.log -qs -n0 Current line #: 1180000 Program read 206128068 bytes from 1183545 lines LOG FILE ANALYSIS:: SUMMARY Mime type results: << data truncated here >> Altiris Agent requests results: << data truncated here >> IIS Web-applications results: Webapp= /Altiris/NS/Agent/, dir hits = 129775 Webapp= /Altiris/NS/ , dir hits = 2007 Webapp= /Altiris/IRA[1]/ , dir hits = 4410 Webapp= /Altiris/Packages/, dir hits = 19743 Webapp= /Altiris/CTS[4]/ , dir hits = 1008779 Webapp= /Altiris/TaskMgmt/, dir hits = 13882 Webapp= /Altiris/ , dir hits = 1942 Webapp= Other , dir hits = 2999
We can see on the web-app statistics that the ClientTaskServer is causing most of the workload, with a whooping 1 million hits (for 1.1M requests on the log file).
At this rate we are almost assured that the detailed statistics will show us the top hitters ip address and computer guid:
LOG FILE ANALYSIS:: DETAILED
IIS status code results:
IIS Status code= Success (1xx,2xx), hits count = 1180327
IIS Status code= Redirected (3xx), hits count = 162
IIS Status code= Client error (4xx), hits count = 1041
IIS Status code= Server error (5xx), hits count = 2007
IIS sub-status code results:
Sub Status code= 0, hits count = 1182562
<< data truncated here >>
IIS Win32 status code results:
Win32 Status code= Win32 Success , hits count = 1174061
Win32 Status code= Win32 Failure , hits count = 9476
24 hour hit counters:
Hits counted during hour 7 to 8 was 31947
Hits counted during hour 8 to 9 was 106662
Hits counted during hour 9 to 10 was 126182
Hits counted during hour 10 to 11 was 129480
Hits counted during hour 11 to 12 was 144856
<< data truncated here >>
Ip cache top 20 entries:
14901, 192.168.10.51
12661, 169.254.3421
10866, 192.168.20.58
10451, 192.168.2.134
10233, 192.168.72.154
<< data truncated here >>
Guid cache top 20 entries:
14762, 965c234c-9e45-4d19-8f83-bb9970f401d5
10845, 62bd92f5-b1bb-41eb-899d-90e08ee2dbcb
10350, 6a52a043-1594-41d5-bd17-669533ff6e83
10162, 83cbe1f5-7c2d-4071-976f-fb522f699f23
<< data truncated here >>
But for the purpose of drilling into the tool functionalities and to share some troubleshooting tips we'll go onto the query-shell to verify what we are seeing (confirm a hunch or double checking that what we think is backed by hard evidences).
So we'll drill into the CTS stats to see how bad the basexmlhttpcallback has an impact on the server.
So first let's get the filter help, to see what options are available.
Enter a command (help or ?) or exit to quit:
aila-qs/:filter help
Currently supported commands are:
print: display the current filter specifications
clear: reset the filter to default value (c-ip, no string)
not: switch (false -> true, true -> false) the exclude entries matching
the current filter from exec commands
not <field name> <value>: set an exclude filter explicitly
help <field name>: to get help on the filter field specified, for exam-
ple use 'filter help uri-webapp' to get the help message for the
uri-webapp filter.
<field name> <value>
Global filter:
top n
from n
Standard filter:
c-ip <client_ip>
uri-agent [0..6]
uri-webapp [0..10]
uri-mime [0..3]
guid <guid>
iis-status n
win32-status n
time-taken n
The 2 global filters can be combined with the standard filter, so you
can specify where to start from, how much data to look at and other keys
keys.
Given we're interested in the ClientTaskServer details we can get the syntax using 'filter help uri-webapp' from the shell root:
aila-qs/:filter help uri-webapp
Field name: uri-webapp
Values:
0: Altiris Agent 1: NSCAP
2: NS Console 3: Resource Manager
4: Patch Inventory Rule 5: SWD Packages
6: SWD Portal 7: Client Task Agent
8: Client Task Server 9: Task Management
10: Other Altiris apps 11: Other directories
Sample: 'filter uri-agent 6' to match Software Portal requests only.
The identifier for the CTS web-app is 8, so we can craft the filter and verify it using the following commands:
aila-qs/:filter uri-webapp 8 aila-qs/:filter print uri-webapp 8
Now that our filter is defined as we need, we can stat the store with the filter applied, so we can see information specific to this web-app only:
aila-qs/:exec stat store Processed 1008779 requests from the store. LOG FILE ANALYSIS:: SUMMARY Mime type results: File type= aspx , page hits= 1008779 Altiris Agent requests results: IIS Web-applications results: Webapp= /Altiris/CTS[4]/ , dir hits = 1008779 [1] IRA is an abbreviation of InventoryRuleManagement/Agent [2] SWD is an abbreviation of SWD/SWDWin32/SWPortal [3] CTA is an abbreviation of ClientTaskAgent [4] CTS is an abbreviation of ClientTaskServer LOG FILE ANALYSIS:: DETAILED << data truncated here >> 24 hour hit counters: Hits counted during hour 7 to 8 was 20263 Hits counted during hour 8 to 9 was 59116 Hits counted during hour 9 to 10 was 116957 Hits counted during hour 10 to 11 was 113101 Hits counted during hour 11 to 12 was 130859 << data truncated here >>
This shows quite a lot of incoming requests from the task agent.
Next then, we can take a look at a sample set of requests. To do this we use the "from" keyword, which as the name indicates allow the exec commands to start at a specific location.
Here we will start at ~10:00, using the start position at request 300,000 (using the 24 hour hit counter data above). Given we'll print the output to the console we also use the "top" keyword to avoid flooding the console.
aila-qs/:
aila-qs/:filter top 50
aila-qs/:filter from 300000
aila-qs/:exec print store
300000, 192.168.17.196, 166, /Altiris/CTS[4]/, (null), aspx, 20110328, 101545, Success (1xx,2xx), 0, 0, 5d1d341b-86fc-4d45-956b-2a466fee0f1f
300001, 192.168.15.102, 166, /Altiris/CTS[4]/, (null), aspx, 20110328, 101545, Success (1xx,2xx), 0, 0, 325b9b88-8ab3-4fb2-a083-c23e8c7427c1
300002, 192.168.18.186, 166, /Altiris/CTS[4]/, (null), aspx, 20110328, 101545, Success (1xx,2xx), 0, 0, d888e779-a0b7-40ac-b265-37d6b43a7934
300007, 192.168.67.68, 165, /Altiris/CTS[4]/, (null), aspx, 20110328, 101545, Success (1xx,2xx), 0, 0, e10f5d20-d8fd-47f2-86a3-8660e7aef929
<< data truncated here >>
300075, 192.168.72.154, 166, /Altiris/CTS[4]/, (null), aspx, 20110328, 101547, Success (1xx,2xx), 0, 0, 5377e478-49ee-40b3-8c42-e744c5873fc3
aila-qs/:quit
Brought to you by {Connect Winter of Code}
Now to be 100% accurate we could filter the input file to ensure it only contains TaskManagement requests. This is done with the Unix tool grep (available by default on cygwin).
But to parse the resulting file with aila we need to make sure the resulting file contains the header (which contains the w3c schema details needed by aila).
So we start creating the file with the 'head' command and then we append to it the matches from grep:
ludovic@smallisbeautiful:~/PROD_003/SW-SRC/altiris-ns-tooling/aila/payload$ head -n 4 iis_6_logs/ex110418.log > cts_only ludovic@smallisbeautiful:~/PROD_003/SW-SRC/altiris-ns-tooling/aila/payload$ grep -i "ClientTaskServer" iis_6_logs/ex110418.log >> cts_only
And off we go again with aila.
ludovic@smallisbeautiful:~/PROD_003/SW-SRC/altiris-ns-tooling/aila/payload$ aila -f cts_only -n0 -qs Current line #: 1010000 Program read 167569311 bytes from 1010226 lines LOG FILE ANALYSIS:: SUMMARY Mime type results: File type= aspx , page hits= 1010222 Altiris Agent requests results: IIS Web-applications results: Webapp= /Altiris/CTS[4]/ , dir hits = 1008779 Webapp= /Altiris/TaskMgmt/, dir hits = 1443 << data truncated here >> Ip cache top 20 entries: 14763, 192.168.10.51 10845, 192.168.20.58 10350, 192.168.2.134 10162, 192.168.14.154 10158, 192.168.72.154 <<data truncated here>> Guid cache top 20 entries: 14762, 965c234c-9e45-4d19-8f83-bb9970f401d5 10845, 62bd92f5-b1bb-41eb-899d-90e08ee2dbcb 10350, 6a52a043-1594-41d5-bd17-669533ff6e83 10162, 83cbe1f5-7c2d-4071-976f-fb522f699f23 10158, 5377e478-49ee-40b3-8c42-e744c5873fc3 <<data truncated here>> Enter a command (help or ?) or exit to quit:
As a sanity check I normally filter out the data I want to work with, to verify if there's anything else (false positive or any other kind of pollution). That we use the same filter as before, with the addition of the 'not' keyword:
Enter a command (help or ?) or exit to quit: aila-qs/:filter not uri-webapp 8 aila-qs/:filter print uri-webapp not 8 aila-qs/:exec stat store Processed 1443 requests from the store. LOG FILE ANALYSIS:: SUMMARY Mime type results: File type= aspx , page hits= 1443 Altiris Agent requests results: IIS Web-applications results: Webapp= /Altiris/TaskMgmt/, dir hits = 1443 [1] IRA is an abbreviation of InventoryRuleManagement/Agent [2] SWD is an abbreviation of SWD/SWDWin32/SWPortal [3] CTA is an abbreviation of ClientTaskAgent [4] CTS is an abbreviation of ClientTaskServer LOG FILE ANALYSIS:: DETAILED IIS status code results: IIS Status code= Success (1xx,2xx), hits count = 1443 IIS sub-status code results: Sub Status code= 0, hits count = 1443 IIS Win32 status code results: Win32 Status code= Win32 Success , hits count = 1426 Win32 Status code= Win32 Failure , hits count = 17 24 hour hit counters: Hits counted during hour 7 to 8 was 119 Hits counted during hour 8 to 9 was 344 << data truncated here >>
This shows 1,443 entries that are not related to the Client Task Server. This should be explained by the grep command use, as it matches any requests with "TaskManagement", which is valid for the TaskManagement webapp.
A quick print out to the console allows us to verify this (so we can safely discard the false positive as a user fault, rather than a tool or server problem):
aila-qs/:filter top 20 aila-qs/:exec print store 669, 169.254.3421, 185, /Altiris/TaskMgmt/, (null), aspx, 20110328, 74421, Success (1xx,2xx), 0, 0, 738, 169.254.3421, 185, /Altiris/TaskMgmt/, (null), aspx, 20110328, 74422, Success (1xx,2xx), 0, 0, << data truncated here >> 1057, 169.254.3421, 185, /Altiris/TaskMgmt/, (null), aspx, 20110328, 74424, Success (1xx,2xx), 0, 0, 1123, 169.254.3421, 185, /Altiris/TaskMgmt/, (null), aspx, 20110328, 74424, Success (1xx,2xx), 0, 0, aila-qs/:quit
We can satisfy ourselves that the errors are so marginal they shouldn't impact the ip or guid cache in any way. So we can go back to the shell and use the "print" commands to list the first 10,50 or 500 ip address or guid, as you find suitable.
Finally, should you need to get all the ip or guid from the aila analysis remember that the cache_dump.csv file is created by default, so you can use that directly too ;).
Should you need more samples or details on the tool and usage, please come back to me here on Connect!