ACCLib too slow on big system
Created: 06 Feb 2012 | Updated: 23 Mar 2012 | 1 comment
Early versions of our cluster had our own OeBS monitor scripts. Scripts sometimes worked not properly, that's why the customer wants to have a solution that would fully supported by vendor, have no problems with upgrade and track the resource status correctly.
At the moment we see that the agents supplied by Symantec not working very well.
The monitoring procedure has two levels: 1st monitor level and 2nd monitor level. The 1st monitor level works by scanning the process table, while the second uses application API. We can't use second monitor level bypass 1st.
The 1st monitor level of many agents uses ACClib framework, which runs the BSD ps to scan the process table. Each OeBS resource creates his own copy of BSD ps. The BSD ps seq reads the address space of all processes in the system(huge work)
The cluster environment enabled:
oot@mn-nfsap # pgrep -lf /usr/ucb/ps
25994 /usr/ucb/ps axwwl
29610 /usr/ucb/ps axwwl
28455 /usr/ucb/ps axwwl
26538 /usr/ucb/ps axwwl
26971 /usr/ucb/ps axwwl
26500 /usr/ucb/ps axwwl
28624 /usr/ucb/ps axwwl
26382 /usr/ucb/ps axwwl
27159 /usr/ucb/ps axwwl
27698 /usr/ucb/ps axwwl
27625 /usr/ucb/ps axwwl
25249 /usr/ucb/ps axwwl
25268 /usr/ucb/ps axwwl
root@mn-nfsap # uptime
10:26am up 93 day(s), 13:51, 3 users, load average: 34.33, 32.44, 31.25
25994 /usr/ucb/ps axwwl
29610 /usr/ucb/ps axwwl
28455 /usr/ucb/ps axwwl
26538 /usr/ucb/ps axwwl
26971 /usr/ucb/ps axwwl
26500 /usr/ucb/ps axwwl
28624 /usr/ucb/ps axwwl
26382 /usr/ucb/ps axwwl
27159 /usr/ucb/ps axwwl
27698 /usr/ucb/ps axwwl
27625 /usr/ucb/ps axwwl
25249 /usr/ucb/ps axwwl
25268 /usr/ucb/ps axwwl
root@mn-nfsap # uptime
10:26am up 93 day(s), 13:51, 3 users, load average: 34.33, 32.44, 31.25
The cluster environment disabled:
root@mn-nfsap # hastop -local -force
root@mn-nfsap # pgrep -lf /usr/ucb/ps
root@mn-nfsap # uptime
10:39am up 93 day(s), 14:04, 4 users, load average: 13.82, 17.93, 24.42
root@mn-nfsap # pgrep -lf /usr/ucb/ps
root@mn-nfsap # uptime
10:39am up 93 day(s), 14:04, 4 users, load average: 13.82, 17.93, 24.42
You can see that monitoring has great overhead.
If we try to truss /usr/ucb/ps command than will see(sorted by time):
<..sneep..>
27582: 0.0086 pread(5, " o r a _ r w 2 0 _ r u n".., 2097083, 4114128) = 1153328
27582: 0.0088 pread(5, " o r a _ r w 2 0 _ r u n".., 2097083, 4114128) = 1153328
27582: 0.0088 pread(5, " o r a _ r w 2 0 _ r u n".., 2097083, 4114128) = 1153328
27582: 0.0089 pread(5, " r 2 5 r u n\0\0\0\0\0\0".., 2097083, 4057196) = 1152916
27582: 0.0090 pread(5, " o r a _ r w 2 0 _ r u n".., 2097083, 4114128) = 1153328
27582: 0.0097 pread(5, " o r a _ r w 2 0 _ r u n".., 2097083, 4114128) = 1153328
27582: 0.0103 pread(5, " o r a _ r w 2 0 _ r u n".., 2097083, 4114128) = 1153328
27582: 0.0109 pread(5, " o r a _ r w 2 0 _ r u n".., 2097083, 4114128) = 1153328
27582: 0.0151 pread(5, " A P P S / Z G D 6 4 B 5".., 2097076, 1681616) = 1988400
27582: 0.0211 pread(5, " - a x w w l\080\0\0\0\0".., 2097071, 4296047472) = 2097071
27582: 0.0574 pread(5, " f 6 0 r u n m\0\0\0\0\0".., 2097083, 10297584) = 2097083
27582: 0.0574 pread(5, " f 6 0 r u n m\0\0\0\0\0".., 2097083, 10297584) = 2097083
27582: 0.0576 pread(5, " f 6 0 r u n m\0\0\0\0\0".., 2097083, 10297584) = 2097083
27582: 0.0577 pread(5, " f 6 0 r u n m\0\0\0\0\0".., 2097083, 10297584) = 2097083
<..sneep..>
27582: 0.2152 pread(5, " f 6 0 r u n m\0\0\0\0\0".., 2097083, 10297584) = 2097083
27582: 0.2285 pread(5, " f 6 0 r u n m\0\0\0\0\0".., 2097083, 10297584) = 2097083
27582: 0.2447 pread(5, " f 6 0 r u n m\0\0\0\0\0".., 2097083, 10297584) = 2097083
27582: 0.2929 pread(5, " f 6 0 r u n m\0\0\0\0\0".., 2097083, 10297584) = 2097083
27582: 0.0086 pread(5, " o r a _ r w 2 0 _ r u n".., 2097083, 4114128) = 1153328
27582: 0.0088 pread(5, " o r a _ r w 2 0 _ r u n".., 2097083, 4114128) = 1153328
27582: 0.0088 pread(5, " o r a _ r w 2 0 _ r u n".., 2097083, 4114128) = 1153328
27582: 0.0089 pread(5, " r 2 5 r u n\0\0\0\0\0\0".., 2097083, 4057196) = 1152916
27582: 0.0090 pread(5, " o r a _ r w 2 0 _ r u n".., 2097083, 4114128) = 1153328
27582: 0.0097 pread(5, " o r a _ r w 2 0 _ r u n".., 2097083, 4114128) = 1153328
27582: 0.0103 pread(5, " o r a _ r w 2 0 _ r u n".., 2097083, 4114128) = 1153328
27582: 0.0109 pread(5, " o r a _ r w 2 0 _ r u n".., 2097083, 4114128) = 1153328
27582: 0.0151 pread(5, " A P P S / Z G D 6 4 B 5".., 2097076, 1681616) = 1988400
27582: 0.0211 pread(5, " - a x w w l\080\0\0\0\0".., 2097071, 4296047472) = 2097071
27582: 0.0574 pread(5, " f 6 0 r u n m\0\0\0\0\0".., 2097083, 10297584) = 2097083
27582: 0.0574 pread(5, " f 6 0 r u n m\0\0\0\0\0".., 2097083, 10297584) = 2097083
27582: 0.0576 pread(5, " f 6 0 r u n m\0\0\0\0\0".., 2097083, 10297584) = 2097083
27582: 0.0577 pread(5, " f 6 0 r u n m\0\0\0\0\0".., 2097083, 10297584) = 2097083
<..sneep..>
27582: 0.2152 pread(5, " f 6 0 r u n m\0\0\0\0\0".., 2097083, 10297584) = 2097083
27582: 0.2285 pread(5, " f 6 0 r u n m\0\0\0\0\0".., 2097083, 10297584) = 2097083
27582: 0.2447 pread(5, " f 6 0 r u n m\0\0\0\0\0".., 2097083, 10297584) = 2097083
27582: 0.2929 pread(5, " f 6 0 r u n m\0\0\0\0\0".., 2097083, 10297584) = 2097083
preads on adress space file takes siginificantly long time to complete, sometimes by a factor of secs. Gererating one process table may take more then 10 min on our system with more than 4000+ procs.
So setting MonitorTimeout to 10 min little WA, but it's timeout breaks SLA. Switching Apps between nodes taking 1 hour now. Moreover this monitor type has huge impact on the systems perfomance. We can't use MonitorProgram option becouse the 1st monitor level is always ran before proceeding to the the SecondLevel Monitoring, therefore would've just extended the monitor time.
At this point we need a monitor that will not have that big impact on system(see LA above) and run more faster.
Idea Filed Under:
Comments 1 Comment • Jump to latest comment
Alexander,
By OeBS, do you mean Oracle eBusiness Suite? If so, what particular component(s) are you looking to monitor and make highly-available?
Or is OeBS something else?
Which version(s) of VCS are you using? Which platform(s)?
Would you like to reply?
Login or Register to post your comment.