Video Screencast Help

Best Practice - Monitor Solution 7.0/7.1

Created: 26 Feb 2010 • Updated: 20 Sep 2011 | 9 comments
Joseph_Carson's picture

Hello All,

For a long time I have wanted to put together a Monitor Solution Best Practice though i want as many people to contribute to this as much as possible so i thought it would be best to start a discussion here so everyone could add their part to it. 

I will start off with the following but please everyone feel free to add to this as much as possible.

  • Agent/Plug-in Configuration - Under Data Collection Tab i recommend changing the "record metric values every" from 300 to 600
  • Agent/Plug-in Configuration - Under Data Collection Tab i recommend changing the "record process values every" from 300 to 600
  • Event Console Purging Maintenance - I recommend changing the maximum alerts from 150k to 50k
  • By default we include all the informational alerts into the default policies - i recommend removing these from the default policies unless you really need to have them included. 
  • I recommend using the Windows Server Baseline Policy as the default Policy and then add further Policies as needed.  These are available on the Monitor Pack Community Group and will be included in a future release by default.

These are just a few simple changes that i recommend.  All of these will eventually be by default in the out of box install in a future release.

if anyone else has any additional recommendations please feel free to add them here.  I really appreciate all the feedback and hopefully this will eventually add enough valuable information to create a Best Practice Document for Monitor Solution.

Comments 9 CommentsJump to latest comment

scott.hall's picture

Change these settings under - Settings -> All Settings -> Monitoring and Alerting -> Monitor -> Monitor Server Settings -> Heartbeat Tab

Changed heartbeat detection from:

Every minute, retry every 10 seconds, 3 retries

to:

Every 12 Minutes, retry every 230 seconds, 3 retries

Reason for change: Better scaling, reduction of false-positives due to network or NS load..  These numbers were found to produce the least number of false positives after much trial and error.  For reference we're monitoring right around 1000 servers, and no workstations.  Heartbeats alone are not a good indication of up/down of servers as a server that is properly shut down will not report heartbeat alerts.  To ensure complete coverage, heartbeat alerts need to be used in conjunction with pings from the RMS server.  This will cover you in the eventuality that someone mistakenly shuts down a server.
 

+++
Scott Hall
Enterprise Monitoring Engineer
Great American Insurance

Joseph_Carson's picture

 Hi Scott,

Many thanks for your helpful suggestion.  I am always looking at ways to improve the functionality of the heartbeat.  

  1. Add Heartbeat Retry/Resend from Agent
  2. Add Ping Rule if Heartbeat fails
  3. Also add the ability to alert when server is correctly shutdown, this is currently by design but we want to make this configurable option with the current way default

if you have other good suggestions to improve this please let me know.  Also do you recommend the above setting being the default out of the box?

Everyone please continue to provide as many useful suggestions as possible.

Many Thanks,


Ickram's picture

Hi, Joseph

We have had an incidents in the past when a server has had memory leak, resulting in the Altiris Monitor agent being unable to alert about a failing Windows service.

To overcome this issue I have created a sql query which runs on a Microsoft Sql Server.  This SQL database server is the used to store data sent by our application servers. The applications servers send a server alive heartbeat using Microsoft message queue. The sql server has heartbeat table in our custom database. The SQL query checks the current time on the server and compares them against the last server heatbeat time. If the differential is more than 5 minutes. A critical alert is triggered. 
This method allows the monitoring of a collection of servers, which are dependent on each other. 
 
My suggestion would be.

 1. Create Server Heartbeat Table on the Notifcation Server.
2. By default enable Server Heartbeat Alive and have a polling interval 180 seconds.
3. When the server does not respond in 180 second the alert is triggered. The Critical alert should warn about the server possibly not being online and up. The rule should reset to normal when a server heartbeat is received.

Joseph_Carson's picture

Hi Ickram,

Many thanks for the feedback on this.  In version 7.0 of Monitor Solution we do have a Monitor Plug-in heartbeat that gets sent to the Server on a periodically time frame and then the server checks to see which plug-ins have reported and which ones have not, the ones which have not reported it puts a heartbeat alert into the Event Console.

The problem here is that sometimes if a network glitch occurs and one heartbeat fails to reach the server it will display a heartbeat alert in the Event Console.  I am looking for feedback on whether we should add a retry ability so for example if one heartbeat fails either try to ping the resource or have the monitor plug-in try a second heartbeat. So potentially adding that send a Warning on 1 heartbeat Failure and a Critical on 2 heartbeat failures.  Something around this type of configuration ability along with retry ability.

Any thoughts?

Kind Regards, 

Michael_S's picture

Hi Joseph. We'll meet in a few days but in the mean time I wanted to comment on this thread. We monitor retail outlets across the WAN and probably from the very first day I rolled this out I was looking for the retry feature. For one reason or another it's important to have some built in 2nd try so we illiminate on as many false positives and have accurate reports.
We had to revert to agentless ICMP monitoring across the WAN because of the missing 2nd try. This didn't work out so well for us either. That being said Symantec and in particular the monitoring team has been top notch in working through this issue. Last week I think we finally saw the light at the end of a long tunnel so our agentless metrics will soon be recording correctly and the reports will be accurate... no more %UNKNOWN. :) 
Best regards and see you on Monday.

Joseph_Carson's picture

Hi All,

Just checking if anyone has new updates on Best Practices for monitor solution.  Anything that you normally change after out of box install etc...  I would like to know any feedback on modifying the default settings to rollout policy suggestions.

Michael, we are researching the heartbeat request and hopefully sometime in the near future we will add this.  I will keep you posted here.

Many Thanks,

scott.hall's picture


I don't know if this is a best practice or not, but this is something that's made our lives a little easier.

We have several port checks that we want to apply globally across our environment, but we want the target of the checks to only be servers that have the Altiris Agent, and a specific service installed and set to 'Automatic' or 'Manual'.  We whipped up a quick sql based filter to go off of our inventory data so that we can identify only those machines.

Here's an example of the SQL used in the filter that is needed to identify our active Citrix servers so we can run port 1494 checks against them:


SELECT  distinct
  cid.Guid [Guid],
  cid.[Name]
FROM
  vComputer cid
JOIN Inv_AeX_AC_Client_Agent cia on cid.Guid = cia._ResourceGuid
JOIN Inv_AeX_AC_NT_Services nt ON cid.Guid = nt._ResourceGuid
WHERE 1 = 1
AND  nt.[Name] = 'IMAService'
AND (nt.StartupType = 'Automatic' or nt.StartupType = 'Manual')
AND UPPER(cid.[OS Name]) LIKE UPPER('%Windows%')
AND (UPPER(cia.[Agent Name]) = UPPER('Altiris Monitor Agent')
OR UPPER(cia.[Agent Name]) = UPPER('Altiris Monitor Agent POP')
OR UPPER(cia.[Agent Name]) = UPPER('Altiris Monitor Agent RMS'))
 

Here you can see that you can easily change the name of the service to have it apply to whatever  you need.

Anyway, that's helped us move away from static lists that need constant updating.

+++
Scott Hall
Enterprise Monitoring Engineer
Great American Insurance

yabru's picture

Hi Joseph,

You mentioned in your first post there will be some improvement included in a future release of Monitor solution. When do expect this to be? Is it part of the up and coming 7.1?

Steve

Joseph_Carson's picture

Hi Scott,

Many thanks for sharing and yes this is exactly the type of things we are looking for here. 

All just to ensure you are also familar with the Monitor Community on Connect for sharing Packs, Reports, Tasks, Scripts, Video's and anything that everyone will find useful.  Please Join the Group.  I have added some very useful reports recently.

https://www-secure.symantec.com/connect/groups/monitor-pack-factor-challenge-altiris-server-management-suite-70 

Yabru,

Correct we will be having monitor improvements in the upcoming 7.1.  A brief summary is below.

  1. x64 Altiris Server Support
  2. Multiple Remote Monitoring Server support (integrated into Site Service)
  3. Command Line Metric Parameters for column and row delimiters
  4. Moving informational rules to library only
  5. Windows and Linux Server Health Packs (note these are small packs that include rules for CPU, Memory, Disk and Network only) - they populate the new Server Resource Manager Home

These are just some of the highlights that will be included, target for these is approx end of year.  Keep an eye on CMS/SMS Beta for early access to these features.

Please continue to submit suggestions, ideas and best practice here.  Everyone is welcome.