Video Screencast Help

What is frequency of Monitor Script run in VCS ?

Created: 02 Apr 2014 | 8 comments

Hello,

I came across one issue in VCS where application is configured using Application agent.  We found VCS iniated failover of application after Application agent found Application went offline outside cluster ( not initiated by vcs ), however from Application logs we found everything was fine on application.

Then we tested Monitor Program ( script ) provided by Application. It was well defined to monitor a PID and provide exit code based on PID available to VCS agent. i.e exit 110 if running, 110 if not running.

To test if Monitor script is giving correct exit code to VCS agent, we introduced echo messages in script directed to a log file locally. And we found some strange input in our created log file.

Note :  MonitorInverval  = 60

           MonitorTimeout = 60

 

We noticed, when Monitor script run after exacly 60 seconds, it gives exit code 110 which is fine.  However, we found Monitor Scripts executes by VCS before 60 seconds sometimes, i.e arund 7-8 seconds, and at that time it gives exist code 100 which is problem.

I want to know why VCS agent is calling Monigor Programm before 60 second ( within 7-8 seconds ) after last run.

# cat is_server_running.log | egrep 'exit'
 [Tue Apr 1 13:12:34 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:23:52 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:27:25 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:27:32 BST 2014] exit 100
 [Tue Apr 1 13:28:25 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:29:25 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:30:25 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:31:25 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:32:25 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:32:31 BST 2014] exit 100
 [Tue Apr 1 13:33:25 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:34:01 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:34:26 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:35:26 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:36:26 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:37:26 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:37:32 BST 2014] exit 100
 [Tue Apr 1 13:38:26 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:39:25 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:39:30 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:40:03 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:40:25 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:41:25 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:42:25 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:42:32 BST 2014] exit 100
 [Tue Apr 1 13:43:25 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:44:25 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:45:25 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:46:25 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:47:25 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:47:32 BST 2014] exit 100
 [Tue Apr 1 13:48:25 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:49:25 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:50:25 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:51:25 BST 2014] Server is already running, exit 110

Operating Systems:
Discussion Filed Under:

Comments 8 CommentsJump to latest comment

mikebounds's picture

VCS runs, by default, every 5 mins, for offline resources, so you will see that the above logs shows the "exit 100" running every 5 mins.  So for a single service group then the Monitor should run every minute on the system the service group is online and every 5 mins on the system(s) it is offline. 

If you have more that one service group containing your application and one service group is online and another service group is offline on a particular system, then that system will have the Monitor running at both 1 minute and 5 minute intervals, but each monitor should be monitoring its own specific resource.

So your logs would indicate one application is online which VCS checks every minute and another application is offline which VCS checks every 5 mins.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

Marianne's picture

Is this a typo? Or really in the script?

exit 110 if running, 110 if not running.

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

sajith_cr's picture

Hi Anoop,

As Mike mentioned, verify on which cluster node you are getting the return values differently.

On the online node, monitor program run every 60 sec by default and on offline node it run every 5 min. Another attribute to check is LevelTwoMonitorFreq. It is possible that level one monitoring is finding that resource is online but a indepth monitoring -Level Two-(set to run at a frequency of 5 min) is returning offline.

Regards,

Sajith

If this post has helped you, please vote or mark as solution.

Setu Gupta's picture

Hi Anoop,

I agree with Mike and Sajith that there could be 2 different monitors logging at 2 different frequencies.

But there are some logs that are intermediate:

 [Tue Apr 1 13:39:30 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:40:03 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:40:25 BST 2014] Server is already running, exit 110
 [Tue Apr 1 13:41:25 BST 2014] Server is already running, exit 110

Was the MonitorProgram executed manually at 13:39:30 and 13:40:03?

 

Venkata Reddy Chappavarapu's picture

Printing resource name from ENV - VCS_LOG_RESOURCE_NAME along with exit code will help in identifying for which resource the exit code was printed.

Venkata Reddy Chappavarapu,

Sr. Manager,

Information Availability Group (VCS),

Symantec Corporation

===========================

PS: If you are happy with the answer provided, please mark the post as solution. You

kjbss's picture

I just did a couple of quick tests and unfortunately it seems that the VCS_LOG_RESOURCE_NAME Environment variable is only being set for the online and offline entry points!:

[monitor] Environment variable "VCS_LOG_RESOURCE_NAME" not set or length null....
[online] Environment variable "VCS_LOG_RESOURCE_NAME" [kjb_testLogging_app] must have been inherited from the caller!
[monitor] Environment variable "VCS_LOG_RESOURCE_NAME" not set or length null....
[clean] Environment variable "VCS_LOG_RESOURCE_NAME" not set or length null....
[monitor] Environment variable "VCS_LOG_RESOURCE_NAME" not set or length null....
[online] Environment variable "VCS_LOG_RESOURCE_NAME" [kjb_testLogging_app] must have been inherited from the caller!
[monitor] Environment variable "VCS_LOG_RESOURCE_NAME" not set or length null....
[offline] Environment variable "VCS_LOG_RESOURCE_NAME" [kjb_testLogging_app] must have been inherited from the caller!
[monitor] Environment variable "VCS_LOG_RESOURCE_NAME" not set or length null....

Maybe this has changed in more up-to-date versions of SFHA?

Here's the test environment -- SFHA 5.0MP3RP2 on Solaris 10 (SunOS agbc81 5.10 Generic_141414-10 sun4u sparc):

-$ modinfo | grep -i vx
 39  1388350  3eea0 322   1  vxdmp (VxVM 5.0MP3RP2: DMP Driver)
 41 7be00000 20bbf0 323   1  vxio (VxVM 5.0MP3RP2 I/O driver)
 43 7bfeb508    c78 324   1  vxspec (VxVM 5.0MP3RP2 control/status d)
223 7afd1180    cb0 320   1  vxportal (VxFS 5.0_REV-5.0MP3A25_sol port)
224 7aa00000 1d89e0  21   1  vxfs (VxFS 5.0_REV-5.0MP3A25_sol SunO)
227 7b75c000  521d8 327   1  vxfen (VRTS Fence 5.0MP3RP2)
242 7a6b4000   a9e0 321   1  fdd (VxQIO 5.0_REV-5.0MP3A25_sol Qui)

-$ uname -a
SunOS xxxxx 5.10 Generic_141414-10 sun4u sparc SUNW,SPARC-Enterprise

 

Please tell us this is fixed in later versions....  -;)

 

kjbss's picture

...just to remove any doublts about the resource type used for the testing:

-$ hares -value kjb_testLogging_app Type
Application

 

kjbss's picture

I've tested on a "Red Hat Enterprise Linux Server release 6.4 (Santiago)" cluster with SFHA 6.0.30.0.

It seems that things have improved as the clean entry point now gets passed the VCS_LOG_RESOURCE_NAME Environment variable.

However, the monitor entry point is still not getting this environment variable passed to it, as you can see from the test custom-script-log below:

[online]  Environment variable "VCS_LOG_RESOURCE_NAME" [kjb_testLogging_app] must have been inherited from the caller!
[monitor] Environment variable "VCS_LOG_RESOURCE_NAME" not set or length null....
[offline] Environment variable "VCS_LOG_RESOURCE_NAME" [kjb_testLogging_app] must have been inherited from the caller!
[monitor] Environment variable "VCS_LOG_RESOURCE_NAME" not set or length null....
[online]  Environment variable "VCS_LOG_RESOURCE_NAME" [kjb_testLogging_app] must have been inherited from the caller!
[monitor] Environment variable "VCS_LOG_RESOURCE_NAME" not set or length null....
[clean]   Environment variable "VCS_LOG_RESOURCE_NAME" [kjb_testLogging_app] must have been inherited from the caller!
[monitor] Environment variable "VCS_LOG_RESOURCE_NAME" not set or length null....

The point here is that it would have been great to be able to depend upon this environment variable being set for all entry points.  Another environment variable  that would be very useful for these Application-Type scripts is the VCS_LOG_SCRIPT_NAME environment variable.  Unfortunately, this also cannot be relied upon, as it is not set for EVERY entry point.

That is all too bad, because instead of being able to define Application-Type entry-points scripts as:

hares -modify kjb_testLogging_app StartProgram   "/mnt/bin/testAppLogging"
hares -modify kjb_testLogging_app StopProgram    "/mnt/bin/testAppLogging"
hares -modify kjb_testLogging_app CleanProgram   "/mnt/bin/testAppLogging"
hares -modify kjb_testLogging_app MonitorProgram "/mnt/bin/testAppLogging"

...and then having the script internally "do the right thing" by determining the entry point and resource via the VCS_LOG_SCRIPT_NAME and VCS_LOG_RESOURCE_NAME Environment variables, we are stuck forcing the Cluster Administrator to pass this information into the script by hard-coding these into the attribute definitions:

hares -modify kjb_testLogging_app StartProgram   "/mnt/bin/testAppLogging online  kjb_testLogging_app"
hares -modify kjb_testLogging_app StopProgram    "/mnt/bin/testAppLogging offline kjb_testLogging_app"
hares -modify kjb_testLogging_app CleanProgram   "/mnt/bin/testAppLogging clean   kjb_testLogging_app"
hares -modify kjb_testLogging_app MonitorProgram "/mnt/bin/testAppLogging monitor kjb_testLogging_app"

Which just isn't anywhere near as good (but is 100% reliable, and therefore of course the only way to do it for now).