Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.

IntentionalOffline feature seems to cause offline service groups to fail-over and restart (Veritas 5.0 MP3 RP2 on Solaris 10 x86-64)

Created: 16 Apr 2010 • Updated: 13 Jun 2010 | 4 comments
This issue has been solved. See solution.

Hi all,

THE BACKGROUND:

We have implemented a custom Veritas Agent for monitoring our application, and we are using the IntentionalOffline feature of V51 agents (VCS 5.0 MP3 or later) to signal to VCS that the application has been brought "Intentionally Offline". From the Veritas Agent Developer's Guide:

About intentional offline of applications
 
Certain agents can identify when an application has been intentionally shut down outside of VCS control. If an administrator intentionally shuts down an application outside of VCS control, VCS does not treat it as a fault. VCS sets the service group state as offline or partial, depending on the state of other resources in the service group.

This feature allows administrators to stop applications without causing failovers.
 

This is achieved by using a specific return code (RC 200 indicates intentional offline) from the custom agent monitoring script when the script detects that the application is offline outside of Veritas control.

Intentional Offline is supposed to set the resources offline in a similar way to manually taking a resource offline using 'hares -offline ...' or using the VCS GUI. It works as intended in general, and we are very happy with it. We also set the ExternalStateChange attribute on our resources which support Intentional Offline to 'OnlineGroup & OfflineGroup', meaning that VCS takes the service group online or offline as appropriate in respose to an external state change. Our service groups are very simple NIC->IP->Application resource dependencies, with no service group to service group dependencies defined.

THE PROBLEM:

Consider this scenario:

The "custom_app_grp" service group is online on "nodeA" of our cluster. Our cluster contains three nodes: nodeA, nodeB and nodeC. The "custom_app_grp" group contains NIC ("custom_app_nic_res"), IP ("custom_app_ip_res") and Application ("custom_app_srv_res") resources, and the monitoring script for the Application resource supports "Intentional Offline". The service group is allowed to start on any node in the cluster, but is a failover service group, and can only run on one node at once.

We take our application offline using the application controls outside of Veritas. The "custom_app_srv_res" resource goes OFFLINE in response, and since ExternalStateChange is set, this brings the "custom_app_grp" service group offline too. The "custom_app_grp" is now OFFLINE on "nodeA". All nodes, "nodeA", "nodeB" and "nodeC", are online, but are now not running any other service groups apart from the cluster service group.

We now reboot the "nodeA" server. What we now experience is that the "custom_app_grp" now invokes its fail-over behaviour, and VCS attempts to restart the service group on "nodeB".

Note: if instead of taking the "custom_app_grp" offline outside of VCS control, we instead simply offline the group using 'hagrp' or the VCS GUI, then reboot "nodeA", the "custom_app_grp" does NOT begin to fail over.

...

So: would people expect a service group which is OFFLINE in the cluster, so suddenly be marked as FAULTED and trigger VCS to begin restarting the service group on another node?

We wouldn't.

Has anyone else experienced this behaviour with Intentional Offline? Can Symantec support representatives reading the forum comment on whether they would consider this behaviour a bug?

Many thanks.

Kind regards,

Dave Hassett
 

Comments 4 CommentsJump to latest comment

Leigh Brown's picture

Hello Dave,

I have developed a custom agent that also made use of the new intentional offline functionality.  I encountered a similar issue that may or may not be related to your own, depending on how you have coded your own agent.

I thought I'd share the details with you in case it helped (it may not).

In my case, I was developing a resource that acted as a service group proxy, in other words the state of the resource mirrored the state of another service group. This is somewhat similar to the remote service group but was for other service groups in the same cluster.  Anyway, one of the features I wanted for the agent was to avoid faulting the resource if the other service group went offline.

I implemented this functionality and my initial testing worked very well.  However, when I was testing stopping and starting VCS (using hastop -local -force then hastart) I encountered a similar issue - when VCS probed the resources in the service group that included the proxy resource, it took the parent resources of the proxy resource offline.  This was definitely not what I was expecting.

I eventually determined that this was because my agent was quite stupid.  In a naive attempt to "ensure" that the resource would never fault, I made it always return 200 (intentional offline) when it was offline.  Unfortunately, this did not work as expected when initially probing the resources.

In order to fix the problem I added additional code to my agent to ensure that it only returned 200 (intentional offline) when it was transitioning from an online to offline state, and in particular never to return 200 when the resource was first probed.

After I made this change I was able to stop and start VCS (leaving everthing running) without issue.

So, to cut a long story short, if your agent is returning 200 on the initial probe when VCS starts, this may be the cause of the issue you are seeing.

Regards,

Leigh.

rationalbytes's picture

Hi Leigh,

Thanks for your reply.

In my case, the situation is slightly different. The agent is not returning Intentional Offline during agent start-up: it is only returning it when the application notifies the agent it is stopping.

The rest of the time, the agent returns Offline when the application is detected as not running.

It is a puzzler. I'm going to try logging the question with Symantec support, but given their statement to the effect of "we don't support custom agents", I'm not holding my breath...

Thanks for your feedback.

Kind Regards,

Dave.

Leigh Brown's picture

Hi Dave,

I should have suggested this before, but it might be worth posting your engine logs from when VCS starts up to see if anyone can see why it is exhibiting the unexpected behaviour.

Regards,

Leigh.

rationalbytes's picture

Looks like this is a bug fixed in 5.0 MP3 Rollup Patch 3. We logged the call with Symantec, and they could not reproduced the issue. On investigation, it turned out that they were testing on RP3, rather than RP2a. We noticed in the Rollup Patch 3 release notes this line:

  1744255 [AGFW] Agfw should not convert IntentionalOffline to Offline, (1) in first probe, (2) when probe is requested in Offline state

 
We have retested on MP3 RP3, and we cannot reproduce the issue. Looks like this is the solution.

Hope this helps somebody else.

Thanks for the replies.

Dave.

SOLUTION