Video Screencast Help

How did SFWHA handle Split-Brain without Fening Function?

Created: 07 Nov 2013 • Updated: 10 Nov 2013 | 3 comments
This issue has been solved. See solution.

Hi 

As we know , Fening is not available on SFWHA 6.0.1 Windows

So, if all heatbeat link lost ,how did SFWHA 6.0.1 handle Split-Brain to keep cluster working good?

Thanks :) 

Operating Systems:

Comments 3 CommentsJump to latest comment

Wally_Heim's picture

Hi Tonado,

SFW-HA follows the Microsoft Clustering techniques when it comes to disk reservations in a split brain situation.  The technique is called Challenge/Defense.  Here are the basics of it.

* The active note maintains a SCSI reservation on all of the disks in the disk group.  The reservation is monitored and updated every few seconds (depending on the version of windows this timing is changed slightly.)  For our example lets say that the active node does its SCSI reservation updates every 3 seconds.

After all heartbeats are lost at the same time the following happens.

1. The remaining nodes in the cluster decide which node will attempt to online the service groups that were on the Active node.

2. This node clears the SCSI reservations on the disks in the diskgroup that it is going to try to take over.

3. It waits for 7 seconds (2 SCSI reservation cycles + 1 second) then it checks if the SCSI reservation is still cleared or if the SCSI reservation was updated by the prior active node.

3a.  If the SCSI reservation is still cleared on the disks in the disk group then the node brings on the diskgroup and starts the rest of the service group.

3b. If the SCSI reservation has been restored by the active node, then the node faults its attempt to online the disk group and faults the service group on this node.

4. If there are other passive nodes, this same challenge/defense process is tried on all the remaining nodes 1 at a time until they all fault or one of them is able to online the service group.

 

SFW does this same challenge/defense process for VCS and for Windows clusters where SFW is involved.

 

The IP resource is also another resource that can stop a split brain cluster from onlining the service groups.  The IP resource checks the network (pings) for the IP to be online on another system in the environment.  It the IP responds, then the service group is faulted and online on that node is stopped.

 

Thank you,

Wally

SOLUTION
Tonado.wang's picture

Hi Wally

Thanks for your detailed reply.

It really help me a lot !smiley

Marianne's picture

In addition to Wally's excellent post - Please follow Symantec's recommendation to ensure that heartbeats are using physically different NICs, different switches/network paths.

In short, ensure that there is no single point of failure as far as heartbeats are concerned.

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links