Video Screencast Help
Search Video Help Close Back
to help
New in the Rewards Catalog: Vouchers for "Symantec Technical Specialist" and "Symantec Certified Specialist" exams.

shared mount point down after one node down?

Updated: 21 May 2010 | 4 comments
lkthomas's picture
0 0 Votes
Login to vote

Hey folks, I am doing some POC in our lab and found this problem:

two nodes redhat SFCFS cluster 5.0 MP3

testing when node A heartbeat link down, mount point /storage on node B will also unmount

is that support to auto fail over? if so how could I set it ?

I can't imagine when there is 20nodes cluster, one node fail to bring down 19 all healthy cluster?! doesn't make sense to me, please help.

Comments

ScottK's picture
22
Oct
2009
0 Votes 0
Login to vote

I have a hypothesis around

I have a hypothesis around the configuration of your cluster. When node A fails and unmounts on node B...
(a) do you have only one active heartbeat link at that point in time?
(b) do you have I/O fencing configured?

If the answers are (a)=yes and (b)=no, then what you are seeing is the software attempting to avoid data corruption. To node B, when only one heartbeat link is active, the failure of node A is not distinguishable from someone pulling the heartbeat link. If there is no heartbeating between the nodes, then node A and node B could not coordinate writes, locks, etc. to the disk and data would, in many cases, become corrupted. Configuring I/O fencing provides an alternate mechanism to determine whether the communication stoppage is due to node failure or to network failure.

If you have multiple links and this happens, or if you have I/O fencing configured, then there must be something else happening.

lkthomas's picture
22
Oct
2009
0 Votes 0
Login to vote

ok, so the whole shared

ok, so the whole shared storage if:
1. I/O fencing is config
or
2. more than one heartbeat link available ? does it count second heartbeat link as low priority or second heartbeat link?

one question:
do I need to config anything to preserve shared storage mounted on other nodes when one node is failed ?!

thanks.

ScottK's picture
23
Oct
2009
0 Votes 0
Login to vote

Sorry, I didn't quite follow

Sorry, I didn't quite follow the question...

Sharing storage among nodes has a lot of benefits (failover, parallel access for parallel apps, flexibility...). But, the nodes have to coordinate their access. If the nodes aren't coordinated/talking to each other, the data can get corrupted. One approach is to stop all access when intra-node communication stops. But that approach means that all applications on all nodes have to stop. Another approach is to, via some method, limit access to only some of the nodes, and deny access to all the other nodes. That way, at least some nodes (and their applications) keep running. Storage Foundation + VCS implement this latter approach through their I/O Fencing feature. I/O Fencing is extremely robust protection against uncoordinated access to shared storage, while still allowing at least one node access to that storage. I/O Fencing does have to be explicitly setup and configured; and your storage has to support it (most, but not quite all, storage does so).

Hope that helps...

Gaurav Sangamnerkar's picture
26
Oct
2009
0 Votes 0
Login to vote

Hello, To answer your

Hello,

To answer your queries....

a) It is always recommended to use IO Fencing to prevent data corruption in occurrence of Split Brain (Loss of all heartbeats)....
b) If you had only 1 link & if you configure second link, it should be configured as High Priority link only (unless you specifically say it to configure as low priority)..

Regarding your last query...

If you have configured IO fencing correctly, that is all nodes are registered to coordinator disks, you need not to do any manual configuration (or preserve anything) when any one node fails....

Lets assume an example which might help you to understand here.... Lets say you have 3 nodes cluster node A, B & C & you have also configured IO Fencing on it (configured correctly), Also lets assume this cluster runs 3 applications AppsA, AppsB & AppsC each run on one node & each service group contains one diskgroup, couple of volumes & couple of mount points. So AppsA runs on A, AppsB on B & AppsC on C.

Lets say node C fails (node crash/panic or hardware fault).... In this case other two active nodes will detect failure of node C (assuming heartbeats are ok) . In this scenario AppsC would be failed over to A or B depending on SystemList priority you set. Ideally if a node goes down gracefully, fencing module will go down gracefully & IOFencing keys would be removed from the disks it was owning.. but if node crashes, keys might not be removed.. though it wouldn't make a difference in scenario we are considering. so to conclude nothing to preserve here manually, all actions would be taken by VCS, & your AppsC would be brought online successfully on any one of surviving nodes.

Lets take another case of heartbeat failure, lets say node C looses all working heartbeats, so node A & B will think that node C is lost & node C will think A & B are lost, this condition is called split brain... A & B nodes will try to occupy AppsC while  node C will attempt to occupy AppsA & AppsB thus causing data corruption... since you have IOFencing enabled here, a coordinator race will happen & mini cluster (with nodes A & B) will win the race & causing a panic on node C, thus your node C is out of equation to gain access & causing data corruption... Once node C comes back up after panic, if heartbeats are not repaired, it won't start & sit on GAB to seed manually...

So IOFencing is a full proof protection to save the data...

Hope this helps...

Gaurav

PS: If you are happy with the answer provided, please mark the post as solution. You can do so by clicking link "Mark as Solution" below the answer provided.