MultiNICA failed issue
We have a problem with MultiNICA resources failing on one node when the other is disconnected/failed. E.g. if we disconnect the NIC e1000g1 from the switch on node A, then e1000g1 on node B is also failed.
We have a 1+1 symetric cluster setup with two T5220 nodes.
These are connected to the LAN with 3 NICs, 1 for operation and maintenance and 2 for traffic. The 2 traffic NICs (e1000g1, nxge1) are handled by Veritas, and only one of them is active at one time; the other is redundant.
The Network resource looks like this in main.cf:
group Network (
SystemList = { node1 = 0, node2 = 1 }
Parallel = 1
AutoStartList = { node1, node2 }
)
MultiNICA Multi-NIC (
Device @node1 = { e1000g1 = "10.240.204.197",
nxge1 = "10.240.204.197" }
Device @node2 = { e1000g1 = "10.240.204.198",
nxge1 = "10.240.204.198" }
NetMask = "255.255.255.240"
RetestInterval = 2
RouteOptions = "default 10.240.204.193"
IfconfigTwice = 1
)
Phantom Network_Phantom (
)
// resource dependency tree
//
// group Network
// {
// MultiNICA Multi-NIC
// Phantom Network_Phantom
// }
e1000g1 and nxge1 are connected to 2 separate switches for both nodes.
Typically, a resource using this interface looks like:
group SentinelLM (
SystemList = { node1 = 0, node2 = 1 }
AutoStartList = { node1, node2 }
FailOverPolicy = Load
Load = 5
)
IPMultiNIC SentinelLM_ip (
Address = "10.240.204.200"
NetMask = "255.255.255.240"
MultiNICResName = Multi-NIC
IfconfigTwice = 1
)
Process SentinelLM (
PathName = "/application/sentinel/bin/lserv"
Arguments = "-s /application/sentinel/bin/lservrc"
)
Proxy SentinelLM_nic (
TargetResName = Multi-NIC
)
requires group ApplicationDG online local firm
SentinelLM requires SentinelLM_ip
SentinelLM_ip requires SentinelLM_nic
// resource dependency tree
//
// group SentinelLM
// {
// Process SentinelLM
// {
// IPMultiNIC SentinelLM_ip
// {
// Proxy SentinelLM_nic
// }
// }
// }
We are running tests where we are disconnecting the cables from both interfaces on one node to see that it fails over to the other.
But when we disconnect both cables from node1, then the same interface fails on node2. This is what we're getting in engine_A.log:
2009/10/09 13:47:03 VCS WARNING V-16-10001-6004 (node2) MultiNICA:Multi-NIC:monitor:Device e1000g1 FAILED
2009/10/09 13:47:03 VCS WARNING V-16-10001-6005 (node2) MultiNICA:Multi-NIC:monitor:Acquired a WRITE Lock
2009/10/09 13:47:03 VCS WARNING V-16-10001-6006 (node2) MultiNICA:Multi-NIC:monitor:Bringing down IP addresses
2009/10/09 13:47:03 VCS ERROR V-16-10001-6018 (node2) MultiNICA:Multi-NIC:monitor:Error in 'ifconfig' command execution:
ifconfig: SIOCSLIFNAME for ip: nxge1: already exists
2009/10/09 13:47:04 VCS WARNING V-16-10001-6019 (node2) MultiNICA:Multi-NIC:monitor:Device nxge1 could not be brought up
2009/10/09 13:47:04 VCS ERROR V-16-10001-6014 (node2) MultiNICA:Multi-NIC:monitor:No more Devices configured. All devices are down. Returning OFFLINE
2009/10/09 13:47:05 VCS WARNING V-16-10001-6004 (node2) MultiNICA:Multi-NIC:monitor:Device FAILED
2009/10/09 13:47:05 VCS WARNING V-16-10001-6005 (node2) MultiNICA:Multi-NIC:monitor:Acquired a WRITE Lock
2009/10/09 13:47:05 VCS WARNING V-16-10001-6006 (node2) MultiNICA:Multi-NIC:monitor:Bringing down IP addresses
2009/10/09 13:47:05 VCS WARNING V-16-10001-6007 (node2) MultiNICA:Multi-NIC:monitor:Trying to online Device e1000g1
2009/10/09 13:47:07 VCS INFO V-16-10001-6008 (node2) MultiNICA:Multi-NIC:monitor:Sleeping 2 seconds
2009/10/09 13:47:09 VCS WARNING V-16-10001-6009 (node2) MultiNICA:Multi-NIC:monitor:Pinging Broadcast address 10.240.204.207 on Device e1000g1, iteration 1
2009/10/09 13:47:10 VCS WARNING V-16-10001-6004 (node1) MultiNICA:Multi-NIC:monitor:Device e1000g1 FAILED
2009/10/09 13:47:10 VCS WARNING V-16-10001-6005 (node1) MultiNICA:Multi-NIC:monitor:Acquired a WRITE Lock
2009/10/09 13:47:10 VCS WARNING V-16-10001-6006 (node1) MultiNICA:Multi-NIC:monitor:Bringing down IP addresses
2009/10/09 13:47:10 VCS ERROR V-16-10001-6018 (node1) MultiNICA:Multi-NIC:monitor:Error in 'ifconfig' command execution:
ifconfig: SIOCSLIFNAME for ip: nxge1: already exists
2009/10/09 13:47:11 VCS WARNING V-16-10001-6019 (node1) MultiNICA:Multi-NIC:monitor:Device nxge1 could not be brought up
2009/10/09 13:47:11 VCS ERROR V-16-10001-6014 (node1) MultiNICA:Multi-NIC:monitor:No more Devices configured. All devices are down. Returning OFFLINE
2009/10/09 13:47:12 VCS WARNING V-16-10001-6004 (node1) MultiNICA:Multi-NIC:monitor:Device FAILED
2009/10/09 13:47:12 VCS WARNING V-16-10001-6005 (node1) MultiNICA:Multi-NIC:monitor:Acquired a WRITE Lock
2009/10/09 13:47:12 VCS WARNING V-16-10001-6006 (node1) MultiNICA:Multi-NIC:monitor:Bringing down IP addresses
2009/10/09 13:47:12 VCS WARNING V-16-10001-6007 (node1) MultiNICA:Multi-NIC:monitor:Trying to online Device e1000g1
As you can see: even though we disconnected node1, the first node to have a failed network resource is node2.
We are also seeing a situation where all cables are connected and the MultiNIC resource fails on the active node when we stop/start the passive node. I believe this could be related.
Any theory of what is happening here?
Comments
Hi, Could it be the case that
Hi,
Could it be the case that the networks you are currently using do not contain any other hosts besides the cluster nodes and the router? Routers are mostly configured for not replying to broadcast pings which are used by the MultiNICA Agent. If the other nodes NIC is disconnected the broadcast pings are not answered anymore.
I would try adding the NetworkHosts attribute. It should point to the router or any other host in the network. Here is the excerpt from the VCS Bundled Agents Guide:
NetworkHosts
The list of hosts on the network that are pinged to determine if the network connection is alive. Enter the IP address of the host, instead of the host name, to prevent the monitor from timing out—DNS causes the ping to hang. If this attribute is unspecified, the monitor tests the NIC by pinging the broadcast address on the NIC. If more than one network host is listed, the monitor returns online if at least one of the hosts is alive. Type and dimension: string-vector Example: "128.93.2.1", "128.97.1.2"
I hope this helps.
Regards
Manuel
Please don't forget to mark your thread solved with whatever answer helped you : )
Would you like to reply?
Login or Register to post your comment.