Video Screencast Help

MultiNIC faulted and then switchover the base+virtual IP to another interface

Created: 11 Sep 2012 • Updated: 18 Oct 2012 | 21 comments
Zahid.Haseeb's picture
This issue has been solved. See solution.

Expected Environment

Solaris 10

HA 6.0 (Single Node Cluster)

one built-in interface on solaris machine (bge0)

four port interface card installed (qfe0, qfe1, qfe2, qfe3)

 

I configured MultiNIC and add qfe0 and qfe1 with base ip 192.168.253.140 and 192.168.253.140 under MultiNIC resource and also configured IPMultiNIC resource with Virtual IP 192.168.253.250

I unplugged the networking cable from qfe0. The MultiNIC + IPMultiNIC resource faulted and after a while the MultiNIC + IPMultiNIC resource came online again and than base + virtual ip switched over. 

My question why the resource is getting faulted and comming back with new IP+interface. why not IP's switched over without faulted. Any thing I am missing. 

Discussion Filed Under:

Comments 21 CommentsJump to latest comment

mikebounds's picture

Make sure you have other IPs on the 192.168.253  subnet on other servers and routers as VCS will ping out to check network. 

A few other points:

  1. You should use interfaces from different physical cards so the quad card is not a single point of failure, so for example use bge0 and qfe0
  2. It is much better to use MulltiNCB with Solaris IPMP - this is considerably faster - a few seconds compared to over a minute.  If you don't want to use Solaris IPMP, then MulltiNCB is still quicker (just requires more IP addresses)
  3. I would recommend using NetworkHosts attribute as specifying IPs in this attribute is much more efficient than doing a broadcast which is what happens if you leave this attribute blank 

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

Zahid.Haseeb's picture

Yes I have one IP assigned on a server 192.168.253.1

1.) That was testing phase but I configured bge0 and qfe0

2.) right now I want to use MultiNIC

3.) Yes this attribute is set

But still same problem ... When I unplugged a active NIC(bge0) cable the resource(MultiNIC) got faulted and after a while MultiNIC resource got up and IP's assigned on backup NIC(qfe0)

Dont know why the MultiNIC resource getting faulted

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

Zahid.Haseeb's picture

I might did something wrong. I created a new service group and did add the MultiNIC and IP MultiNIC resource and linked them both. See the below main.cf file for reference. But now when I unplugged the LAN cable no Service Group is faulting even no IP(Base+Virtual) is switching to the backup NIC/interface. Solaris is saying the IP is live but that IP is not pingable from outside

                                                     """"base IP and virtual IP""""

( # ping 192.168.253.254
192.168.253.254 is alive

# ping 192.168.253.250
192.168.253.254 is alive

)

                                                               """"main.cf file""""

bash-3.00# cat /etc/VRTSvcs/conf/config/main.cf

include "OracleASMTypes.cf"

include "types.cf"

include "Db2udbTypes.cf"

include "OracleTypes.cf"

include "SybaseTypes.cf"

 

cluster CLUSTERA (

        UserNames = { admin = bIJbIDiFJeJJhRJdIG }

        Administrators = { admin }

        )

 

system solaris (

        )

 

group CLUSTERA-SG (

        SystemList = { solaris = 0 }

        )

 

        IPMultiNIC IPMnic (

                Address = "192.168.253.250"

                NetMask = "255.255.0.0"

                MultiNICResName = MultiNIC

                )

 

        MultiNICA MultiNIC (

                Device @solaris = { bge0 = "192.168.253.254",

                         qfe0 = "192.168.253.254" }

                NetMask = "255.255.0.0"

                NetworkHosts = { "192.168.253.1" }

                )

 

        IPMnic requires MultiNIC

 

 

        // resource dependency tree

        //

        //      group CLUSTERA-SG

        //      {

        //      IPMultiNIC IPMnic

        //          {

        //          MultiNICA MultiNIC

        //          }

        //      }

 

 

                                                           """"ifconfig -a""""
 

# ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
qfe0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 52
        inet 0.0.0.0 netmask ff000000
        ether 8:0:20:ea:2:2c
qfe1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 47
        inet 192.168.253.240 netmask ffff0000 broadcast 192.168.255.255
        ether 8:0:20:ea:2:2d
bge0: flags=1000803<UP,BROADCAST,MULTICAST,IPv4> mtu 1500 index 51
        inet 192.168.253.254 netmask ffff0000 broadcast 192.168.255.255
        ether 0:3:ba:78:61:75
bge0:1: flags=1000803<UP,BROADCAST,MULTICAST,IPv4> mtu 1500 index 51
        inet 192.168.253.250 netmask ffff0000 broadcast 192.168.255.255
 

The base and virtual ip got stuck with bge0 and did not failover

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

g_lee's picture

From the VCS 6.0 Bundled Agents reference guide

http://www.symantec.com/business/support/resources...

----------
MultiNICA notes
[...]
The MultiNICA agent supports only one active interface on one IP subnet; the agent does not work with multiple active interfaces on the same subnet.

On Solaris, for example, you have two active NICs, hme0 (10.128.2.5) and qfe0 (10.128.2.8). You configure a third NIC, qfe1, as the backup NIC to hme0. The agent does not fail over from hme0 to qfe1 because all ping tests are redirected through qfe0 on the same subnet. The redirect makes the MultiNICA monitor return an online status. Note that using ping -i does not enable the use of multiple active NICs.
[...]
----------

in your ifconfig output, you still have qfe1 configured on the same subnet - so the ping tests are going through this interface, so it doesn't think the multinic group (bge/qfe0) is offline

qfe1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 47
        inet 192.168.253.240 netmask ffff0000 broadcast 192.168.255.255
        ether 8:0:20:ea:2:2d

If this post has helped you, please vote or mark as solution

Zahid.Haseeb's picture

What may be the suggested subnet in this scererio ?

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

g_lee's picture

???????????

If you actually read the document, it's saying don't put other interfaces (not in the MultiNIC group) on the same subnet.

So, take qfe1 down, or give it another IP address on another subnet (ie: not 192.168.253.x), so MultiNIC can manage qfe0, bge0 properly

 

If this post has helped you, please vote or mark as solution

Zahid.Haseeb's picture

Resource faulted. After a while base ip + virtual ip failed over to qfe0

 

bash-3.00# ifconfig -a

lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1

        inet 127.0.0.1 netmask ff000000

qfe0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 54

        inet 192.168.253.254 netmask ffff0000 broadcast 192.168.255.255

        ether 8:0:20:ea:2:2c

qfe0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 54

        inet 192.168.253.250 netmask ffff0000 broadcast 192.168.255.255

qfe1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 47

        inet 10.0.0.1 netmask ff000000 broadcast 10.255.255.255

        ether 8:0:20:ea:2:2d

 

==========

I have to plumb the bge0. Its not in the ifconfig -a result after failed over

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

mikebounds's picture

As in earlier post "VCS will ping out to check network" and as Grace as says, VCS things MulitNIC is up, because it can ping 192.168.253.1 (via qfe1)  Not sure what you mean by  "What may be the suggested subnet in this scererio ?"  Why do you need 2 networks on the same subnet.

Another thing is that you may want to configure "IfconfigTwice" as without this, many routers will not get the new mac address of the IP when IP fails over, which stops you being able to ping from outside.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

mikebounds's picture

So is it all working now?

From what I can remember (it has been a long time since I have used MultiNICA as most customers use MultiNICB), it is normal for unused interface to be unplumbed.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

Zahid.Haseeb's picture

No mike. See my above last post please

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

mikebounds's picture

I'm not clear what s wrong, base ip + virtual ip are failing over and unused base ip is being unplumbed - this is what is supposed to happen.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

Zahid.Haseeb's picture

This is really suppose to happen mike but when the resource is getting faulted

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

mikebounds's picture

Sorry Zahid, you are right, resource should not fault.  It may have faulted as qfe0 was plumbed in as I think this is not supposed to be plumbed in.  Now that bge0 is not plumbed in, can you repeat test, pulling bge0 and if resource faults, can you give extract from engine_A.log and MultiNICA_A.log from when the failure occurs.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

Paresh Bafna's picture

Debug logs to help debug issue

 

Hi Zahid,

You can also enable debug logs for MultiNICA and IPMultiNIC agent while performing the test. ThHis will help debug the issue and narrow down to root cause.

Debug logs can be enabled for MultiNICA and IPMultiNIC agents using following commands -

 

# hatype -modify MultiNICA LogDbg 1 2 3 4 5

# hatype -modify IPMultiNIC LogDbg 1 2 3 4 5

 

Thanks and Regards,

Paresh Bafna

Thanks and Regards,
Paresh Bafna

Zahid.Haseeb's picture

@ mike and Paresh

Thanks for kind replies and quick follow-up I appreciate. Let me do this and will share the result
 

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

Zahid.Haseeb's picture

@ Paresh and mike: As suggested

  • Debug logs enabled for MultiNICA and IPMultiNIC agents using the said commands -
  • Service group is UP with MultiNIC and IP MultiNIC resource
  • ifconfig command result before unplugged the LAN cable

# ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
qfe0: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
        inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
        ether 8:0:20:ea:2:2c
qfe1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
        inet 10.0.0.1 netmask ff000000 broadcast 10.255.255.255
        ether 8:0:20:ea:2:2d
bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 7
        inet 192.168.253.254 netmask ffff0000 broadcast 192.168.255.255
        ether 0:3:ba:78:61:75
bge0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 7
        inet 192.168.253.140 netmask ffff0000 broadcast 192.168.255.255
 

ifconfig result after unplugged the LAN cable

bash-3.00# ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
qfe0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 9
        inet 192.168.253.254 netmask ffff0000 broadcast 192.168.255.255
        ether 8:0:20:ea:2:2c
qfe1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
        inet 10.0.0.1 netmask ff000000 broadcast 10.255.255.255
        ether 8:0:20:ea:2:2d
 

engine and ip multinic result are attached and multinic logs dont have any thing

AttachmentSize
logs.tar.gz 2.18 KB

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

mikebounds's picture

Attachment is just an empty logs directory - attachment should be bigger than 109 bytes.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

Zahid.Haseeb's picture

Kindly check attachment again.

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

Paresh Bafna's picture

Hi Zahid,

Thanks for the logs.

As per my understanding below is sequence of events that may have taken place during your testing –

#1
MultiNICA resource is online with bge0 as active interface.
# ifconfig -a
lo0: flags=2001000849 mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
qfe0: flags=1000842 mtu 1500 index 3
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
ether 8:0:20:ea:2:2c
qfe1: flags=1000843 mtu 1500 index 4
inet 10.0.0.1 netmask ff000000 broadcast 10.255.255.255
ether 8:0:20:ea:2:2d
bge0: flags=1000843 mtu 1500 index 7
inet 192.168.253.254 netmask ffff0000 broadcast 192.168.255.255
ether 0:3:ba:78:61:75
bge0:1: flags=1000843 mtu 1500 index 7
inet 192.168.253.140 netmask ffff0000 broadcast 192.168.255.255

Point to be noted here is even though bge0 is active interface other interface - qfe0 is also plumbed with 0.0.0.0 address.

#2
Cable pull performed on bge0.
MultiNICA agent detects cable pull and marks bge0 as failed interface.

2012/09/14 14:17:27 VCS WARNING V-16-10001-6004 (solaris) MultiNICA:MultiNIC:monitor:Device bge0 FAILED

#3
Now MultiNICA agent tries to bring up qfe0 interface by try to plumb interface and assign base IP address ("192.168.253.254") to the interface. This is done in agent by executing command similar to below –
# ifconfig qfe0 inet plumb 192.168.253.254 netmask 255.255.0.0 up

#4
When MultiNICA agent tries to execute above mentioned command, execution fails as qfe0 is already plumbed. This can be seen from following log –

2012/09/14 14:17:28 VCS ERROR V-16-10001-6018 (solaris) MultiNICA:MultiNIC:monitor:Error in 'ifconfig' command execution:ifconfig: SIOCSLIFNAME for ip: qfe0: already exist
2012/09/14 14:17:28 VCS WARNING V-16-10001-6019 (solaris) MultiNICA:MultiNIC:monitor:Device qfe0 could not be brought up

This error will cause agent to clear (unplumb) qfe0 interface.

#5
As attempt to bring up qfe0 failed and bge0 is not connected, agent reports all interfaces as down.
2012/09/14 14:17:28 VCS ERROR V-16-10001-6014 (solaris) MultiNICA:MultiNIC:monitor:No more Devices configured. All devices are down. Returning OFFLINE

Resource faults at this point.

#6
Agent retry to online interfaces in next monitor.
2012/09/14 14:18:30 VCS WARNING V-16-10001-6004 (solaris) MultiNICA:MultiNIC:monitor:Device FAILED
2012/09/14 14:18:30 VCS WARNING V-16-10001-6005 (solaris) MultiNICA:MultiNIC:monitor:Acquired a WRITE Lock
2012/09/14 14:18:30 VCS WARNING V-16-10001-6006 (solaris) MultiNICA:MultiNIC:monitor:Bringing down IP addresses
2012/09/14 14:18:31 VCS WARNING V-16-10001-6007 (solaris) MultiNICA:MultiNIC:monitor:Trying to online Device bge0
2012/09/14 14:18:32 VCS INFO V-16-10001-6008 (solaris) MultiNICA:MultiNIC:monitor:Sleeping 5 seconds
2012/09/14 14:18:37 VCS WARNING V-16-10001-6010 (solaris) MultiNICA:MultiNIC:monitor:Pinging 192.168.253.1 with Device bge0 configured: iteration 1
2012/09/14 14:18:42 VCS INFO V-16-10001-6008 (solaris) MultiNICA:MultiNIC:monitor:Sleeping 5 seconds
2012/09/14 14:18:47 VCS WARNING V-16-10001-6009 (solaris) MultiNICA:MultiNIC:monitor:Pinging Broadcast address 192.168.255.255 on Device bge0, iteration 2
2012/09/14 14:18:57 VCS WARNING V-16-10001-6011 (solaris) MultiNICA:MultiNIC:monitor:Tried the PingTest 2 times
2012/09/14 14:18:57 VCS WARNING V-16-10001-6012 (solaris) MultiNICA:MultiNIC:monitor:The network did not respond
2012/09/14 14:18:57 VCS WARNING V-16-10001-6013 (solaris) MultiNICA:MultiNIC:monitor:Giving up on NIC bge0
2012/09/14 14:18:57 VCS WARNING V-16-10001-6015 (solaris) MultiNICA:MultiNIC:monitor:Trying the next NIC qfe0 in the list
2012/09/14 14:18:57 VCS WARNING V-16-10001-6007 (solaris) MultiNICA:MultiNIC:monitor:Trying to online Device qfe0
2012/09/14 14:18:58 VCS INFO V-16-10001-6008 (solaris) MultiNICA:MultiNIC:monitor:Sleeping 5 seconds
2012/09/14 14:19:03 VCS WARNING V-16-10001-6010 (solaris) MultiNICA:MultiNIC:monitor:Pinging 192.168.253.1 with Device qfe0 configured: iteration 1
2012/09/14 14:19:04 VCS WARNING V-16-10001-6016 (solaris) MultiNICA:MultiNIC:monitor:Migrated to Device qfe0
2012/09/14 14:19:04 VCS WARNING V-16-10001-6017 (solaris) MultiNICA:MultiNIC:monitor:Releasing Lock
2012/09/14 14:19:05 VCS INFO V-16-1-10299 Resource MultiNIC (Owner: Unspecified, Group: CLUSTERA-SG) is online on solaris (Not initiated by VCS)

Retry in subsequent monitor succeeds because agent had cleared (unplumbed) qfe0 interface during first failed attempt.

My understanding is if you keep only one interface plumbed at any moment agent should take care of failover during cable pull scenario.
Please test following scenario
- Connect cables both all the interfaces – bge0 and qfe0
- Wait until MultiNICA agent detects/brings up base IP address on one of the interface – say bge0
- Make sure another interface – qfe0 is not plumbed on the system
- Perform cable pull test on bge0
- MultiNICA agent should failover base as well as any virtual IP addresses on bge0 to qfe0 interface.

Please let us know if this solves your issue.

Thanks and Regards,
Paresh Bafna

Thanks and Regards,
Paresh Bafna

mikebounds's picture

As I said in earlier post:

It may have faulted as qfe0 was plumbed in as I think this is not supposed to be plumbed in.  Now that bge0 is not plumbed in, can you repeat test

and Paresh has confirmed this, so can you test with just one interface plumbed in.  Paresh suggests not plumbing any in, but the normal sequence of events is that you plumb in the preferred interface in boot up (i.e suppose you want qfe0 to be be plumbed in then you create a /etc/hostname.qfe0 file) and you leave the other interface unplumbed.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

SOLUTION
Zahid.Haseeb's picture

@ Mike and Paresh

Fantastic efforts by both of you. I am so thankful. giving you thumbs but confuse whose post mark as solution. request to admin please.

Thanks glee too

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com