Video Screencast Help
Symantec Appoints Michael A. Brown CEO. Learn more.

link between primary site and dr site disconnected

Created: 01 Oct 2012 • Updated: 18 Oct 2012 | 29 comments
Zahid.Haseeb's picture
This issue has been solved. See solution.

SFHA/DR = 5.1

rhel = uname -a
Linux xxxxxxxxxxxxxxxxx 2.6.18-128.el5 #1 SMP Wed Dec 17 11:41:38 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
 

Hello all

We hope yourall be fine. I want to investigate what happen at 12:30 am on 30th September 2012. One of my client manually switched over service group from primary site to DR site. As the switching over is being executed the hagui session got disconnected(which he was taken from primary site) and not able to see the status what happened with DR site. When he felt he is not able to see any thing at DR site he did online the service group at primary site again so that his critical application can UP, and the service group again got UP successfully at primary site. When the service group got UP again successfully he said to me that he is again able to take the hagui session of DR site via public IP but the replication was stopped between primary and DR site. (Please Note: before switch over from primary to DR site the repstatus was connected and up-to-date and not behind)

the above is the case one....

case no two:

Today in evening between 5:00 pm to 6:00 pm on same day when I reached at the primary site of my client, I saw that the DR site replication service group was offline which made me thought that this is problem which is why the replication was stopped, as I UP the service group of DR site the red exclamation mark appear at primary node. I checked the status of replication via repstatus command which said me that the "primary - primary " configuration , so I ran the fbsync command from DR site command prompt which also failed with error. I again stopped the replication service group and reboot the primary site node1 and the exclamation mark disappear. I just UP the service group on node2 at primary site.

0.) Kindly shared your expert opinion what actually happened also see the below question too please
1.) Why all session got disconnected when my client switched over to DR site (case one)
2.) Why the primary - primary situation occurred (case two)

I have up loaded the engine logs of dr site node for review

Discussion Filed Under:

Comments 29 CommentsJump to latest comment

mikebounds's picture

Attachment seems corrupt.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

Zahid.Haseeb's picture

its a tar file. OR you may remove the .tar with rename command and then untar this file.

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

mikebounds's picture

I know it's a tar file, but the tar file contains 1 incomplete line with no asci characters.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

Zahid.Haseeb's picture

check this

AttachmentSize
node3logs.txt 38.74 KB

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

arangari's picture

i would suggest an escalation should be raised, if there is possible issue with the configuration and/or product behaviour. 

Thanks and Warm Regards,

Amit Rangari

If this post helped you resolving the issue, please mark it as solution. _____________________________________________________________________________

Zahid.Haseeb's picture

already open a case but I dont get any + yet

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

Zahid.Haseeb's picture

arangari did you get my mail ??

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

Zahid.Haseeb's picture

???

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

mikebounds's picture

When service was switched to DR site, , VVR at DR was not connected to prod site so VVR did a takeover, as oppose to a migrate so Prod was not changed from secondary to primary so you would have been in a primary-primary at this state which requires a resync - see entry in the log:

 

2012/09/30 00:26:54 VCS INFO V-16-20012-80 (node3) RVGPrimary:PRI-RVG:online:RVG RVG takeover successful; use vxrvg resync command to resynchronize the original primary upon its restoration

As you say you thought VVR was connected, you should check system logs (which should tell you when VVR becomes disconnected and connected again) and you can check prod VCS logs to check VVR group was online.

The replication group was manually taken offline:

 

2012/09/30 01:31:58 VCS INFO V-16-1-50133 User admin has logged in from ::ffff:10.200.3.15
2012/09/30 01:37:32 VCS INFO V-16-1-50135 User admin fired command: hagrp -offline HomeApp-REPLICATION  node3  localclus  from ::ffff:10.200.3.15
2012/09/30 01:37:32 VCS NOTICE V-16-1-10167 Initiating manual offline of group HomeApp-REPLICATION on system node3

I can never remember which node to run fbsync from so if it fails one node, then I just try the other side. Rebooting should not have resolved your problem unless you have AutoSync attribute set to 1 in the RVGPri agent which I would not recommend.

The GUI may have become disconnected if the GUI was connected to the application virtual IP, instead of the IP in ClusterService group or physical IP.

Mike

 

 

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

Zahid.Haseeb's picture

 VVR at DR was not connected to prod site so VVR did a takeover,

after takeover who become primary ?

 Prod was not changed from secondary to primary

Did not understand. Elaborate please

 

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

Zahid.Haseeb's picture

One more question Mike:

If I run the fbsync who will be the primary and who will be the secondary. How this will be decided ?

(I scare the updated data node(real primary) will not be made secondary and all the seconday replicate all old data to primary :(   )

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

mikebounds's picture

 

VVR at DR was not connected to prod site so VVR did a takeover,

after takeover, DR becomes new primary, but prod site is still a primary too - see next bit:

 Prod was not changed from secondary to primary

As VVR is not connected, Prod cannot be changed to a secondary (if VVR is connected then VVR does a migrate and roles are reversed, so you have a primary and secondary), and so you have a primary-primary config.

In a primary-primary config, when the 2 nodes next connect, the oldest primary changes to an acting secondary (although some commands still show Primary-Primary) and so the newly promoted primary stays as the primary.  When you do fbsync, the acting secondary becomes a full secondary.  So in your scenario, after fbsync, the Prod site would have become secondary as you ran takeover command at DR to promote this to the newest primary.  You cannot get VVR to sync the other way - i.e suppose you take over at DR site, but don't make any changes, and then old primary comes back, then you cannot make DR site a secondary again using fbsync - you have to do resync or run some low level force commands if you KNOW VVR was not behind when you ran takeover and no writes have since been written.

Mike

 

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

Zahid.Haseeb's picture

1.) Now I have to sync the volumes from primary to secondary means replication from very start again ?

if yes then kindly share what should I do at DR site like: The volumes should be unmount etc ?

 

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

mikebounds's picture

You should NOT use fbsync as this would sync the wrong way for you, and what has occured in your scenario maybe why fbsync failed when you tried it as usually this happens:

Suppose Prod is primary, DR is secondary and when you do takeover, VVR is behind, so then:

  1. Prod has some writes outstanding writes that were written BEFORE takeover (in SRL)
  2. DR has outstanding writes that are written AFTER takeover  (tracked in DCM)

But you have writes on Prod, written after takeover, as you went back to using Prod, which is the big difference (and you have no writes at DR written after takeover).  So normally writes are NOT written at old primary after takeover is run, which is probably why fbsync did not work for you

So you will need to do a full resync, as I believe DCM cannot replay the other way - see https://www-secure.symantec.com/connect/ideas/allow-vradmin-fbsync-sync-both-ways

If Prod is not an acting secondary, then you should just be able run "vxrvg makesecondary" at DR (may need to detach rlinks first), and then do a full resync (vradmin -a startrep).  If Prod is an acting secondary, then you will probably have to remove rlinks and RVG and this can be difficult, as "vradmin delsec" will not work in a primary-primary (acting secondary) setup, so you will have to delete objects manually - see post https://www-secure.symantec.com/connect/forums/vvr-solaris-logs-synchronization-issue-failback-original-primary.  

I have used low-level commands to delete rlinks and RVGs quite a few times, but it is quite involved and so as you have a call logged, I would ask support to talk you through running commands to sort this out.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

mikebounds's picture

Application service group should be offline at DR (so volumes are not mounted) if you are using Prod and while soritng VVR out you should freeze replication service groups on both sites, so they do not fault.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

mikebounds's picture

Sorry, I thought post https://www-secure.symantec.com/connect/forums/vvr-solaris-logs-synchronization-issue-failback-original-primary had low-level commands to delete VVR objects, but this post contains ow-level commands to do a migrate.  Low-level commands to delete VVR objects are:

To remove secondary (run on both sides):

vrlink det

vxrlink dis

vxedit rm rlink_name

You may need "-f" on some of above

To delete RVG at DR, as this is a primary, you should be able to use "vradmin delpri", but if this doesn't work then you need to run:

"vxvol dis" on all volumes, including SRL (may need to run "vxrvg stop" first)

vxedit rm rvg_name

You will probably need to run above to delete RVG at prod too, which you can do while application is still running.

Then you just need to run:

vradmin createpri

vradmin addsec

vradmin -a startrep.

As, I said in earlier post, you may want Support to talk you through all this.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

Zahid.Haseeb's picture

Unfortunately support responded me very late so what I did

- Mark replication resources un critical at both sides(primary and DR)

- freeze application and replication service group at both sides

- I tried to remove delsec and delpri which I thought remove all the replication configuration so I thought I will only need to run createpri , addsec and startrep command but delsec failed because config error  which actually was showing under repstatus (primary - primary )

"Plan B "

- Destroy DG at DR site.

- create DG at DR site

- create volumes at DR site

- detach and deassociate rlink at primary site

- del pri with -f option at primary site

- create pri at primary site

- addsec at primary site

- start replication from primary site

- un freeze service group at both sites

- mark resources critical at both sites

=============

 

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

Zahid.Haseeb's picture

What I feel what happened with us.

- We switched over to DR site at 2012/09/30 00:25:52 as showed in engine log.

- As switch over command executed after 44seconds which is 00:26:36 (as per dmesg log) the connectivity/replication disconnected.

- Switch over command executed before replication disconnected and the switchover signal passed to DR site as the replication dissconnected after 44seconds of switchover.

- DR site resources started at 00:26:54 the first resource was RVG primary. As the RVG primary started it sees that the replication /connectivity links are disconnected and it did Takeover instead of migrate.

- The last resource was started at DR site is at 00:27:30 successfully a DR site.

- Client dont know the status of DR site

- The situation did not undestand by client as his hagui management console connected to DR site was also disconnected which is at primary site and he opened hagui of primary site cluster and clicked on online Application service group at 00:46:26 (00:46:26 VCS INFO V-16-1-50859 Attempting to switch group APPLICATION from system node3 to system node1)

- Still replication disconnected

- DR site started to down the resources. The last resource down on DR site at 00:47:00 (00:47:00 VCS INFO V-16-1-10305 Resource PRI-RVG)

- First resource UP on primary site node1 at 00:47:03 ( 00:47:03 VCS NOTICE V-16-1-50982 Resource PRI-RVG is online on system node1)

 

THIS IS WHAT I FELT THE SITUATION OCCURED. CORRECT ME IF I AM WRONG PLEASE.

...............

One question here after analysed the situation (engine log says ""  2012/09/30 00:46:26 VCS INFO V-16-1-50859 Attempting to switch group APPLICATION--service groupxxxx from system node3 to system node1  "") 

If replication/connectivity is still dissconnected how DR got the information of switchover at 00:46:26 and started switch group APPLICATION from node3 to node1 and last resource down at 00:47:00 on DR site

 

 

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

mikebounds's picture

This looks ok, except you have missed step to delete rlink and delpri may fail.  A few additional comments:

createpri creates RVG at primary (1 object)

addsec creates RVG at secondary and rlink at BOTH sites (3 objects)

delpri and delsec do the reverse.

At the point where you have no RVG and rlink at the DR site and you remove rlink at the prod site, you MAY be able to just run addsec, rather than re-creating the primary, but I think this will only work if prod has not become an acting secondary. If addsec does NOT work at this stage, then delpri, probably won't either, and you would need to go through steps in my previous post.

Rather than destroying DG at secondary, I would try to delete replication objects - I guess this is harder, but you may get a better understanding of VVR objects and if it doesn't work, then you can run "destroy diskgroup at any point (although I have never run a destroy diskgroup containing VVR objects).

To delete replication objects the process is basically  "stop, disassociate, remove" for rlink first and then RVG):

STOP: ("vxrlink det" stops replication  |  "vxrvg stops" RVG)

DISSASSOCIATE: ("vxrlink dis" disassociates rlink from RVG  |  "vxvol dis" disassociates volumes from RVG)

REMOVE: (vxedit rm object removes disassociated object)

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

Zahid.Haseeb's picture

To delete replication objects the process is basically  "stop, disassociate, remove" for rlink first and then RVG):

STOP: ("vxrlink det" stops replication  |  "vxrvg stops" RVG)

DISSASSOCIATE: ("vxrlink dis" disassociates rlink from RVG  |  "vxvol dis" disassociates volumes from RVG)

REMOVE: (vxedit rm object removes disassociated object)

 

.....

Hmmm

- So you mean before run delsec first det and dis the rlinks and then remove rlink 

- second run delsec which remove the RVG at secondary site

- Now only need to run addsec

CORRECT :)

 

===============================

===============================

Still my question is pending:

One question here after analysed the situation (engine log says ""  2012/09/30 00:46:26 VCS INFO V-16-1-50859 Attempting to switch group APPLICATION--service groupxxxx from system node3 to system node1  "") 

If replication/connectivity is still dissconnected how DR got the information of switchover at 00:46:26 and started switch group APPLICATION from node3 to node1 and last resource down at 00:47:00 on DR site

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

mikebounds's picture

To delete replication objects using low level commands  "stop, disassociate, remove" is an alternative to delsec and delpri as delsec and delrpi will only work for clean states.

If replication is disconnected then this means there is a problem between replication IPs (or a problem on the node like replication service group is offline or problem with VVR daemons).  The cluster gets its information using the cluster IPs which should be different IPs to replication, so replication being disconnected should not effect communication between clusters.

I suspect you have incorrectly configured the network somewhere or your script to remove IPs on Application IP is breaking things, for example it may be messing up the routing table so that communication between replication IPs stops working.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

SOLUTION
Zahid.Haseeb's picture

Any command in which we can see that either the migration occured or take over at DR site.

(As in our case at DR site the take over occured instead of migration due to replication dissconnection)

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

mikebounds's picture

Yes as per log shown in first reply:

 

2012/09/30 00:26:54 VCS INFO V-16-20012-80 (node3) RVGPrimary:PRI-RVG:online:RVG RVG takeover successful; use vxrvg resync command to resynchronize the original primary upon its restoration
 
Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

Zahid.Haseeb's picture

Ahh ! you misunderstood this point. Sorry for that. I meant any command which can show us that either take over done after switch over for example show takeover etc  (administration point of view)....

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

mikebounds's picture

If a takeover was done then DCM should be in use  (as long as you did NOT select "disaster" to a GCO event of cluster down, which does a takeover without DCM tracking as by disaster you are saying cluster is never coming back - it is destroyed) , where as if migrate is down, then SRL will be in use - I think vradmin repstatus shows you if DCM or SRL is been used (you can also try vxrlink status and vxprint -Pl)

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

mikebounds's picture

Replication shoul not disconnect when you do a switch.  What you should have with GCO is at least 3 separate virtual IPs at each site:

  1. Cluster IP - this IP should be in the ClusterService group and set in ClusterAddress Cluster attribute and used in GCO Icmp heartbeats
  2. Application IP for your application clients that should be in your Application service group
  3. Replication IP that should be the one shown in vxprint -Pl and be in your replication service group

When you are using the VCS GUI should connect to the Cluster IP or the hostname of any of the cluster nodes - you should not connect to the Application or replication IP. If you are not sure your config is correct, then please post you main.cf from both sites and outputs from vxprint -Pl

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

Zahid.Haseeb's picture

"" Suppose ""

 

1.) A VIRTUAL IP IS ASSIGNED HERE FOR CLUSTER IP

 

- Physical IP

Node1 (eth0  10.0.0.1)    Node2 (eth0  10.0.0.2)                            NodeDR (eth0   10.0.0.3)

I am use to connect hagui with this physical IP...

The cluster IP is assigned as a virtual IP for example ( Node1  eth0:1 IP1)     ( NodeDR  eth0:1  IP2) ....

 

 

 

2.) APPLICATION IP

- Physical IP

 Node1 (eth3 sameIP )   Node2 (eth3 sameIP but not assigned )   NodeDR (eth3 sameIP but not assigned )

 We have a script configured as Application resource under service group. What this script do. It assigns an IP to a particular interface suppose eth3. Where this resource UP(suppose Node1) the IP will only be assigned on eth3 of Node1

Users are used to connect the application via this IP

 

3.) REPLICATION IP

- We have a seprate Replication IP

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com

mikebounds's picture

You can connect to hagui with physical IP or cluster IP.

If you connect via a physical IP, then if that node goes down, then as long as the client you are running hagui on can resolve the name of the other cluster node to get the IP, you will automatically be connected via the other cluster node and this is quicker than the cluster IP as you have to wait for cluster IP to failover, but this is quite quick.  Downside of using physical IP, is that hagui doesn't know name of other cluster node until it connects, so if physical node configured in hagui is down when you try to connect, it will not work and you will have to manually change "connect information" in hagui to the other node.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

Zahid.Haseeb's picture

Thanks all for kind replies specially Mike. I restored my DR site with all kind support provided above. Still I have some concerns but not the time to discuss. Thanks all again :) Giving thumbs to helpful post

Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb

zahidhaseeb.wordpress.com