Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.

Allow vradmin fbsync to sync both ways

Created: 02 Oct 2012 • Updated: 14 Jan 2013 | 5 comments
mikebounds's picture
2 Agree
0 Disagree
+2 2 Votes
Login to vote
Status: Reviewed

Currently if you use vradmin fbsync then:

  1. Writes in SRL on the oldest Primary (not the node the takeover was run on) are converted into DM
  2. Writes written on the new Primary (where the takover was run) which are tracked in the DCM are merged with DCM at old Primary
  3. Old Primary is made a secondary and writes from merged DCM are replicated from new Primary to old Primary (now a secondary) 

What I would like to see to vradmain fbsync is a flag to say replicate the other way so writes are replicated from old Primary to new Primary and new primary is made a secondary:

Lets look at an example:

SRL has outstanding writes at Prod site and a vradmin takeover is run at DR as Prod can't be recovered.  So we have:

      P R O D                   D R
      =======                   ===
Blocks  DCM     SRL     Blocks  DCM     SRL

Z B Y   0 0 0   YZ      A B C   0 0 0   
D E F   0 0 0           D E F   0 0 0

A few writes are written at new primary in DR and then Prod is recovered.  So we have:

      P R O D                   D R
      =======                   === 
Blocks  DCM     SRL     Blocks  DCM     SRL

Z B Y   0 0 0   YZ      A B X   0 0 1   
D E F   0 0 0           D V F   0 1 0

So fbsync is initiated so SRL is converted into DCM at Prod and DCMs are merged so we have:

      P R O D                   D R
      =======                   === 
Blocks  DCM     SRL     Blocks  DCM     SRL

Z B Y   1 0 1           A B X   1 0 1   
D E F   0 1 0           D V F   0 1 0

So, then what I would like to see, is an option to replicate from Prod to DR, so that changes at DR are lost and so we get:

      P R O D                                       D R
      =======                                       === 
Blocks  DCM     SRL    Blocks replicated    Blocks  DCM     SRL

Z B Y   1 0 1            >>> Z Y E >>>      Z B Y   1 0 1   
D E F   0 1 0                --->>          D E F   0 1 0

But the only option is the other way, so that changes at Prod are lost, so we get:

      P R O D                                       D R
      =======                                       === 
Blocks  DCM     SRL    Blocks replicated    Blocks  DCM     SRL

A B X   1 0 1            <<< A X V <<<      A B X   1 0 1   
D V F   0 1 0                <<---          D V F   0 1 0

As, above, I can't see any technical issue doing this, option just needs to be made available.  As a consultant for 10 years, I saw where this was required a few times, so the work-a-round is to do a full resync, but this means first you have to manually make DR a secondary, but if Prod has become an acting Secondary, then it is difficult, if not impossible to make Prod a primary and I have had to remove RLink and RVG and recreate to do this.  And of course a full resync is a lot slower than DCM replay.

This option could be used for discussion https://www-secure.symantec.com/connect/forums/link-between-primary-site-and-dr-site-disconnected , but as it is not available, they will have to use work-a-round and do a full resync. Also note, in this scenario, nothing was written at DR and so DCM bitmap at DR will be empty, and so rather converting SRL to DCM at prod, you could just drain the SRL at Prod.  Similarly, if when you takeover at DR, it VVR wrote in SRL at DR as well as marking DCM, then if fbsync was run the normal way AND DCM at Prod was empty, then SRL could be drained at DR, rather than using DCM.

Mike

Comments 5 CommentsJump to latest comment

RyanJancaitis's picture

Mike,

This should be taken care of by the primary-elect feature introduced in 5.1SP1, and detailed in the Replication admin guide on SORT.

Primary-elect allows the user to control the direction of a sync by electing a new pimary after a network disruption or outage.

This is available only with a GCO license.

Can you review this feature and see if it fits your needs?

-ryan

0
Login to vote
mikebounds's picture

Initially I couldn't find any information as to what this does as the VVR admin guide says:

For detailed information on configuring and using the primary-elect feature, see
Veritas Cluster Server Agents for Veritas Volume Replicator Configuration
Guide.
 
and the VCS Agents for VVR guide says
For a detailed description of the primary-elect feature, see Veritas Volume
Replicator Administrator's Guide.
But on reading on through the both guides, I found some more info, but I have a few questions on the info in the VCS Agents for VVR guide:
  1. In the description of the AutoSync attribute it says for value "2" that "The RVGPrimary agent also creates space-optimized snapshots for all the data volumes in the RVG resource, but then I can't find any more info on this - i.e in the section "Configuring and using the primary-elect feature" I would expect to see something like "you need to prepare volumes in the diskgroup" for snapshots, otherwise the snapshot will fail.  The guide seems to elude to this in the trouble shooting section, but perhaps you could clarify if a snapshot is mandatory if you use Primary-Elect feature and therefore you must prepare all volumes for snapshots and also clarify if the snapshot fails whether this will cause the RVGPrimary resource to fail or whether it will still online
     
  2. The troubleshooting section says "Did not restore the data volumes of the RVG from the space-optimized snapshots" - does this mean the primary-elect feature uses the snapshot to restore the state of the elected secondary rather than using DCM as is done with "vradmin fbsync"
     
  3. In the troubleshooting section it says the command "vxrvg -E -F -r makeprimary rvg" is run.  What is the "-E" flag as this is not documented in the vxrvg manual
     
  4. The manual for vxrvg says use the "-r" option "to perform an automatic resync of the RVG after a failover" - how does this do the resync - from scratch or using JUST changing in DCM since the takeover (or does it use snapshot)
Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

0
Login to vote
RyanJancaitis's picture

Mike

1a. Yes, snapshots are required for the Primary-Elect feature. The guide does not mention snapshots intentionally as its meant to be a transparent operation.  Snapshots are created during takeover and exist until a primary has been elected.  Preparing the volumes for snapshots is done automatically by the RVGPrimary agent.

1b. It will fail.  The steps to recover from snapshot creation failure are already detailed in the troubleshooting section.

2. Once a network communication failure occurs on the primary (Node A), the secondary (Node B) will automatically take over as the primary and will create Space-Optimized snapshots. VVR will enter Primary-Elect mode with both nodes running as primary.  Should node B be elected as the new primary, the application on Node A will be offline-d, and an entry point for node B will be created by 1) Deleting all SO Snapshots, 2) Restoring heartbeat communications, and 3) Performing a fast failback sync from B to A.  Node B is now the primary and Node A is the secondary. Primary-elect only uses the snapshot to restore data in the case of the oringal primary being re-elected as prrimary going forward.

3.The "-E" option is intentionally not documented as only the RVGPrimary agent is expected to use this command.  I understand the confusion and we will clarify the document.

4. It performs a fast failback sync, it does not use the snapshot for doing the sync.

0
Login to vote
mikebounds's picture

Thanks Ryan,

In terms of the feature I was looking for, this is not it.  I have already used this work-a-round a few years ago manually in 5.0 - i.e the steps I carried out manually were:

  1. Take SO snapshot, before takeover
  2. If original primary is required be new primary going forward, then instead of doing fbsync, use the SO snapshot to do a resync from replica and run vxrvg makesecondary to make new primary a secondary

One problem I had with this procedure is that you had to do the above before the old primary communicates with the new primary as if this happens, the old primary becomes an acting secondary and in 5.0 you couldn't use vxrvg makeprimary to make an "acting secondary" primary, so I am curious how you get round this issue with PrimaryElect - is vxrvg makeprimary now able to make an "acting secondary" a primary or do you somehow prevent the "acting secondary" state occuring.

So it seems the PrimaryElect feature is just an automation of manual tasks for the work-a-round of NOT having the abilty for fbsync to be able to sync both ways using the DCM.  So I am looking for the feature to be implemented properly in VVR and this would be a lot simpler than the PrimaryElect feature as it would just be a flag to vradmin fbsync to say, sync the other way, and  unless I am missing something this would be technically very quick and easy to code as you are simply just coding to copy merged DCM in the other direction.

I don't like the way the PrimaryElect feature has been implemented, because a vxsnap prepare will only work if there is free space in the diskgroup and I find many customers have no space free in the diskgroups as they create volumes of maximum size and you will also need space for the cache object, so I don't know how the agent decides how big to create this.  This is made much worse by the RVGPrimary agent failing if there is no space available and even worse still that the agent does not explain what it is doing and tell the customer they need to have free space in the diskgroup, so by enabling this feature they could cause a DR failover to fail.

Also I have had problems with SO snapshots with RVGs with a lot of volumes.  I have a customer with 500+ volumes and a restore from snapshot takes several hours, even though there are next to no changes.  I replicated this issue on my VMware setup with 500 small 10MB volumes with no changes in the SO snapshot and from what I can remember, if you have 10 or 20 volumes it resyncs in seconds, but the more volumes you have, then the time increasing exponentionally for some reason.  I logged a call for this (about 2 years ago), but Support said they didn't see the same thing, but the customer didn't follow it through, as I found a work-a-round using VVR checkpoints to get round having to use SO snapshots.

Mike

UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows

If this post has answered your question then please click on "Mark as solution" link below

0
Login to vote
RyanJancaitis's picture

Thanks for the feedback.  We'll take this item under review for as roadmap item for post 6.1 release.

0
Login to vote