Do we need fbsync can do a complete sync if Primary Site Disk fails
Environment
OS = rhel 5
SGHA/DR version = 5.0 MP3 RP3
Primary Site = Two Nodes
DR Site = Single Node
Disk Group = one (with four SAN Disks)
Work Performed
My Primary Site's two SAN Disk from four DIsks which was shared between both Nodes failed. Now my Application was DOWN at Primary Site So I UP my Application from DR Site successfully. Now at Primary Site I replaced both bad Disks with the two new DIsks via using option 4 and 5 of vxdiskadm command successfully. Now from DR Site I ran the fbsync command which started successfully.
My question :
I ran the fbsync command .. This I did right or wrong ? (as this is not a partial sync and has to sync complete as my Primary Site have fresh Disks. I think that fbsync do a incremental Sync) OR fbsync can also SYNC complete data from DR Site to Primary in my case ?
Comments Required please
Comments 31 Comments • Jump to latest comment
I am surprised the fbsync command worked, but I would think there is a good chance your primary data is corrupt - I can see 2 possibilties:
I would run a space optimised snapshot on the primary site and mount the data to check, which probably won't mount if the are corrupt. You could also try to mount the volumes read-only ("-o ro" I think) on the Primary - this is not supported as if you try to read the files, they could be changing with VVR and these changes won't be in the primary nodes cache, but they should mount and if they don't mount, then your data is almost certainly corrupt.
If your data is corrupt, then just run a "vradmin -f stoprep" and vradmin -a startrep" to resync the data.
Mike
UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows
If this post has helped you, please vote or mark as solution
First thanks for your words Mike.
One question raising in my mind. As I said that I UP my Application at DR Site(did not switched over) and while UP my Application at DR Site I feel that my Client restarted the Primary Site Machine as well, means that might be a Take Over, Because before execute the fbsync command I checked the repstatus and saw the Primary - Primary configuration.
So Let me think on your words in which you said that :
If your data is corrupt, then just run a "vradmin -f stoprep" and vradmin -a startrep" to resync the data.
is not the above command failed as the configuration was Primary - Primary configuration.
Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb
zahidhaseeb.wordpress.com
Supposed Environment:
SFHA/DR setup
Primary Site = Two Cluster Node's
DR Site = One Cluster Node
What could be the step(s) if suppose Primary Site both Server's shutdown abnormally due to power outage and suppose we start the Application at DR Site (via DR Site Cluster Java Console.)
Now when the Power restored and Primary Site Servers are UP, it found that the Primary Site central storage/SAN Lun's which shared by both nodes dead/corrupted/faulty. So we replaced the SAN Luns via the help of option 4 and 5 of vxdiskadm. Now our next activity will be that we have to synchronized our Primary Site.
Now when we see the status of Replication via repstatus command we can see a Primary - Primary configuration. In this situation what will be the roadmap ?
Everyone's comment will be appriciated please.
Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb
zahidhaseeb.wordpress.com
Zahid,
Suggested reading:
Veritas Volume Replicator 5.0MP3 Linux Administrator's guide -> Transferring the Primary
https://sort.symantec.com/public/documents/sf/5.0M...
or https://d1mj3xqaoh14j0.cloudfront.net/public/docum... (p239)
Particularly the sections:
Taking over from an original Primary (p247)
https://sort.symantec.com/public/documents/sf/5.0M...
--------------------
The takeover procedure involves transferring the Primary role from an original Primary to a Secondary. When the original Primary fails or is destroyed because of a disaster, the takeover procedure enables you to convert a consistent Secondary to a Primary. The takeover of a Primary role by a Secondary is useful when the Primary experiences unscheduled downtimes or is destroyed because of a disaster.
--------------------
Failing back to the original Primary (p255)
https://sort.symantec.com/public/documents/sf/5.0M...
--------------------
After an unexpected failure, a failed Primary host might start up to find that one of its Secondaries has been promoted to a Primary by a takeover. This happens when a Secondary of this Primary has taken over the Primary role because of the unexpected outage on this Primary. The process of transferring the role of the Primary back to this original Primary is called failback.
--------------------
Also: Veritas Cluster Server Agents for Veritas Volume Replicator Configuration Guide (5.0MP3 Linux)
https://sort.symantec.com/public/documents/sf/5.0M...
https://d1mj3xqaoh14j0.cloudfront.net/public/docum...
(probably start at Overview of how to configure VVR in a VCS environment, and work from there)
For other versions/platforms, as always, look for the relevant documents on http://sort.symantec.com/documents - the concepts / overall procedures are basically the same though.
If this post has helped you, please vote or mark as solution
It is ok to run fbsync, if vradmin lets you run fbysnc after you have replaced some LUNs, because as you say you need to sort out the Primary-Primary config, but as I said earlier, I think there is a good chance that fbsync won't sync all the data on the replaced LUNs.
Before you run fbsync, I believe if you run vxrlink status at the old primary (where you replaced the LUNs), it should tell you how many bytes are outstanding on the DCM, so if fbsync is going to work, the DCM would have to contain at least the size of the replaced LUNs and if it doesn't the vradmin fbsync is not that intelligent as in point 2 of my first post. I guess vradmin MIGHT mark all data dirty on DCM for replaced LUNs, as part of running fbsync so you could also run vxrlink status on DR before running fbsync and then after fbsync is run, these 2 DCMs are merged, so you can check if the dirty data from added LUNs is added, then, but I very much doubt it. If you verify fbsync has not taken replaced volumes in to consideration, then there is no point letting fbsync finish, just run "vradmin -f stoprep" and vradmin -a startrep" to sync from scratch.
Mike
UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows
If this post has helped you, please vote or mark as solution
@ Mike
Do you see any complication if the Mount resource is already Online at DR Site and we execute the mount command from Primary Site with read only option ?
From Primary and Secondary Site the Mount Resource is the Parent of RVG-PRI resource, So in any case will the Mount Resource automatically Probe ? and if yes, is'nt the Child Resource (which is RVG-PRI in our case) will be Online automatically as the Parent(MOUNT Resource) got Online ?
Means I just want to make sure that in any case the RVG-PRI will be Offline at Primary Site as its already online at DR Site.
Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb
zahidhaseeb.wordpress.com
You could mount using a temporary mount point so VCS shouldn't recognise the resource as online. In theory the worst that coulld happen is that you get a currency violation and VCS will umount the read-only mount (it won't online dependent resources), but I would freeze application service groups on both sides for a "belts and braces" approach.
Mike
UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows
If this post has helped you, please vote or mark as solution
Hmmm let me share with my OS team(as my Client cant afford extra down time in terms of Cuncurrency Voilation as it has around 100,000 of users) and do it this way and will share the result.
- Freeze Application Service Group on both Sites (Primary and DR).
- Try to mount the Volume to a temporary place like /mnt. for example
#mount -t vxfs -o ro /dev/vx/dsk/DG/Volume /mnt
Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb
zahidhaseeb.wordpress.com
I just did stoprep and then startrep and when sync completed, now trying to run the mount command with readonly but feel that some issue. See the below for reference:
[root@xxxxxx ~]# mount -o ro /dev/vx/dsk/DG/Volume /mnt/
mount: wrong fs type, bad option, bad superblock on /dev/vx/dsk/DG/Volume,
missing codepage or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
#tail -f /var/logs/messages
Nov 28 15:43:06 xxxxxx kernel: lost page write due to I/O error on VxVM65533
Nov 28 15:43:06 xxxxxx kernel: JBD: recovery failed
Nov 28 15:43:06 xxxxxx kernel: EXT3-fs: error loading journal.
I think that we need to run fsck but in a situation where I have ext3 filesystem. Any suggesstion what would be the syntax to run the fsck in my situation
like
fsck -o full -y /dev/vx/dsk/DG/volume
OR
like fsck -f -y /dev/vx/dsk/DG/volume
Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb
zahidhaseeb.wordpress.com
You can't run fsck on a readonly filesystem. It could be you just need to specify filesystem type, or it maybe that the inode table was been updated on the primary while you were trying to mount on the secondary (this is why this method is not officially supported). If you have an Enterprise license and a little free space in the diskgroup, then the better method is to run a space optimised snapshot and mount this.
Mike
UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows
If this post has helped you, please vote or mark as solution
So you mean this may not be the issue of filesystem thats why not mounting(would be other reasons). thanks
Would you kindly share the command for space optimized snapshot and how much space should i consider free.
I tried to pauserep(here the inode definetely be not updating at secondary/RealPrimary ) and then did mount command by faced same messages
Thanks
Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb
zahidhaseeb.wordpress.com
The amount of space you need free is:
About the same size of inode table for pointers to real or COW (copy on write) data so about 2%
+
Amount of changes that occur on the primary while snapshot is mounted
+
Any changes you MAY make to filesystem when it is mounted as a space optimized snapshot.
If this is just a test to check the filesystem is ok, then 5% should be plenty
The commands to do snapshot are under section "Space-optimized instant snapshots" in Vxvm admin guide - examples from guide shown below:
Make cache object - this is where space above is stored so make this 5% size of filesystem:
UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows
If this post has helped you, please vote or mark as solution
Bundle of Thanks first for your long drafted reply Mike.
I have two Disks/SAN LUN's 100 GB each
I have two volumes on these two Disks/LUN's. One is 150GB which is data volume and being replicated and the 50GB aprox is the Srl VOLUME.
( I dont have free space available in the DiskGroup So I think I need to add a disk with 10 GB aprox space for making the volume, name cachevol as per your suggesstion above ) CORRECT ?
======
Is this only one thing ""source=myvol/newvol=snap3myvol/cache=cobmydb""? I am not able to understand this please.
Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb
zahidhaseeb.wordpress.com
Yes you need to add disk to diskgroup as you need free space to be in diskgroup.
This is what Symantec called a tuple - it is specifying 3 things which you seperate with "/" as shown with no spaces so you have:
source=myvol - The volume you are taking a snapshot of
newvol=snap3myvol - Name of new volume that is the snapshot of your volume
cache=cobmydb - The name of your cache object
Note you can use the same cache object for many volumes, so if you had more than one volume, you don't need a separate cache object and so you would just repeat the "vxsnap prepare" and "vxsnap make" for a second volume if you had one.
Mike
UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows
If this post has helped you, please vote or mark as solution
As per the above mentioned messages and see below also as a reference, What could be the factors involve which is why i am getting these messages(as I am useto mounting the filesystem with readonly at DR Site and I never face this problem)
[root@xxxxxx ~]# mount -o ro /dev/vx/dsk/DG/Volume /mnt/
mount: wrong fs type, bad option, bad superblock on /dev/vx/dsk/DG/Volume,
missing codepage or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
#tail -f /var/logs/messages
Nov 28 15:43:06 xxxxxx kernel: lost page write due to I/O error on VxVM65533
Nov 28 15:43:06 xxxxxx kernel: JBD: recovery failed
Nov 28 15:43:06 xxxxxx kernel: EXT3-fs: error loading journal.
Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb
zahidhaseeb.wordpress.com
Is this an ext3 filesystem? If it is not ext3, then you need to specify the mount type when you mount it.
Mike
UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows
If this post has helped you, please vote or mark as solution
yes this is ext3 filesystem. But facing same error which I am facing
One more thing I have browse which is what.
At Primary Site (New Secondary) we stopped Replication and then Stopped VCS at Primary Site (New Secondary) then mount the Volume again I was not able to mount it and giving me same error while mount. I created a new filesystem again on this Volume(which was not able to mount) but still I am facing the same problem after creating a new filesystem. I dissociated the Volume from the RVG and then created a filesystem and then I mount it. It got successfully mount but as I associated the Volume back to RVG I face same error while mounting the Volume.
Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb
zahidhaseeb.wordpress.com
The original point of doing a space-optimised snapshot or mounting read-only was to see if fbsync synced volumes on the replaced disks, but if you have stopped and started replication, then the fact that you replaced disks is irrelevant as you have synced everything from scratch. I have mounted VVR volume on secondary in Solaris vxfs fine, as Solaris just marks a mounted flag, but I have never tried on Linux ext3, so Linux MAY do something different, like it might try to write to inode table as error says " lost page WRITE" and this won't work as the whole volume is readonly. But if you stop replication so that secondary volumes are writable, then this should work and if this doesn't then you probably did something wrong earlier.
If you try all this again, I would use space-optimised snapshot and it doesn't work, then send all the commands you are running (in order you run them) including associating volumes, stopping and starting replication, snapshot commands and mount of snapshot.
Mike
UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows
If this post has helped you, please vote or mark as solution
Thanks for all Participants specially Mike for kind words on this Post
All volumes are started which are associated with the Volumes in vxprint command.
an example to start disabled volume
(vxvol -g DG start VolumeName OR vxvol -g DG startall )
Start and Stop Replication Example
(vradmin -g DG -a startrep RVGname and vradmin -g DG -f stoprep RVGname)
===================
Even At Primary Site where Application was not LIVE, I created new volumes/FileSystem which are able to mount if they are diassociated from RVG but when I Associated the Volumes with RVG I am not able to Mount even the Replication is Stopped. Now at this point if the fresh/formated with new filesystem is not able to mount under RVG how can I aspect that this volume snapshot can also be mount even it mount suppose what is my benefit as the Real Volume is not able to mount, in this situation what happen while switchover/failover means how can the volume be mounted. I am so depressed with Support at this time.
Now my Final PLAN
- Finally I broken the GCO which made both clusters(Primary Site and DR Site) isolated, and remove the LINKS between ServiceGroups(Application and Replication ServiceGroups). Application was still running on DR Site
- Then created a New ServiceGroup, DiskGroup,New Volumes with new name only for Application and just ran the createpri command on the Primary Site.
After doing all this I copied my Aplication from DR Site to Primary Site today morning and UP the Application from Primary Site.
Plan for today
Phase-I
Now at DR Site I will create New ServiceGroup, DiskGroup and New volumes with the same name and size(as on Primary Site we newly created).
Phase-II
Will execute the addsec command from Primary Site which will start the Replication between Primary and DR Site.
Phase-III
We need to create the RVG Resource at Replication ServiceGroup.
1.) At PrimarySite we Remove the DiskGroup Resource from Application ServiceGroup and Create the DIskGroup Resource under Replication ServiceGroup (Can we do that without any DownTime as when we remove the Diskgroup Resource from Application ServiceGroup, this may DEPORT the DiskGroup at PrimarySite ?, may cause Application Down)...How can we move the Diskgroup from Application ServiceGroup to Replication ServiceGroup seamlessly without any impact on LIVE Application
2.) Create RVG Resource at Replication ServiceGroup(at PrimarySite)
3.) Create RVG-Primary Resource at Application ServiceGroup(at PrimarySite).
4.) Create an Online-LocalHard Link between Application and Replication ServiceGroup(Will Select the ApplicationServiceGroup and then Select the Replication ServiceGroup). (at PrimarySite)
The above four Activities we also need to perform at DR Site which I dont think is really tension but the point which I am concern about is Point # 1 (Actually these four steps put the Replication under Veritas Cluster Control)
5.) Add the Remote Cluster
6.) Create the Application ServiceGroup a Global ServiceGroup.
If any concern on above Activity under Plan for Today please share the comments ? For this I need urgent and quick response as I have lost all Symantec creditability at my Client Site as the Severity-1 Case took a week for even not been able to complete. I would really appreciate if I can get proper resolution for this please.
Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb
zahidhaseeb.wordpress.com
I have found a TN as per my point # 1 concern. This make me feel that we can copy the DiskGroup Resource from Application ServiceGroup to Replication ServiceGroup and after copied+Resources dependencies created at Replication ServiceGroup, we can delete the DiskGroup Resource from Application ServiceGroup.
https://sort.symantec.com/public/documents/sf/5.1/...
Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb
zahidhaseeb.wordpress.com
I tried mounting a volume in secondary RVG on my VMWare 5.1GA RHEL5 setup and it mounted an ext3 filesystem read-only without any issues when I had a consistent rlink. When my rlink was inconsistent, then it would not mount and I got a vxio error "Failing reads on rvg", followed by the errors you get. So as long as your rlink is consistent, which it should be if you have just run a stoprep and startrep and autosync has finished, then you should be able to mount filesystem readonly. If rlink is inconsistent, then you won't be able to do a snapshot either as the RVG disables the reads, BUT if you detach the rlink (by vxrlink det or vradmin stoprep), then you can read AND write to your volumes so you should be able to mount normally after stopping replication. Maybe something works different in 5.0MP3, but you can check if you can read from your volume using dd:
dd if=/dev/vx/dsk/DG/vol count=1 of=/dev/null
This should alway work on volumes on your secondary RVG unless rlink is inconsistent and attached. If your rlink is consistent OR detached, then reads should work.
I am not sure what the point is in recreating VCS objects as your issue is with VVR, not VCS. Removing resources from VCS does not offline the resource and you can move a resource in GUI by copy,delete,paste.
Mike
UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows
If this post has helped you, please vote or mark as solution
FIrst thanks alot to all
Ahh as recreating the vcs resources, actually i rename the DG, volume, resources name
I recreated as all the shared plan. Now all things fine. What I have come to know in last day which me and Support both forgetting. I saw when the RVG was is recovery state I was not able to mount volume. Simple..But I did nt remember when initially both sites fine and replication was consistent and up to date at that time what is the state of RVG (definately at that time it should be Enable)
Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb
zahidhaseeb.wordpress.com
So can you mount volume read-only at the secondary now?
Mike
UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows
If this post has helped you, please vote or mark as solution
I left premises when the Replication started(which was expected to take 2 hours). I will check by Monday and share with you the result
Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb
zahidhaseeb.wordpress.com
No not yet :(
See the below :
Primary Site (Real Primary)
# mount
/dev/sdg3 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sdg2 on /var type ext3 (rw)
/dev/sdg1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
tmpfs on /dev/vx type tmpfs (rw,size=4k,nr_inodes=2097152,mode=0755)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/vx/dsk/DG/Volume on /u type vxfs (rw,delaylog,largefiles,ioerror=mwdisabl
e)
DR Site (Real Secondary
# mount -o ro -t vxfs /dev/vx/dsk/DG/Volume /mnt
UX:vxfs mount.vxfs: ERROR: V-3-21252: not super user
#
Symantec Reference to mount a Volume at Secondary Site under VVR scenerio
https://sort.symantec.com/ecls/umi/V-3-21252
Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb
zahidhaseeb.wordpress.com
This link you provided shows you should only get this error if you try to mount rw, so maybe readonly option is not working for some reason - you could try reverse the options so use "-t vxfs -o ro", rather than "-o ro -t vxfs".
Mike
UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows
If this post has helped you, please vote or mark as solution
You could also try "-t vxfs -r", but this shouldn't make any difference as "-r" is the same as "-o ro".
Mike
UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows
If this post has helped you, please vote or mark as solution
Let me check it and will let you know. But at my local environment (not client side where the problem occured). I ran the below command and it got successful:
# mount -o ro -t vxfs /dev/vx/dsk/DG/Volume /mnt
Client Side OS
# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.3 (Tikanga)
My own environment OS
Red Hat Enterprise Linux Server release 5.3 (Tikanga)
So as the environment is same at both sides and my command is running fine at my environment, then it should run perfect at Client side as well :)
==================
(may be no logic on these words) The only difference is that at Client the Replication/VVR is under VCS control (means SFHA/DR or GCO) and at my environment only SF+VVR is running. Did you also do mount successful under VCS controlled VVR environment or ran in a environment where just VVR exist.
Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb
zahidhaseeb.wordpress.com
VCS should have no effect on mount commands in UNIX - it is only Windows where SFW knows if storage is under VCS control where manual SFW commands are blocked for VCS resources.
Mike
UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows
If this post has helped you, please vote or mark as solution
Other difference could be which mount binary is being used which will depend on path:
There are 2 mount binaries:
If both are in your path, then the one listed first will be used.
Mike
UK Symantec Consultant in VCS, GCO, SF, VVR, VxAT on Solaris, AIX, HP-ux, Linux & Windows
If this post has helped you, please vote or mark as solution
I tried to mount Volume in my test environment (SFHA/DR) on VM with both commands you mentioned above and able to mount at DR Site with read only while I copying some files from Primary to DR Site Volume.
Any comment will be appreciated. Mark as Solution if your query is resolved
__________________
Thanks in Advance
Zahid Haseeb
zahidhaseeb.wordpress.com
Would you like to reply?
Login or Register to post your comment.