Very slow Exchange GRT backups
Hi !
We have NBU 6.5.4 and try to run GRT backups on a W2003 SP2 R2 server, with all needed nfs patches.
We can see two phases in the backup
1- very fast backup of the store via SAN client
2- very slow indexation when the NFS client/server is involved. It takes 1 to 4 hours to index the 60GB store which contains 5 mailboxes.
The exchange server and the media server are connected to the same switch.
If we snoop the dialog on the media server, we can see that NFS packets are sent very slowly (about 4-8 KB/s). We see that the NBU nfs server waits 1 second before sending the data (see snoop below).
We also found this 1 second delay in the bpdm log (see below).
Happens with basic disk and advanced disk.
Any idea how to solve this problem ?
TIA,
Ludo.
------- nfs snoop ---------
2.24509 exchserv -> nbumedia026 NFS C READ3 FH=2283 at 1175568384 for 4096
==> read request from the Exchange server
2.29524 nbumedia026 -> exchserv TCP D=18624 S=7394 Ack=4153401168 Seq=45132715 Len=0 Win=64240
==> small ack sent by the media server (len=0)
3.28856 nbumedia026 -> exchserv NFS R READ3 OK (4096 bytes)
==> but data is sent 1 second after the request !
3.28862 nbumedia026 -> exchserv TCP D=18624 S=7394 Ack=4153401168 Seq=45134175 Len=1460 Win=64240
3.28866 nbumedia026 -> exchserv TCP D=18624 S=7394 Push Ack=4153401168 Seq=45135635 Len=1308 Win=64240
------- bpdm --------
14:18:52.572 [13851] <2> read_data: waited for empty buffer 3 times, delayed 71 times
14:18:52.572 [13851] <2> set_restore_cntl: dmcommon.c.6963: firstblk = 0, blocks_to_skip = 0, bytes
_to_skip = 0, fragnum = 0 (input parameters)
14:18:52.572 [13851] <2> read_backup: seeking to image relative block number 130967312 frag relativ
e block number 130967312 to start read-blockmap
===> here is another 1 second delay
14:18:53.557 [13852] <2> send_bptm_req: [13851] bptm parent answered 0, 0, 0
14:18:53.557 [13852] <2> write_blocks: [13851] writing 2048 data blocks of 512
14:18:53.648 [13852] <2> filter_image_ifr: [13851] sending bp*m position request, curr_frag = 1, ne
w_frag = 1, curr_blknum = 130969360, new_blknum = 130967216, firstblk = 130967216
14:18:53.651 [13851] <2> check_positioning: CINDEX 0 wants to skip to frag 1, firstblk 130967216, A
CTIVE_GC = 1
Comments
Hi Ludo, we`ve the same
Hi Ludo,
we`ve the same poblem
ciao bernes
Same delay seen in the bpdm
Same delay seen in the bpdm logs ?
We have about the same problem
I've opened couple of cases with Symantec support several months ago and didn't receive any meaningful explanation so far.
It seems that GRT technology either was not tested well or simply doesn't work as advertised.
Awesome ! With a truss you
Awesome !
With a truss you can see that they added a sleep(1) in the bpdm code !
Please Symantec, fix this code !
...
3.5539 read(0, " X F E R B L O C K 4 2".., 21) = 21
3.5541 alarm(0) = 600
3.5542 sigaction(SIGALRM, 0xFFBFD0C0, 0x00000000) = 0
new: hand = 0x00000000 mask = 0 0 0 0 flags = 0x0000
3.5545 getpid() = 22360 [22355]
3.5546 llseek(4, 0, SEEK_END) = 1964349
3.5548 write(4, " 1 7 : 0 0 : 5 2 . 2 8 5".., 172) = 172
3.5550 nanosleep(0xFFBFE240, 0xFFBFE238) = 0
tmout: 0.000000000 sec
resid: 0.000000000 sec
4.5553 nanosleep(0xFFBFE240, 0xFFBFE238) = 0
tmout: 1.000000000 sec
resid: 0.000000000 sec
4.5559 kill(22355, SIG#0) = 0
4.5560 getpid() = 22360 [22355]
4.5562 llseek(4, 0, SEEK_END) = 1965001
4.5563 write(4, " 1 7 : 0 0 : 5 3 . 2 8 6".., 77) = 77
4.5566 getpid() = 22360 [22355]
4.5567 llseek(4, 0, SEEK_END) = 1965078
...
Hi Lu, Thats a good
Hi Lu,
Thats a good find.
Right now Symantec is still working on another EEB. Today i'm going to test with extending the SoftMountPingtimeout to 60 (in regards to NFS). I'll let you know how that goes.
Matt
Yes setting
Yes setting SoftMountPingtimeout to 60 may avoid an error 1 during the indexing phase. But it does not speed up the indexing.
True, and for the record it
True, and for the record it did not help our issue.
Good news is that the last test we performed Symantec had me add a touch file (only works with the provided EEB) and the GRT backup only took a few minutes longer than the non-grt backups. It still ended in a status 1 but it seems to be a step in the right direction. At least my test jobs wont last 24 hours...
Is it
Is it /usr/openv/netbackup/db/config/nbfsd_enableDirect ?
Thats the one. Support get
Thats the one. Support get you that EEB? If so hows it work for you?
It seems to work :-) The
It seems to work :-) The indexing is very fast because the nfs server directly opens the image on disk instead of using bpdm with its 1 second delays everywhere... No error 1 so far.
Warning !!! With this EEB
Warning !!! With this EEB duplications of Exchange backups are broken ! Avoid duplications or your NBU server may be stuck running "image cleanups" during hours !
Does GRT works for Exchange
Does GRT works for Exchange 2007 ?
Only "alpha quality" for Exchange 2003 ?
We are using exchange 2007
We are using exchange 2007 and are experiencing this issue. See the post in the other thread for our setup details.
https://www-secure.symantec.com/connect/forums/tot...
So sum up, the problem is
So sum up, the problem is found on NBU 6.5.3 or 6.5.4, Win 2k3 or Solaris 10 media server.
Unoffical Work Around...
I wanted to remove the 1 second delay in bpdm, so I tried to do this with a LD_PRELOAD32 (Solaris 10).
1- Compile the code below with "cc -G libnanosleep.c -o libnanosleep.so"
2- Create a wrapper for bpdm using this script:
#!/bin/ksh
LD_PRELOAD_32=/tmp/libnanosleep.so
export LD_PRELOAD_32
/usr/openv/netbackup/bin/bpdm.orig "$@"
It will replace the 1 sec sleep with a 20ms one.
Here are our results:
- indexing time=3-4 minutes instead of 1-4 hours
- but much higher I/O load on the media server, caused by the fact that 'bpdm' read data in 256k block when NFS needs only 4/8k.
So there are 2 bugs in the 6.5.4 bpdm:
1- the 1 second sleep before each seek
2- it reads 256kB when NBU NFS only asks for 4kB
This "patch" only addresses the 1st one.
===libnanosleep.c=== #include <stdio.h> #include <errno.h> #include <dlfcn.h> #include <time.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <strings.h> static unsigned int (*func)(); unsigned int sleep(unsigned int seconds) { unsigned int retval; int fd; char buf[256]; time_t cur_time; if(!func) { func = (unsigned int (*)()) dlsym(RTLD_NEXT, "sleep"); } sprintf(buf, "/tmp/sleep.%d.txt", (int)getpid()); fd = open(buf, O_WRONLY|O_CREAT|O_APPEND); time(&cur_time); sprintf(buf, "%.19s: sleep=%d\n", ctime(&cur_time), (int)seconds); write(fd, (const void *)buf, (size_t)strlen(buf)); close(fd); if (seconds == 1) { usleep(20000); return 0; } retval = func(seconds); return retval; }Hi ludo, I've got exactly the
Hi ludo,
I've got exactly the same problem!
Can you give more details ?
Can you give more details ? (NBU version/platform, Exchange version, stores size, etc)
Hi Lu, Thanks for the tip to
Hi Lu,
Thanks for the tip to reduce the nanosleep!.... It s crazy!
Solaris 10 with exchange 2003
We try to backup a mail store of 60 Go with 5 mailbox
The windows 2003 server is on R2 SP2 with the NFS patch.
I think I will test your code to reduce the nanosleep time, but I should need a confirmation of a veritas or symantec guy that's working fine....
Did anyone experiencing this
Did anyone experiencing this issue have older versions of netbackup or backup exec previously installed on their exchange servers? I was looking through the registry on our exchange server and I see lots of entries for the previous install of backup exec.
The reason i mention this is because netbackup uses backup exec code to backup exchange.
I guess a better question would be, are any of you working with a fresh OS, exchange install and netbackup install?
We have a fresh install of
We have a fresh install of the media servers on Solaris 10, so there's no registry. We see that we have the same problem with two completely different OSes, so the problem is really in bpdm.
Just tried with the 6.5.5
Just tried with the 6.5.5 bpdm binary. Still this 1 second delay...
(I just installed the bpdm binary, not tried a full 6.5.5 upgrade on the media server: anybody can try with a 6.5.5 upgrade ?)
lu i was refering to the
lu i was refering to the client, the exchange server.
By the way, i have experienced this with a RHEL4 media server on a seperate exchange cluster.
> lu i was refering to the
> lu i was refering to the client, the exchange server.
Ok ! Our exchange server was recently installed from scratch, for this test. So I think the registry is clean...
Good to know, thanks.
Good to know, thanks.
More info: this bug also
More info: this bug also slows down restores... :-(
I have started a granular restore, and it took 10 minutes for 25MB. I 'trussed' bpdm during the restore and also saw the 1 second sleeps. With the LD_PRELOAD32 hack, the restore took 2 minutes.
Please, Symantec, when the final release of GRT will be available :-)
Have you opened a case with Support?
This definitely looks like the sort of thing that should get escalated to the developers for a real answer (or fix)!
I have a hunch there's a "really good reason" that delay was coded in, but obviously I have no clue what it is. I'd hate for your modifications to come back and bite you, though, so I *strongly* recommend opening a case to get an officially supported EEB replacement... if that's what it ends up being.
Yes... still waiting :-(
Yes... still waiting :-(
Still waiting...The latest
Still waiting...
The latest EEB created problems with Duplications...
Lu - could you post the EEB
Lu - could you post the EEB file name or case number to reference to at least fixed part one of the problem. I'll worry about the duplications later - I still rather have my GRT and I can't seam to get a hold of the EEB for this.
EEB 1712608 + touch
EEB 1712608 + touch /usr/openv/netbackup/db/config/nbfsd_enableDirect => faster indexing but BIG problems with duplications (catalog corruption).
Looks we are using the same
Looks we are using the same EEB Lu.
Reckless, if you want the EEB for 6.5.5 try and ask for 1928803.1 (that should be a recompiled version of 1712608.9)
Matt
ET1712608.8 AND nbfsd_enableDirect
This fix has been in effect and it does resolve the issue regarding the backup performance,
Engineering/Tech Support are fully aware of the side effect with SLP - however, the details from original site that reported are so generic that it would foolish to suggest anything such as corruption at this stage, I will update this thread in a few days time to connect the link with SLP as there will be some development on this SLP v ET1712608.8
RogC
Reckless, I forgot to mention
Reckless,
I forgot to mention that this EEB does not solve my particular issue with GRT backups ending in a status 1 when the mailstore is larger than 60-90GB (could never narrow down at what size the problem shows up).
Though it has cut our testing time down to just a few minutes longer than a normal non-grt backup. Previously the job would run for days!
Matt
We have tried the EEB
We have tried the EEB 1712608.10 and the backup is fast but the duplication to tape is still awful (catalog corruption).
My tests with the latest EEB
My tests with the latest EEB have been successfull for single large mail stores. But when try to backup more than one at a time (Microsoft Information Store:\*) it usually ends with a status 1 again. I even changed the policy to only allow 2 jobs to run at the same time, but i still ended up with random status 1s and backup jobs running for 20+ hours.
So, semi solved here.
> But when try to backup more
> But when try to backup more than one at a time (Microsoft Information Store:\*) it usually ends with a status 1 again.
Thanks ! Good to know ! Maybe the media server is able to provide via NFS only one image to index at the same time ?
ET1712608.10 + GRANULAR_DUP_RECURSION = 0 *if dup'ing GRT images
Apologies for the delay in updating in all but it has been confirmed
that ET1712608.10 can be safely applied to environments without causing any
side effects.
The only caveat/precuation is that when running GRT Image Duplications you should
implement "GRANULAR_DUP_RECURSION = 0" in the bp.conf. This disables the catloging of
GRT images.
Full coverage of this flag is documented in this tech note.
http://support.veritas.com/docs/317302
In summary ET1712608.10 + GRANULAR_DUP_RECURSION (if dup'ing GRT images) = effective quicker backups.
It has also been proven that the same ET1712608.10 can resolve SharePoint GRT performance
related issues.
We're currently documenting all of this and will provide an official resoultion to this problem.
Rog C
Yes, I can confirm that with
Yes, I can confirm that with ET1712608.10 + GRANULAR_DUP_RECURSION=0, we started to have GRT backups working and duplication to tape was ok.
Of course the duplicated image to tape, needs to be copied back to disk to have GRT browsing.
Is there a fix for "errors 1" when the Exchange stores are backed-up in parallel (multiplexing enabled) ? Is it related to the fact that only one NFS mount is allowed on the Exchange client at one time ?
Lu, we tried this on (7.0GA)
Lu, we tried this on (7.0GA) backing up (using GRT) both Stores in parallel using multiple data streams. Both jobs ended with a Status 0. If you are encountering this issue I would suggest you create a Support case into Symantec Support.
Inform me of the case id and I will get this looked into,
Rog C
Roger, Though Lu's issue and
Roger,
Though Lu's issue and mine are different we have been using the same EEBs. Would you be able to test this with 2 mail stores over 100GB, multi-streaming?
My issue is that im getting status 1s for large databases. The latest EEB solved the issue for me when only allowing one mail store to backup at a time. Once i allowed multi-streaming it failed with a status 1 again.
The reason i ask to test is that i have not been able to perform another test in our production environment for a couple weeks and im not sure when ill be able to get another test in.
Thanks ahead of time if you are able to help.
Matt
Would you like to reply?
Login or Register to post your comment.