Video Screencast Help
Search Video Help Close Back
to help
New in the Rewards Catalog: Vouchers for "Symantec Technical Specialist" and "Symantec Certified Specialist" exams.

Very slow Exchange GRT backups

Updated: 05 Sep 2010 | 41 comments
lu's picture
+2 2 Votes
Login to vote

Hi !
We have NBU 6.5.4 and try to run GRT backups on a W2003 SP2 R2 server, with all needed nfs patches.

We can see two phases in the backup
1- very fast backup of the store via SAN client
2- very slow indexation when the NFS client/server is involved. It takes 1 to 4 hours to index the 60GB store which contains 5 mailboxes.

The exchange server and the media server are connected to the same switch.

If we snoop the dialog on the media server, we can see that NFS packets are sent very slowly (about 4-8 KB/s). We see that the NBU nfs server waits 1 second before sending the data (see snoop below).

We also found this 1 second delay in the bpdm log (see below).
Happens with basic disk and advanced disk.

Any idea how to solve this problem ?
TIA,
Ludo.

------- nfs snoop ---------

2.24509 exchserv -> nbumedia026 NFS C READ3 FH=2283 at 1175568384 for 4096
==> read request from the Exchange server
2.29524 nbumedia026 -> exchserv TCP D=18624 S=7394 Ack=4153401168 Seq=45132715 Len=0 Win=64240
==> small ack sent by the media server (len=0)
3.28856 nbumedia026 -> exchserv NFS R READ3 OK (4096 bytes)
==> but data is sent 1 second after the request !
3.28862 nbumedia026 -> exchserv TCP D=18624 S=7394 Ack=4153401168 Seq=45134175 Len=1460 Win=64240
3.28866 nbumedia026 -> exchserv TCP D=18624 S=7394 Push Ack=4153401168 Seq=45135635 Len=1308 Win=64240

------- bpdm --------
14:18:52.572 [13851] <2> read_data: waited for empty buffer 3 times, delayed 71 times
14:18:52.572 [13851] <2> set_restore_cntl: dmcommon.c.6963: firstblk = 0, blocks_to_skip = 0, bytes
_to_skip = 0, fragnum = 0 (input parameters)
14:18:52.572 [13851] <2> read_backup: seeking to image relative block number 130967312 frag relativ
e block number 130967312 to start read-blockmap

===> here is another 1 second delay

14:18:53.557 [13852] <2> send_bptm_req: [13851] bptm parent answered 0, 0, 0
14:18:53.557 [13852] <2> write_blocks: [13851] writing 2048 data blocks of 512
14:18:53.648 [13852] <2> filter_image_ifr: [13851] sending bp*m position request, curr_frag = 1, ne
w_frag = 1, curr_blknum = 130969360, new_blknum = 130967216, firstblk = 130967216
14:18:53.651 [13851] <2> check_positioning: CINDEX 0 wants to skip to frag 1, firstblk 130967216, A
CTIVE_GC = 1

discussion Filed Under:

Comments

bernes_stainz's picture
26
Nov
2009
1 Vote +1
Login to vote

Hi Ludo, we`ve the same

Hi Ludo,

we`ve the same poblem

ciao bernes

lu's picture
27
Nov
2009
1 Vote +1
Login to vote

Same delay seen in the bpdm

Same delay seen in the bpdm logs ?

Mouse's picture
28
Nov
2009
1 Vote +1
Login to vote

We have about the same problem

I've opened couple of cases with Symantec support several months ago and didn't receive any meaningful explanation so far.
It seems that GRT technology either was not tested well or simply doesn't work as advertised.

lu's picture
30
Nov
2009
2 Votes +2
Login to vote

Awesome ! With a truss you

Awesome !
With a truss you can see that they added a sleep(1) in the bpdm code !
Please Symantec, fix this code !

...
3.5539 read(0, " X F E R B L O C K 4 2".., 21) = 21
3.5541 alarm(0) = 600
3.5542 sigaction(SIGALRM, 0xFFBFD0C0, 0x00000000) = 0
new: hand = 0x00000000 mask = 0 0 0 0 flags = 0x0000
3.5545 getpid() = 22360 [22355]
3.5546 llseek(4, 0, SEEK_END) = 1964349
3.5548 write(4, " 1 7 : 0 0 : 5 2 . 2 8 5".., 172) = 172
3.5550 nanosleep(0xFFBFE240, 0xFFBFE238) = 0
tmout: 0.000000000 sec
resid: 0.000000000 sec
4.5553 nanosleep(0xFFBFE240, 0xFFBFE238) = 0
tmout: 1.000000000 sec
resid: 0.000000000 sec
4.5559 kill(22355, SIG#0) = 0
4.5560 getpid() = 22360 [22355]
4.5562 llseek(4, 0, SEEK_END) = 1965001
4.5563 write(4, " 1 7 : 0 0 : 5 3 . 2 8 6".., 77) = 77
4.5566 getpid() = 22360 [22355]
4.5567 llseek(4, 0, SEEK_END) = 1965078
...

MattS's picture
30
Nov
2009
0 Votes 0
Login to vote

Hi Lu, Thats a good

Hi Lu,
Thats a good find.

Right now Symantec is still working on another EEB.  Today i'm going to test with extending the SoftMountPingtimeout to 60 (in regards to NFS). I'll let you know how that goes.

Matt

lu's picture
15
Dec
2009
0 Votes 0
Login to vote

Yes setting

Yes setting SoftMountPingtimeout to 60 may avoid an error 1 during the indexing phase. But it does not speed up the indexing.

MattS's picture
15
Dec
2009
1 Vote +1
Login to vote

True, and for the record it

True, and for the record it did not help our issue.

Good news is that the last test we performed Symantec had me add a touch file (only works with the  provided EEB) and the GRT backup only took a few minutes longer than the non-grt backups.  It still ended in a status 1 but it seems to be a step in the right direction.  At least my test jobs wont last 24 hours...

lu's picture
16
Dec
2009
1 Vote +1
Login to vote

Is it

Is it /usr/openv/netbackup/db/config/nbfsd_enableDirect ?

MattS's picture
16
Dec
2009
0 Votes 0
Login to vote

Thats the one.  Support get

Thats the one.  Support get you that EEB? If so hows it work for you?

lu's picture
16
Dec
2009
0 Votes 0
Login to vote

It seems to work :-) The

It seems to work :-) The indexing is very fast because the nfs server directly opens the image on disk instead of using bpdm with its 1 second delays everywhere... No error 1 so far.

lu's picture
18
Dec
2009
0 Votes 0
Login to vote

Warning !!! With this EEB

Warning !!! With this EEB duplications of Exchange backups are broken ! Avoid duplications or your NBU server may be stuck running "image cleanups" during hours !

lu's picture
30
Nov
2009
1 Vote +1
Login to vote

Does GRT works for Exchange

Does GRT works for Exchange 2007 ?
Only "alpha quality" for Exchange 2003 ?

MattS's picture
30
Nov
2009
1 Vote +1
Login to vote

We are using exchange 2007

We are using exchange 2007 and are experiencing this issue.  See the post in the other thread for our setup details.
https://www-secure.symantec.com/connect/forums/tot...

lu's picture
01
Dec
2009
1 Vote +1
Login to vote

So sum up, the problem is

So sum up, the problem is found on NBU 6.5.3 or 6.5.4, Win 2k3 or Solaris 10 media server.

lu's picture
01
Dec
2009
1 Vote +1
Login to vote

Unoffical Work Around...

I wanted to remove the 1 second delay in bpdm, so I tried to do this with a LD_PRELOAD32 (Solaris 10).

1- Compile the code below with "cc -G libnanosleep.c -o libnanosleep.so"
2- Create a wrapper for bpdm using this script:

#!/bin/ksh
LD_PRELOAD_32=/tmp/libnanosleep.so
export LD_PRELOAD_32
/usr/openv/netbackup/bin/bpdm.orig "$@"

It will replace the 1 sec sleep with a 20ms one.
Here are our results:
- indexing time=3-4 minutes instead of 1-4 hours
- but much higher I/O load on the media server, caused by the fact that 'bpdm' read data in 256k block when NFS needs only 4/8k.

So there are 2 bugs in the 6.5.4 bpdm:
1- the 1 second sleep before each seek
2- it reads 256kB when NBU NFS only asks for 4kB

This "patch" only addresses the 1st one.

===libnanosleep.c===
#include <stdio.h>
#include <errno.h>
#include <dlfcn.h>
#include <time.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <strings.h>

static unsigned int (*func)();

unsigned int sleep(unsigned int seconds)
{
  unsigned int retval;
  int fd;
  char buf[256];
  time_t cur_time;

  if(!func) {
    func = (unsigned int (*)()) dlsym(RTLD_NEXT, "sleep");
  }
  
  sprintf(buf, "/tmp/sleep.%d.txt", (int)getpid()); 
  fd = open(buf, O_WRONLY|O_CREAT|O_APPEND);
  time(&cur_time);
  sprintf(buf, "%.19s: sleep=%d\n", ctime(&cur_time), (int)seconds);
  write(fd, (const void *)buf, (size_t)strlen(buf));
  close(fd);

  if (seconds == 1) { usleep(20000); return 0; }
  
  retval = func(seconds);
  return retval;
}
Vapulaflo's picture
01
Dec
2009
2 Votes +2
Login to vote

Hi ludo, I've got exactly the

Hi ludo,

I've got exactly the same problem!

lu's picture
01
Dec
2009
1 Vote +1
Login to vote

Can you give more details ?

Can you give more details ? (NBU version/platform, Exchange version, stores size, etc)

Vapulaflo's picture
01
Dec
2009
0 Votes 0
Login to vote

Hi Lu, Thanks for the tip to

Hi Lu,

Thanks for the tip to reduce the nanosleep!.... It s crazy!

Vapulaflo's picture
01
Dec
2009
0 Votes 0
Login to vote

Solaris 10 with exchange 2003

We try to backup a mail store of 60 Go with 5 mailbox

The windows 2003 server is on R2 SP2 with the NFS patch.

I think  I will test your code to reduce the nanosleep time, but I should need a confirmation of a veritas or symantec guy that's working fine....

MattS's picture
01
Dec
2009
1 Vote +1
Login to vote

Did anyone experiencing this

Did anyone experiencing this issue have older versions of netbackup or backup exec previously installed on their exchange servers?  I was looking through the registry on our exchange server and I see lots of entries for the previous install of backup exec.

The reason i mention this is because netbackup uses backup exec code to backup exchange.

I guess a better question would be, are any of you working with a fresh OS, exchange install and netbackup install?

lu's picture
02
Dec
2009
1 Vote +1
Login to vote

We have a fresh install of

We have a fresh install of the media servers on Solaris 10, so there's no registry. We see that we have the same problem with two completely different OSes, so the problem is really in bpdm.

lu's picture
02
Dec
2009
1 Vote +1
Login to vote

Just tried with the 6.5.5

Just tried with the 6.5.5 bpdm binary. Still this 1 second delay...
(I just installed the bpdm binary, not tried a full 6.5.5 upgrade on the media server: anybody can try with a 6.5.5 upgrade ?)

MattS's picture
02
Dec
2009
2 Votes 0
Login to vote

lu i was refering to the

lu i was refering to the client, the exchange server.

By the way, i have experienced this with a RHEL4 media server on a seperate exchange cluster.

lu's picture
02
Dec
2009
2 Votes +2
Login to vote

> lu i was refering to the

> lu i was refering to the client, the exchange server.
Ok ! Our exchange server was recently installed from scratch, for this test. So I think the registry is clean...

MattS's picture
02
Dec
2009
0 Votes 0
Login to vote

Good to know, thanks.

Good to know, thanks.

lu's picture
03
Dec
2009
0 Votes 0
Login to vote

More info: this bug also

More info: this bug also slows down restores... :-(

I have started a granular restore, and it took 10 minutes for 25MB. I 'trussed' bpdm during the restore and also saw the 1 second sleeps. With the LD_PRELOAD32 hack, the restore took 2 minutes.

Please, Symantec, when the final release of GRT will be available :-)

CRZ's picture
03
Dec
2009
0 Votes 0
Login to vote

Have you opened a case with Support?

This definitely looks like the sort of thing that should get escalated to the developers for a real answer (or fix)!

I have a hunch there's a "really good reason" that delay was coded in, but obviously I have no clue what it is.  I'd hate for your modifications to come back and bite you, though, so I *strongly* recommend opening a case to get an officially supported EEB replacement... if that's what it ends up being.

 

lu's picture
04
Dec
2009
0 Votes 0
Login to vote

Yes... still waiting :-(

Yes... still waiting :-(

lu's picture
11
Jan
2010
0 Votes 0
Login to vote

Still waiting...The latest

Still waiting...
The latest EEB created problems with Duplications...

RecklessTrippy's picture
12
Jan
2010
1 Vote +1
Login to vote

Lu - could you post the EEB

Lu - could you post the EEB file name or case number to reference to at least fixed part one of the problem.  I'll worry about the duplications later - I still rather have my GRT and I can't seam to get a hold of the EEB for this.

lu's picture
13
Jan
2010
0 Votes 0
Login to vote

EEB 1712608 + touch

EEB 1712608 + touch /usr/openv/netbackup/db/config/nbfsd_enableDirect => faster indexing but BIG problems with duplications (catalog corruption).

MattS's picture
13
Jan
2010
2 Votes +2
Login to vote

Looks we are using the same

Looks we are using the same EEB Lu. 
Reckless, if you want the EEB for 6.5.5 try and ask for 1928803.1 (that should be a recompiled version of 1712608.9)

Matt

Roger C's picture
15
Jan
2010
2 Votes +2
Login to vote

ET1712608.8 AND nbfsd_enableDirect

This fix has been in effect and it does resolve the issue regarding the backup performance,

Engineering/Tech Support are fully aware of the side effect with SLP - however, the details from original site that reported are so generic that it would foolish to suggest anything such as corruption at this stage, I will update this thread in a few days time to connect the link with SLP as there will be some development on this SLP v ET1712608.8

RogC

MattS's picture
15
Jan
2010
1 Vote +1
Login to vote

Reckless, I forgot to mention

Reckless,

I forgot to mention that this EEB does not solve my particular issue with GRT backups ending in a status 1 when the mailstore is larger than 60-90GB (could never narrow down at what size the problem shows up).
Though it has cut our testing time down to just a few minutes longer than a normal non-grt backup.  Previously the job would run for days!

Matt

lu's picture
09
Feb
2010
1 Vote +1
Login to vote

We have tried the EEB

We have tried the EEB 1712608.10 and the backup is fast but the duplication to tape is still awful (catalog corruption).

MattS's picture
16
Feb
2010
1 Vote +1
Login to vote

My tests with the latest EEB

My tests with the latest EEB have been successfull for single large mail stores.  But when try to backup more than one at a time (Microsoft Information Store:\*) it usually ends with a status 1 again.  I even changed the policy to only allow 2 jobs to run at the same time, but i still ended up with random status 1s and backup jobs running for 20+ hours.

So, semi solved here. 

lu's picture
17
Feb
2010
1 Vote +1
Login to vote

> But when try to backup more

> But when try to backup more than one at a time (Microsoft Information Store:\*) it usually ends with a status 1 again.

Thanks ! Good to know ! Maybe the media server is able to provide via NFS only one image to index at the same time ?

Roger C's picture
03
Mar
2010
1 Vote +1
Login to vote

ET1712608.10 + GRANULAR_DUP_RECURSION = 0 *if dup'ing GRT images

Apologies for the delay in updating in all but it has been confirmed
that ET1712608.10 can be safely applied to environments without causing any
side effects.

The only caveat/precuation is that when running GRT Image Duplications you should
implement "GRANULAR_DUP_RECURSION = 0" in the bp.conf. This disables the catloging of
GRT images.

Full coverage of this flag is documented in this tech note.
http://support.veritas.com/docs/317302

In summary ET1712608.10 + GRANULAR_DUP_RECURSION (if dup'ing GRT images) = effective quicker backups.

It has also been proven that the same ET1712608.10 can resolve SharePoint GRT performance
related issues.

We're currently documenting all of this and will provide an official resoultion to this problem.

Rog C

lu's picture
03
Mar
2010
0 Votes 0
Login to vote

Yes, I can confirm that with

Yes, I can confirm that with ET1712608.10 + GRANULAR_DUP_RECURSION=0, we started to have GRT backups working and duplication to tape was ok.
Of course the duplicated image to tape, needs to be copied back to disk to have GRT browsing.

Is there a fix for "errors 1" when the Exchange stores are backed-up in parallel (multiplexing enabled) ? Is it related to the fact that only one NFS mount is allowed on the Exchange client at one time ?

Roger C's picture
03
Mar
2010
0 Votes 0
Login to vote

Lu, we tried this on (7.0GA)

Lu, we tried this on (7.0GA) backing up (using GRT) both Stores in parallel using multiple data streams. Both jobs ended with a Status 0. If you are encountering this issue I would suggest you create a Support case into Symantec Support.
Inform me of the case id and I will get this looked into,

Rog C

MattS's picture
03
Mar
2010
0 Votes 0
Login to vote

Roger, Though Lu's issue and

Roger,

Though Lu's issue and mine are different we have been using the same EEBs.  Would you be able to test this with 2 mail stores over 100GB, multi-streaming?

My issue is that im getting status 1s for large databases.  The latest EEB solved the issue for me when only allowing one mail store to backup at a time.  Once i allowed multi-streaming it failed with a status 1 again.

The reason i ask to test is that i have not been able to perform another test in our production environment for a couple weeks and im not sure when ill be able to get another test in.

Thanks ahead of time if you are able to help.

Matt