Video Screencast Help

Backups/Restores work, Duplications fail

Created: 19 Apr 2013 • Updated: 17 May 2013 | 17 comments
This issue has been solved. See solution.

Master and Media Server:

NB 7.5.0.4

RHEL 6.3

Data Domain DD880 (VTL)

STK L700e w/ IBM LTO3 drives

When we do a backup or restore to VTL or physical tape, both procedures work fine.  When we try to do a duplication, the dupe hangs or may take 3-4 days to dupe what was a 30 minute backup.  This occurs with dupes from  VTL->VTL, VTL->Physical Tape, Physical ->Physical.  Also doesn't matter whether we use Vault, SLP, or dupe from the catalog image manually, the same thing happens.

Any Ideas?  We have been working with a Symantec Tech for several months on this, thought I would finally put the question out here.

 

Thanks in advance for any replies

Operating Systems:

Comments 17 CommentsJump to latest comment

dukbtr's picture

hmmm, I guess no body has run into this before.

I got a bad feeling about this !

Marianne's picture

Nope...

I have worked with many customers over the years using different make/model VTLs on different OS's.

Never seen duplication issues.

I guess Symantec Support engineers have been requesting loads of level 5 logs?

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links

dukbtr's picture

Yes, a ton of high level logs have been sent smiley

I got a bad feeling about this !

mph999's picture

Firstly, thank you for the clear explanation of the issue.  I wish EVERYBODY (yes guys, that's a big hint) would provide sufficient details in their opening post so we at least have an idea of whats going on.

I think the best thing I can do, is ask you for the case number.  I will see what I can then do to get the case escalated to BL.  If it is already with BL then it must be a difficult one.

Few questions :

1.  When did the problem star, any known changes at this time

2.  Has it ever worked

3.  Where abouts in the process does it hang

4.  Is the system busy when the duplication(s) are running

5. Is it duplicatng over the lan, or is it handled by just one media server

 

Martin

 

Regards,  Martin
 
Setting Logs in NetBackup:
http://www.symantec.com/docs/TECH75805
 
dukbtr's picture

 

I sent an email with the case number.

 

1.  When did the problem start, any known changes at this time

The problem started about a week after we added the 2 drives from the STK L700e library.  The physical drives are used for offsite duplication purposes only.  Over this past weekend though, I ran a few more tests like multiple copy and the issue showed up there also.  The backup that normally takes about an hour to run was at almost 9 hours doing nothing.  Once I killed it and changed it back to only going to the VTL, it was fine.

 

2.  Has it ever worked

Yes, it is random.  we might have 2-3 duplications run perfect, then the next one or more stalls.  It is hit or miss when we see it, though I would say that it is more miss lately.

 

3.  Where abouts in the process does it hang

Varies,  It may stall on the mount or the tape gets mounted, does nothing then writes like 600mb every hour or so.  

 

4.  Is the system busy when the duplication(s) are running

No, NetBackup is the only thing on the Media Server and no other jobs are running 

 

5. Is it duplicatng over the lan, or is it handled by just one media server

One media server, it is connected to both the DD880 VTL and the STK L700.  2 four port fiber cards.  2 fiber channels to the VTL and 2 to the Physical Tape drives, all connected into a switch for the TAN.

I got a bad feeling about this !

mph999's picture

Got it thx, will take a look.

martin

 

Regards,  Martin
 
Setting Logs in NetBackup:
http://www.symantec.com/docs/TECH75805
 
dukbtr's picture

So I have been looking at compatibity matrixes and found this:

This is what we are currently using as the changer:

Robot driver = D_DOMAINRESTORER_L180 5110, EMULATING ULTRIUM_3 TAPES 
 

According to the NetBackup HW compatiblity matrix for a DD880 we can use: 

D_DOMAINRESTORER_L180, 
IBM^^^^^03584L32, 
ADIC^^^^Scalar^i2000 

According to the EMC DD880 compatibilty matrix for dd_firmware 5.1 and  RHEL 6 it only lists the following under changers: 
IBM System Storage TS3500

On EMC's compatiblity matrix for D_DOMAINRESTORER_L180, only Windows servers are listed.

Could this be contibruting to the issues we are seeing, thoughts?

I got a bad feeling about this !

Mark_Solutions's picture

Note sure on the L180 question .. but does this affect all backups?

I am wondering about tow things that can have a real affect on duplication, especially when de-dupe is involved

1. Do any of these use GRT - if so it processes all of the GRT information by mounting the image before it actually starts running the duplication so this could be a part of it

2. (any this really comes into play to 1. above if the answer is yes) What fragement size do you use on your disk storage unit for the DD? This can have a big affect on duplication performance when it starts to read a large image file which gets re-hydrated via de-dupe. I try and use a 5000MB fragment size when writing to de-dupe and the same size for all types of disk when using GRT in any backups

Hope this helps - may be worth a try anyway to see if a smaller fragment size sorts things out for you?

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

dukbtr's picture

No we don't use GRT.  Yes this effects all backups.  We even see the issue when doing multiple copies for a backup.  
 

I got a bad feeling about this !

Mark_Solutions's picture

OK - if possible try some backups using a smaller fragment size for the DD storage unit (though you didn't actually say what it was set to) and see if that improves things

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

dukbtr's picture

I will try that to see if it makes any difference.  We are currently using the default 1TB fragment size. 

I got a bad feeling about this !

dukbtr's picture

So I have figured out a common link between the backup images/tape numbers that get hung during a duplication.  All the tapes have a backup that have "fragment=2" if I look at Images on tape.  The tapes/images that duplicate fine all have "fragment=1".  

So now, any ideas on what this means or what to look for next?  Stumped as far as that goes.

I don't think I want to decrease fragment size, wouldn't decreasing size just create more fragments for the backups, compounding the problem even further?

 

I got a bad feeling about this !

mph999's picture

Nice one, so to confirm, if the tape has one froagment in the image it works, if it has 2 fragements in a given image it fails.

So, if you decrease the frag size, you will get more images, it is worth trying because.

1,  If it is a number of fragments issue, you will compound the problem, correct

2.  If it is a backup image size issue, you won't

It is worth, if not vital to find out which of these two is 'true'.

Unfortunately, with some issues play abot and trying things out can be the most powerful troubleshooting technique there is.  The logs don't always give answers, and sometimes even when they do they issue can be very timeconsuming and difficult to find, and testing can help redulce this.

martin

 

Regards,  Martin
 
Setting Logs in NetBackup:
http://www.symantec.com/docs/TECH75805
 
watsons's picture

Just to add a few more points..

A DebugLevel=6 unified log (vxul) and verbose5 bptm of those duplication jobs should provide more insights as to where the bottleneck or slowness occurs. Whether it's the mount/unmount of the tape resources, the allocation/unallocation of resources (nbrb process), or the duplication write simply take longer - did it compete against backup job for resources, given a lower job priority?

The configuration of the duplication needs to be checked first, along with how it interacts with the backup.

dukbtr's picture

Tried varying fragment sizes, all resulted in same issue.

Today though, we had lost all the VTL drives for a few miutes, not sure why and sys admins or San admin can't tell us either.  Here is waht the /var/log/messages had in it:

 

 

May  7 09:30:23 rhesutil02 tldcd[26537]: TLD(0) mode_sense ioctl() failed: Success

May  7 09:30:23 rhesutil02 tldd[5544]: TLD(0) going to DOWN state, status: Unable to sense robotic device

May  7 09:30:24 rhesutil02 ltid[5326]: Request for media ID LT3291 is being rejected because mount requests are disabled (reason = robotic daemon going to DOWN state)

May  7 09:32:29 rhesutil02 tldd[5544]: TLD(0) going to UP state

 

Thinkng it may be some sort of communications issue between media server and VTL?

I got a bad feeling about this !

dukbtr's picture

We have solved the issue.  It was one of the ports on the Fiber Patch panel above the server.  It was only 1 of the 2 in the pair, the fiber that is receiving.  This is why all backups (writes) worked perfectly, and that we only saw issues during duplications.  But it was only on 1 port which affected port 4a on the DD880.  That is why the problem was also random, or so it seemed.  A duplication or restore would run fine as long is it was using a drive on port 4b on the VTL. If it switched tapes and used a 4a drive, then it would fail.  

I narrowed it down to 4a and we switched the port on the patch panel, everything is running perfect.

Jeez, this was a rough one!!!

Thanks for the help

I got a bad feeling about this !

SOLUTION
Marianne's picture

Thanks for sharing!

Supporting Storage Foundation and VCS on Unix and Windows as well as NetBackup on Unix and Windows
Handy NBU Links