Video Screencast Help

Netbackup 7.5 - Duplication / Replication Performance

Created: 22 Apr 2013 | 23 comments
SYMAJ's picture

I have a site with a 'strange' requirement (considering multiple 5220 appliances are in place) whereby data is required to be backed up initially to an advanced disk 'landing zone' on the appliance - and subsequently duplicated to de-dup disk on the same appliance and also to SAN attached tape.  In addition the de-dup copy is then replicated to a remote 5220 appliance - and further onto tape from there.  All very complicated, and some may say unneccessary - but that is the requirement.

After initially 'seeding' the remote appliance on site it was relocated to the DR site, and connected via an 800Mb link.  All appeared well.

SLP's are being utilised to manage the backup/duplication/replication of the data.

Approximately 160 servers are being backed up totalling around 28TB of data for a full backup.  Full backups are running mainly at weekends, with incrementals running during the week.

Using the standard LIFECYCLE parameters initially there were a high number of duplication and replication jobs running in order to satisfy the requirements.  There is a 3 x LTO5 tape library attached via SAN to both of the local appliances - with two paths from each appliance to the 'tape SAN'.  One major issue I came accross was the fact that each image being duplicated 'single streams' to a tape drive, and multiplexing is not provided even when configured within storage units.

This resulted in many duplication jobs queueing for tape drives, some for many hours, and jobs failing with various errors including 83/84/190/191.

I have a support case open at present and they recommended changing LIFECYCLE parameters to 'batch' the jobs.  This was done by setting the following parameters:

MIN_GB_SIZE_PER_DUPLICATION_JOB  512

MAX_GB_SIZE_PER_DUPLICATION_JOB 1024

MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB 180

This appeared to result in duplications / replications overrunning in a big way, so I subsequently reduced these to 128GB / 512GB / 60mins.

When I look into the failing duplication logs I see that the jobs have been waiting for a long time for the 'logical' tape drive resources, and when they get the 'logical' resource they then wait for the 'physical' tape resource.  When they eventually get a drive, it appears not be allocated correctly and NBU is unable to mount the resource to the tpreq path and therefore when it comes to write to the tpreq location it gets the 83 error.  My feeling is that this is a device allocation / SSO issue.

Based on the above, is there any option for 'streamlining' the duplication process ?  The duplication step from 'landing zone' to 'de-dup' disk works fine every time - it is the 'landing zone' to tape which has all of the issues.  Failing jobs do retry and eventually succeed, but this could take days....

I understand that the data flow is out of the norm, with the advanced disk landing zone (copy 1) duplicating to de-dup disk (copy 2) and tape (copy 3) - then replication to a remote appliance (copy 1 remote) and duplicating to remote tape (copy 2 remote) - but as above this is the specific customer reuirement.

Both Netbackup Masters are at 7.5.0.4, with all 3 appliances at 2.5.1B.

Any thoughts / input appreciated......

AJ.

 

Operating Systems:

Comments 23 CommentsJump to latest comment

Umair Hussain's picture

What are the possibilities on upgrading the appliance to 2.5.2 ??

SYMAJ's picture

Will 2.5.2 help, or just a statement ?

I have held off doing this 2.5.2 upgrade until things settle down a little more (people seem to be bombarded with alerts following the upgrade etc.).  I am always wary of jumping in too early unless there is a good reason to do so !

If this will help me then I will plan fo rthe upgrade (Masters x 2 to 7505 then appliances x 3 to 2.5.2).

AJ

RonCaplinger's picture

Excuse me if I ask basic questions you've already tried:

  1. In your SLP's, are you specifying the correct Alternate Read Server for every duplication step?  I have seen similar behavior in my systems if I leave that blank; NBU will choose a media server that doesn't have tape drive connectivity for a duplication to tape.
  2. I've never seen duplications to tape use multiplexing.  I was under the impression that only backups from the clients would ever multiplex.  Have you used multiplexing with your duplications before?
  3. Have you considered spacing your full backups during the week?  Maybe create separate policies and schedules to run full backups on 20 servers on Sunday and incremental for the others, then another policy for 20 more servers to be backed up in full on Monday, etc. 
SYMAJ's picture

Ron,

1. Appliance A duplicates from it's own advanced disk partition to its de-dup disk partition, and to its own tape drives - so no Alternate read server is specified.  The same appliance is the media server for all of its operations.

2. Neither have I, unless you are duplicating a tape which is already multiplexed - this functionality would be in my view a good enhancement request.

3. A consideration, but I only want to produce tapes (and have to move one set of tapes) once per week - not every day.

AJ

Mark_Solutions's picture

Ok - first off you cannot multiplex a tape duplicatiomn - it has never been possible but i have heard that it may be coming.

I could understand if de-dupe to tape was slower due to re-hydration but if your advanced disk to tape is slow then the first place i would look is your data buffer settings on the appliance

You can do this in the O/S (Support - Maintainance) or via the CLISH under the Settings section for NetBackup (Settings - NetBackup DataBuffers Number Tape 64   etc..)

What you will find is that you need to set both disk and tape buffer sizes as if you just set tape sizes then the disk will also use them (dont know why!)

So I use 64 for the number of disk and tape, 262144 for th size of tape and 1048576 for the size of disk.

If that still doesn't help than you may simply need more tape drives but 3 x LTO5 on a single media server is more than i would normally reccomend anyway, so you may need a second appliance

Hope this helps

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

SYMAJ's picture

Mark,

Thanks for the input.  The buffers on all appliances were set to the values below from installation time:

SETTINGS/NETBACKUP/DATABUFFERS SIZE TAPE 262144 – done all 3 appliances

SETTINGS/NETBACKUP/DATABUFFERS NUMBER TAPE 128 – done all 3 appliances SETTINGS/NETBACKUP/DATABUFFERS SIZE DISK 1048576 – done all 3 appliances

SETTINGS/NETBACKUP/DATABUFFERS NUMBER DISK 64 – done all 3 appliances

On primary site we have two appliances, both configured the same way with 18TB of advanced disk and 22TB of de-dup disk.  Each appliance has access to all 3 tape drives.

The three LTO5 drives are not on a single media server, as they are shared between the two appliances (which are media servers) and the master server (which doesn't use tape but has access to them and also controls the robot).

As we end up with literally a hunderd or more duplications to tape queueing the addition of a couple of tape drives would not help us here.  I don't mind jobs queueing if they did not fail when they do come to run. 

As you correctly say, there is no re-hydration going on here so that should not be slowing things down.

I am prepared to live with the queuing of jobs (to some extent) knowing that multiplexing is not an option when duplicating disk images to tape, but the jobs failing should not be happenning.

We have the de-dup pools on the two local appliances running at approx 74% and 80% of capacity - is this an issue ?

As a point of interest concerning Global De-dedup - not related to ths issue - we have two local appliances running at 74% and 80% of the 22TB pool size, and when both of these replicate ALL their content to the single remote 5220 appliance which has a 39TB de-dup pool the capacity of the remote appliance pool is approx 84% !!  Shows the savings of Global De-Dup.......

Any other ideas.....

AJ 

Mark_Solutions's picture

OK - so you have 2 appliances sharing three tape drive and i am guessing that each storage unit shows that there are three drives available.

So think about this scenario ....

Appliance1 has lots of duplications to do and puts them al in the queue and can have three active.

At this point it is writing to write to 3 x LTO5 drives which may be too much for it (but it may cope?) .... Appliance2 meantime has no drives to use and is doing nothing!

So you add one more drive to the tape library giving you 4 shared drives - then reduce the number of drives in each storage unit from 4 to 2 - so that each appliance can only use 2 drives at a time.

Now you will always have both appliances able to be duplicating to 2 tape drives (which is optimal with LTO5)

That has to be more efficient - by up to double the duplication speed by adding one drive (so halves you duplication window)

Now if they can cope with writing to 3 drives you could add an extra 3 drives to the library and set the storage units to 3 - so each appliance constantly writes its duplications to 3 tape drives and never has to wait for the other one - that should really increase you duplications window

The other issue you may be having is that the allocations are constantly swapping from one server to the other - appliance1 writes a duplication, finishes, dismounts tape, give appliance2 the drive, mounts a tape etc. This would be even worse if you have not setup a media sharing group.

So if you had 6 drives in the library then just give 3 to each media server and stop the drive sharing to remove all contention - do still use media sharing though!

Hope this helps

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

lmosla's picture

Hi SYMAJ,

I'd like to assist with this. DM me and send me your contact information.

Thanks,

 

 

smakovits's picture

Curious, in this document it states that we need to use "=" when defining a parameter.  However, you seem to be on 7.5 and say support said to use  MIN_GB_SIZE_PER_DUPLICATION_JOB 512 or is it MIN_GB_SIZE_PER_DUPLICATION_JOB = 512?
 

SYMAJ's picture

I will double check this for you and come back tomorrow.

AJ

SYMAJ's picture

Just Checked - there is no '=' symbol required in the LIFECYCLE_PARAMETERS file.

AJ

Mark_Solutions's picture

Prior to V7.5 the LIFECYCLE_PARAMETERS file did not have an equals sign

From 7.5 onwards it must have an equals sign to be used

http://www.symantec.com/docs/HOWTO68315

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

SYMAJ's picture

Mark - I went through this confusion previously, but I am using it without the = sign and changes are being effected when I change the values......... (7505).

AJ

Mark_Solutions's picture

OK - Thanks for that - best tell Symantec!

Authorised Symantec Consultant

Don't forget to "Mark as Solution" if someones advice has solved your issue - and please bring back the Thumbs Up!!.

smakovits's picture

Thanks guys, now the second part of my question would be, do I create the file on my master server only or do I also create it on my media server (5220)?  My assumption is that SLPs are executed from the master, so that is where it belongs, but I want to be sure.  Thanks

Umair Hussain's picture

Guys,

 

If you are making changes from Clish on appliance there is no "=" sign but in actual LIFECYCLE_PARAMETERS file you need to put "="

smakovits - you only need to create file on Master server only, if you have 5220 as your master server change your SLP setting through CLISH because in appliance parameters file linked to other location.

Umair Hussain's picture

Just saw this new update in netbackup 7.5 admin guide on page 584 (configuring SLP) new format is with "=" old format was without equal sign.

Tanmoy's picture

Please update if you got any solution already. I am also having similar kind of issue. Thanks.

smakovits's picture

I worked with support and they confirmed that the "=" is not needed.  Here is what I have now

 

DUPLICATION_SESSION_INTERVAL_MINUTES 20
MIN_GB_SIZE_PER_DUPLICATION_JOB 256
MAX_GB_SIZE_PER_DUPLICATION_JOB 512
MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB 180

 

They told me the session interval is needed to change the default 5 minute interval, because otherwise it will still kick off every 5 minutes.  They also noted that changing the settings on mean the system will "try" to group systems per SLP, but I still see lots of SLPs with one system and outside my size requirements.

 

In the end, I am not sure what these settings are actually doing for me, if anything at all.

SYMAJ's picture

I have a support case open at present and they recommended changing LIFECYCLE parameters to 'batch' the jobs. This was done by setting the following parameters:

MIN_GB_SIZE_PER_DUPLICATION_JOB 512

MAX_GB_SIZE_PER_DUPLICATION_JOB 1024

MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB 180

This appeared to result in duplications / replications overrunning in a big way, so I subsequently reduced these to 128GB / 512GB / 60mins.

 

When I was doing this using the above it did indeed have an effect, too much of an effect hence me reducing the numbers provided by support.

I don't recall having to perform any restart of services etc to effect these changes - I think they are read every time the interval is trigerred.

Did the changes have any effect ?  If not, Just to be pedantic - did you try with the = sign in ??  Mine are working without but hey - worth a try if no effect at present ! 

AJ

smakovits's picture

What exactly do you mean by overrunning, like running way longer than they ever used to before?

watsons's picture

At 7.5.0.x you definitely need the "=" sign in all the entries of LIFECYCLE_PARAMETERS, not sure why support told you it's not needed. If you ever upgrade from a previous version to 7.5.x you will notice that file is modified to include "=" sign for all entries.

It's easy to test if they'rre working just by running a small backup and check if the duplication starts after 5 minutes with the following setting:

MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB = 5

smakovits's picture

OK, I think I can confirm that the "=" is needed after all.

Essentially I saw no changes in the way my SLPs were running when I had the rule in place or without it, so I disabled it.  However, the real reason I disabled it was because my SLPs were staying queued for long periods and the number of jobs grew out of control and I wanted to make sure that my parameters file was not the issue so I renamed it.

Well, after that, things cleared up a bit and I thought maybe I don't need it after all.  Until I moved some more jobs around.  Still, without the parameters file, I moved some jobs and suddenly some new SLPs started queueing again.  Essentially I had 400+ SLPs queued and waiting for tape.  I have 20 tape drives for those SLPs and they were running, but whenever they are set to start, they sit for hours waiting for a tape, so this is what probably leads to my large queue.

Regardless, here is the important part...

As a test to reduce the 400+ SLP jobs being queued I returned to the parameters file, but this time I added the "=":

DUPLICATION_SESSION_INTERVAL_MINUTES = 20
MIN_GB_SIZE_PER_DUPLICATION_JOB = 256
MAX_GB_SIZE_PER_DUPLICATION_JOB = 512
MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB = 180

This morning my development VMs started and all but 4 finished before I finally saw an SLP kick off.  It has essentially all of my development VMs, size was 296GB, so this tells me that it was definitely working with the "=".  Before, when the "=" was not in there I saw no change, dev vms would start and SLPs would contain only a few machines, telling me that when I originally had the parameters file in place it was doing nothing and had no impact on the way my SLPs were being queued.

My master is on UNIX and is running 7.5.0.6