Video Screencast Help

What's the industry standard for large enterprise backup success rate?

Created: 22 Jan 2013 | 9 comments
perojr's picture

Hello,

Does anyone know what is the industry standard for large enterprise backup success rate?

Anyone know where I can find such info? I understand that it's not easy to achieve 100% success rate all the time and I was wondering if there is any standard that I can refer?

Discussion Filed Under:

Comments 9 CommentsJump to latest comment

Nagalla's picture

I have not see any such thing...

Success Rate is always depends on how Strong your Enviornment(Infrastructure/Desing) is  and How Strong your Support team is.

Generally it is commitment/contract with the Customer about the success rate which comes out of discussion on/about the Environment

I have seen success rate between the 80% to 99%  

i have seen 100% few times but its not constant all time.(because there is lot of Dependinsces /Network/OS/hardware/resources etc)

mph999's picture

Depends who is running the backups / how well the environemnt is designed.

At my previous job we maintained about 98.6% of 70TB a night over about 2500 servers.

I have seen 100% a couple of times, these were on very slick and well designed environemnts.

The Symantec Managed Backup services has achieved over 99% success rate for every customer over the past 12 months.

Of the customers who use this service, their previous backup success rate was an average of 70%.

The service is a little more complex than just running the backups, it involves an overhaul of the system, putting right what is wrong and numerous other steps.  It does show however, what is possible.

My personal view is that backup success starts with correct system design, without this it will fail at some point.  Correct system design also incudes 'capacity planning ' for future growth.

For example, one common problem is the SLPs can't complete the duplications before the backups start running again.

Was the system designed to run SLP, or have you just started to use it ???

Usually, the answer is just started to use it ...  and there is the problem - effectively incorrect design (in this case it is usually insufficient number of tape drives) and no amount of screaming and shouting at me is going to fix that.

You will notice I mentioned 98.6 % backup success rate - did we really really achieve that every day - well, yes ... but ....

I never said that was on 1 backup server

There were mutiple backup servers (product wasn't NBU as it happens) about 20 in total.  No backup server was busy for more than about 16 hrs in a day.

Therefore, if there was a failure there was sufficient time in a 24 hr period to rerun that backup.  So yes, we got 98.6% but some of the backups will have had to rerun, but, in a 24hr period we did get 98.6%.

Also, becasue we had multiple servers, if one went down we only broke about 5% of the backups, not 100%.

Compare this to what I see now.

Single backup server running 24/7, or near.

A small issue is automatically a major one, as there is no spare time capacity to rerun a backup if it fails, and if the server goes down - 100% is lost.

SLPs are a great example again - if the backup and dups take almost 24hrs, if there is some issue, there is no poossibility to catch up before the backups and next set of duplications need to start, so straight away a fault in this type of environment would have a massive impact on the average success rate.

Martin

 

 

 

 

Martin

 

 

Regards,  Martin
 
Setting Logs in NetBackup:
http://www.symantec.com/docs/TECH75805
 
Nicolai's picture

Great post by Martin +1 from me.

I see more and more customer requiring 99-100% success rate. The key factor for reaching this target is to have: 

  • Well defined procedures/processes to handle the failed backup.Hardware alone will not do it.
  • Have control of the IT stack (e.g. patched servers, NIC driver not to old etc etc)
  • Control of the network.
  • Having a good "operation requirement specification". What level and combination do you support, does down/unstable servers count in the KPI. How slow can a server go before being excluded in the success rate?. Just to give some examples.

 

Assumption is the mother of all mess ups.

If this post answered your'e qustion -  Please mark as a soloution.

perojr's picture

Thanks for the reply guys.

And yes, Martin post also gets 1+ from me.

From what I understand overall, it does really depends on the environment designed and how strong the support team is. Especially when we are talking about large enterprise environment.

Martin does have good point there.

However the customer are quite demanding at times and they expect that there should be no backup failure everyday(100% success rate).That is why I'm looking for standard requirement which are currently being pratice in large enterprise environment.

Really appreciate your feedback there. 

watsons's picture

Customer always want more, but that does not mean they know what they want.

Again, it comes down to how important your backup data is - and a lot of people out there do not understand or care about, but eventually data itself can be categorized into different levels - and a lot of people don't do this. Well in Netbackup, if using SLP, you can do this by the "Classification Type" (platinum, gold, silver & bronze), it is a very good idea but not many who deploy SLP really look into this.

At a role I worked in the past, although the goal is to achieve a min. of  90% success rate, we can live with a 80+% for weeks (of course you need to fix those failures at the end) with a good reason. For instance, status code 1 (partially successful), it can be considered a good or bad depend on how you look at it. Some data such as user folder's music/video files, although being backup daily, are not considered important at all. That means if backup of this folder fails, we don't really troubleshoot it as a priority.

My opinion, if you can communicate with your customer effectively to let them understand their data can be categorized into different critical level, and apply different KPI on it, this will make your life easier and will not focus blindly just to achieve a 99 or 100% success rate.

mph999's picture

Thank you for your kind words.

Other top tips ...

As Nicolai mentions, have control of the network andmake sure it works.  Many NBU issues are network related.

If it's not in the manuals, don't do it - things are unsupported for a reason

Keep things simple - the more complex an environment, the harder and longer it takes to troubleshoot.

Backups fail, it's not the end of the world - if the backup is not essential (eg is an os backup ) then a previous days could probably be used, meaning you don't need to prod and poke NBU to try and make it work. (... I know OS backups are important, but probably not quite as important as DB archive logs ...)

If you change something, write it down ...

Don't change things unless you have a very good reason, and know what the change will do

If you don't know what the change will do, ask ... (or look up in the manuals)

NBU is a good trouble shooting tool, if there is a robot/ drive issue / os issue / network issue NBU will find it.

Just because NBU reports and error, does not mean it caused it.

M

 

Regards,  Martin
 
Setting Logs in NetBackup:
http://www.symantec.com/docs/TECH75805
 
Chrisb2k's picture

Just to add my 2p worth, I've seen one environment that had a 100% success rate almost every night. It was all unix (HPUX), mounting nfs shares to the master/media backup server and only ever failed when an NFS handle became stale. This is the only 100% environment I've ever seen and is the DEFINITELY the exception.

The problems I see most are network related, as Martin and Nicolai above state, where NBU admins have little or no control over the network or are not involved in the Change Control process in any way. In SAN heavy environments, such as VMWare and FT type backups, the same is true, i.e. the NBU guys have no control or input outside of their own area.

If failures are regularly high it's usually due to the above, possibly combined with an over-complex design and/or trying to do too much with too little (or the all too familiar "budget constraints"). In my experience backups can even these days be an after-thought which very suddenly become the focus of attention when critical recovery is required.

So, I'd say a in a well scoped, well invested, well specced environment that is capacity managed, and scalable, there's no reason not to see 95%+ succes regularly. Any one of these lacking can see the numbers drop fast.

RonCaplinger's picture

I'll also mention what I found to make this more difficult: proper tools to report your success rate.  You may have to manually calculate this yourself, unless you are VERY good at scripting. 

For instance, if your backup is multi-streamed, some reporting software will report a success if any one of those streams returns a status code zero.  Some may only report it successful if the last stream that completes returns a zero. And a new version of that reporting software may change how this is calculated, but there won't be anything called out in the release notes that it changed.

For those jobs that automatically restart, does the report software properly take that into consideration?  For example, if the original backup started before midnight was a full backup, and it was restarted after midnight, did it restart as a full or was it an incremental? For manual restarts, how will it know if the stream for the C: drive failed, that a restarted stream isn't the E: drive instead? 

And what happens if the first backup is mostly successful, but you rerun the entire backup and it is still running when the report is created; is that first one still counted against that day's total? Or if you have a hardware problem that causes all initial backups to fail, but you restart hours later and they are still running?  I have at least one issue like this every quarter that totally skews my report and it looks like half of the backups for a certain day fail, but because I restarted them afterwards, everything was actually successful, just late.

FYI, I've had many of these same problems I had to resolve and I have yet to find a reporting package that handles all of these issues, including OpsCenter Analytics.

Stumpr2's picture

EMC gave a presentation of Avamar and the slides showed 100% success rate. It got a chuckle from a few of us. The others were busy signing up for the night's activities, a snipe hunt.

VERITAS ain't it the truth?