Video Screencast Help
Symantec Appoints Michael A. Brown CEO. Learn more.

SSR 2011 offsite copy hangs at 2%, 5%

Created: 15 Aug 2012 • Updated: 27 Sep 2012 | 14 comments
This issue has been solved. See solution.

Hi there,

 

We are currently running SSR 2011 doing backups on 11 HP Proliant ML 350, IBM eSeries, and Dell Poweredge servers to a local LaCie 10TB 5Big Network 2 NAS device with an offsite copy over a 10MB Metro-E fiber link to a duplicate NAS device at a remote branch. 6 of the 11 servers backup locally and offsite fine, the remaining five do not complete the offsite, but do complete the incremental recovery points successfully each day to the local NAS.

The partitions vary from multiple logical drives using RAID 5 to a single mirrored partition on each server. It does not appear consistent on similar ML generations with similar partitions/RAID levels.

There are multiple discussions that dance around this issue. Before we try moving the offsite array local and try to perform the backups during the day, removing the offsite device each night, we want to try one more thread and see if any progress has been made resolving this.

Thanks!

Comments 14 CommentsJump to latest comment

Markus Koestler's picture

Have you created a support call for this issue ? Are you running SSR 2011 SP2 ?

*** Please mark thread as solved if you consider this to have answered your question(s) ***

mdwhome's picture

Hi Markus,

Yeah, I tried to create a case for this but we are not covered by a service agreement apparently, the licenses were purchased through a VAR.

We are running version 10.0.2.44074, it does not say SP2, but have you heard of those conditions I mentioned?

Markus Koestler's picture

Yep this is SP2. if you have a look in this forums there are a couple of issues related to offsite copy, yes.

*** Please mark thread as solved if you consider this to have answered your question(s) ***

TRaj's picture

Hi mdwhome,

Do you receive any errors while backing up to the NAS on the server you are running offsite?

Thanks

We are requesting you to mark the forums as Solution , so that is makes easier for the viewer to search and refer the posts with "Solutions"

Markus Koestler's picture

Were you able to resolve the issue ?

*** Please mark thread as solved if you consider this to have answered your question(s) ***

mdwhome's picture

Sorry it's taken so long to comment, I missed the comment from Tripti.

No errors are reported backing up to the offsite NAS, and the local NAS works fine for all local servers. It is just affected a few servers all running W2K3 sp2 at least, some are R2, but not all, and all but one are HP Proliant ML 350 4th and 5th Generations. The last one is an IBM eSeries.

I think at this point there are two ways to try and resolve the issue. The first is to bring the "offsite" NAS device on-site and run the backups during the day, removing the offsite array each evening. Or more radically remove and reinstall SSR from the problem servers and re-create the jobs and rebuild the recovery points.

 

Markus Koestler's picture

Please let us know the outcome.

*** Please mark thread as solved if you consider this to have answered your question(s) ***

TRaj's picture

Hi mdwhome,

I would suggest to go for first option , also you can refer : http://www.symantec.com/docs/HOWTO48460 

We are requesting you to mark the forums as Solution , so that is makes easier for the viewer to search and refer the posts with "Solutions"

mdwhome's picture

Here is something. I moved the offsite array onsite, and the servers where the data partition fails to copy offiste is copying extremely slowly, taking hours when more well behaving servers finish offsite in seconds of minutes. I'm talking gigabytes in minutes. This particular server taking so long on the offsite copy, backs up to the onsite array in minutes, over 100GB.

Any thoughts?

Markus Koestler's picture

Hm, have you tried the SSR 2011 performance registry keys ?

*** Please mark thread as solved if you consider this to have answered your question(s) ***

TRaj's picture

Also you can check the ports....

 

We are requesting you to mark the forums as Solution , so that is makes easier for the viewer to search and refer the posts with "Solutions"

mdwhome's picture

Thanks for sticking with us on this issue!

I'll look for the performance registry keys and apply them and we'll see. We are charting each server on a spreadsheet to gauge start to complete times, which ones succeed and fail with both arrays on-site, but taking one offsite each day.

Tripti, can you elaborate on which ports we can check?

 

Thanks again.

Markus Koestler's picture

This ports: http://www.symantec.com/business/support/index?pag...

*** Please mark thread as solved if you consider this to have answered your question(s) ***

mdwhome's picture

Resolved!

It's all in the schedule... bottom line is don't try and take too many base recovery points and incremental recovery points. What was happening was we were in a constant loop of offsite "catch-up". Each morning we would come into work and have to restart the Symantec System Recovery service because our 10M - 100M Metro-E pipe was saturated, bogging down application access to our branches, all connected via T's', but needing to access the Citrix apps and other databases on the 10M pipe. When we moved the offsite array to the same location as the primary array, we also adjusted the base recovery points and stopping the incremental completely. When that failed with the same result with the 3% hang, we analyzed the logs and found that there was too much traffic on the arrays and the offsites couldn't complete to allow us to remove the offsite array.

We backed the base recovery points back to 2 per week, and staggered the servers so that no more than two were running each night, and scheduled other non-base dates to do their incremental points at staggered times. The result was hundreds of incremental recovery points being removed from both arrays, and THEN the remaining recovery points created the corresponding offsite copy. Once we noticed it was "self-healing", we left the arrays together in the same location overnight. Now all incremental recoveries are again completing their backups and offsite copies within minutes.

The service was just trying to reconcile recovery points with the offsite and not really hanging, but there were so many that failed because we were stopping and restarting the service, the offsites would never complete the previous copies as well as the previous night's.

Next step is to move the offsite array permanently to the 100M location and monitor. Worst case is we move the array back to the primary location daily and remove it at COB.

Thanks for the suggestions and links; I hope this helps others in similar situations.

SOLUTION