Video Screencast Help

GRT backup makes Hyper-V VM unresponsive...

Created: 30 Jul 2013 • Updated: 05 Aug 2013 | 30 comments

Hi all!

 

So I was all excited when SP2 was released because my migration from my current backup solution to BE was on hold due to us moving towards Hyper-V 2012. So the day it was released, I installed it and thought "Alright! Let's start backing up Hyper-V VMs!" and then BAM! I had all VMs on a particular CSV become unresponsive because I was backing up one VM on that CSV. The only way out was to shut the host down completely and bring those VMs backup on the other node, not the best way to start.

 

Here's my environment and some more of my experience:

  • Hyper-V Environment
  • 2x Hyper-V 2012 hosts w/ BE 2012 SP2 RAWS agent installed.
  • Hyper-V hosts are clustered and using CSVs.
  • Storage provided by a 6 node P4500 cluster.
  • HP Lefthand MPIO used for multipathing.
  • Lefthand VSS provider not installed.
  • SCVMM 2012 SP1 is used to manage the cluster.
  • Backup Exec Environment
  • Backup Exec 2012 SP2 installed
  • Deduplication Disk Storage configured and is the current destination for Hyper-V backups.
  • Guest VM Environment
  • Server 2012 OS w/ latest Integration Services installed
  • BE 2012 SP2 RAWS agent installed.

I started with just trying to backup one VM to see how the whole process worked before I start rolling this out to other VMs. The scenario above was what happened after performing the first GRT enabled backup of the VM. I've since moved the VM to its own CSV so I can test just this server without potentially bringing down the environment.

What I'm seeing is the backing up the VM's VHDXs is fine, every time. If BE just backs up the VHDXs, the job runs successfully and the VM stays responsive the whole time. I can make the BE backup just VHDXs by unchecking "Exclude virtual machines that must be put in a saved state to back up". If I check that box, BE first performs the VHDX backup successfully and then it attempts to perform GRT pass but the VM becomes unresponsive, more than likely from being put into a saved state.

I've checked everything I can possibly think of to allow for live backups for the guest VM to no avail. All disks on the VM are basic NTFS drives, each with their shadow copy storage pointed to their own drive. The correct Hyper-V Integration Services is installed and "backups" are enabled in the VMs Integration Services configuration.

I just for the life of me can't figure this out. It seems like it should be pretty straight forward. Any help would be greatly appreciated!

 

v/r,

Louis

Operating Systems:

Comments 30 CommentsJump to latest comment

lmosla's picture

Hello Louis C,  Please post a image of how you are selecting your machines. 

LouisC's picture

LMosla,

 

Thanks for the prompt response! I'll attempt to post a picture here soon. I can say that I am selecting them via the "[Cluster's Virtual Name]"\"Microsoft Hyper-V HA Virtual Machines"\"[TestVM]".

Oddly enough, I'm now backing up this particular VM succesfully now w/ GRT. I have no idea why it started working (or even what made the CSV unresponsive yesterday) but its good at this moment.

 

I think I'm going to expand the backup selection to contain another VM and see where it takes me.

 

v/r,

Louis

MusSeth's picture

hello louis,

please check the event viewer, see if there are any events for the time when system was frozen, please let us know about any errors, warning or informative events if you see there

LouisC's picture

- This was an interesting error from the application log as I was trying to shut the host down gracefully:

Event ID: 31 Source: VSS

Volume Shadow Copy Service Warning: A writer with name ASR Writer and ID {be000cbe-11fe-4426-9c58-531aa6355fc4} waited 4294967 seconds for in-progress calls to complete before shutting down.

 

- This was in the system log after the backup began and right about when everything stopped working:

Event ID: 1146 Source: FailoverClustering

The cluster Resource Hosting Subsystem (RHS) stopped unexpectedly. An attempt will be made to restart it. This is usually associated with recovery of a crashed or deadlocked resource. Please determine which resource and resource DLL is causing the issue and verify it is functioning properly.

 

- And I get these (from the Microsoft\Windows\Hyper-V-VMMS\Admin log) when backing up a VM now w/ BE:

 Event ID: 19050 Source: Hyper-V-VMMS

'TestVM' failed to perform the operation. The virtual machine is not in a valid state to perform the operation. (Virtual machine ID 0A3A39F9-B9B3-4F19-BDB0-ABB2A0076D87)

 

- This was on HyperVNode1 around the time of the incident:

Event ID: 10028 Source: DistributedCOM

DCOM was unable to communicate with the computer HyperVNode2 using any of the configured protocols; requested by PID c44 (C:\Program Files\Symantec\Backup Exec\RAWS\beremote.exe).

 

- These were scary events during the outage. This was on HyperVNode1:

Event ID: 5120 Source: FailoverClustering

Cluster Shared Volume 'Volume1' ('HQ-P4000-VOL-2') is no longer available on this node because of 'STATUS_NETWORK_NAME_DELETED(c00000c9)'. All I/O will temporarily be queued until a path to the volume is reestablished.

Event ID: 5142 Source: FailoverClustering

Cluster Shared Volume 'Volume1' ('HQ-P4000-VOL-2') is no longer accessible from this cluster node because of error 'ERROR_TIMEOUT(1460)'. Please troubleshoot this node's connectivity to the storage device and network connectivity.

 

 

Other than that, nothing really caught my eye.

 

v/r,

Louis

LouisC's picture

One more that is interesting... ever since installing RAWS (the only backup agent ever installed on these Hyper-V hosts) I've been getting the following event repeatedly:

 

Event ID: 8194 Source: VSS

Volume Shadow Copy Service error: Unexpected error querying for the IVssWriterCallback interface. hr = 0x80070005, Access is denied.

. This is often caused by incorrect security settings in either the writer or requestor process.

Operation:

Gathering Writer Data

Context:

Writer Class Id: {e8132975-6f93-4464-a53e-1050253ae220}

Writer Name: System Writer

Writer Instance ID: {1486d557-3b49-4314-8e12-db0d4b9c7d98}

v/r,

Louis

lmosla's picture

Louis,  what version of Windows is the Media Server running?  Backup Exec 2012 sp2 supports backing up Windows Server 2012, it does not yet support Windows 2012 Server on the media server itself.

LouisC's picture

LMosla,

 

The Backup Exec Media Server is Server 2008 R2 Standard Service Pack 1. Only the Hyper-V hosts and serveral guest VMs are Server 2012.

 

v/r,

Louis

LouisC's picture

Branching out a bit further to perform a few VM backup appears to have been a bad idea. I have now lost all guests on the CSV that is being backed up again...

MusSeth's picture

Hello Louis,

 
 

I found this solution on one of the forum

"I found out the problem (error 8194...IVssWriterCallback...) on Hyper-V host when backing up VMs on a CSV :
Go to DCOM setup : dcomcnfg --> Expand Component Services, Computers --> Right-click My Computer --> Properties --> COM Security tab.
Under Access Permission click Edit Default. --> add the "Network Service" account with Local Access allowed.
Restart the computer !
no more 8194 error."

http://forums.veeam.com/viewtopic.php?f=25&t=9486

as you also see Dcom errors in event viewer, you try this however would suggest to check on technet aswell before you can implement this solution....

I see lots of windows 2012 users are getting this error with different backup applications

LouisC's picture

This will be a bit challenging due to the hosts being Hyper-V Core boxes but I'll see what I can do.

LouisC's picture

I saw that but wasn't sure if that was going to be a "recommended" solution.

LouisC's picture

Oddly enough, all the VMs came back online. Looking at it, HyperVHost1 and HyperVHost2 never complained; its the VMs that suffered. Apparently they lost their ability to write to their disks. On one of the guest VMs logged this during the backup of the CSV its VHDXs reside on:

 

Event ID: 129 Source: storvsc

Reset to device, \Device\RaidPort0, was issued.

 

 

EDIT: I supposed I'll clarify the statement about the hosts not complaining. Looking through their event logs, there are no events hinting to the CSVs ever falling offline. Further more, looking at the cluster logs, not a single CSV resource failed.

 

I do see the following 252 times in a row (every two seconds) in the event logs on the host that had only the 3 VMs I was trying to backup:

Event ID: 10014 Source: DistributedCOM

The activation for CLSID {ECABAFB9-7F19-11D2-978E-0000F8757E2A} failed because remote activations for COM+ are disabled. To enable this functionality use Server Manager to install the COM+ Network Access feature in the Application Server role.

 

What's interesting is the guest VMs that had the problem were on a different node but on the same CSV as the one being backed up during the outage.

 

MusSeth's picture

I would suggest you to open a support case for the same as it we might have to enable logging in order ti isolate the issue.

LouisC's picture

Well that was an interesting night last night. 

I created a support case like recommended. A technician got back to me last night. With his guidance, we started the backup job that was causing the outages. This particular job started to bring down the Hyper-V cluster about an hour into it. At around ~15 minutes or so into the job, the tech says he'll call me back in a few minutes. He does call me back around ~40-50 minutes into the job and tells me it’s the end of his shift and another technician will call be back in roughly 30-40 minutes. I explained that this job will start to cause an outage in approximately 10-20 minutes. He told me he would escalate it and someone will get with me in 30-40 minutes. Low and behold, the job started to crash the Hyper-V cluster 10-20 minutes after getting off the phone with him and no one had called me back.

So now I had a severe outage happening. I started an online chat via the support site and explained I had an open case, that I had a production outage happening, and I was "in-between" technicians. They told me they couldn't escalate it to a priority 1 via chat so I should call in. So that's what I did, I called back in.

After explaining my situation to another person they got me in touch with another technician. While the outage was happening, we looked around. It seemed like no real troubleshooting happening, just waiting for a complete failure to happen (of either the backup job or the Hyper-V cluster). We ended up killing backup engine on the media server but that didn't help bring the VMs backup. Eventually, the cluster considered one node dead and started to bring VMs up on the second node. At this point, I had one Linux box complaining about a possible corrupt volume and to attempt to repair. I got lucky and the repair was successful but none the less, it was a bit scary.

So now that I have the VMs back up and the criticality of the outage has subsided, he had me change some settings on the backup job (VSS provider and storage destination), enable debug logging, and start the job back up. After the backup job was started and logging began, he told me to let the job fully complete, attach the logs to the case, someone will call me tomorrow (which would be today), and got off the phone with me again knowing that this backup job could potentially cause another severe outage!

I was completely shocked! I was let go twice when facing a potential outage. Luckily, the second job failed immediately after starting and right after the tech got off the phone with me (that ought to show the quickness that I was let go). I'm still working with them but I'll be reaching out to my regional sales rep to see if I can get different path to engineers for this case.

LouisC's picture

I wonder if I'm experiencing this....

http://support.microsoft.com/kb/2813630

 

Not sure if anyone else w/ a Hyper-V cluster has this installed or if they have any thoughts.

LouisC's picture

I think I may be on the right path... I found this KB that includes KB 2813630 and its titled "Update that improves cloud service provider resiliency in Windows Server 2012".

http://support.microsoft.com/kb/2870270

 

This blog post seems to further point towards KB 2813630 being a resolution.

http://blog.aaronmarks.com/?p=154

MusSeth's picture

Hello Louis,

 

I apologize for what happened with the support last night however could you please provide me the case number, I try to explain the situation to the engineer assigned to this case and will try and get this expedited, however the article which you have posted seems to be referring to the same issue, I would suggest you apply the fix and than chek if that resolves the issue however would be best to try with a backup using windows utility in order to confirm its the same issue befire you apply that patch...

LouisC's picture

MusSeth,

The case number is :04837204.

 

I'll take a look at attempting a backup using a Windows Utility first.

 

v/r,

Louis

LouisC's picture

Turns out my case was escalated and I'm currently working with an engineer that specializes in Hyper-V. We are pushing forward with KB 2838669 and we'll see what the results are.

 

http://support.microsoft.com/kb/2838669/EN-US

Jaydeep S's picture

I have reviewd the case as requested by you. If that MS KB does not fix the issue, please email me on my email address that I have messaged you.

MusSeth's picture

I have informed the assigned engineer about the issue and discussion we were having on forums and have suggested him to go to thread for more info, however please keep us posted here as in what happens with windows utility if backup fails or successfull and if the issue is resolved with hotfix, this might help others who are facing similar issue.

BackupBjoern's picture

Fix in MS KB 2813630 should correct this!

LouisC's picture

Thanks everyone! So far so good!

I installed KB 2838669 (http://support.microsoft.com/kb/2838669/EN-US) yesterday which contains KB 2813630 and a few other KBs and it is specificaly geared towards this kind of situation.

I was able to perform a backup of one VM, then two VMs, then three VMs succesfully. These were the same 3 VMs that I was backing up when things were crashing so I might be out of the woods.

Since these symptoms would be present regardless of backup solution, I'm thinking that during beta the tier 1 and 2 customers were probably using DPM or something of the sorts and already had the patch installed? Either way, if this does resolve the issue, I'd recommend putting this into the Admin Guide or at least creating a tech article for it because this seems like it'll impact anyone using a clustered Hyper-V 2012 environment with CSVs.

 

So, I'll slowly keep pressing forward and I'll report back any more findings.

 

v/r,

Louis

BackupBjoern's picture

We will create a Tech article for this when we konw for sure the full impact of the issue and the MS fixes have got some mileage.

So for all reading this post please update with fix in http://support.microsoft.com/kb/2838669/ and advise of results.

LouisC's picture

Awesome! Thanks for the feedback BackupBjoern!

I will continue to post here with results over the few weeks.

BackupBjoern's picture

We are continuing to see postive results with this Micrsoft Hotfix (2838669) on so a Technote is in progress. It should be done and public by this evening or tomorrow morning.

This will be the url

http://www.symantec.com/business/support/index?page=content&id=TECH209358

LouisC's picture

Great to hear BackupBjoern!!

 

Just an update from me, I've been succesfully backing up well over 2TB the past few days without incident since the update.

 

v/r,

Louis

SymGuy-IT's picture

Hello Everyonw. Looks like we too are having simillar issue but in this case we have Netbackup 7.5.0.6. Though the Hyper-V server is 2012 with CSVs.

I will see if applying the hotfix will resolve the issue.

MusSeth's picture

Hello sym guy

This is actually a microsoft issue and patch was released as similar issues were observed with windows backup....you can indtall the patch mentioned in thread hopefully it should resolve your issue...

VFJeff's picture

I have been having the same issue as described above. I found Microsoft's KB 2813630 and its successor KB 2870270 and was in the process of installing it when I came across this article; so it is good to know that it should solve our problem.

I have a question for LouisC:

You said that you were getting an error in the application log with Event ID: 8194 Source: VSS for System Writer.

We are also getting this error repeatedly even without a backup running. I also found the same websites that say to grant Network Services account permissions in DCOM, but we are also using Windows Server 2012 Core and DCOM is not installed.

Did you find a solution to this error?

Thanks