Trouble Shooting Windows Server 2012 host based CommVault Backups with DELL Compellent hardware VSS provider of Hyper-V guests: ‘Microsoft Hyper-V VSS Writer’ State: [5] Waiting for completion

We have been running CommVault Simpana 9.0 R2 SP7 in combination with the DELL Compellent Hardware VSS provider to do host based backups of the virtual machines on our Windows Server 2012 Hyper-V clusters host with great success and speed.

We’ve run into two issues so far. One, I blogged about in DELL Compellent Hardware VSS Provider & Commvault on Windows Server 2012 Hyper-V nodes – Volume Shadow Copy Service error: Unexpected error querying for the IVssWriterCallback interface. hr = 0×80070005, Access is denied was an due to some missing permissions for the domain account we configured the Compellent Replay manager Service to run with. The solution for that issue can be found in that same blog post.

The other one was that sometimes during the backup of a Hyper-V host we got an error from CommVault that put the job in a “pending” status, kept trying and failing. The error is:

Error Code: [91:9], Description: Volume Shadow Copy Service (VSS) error. VSS service or writers may be in a bad state. Please check vsbkp.log and Windows Event Viewer for VSS related messages. Or run vssadmin list writers from command prompt to check state of the VSS writers.

clip_image001

When we look at the Compellent controller we see the following things happen:

  • The snapshots get made
  • They are mounted briefly and then dismounted.
  • They are deleted

The result at the CommVault end is that the job goes into a pending state with the above error. When we look at the state of the Microsoft Hyper-V VSS Writer by running “vssadmin list writer” …

image

… from an elevated command prompt we see:

Writer name: ‘Microsoft Hyper-V VSS Writer’
…Writer Id: {66841cd4-6ded-4f4b-8f17-fd23f8ddc3de}
…Writer Instance Id: {2fa6f9ba-b613-4740-9bf3-e01eb4320a01}
…State: [5] Waiting for completion
…Last error: Retryable error

Note at this stage:

  1. Resuming the job doesn’t help (it actually keep trying by itself but no joy).
  2. Killing the job and restarting brings no joy. On top of that our friendly error “Volume Shadow Copy Service error: Unexpected error querying for the IVssWriterCallback interface. hr = 0×80070005, Access is denied.“ is back, but this time related to the error state of the ‘Microsoft Hyper-V VSS Writer’. The error now has changed a little and has become:

clip_image002

 

 

Writer name: ‘Microsoft Hyper-V VSS Writer’
…Writer Id: {66841cd4-6ded-4f4b-8f17-fd23f8ddc3de}
…Writer Instance Id: {2fa6f9ba-b613-4740-9bf3-e01eb4320a01}
…State: [5] Waiting for completion
…Last error: Unexpected error

To get rid of this one we can restart the host or, less drastic, restart the Hyper-V Virtual Machine management Service (VMMS.exe) which will do the trick as well.  Before you do this , drain the node when you pause it, then resume it with the option failing back the roles. Windows 2012 makes it a breeze to do this without service interruption Smile

image

clip_image003

image

The Cause: Almost or completely full partitions inside the virtual machines

Looking for solutions when CommVault is involved can be tedious as their consultancy driven sales model isn’t focused on making information widely available. Trouble shooting VSS issues can also be considered a form of black art at times. Since this is Windows 2012 RTM an the date is September 20th 2012 as the moment of writing, there are not yet any hotfixes related to host level backups of Virtual machines and such. CommVault Simpana 9.0 R2 SP7 is also fully patched.

This,combined with the fact that we did not see anything like this during testing (and we did a fair amount) makes us look at the guests. That’s the big difference on a large production cluster. All those unique guests with their own history. We also know from the past years with VSS snapshots in Windows 2008(R2) that these tend to fail due to issues in the guests. Take a peak at Troubleshoot VSS issues that occur with Windows Server Backup (WBADMIN) in Windows Server 2008 and Windows Server 2008 R2 just for starters  As an example we already had seen one guest (dev/test server) that had 5 user logged in doing all kinds of reconfigurations and installs go into save mode during a backup, so it could be due to something rotten in certain guests. There is very much to consider when doing these kinds of backups.

By doing some comparing of successful & failed backups it really looks as if it was related to certain virtual machines. A lot of issues are caused by the VSS service, not running or not being able to do snapshots because of lack of space so perhaps this was the case here as well?

We poked around a bit. First let’s see what we can find in the Hyper-V specific logs like the Microsoft-Windows-Hyper-V-VMMS-Admin event log. Ah lot’s of errors relating to a number of guests!

image

Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          19/09/2012 22:14:37
Event ID:      10102
Task Category: None
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      undisclosed server
Description:
Failed to create the volume shadow copy inside of virtual machine ‘undisclosedserver’. (Virtual machine ID 84521EG0G-8B7A-54ED-2F24-392A1761ED11)

Well people, that is called a clue Winking smile. So we did some Live Migration to isolate suspect VMs to a single node, run backups, see them fail, do the the same with a new and clean VM an it all works. and indeed … looking at the guest involved when the CommVault backup fails we that the VSS service is running and healthy but we do see all kind of badness related to disk space:

  • Large SQL Server backup files put aside on the system partition or or other disks
  • Application & service pack installers left behind,
  • Log and tempdb volumes running out of space.
  • Application Logs running out of control

That later one left 0MB of disk space on the system (Test Controller TFS shitting itself), but we managed to clear just enough to get to just over 1GB of free space which was enough to make the backup succeed.

clip_image001[8]

image

Servers, virtual or physical ones, should to be locked down to prevent such abuse. I know, I know. Did I already tell you I do not reside in a perfect world? We cannot protect against dev and test server admins who act without much care on their servers. We’ll just keep hammering at it to raise their awareness I guess. For end users and production servers we monitor those well enough to proactively avoid issues. With dev & test servers we don’t do so, or the response team would have a day’s work reacting to all alerts that daily dev & test usage on those servers generate.

The fix

  • Clear at least 1GB or a bit more inside each partition in the guest running on the host that has a failing backup. I prefer to have at least a couple of GB free  (10% to 15% => give yourself some head room people).
  • Then you can resume the backup job manually or let CommVault do that for you if it’s still in a pending state.
  • If you’ve killed the job make sure you restore the
  • Microsoft Hyper-V VSS Writer  to a healthy state as described above. Thanks to Live Migration this can be achieved without any down time.

Conclusion

There is experimenting, testing, production testing, production and finally real life environments where not all is done as it should be. Yes, really the world isn’t perfect. Managers sometimes think it’s click, click, Next, click and voila we’ve got a complex multisite system running. Well it isn’t like that and you need some time and skills to make it all work. Yes even in todays “cheap, fast, easy to run your business form your smartphone”  ecosystem of the private, hybrid and public cloud, where all is bliss and world peace reigns.

The DELL Compellent Hardware VSS provider & replay manager service handle all this without missing a beat, which is very comforting. As previous experiences with hardware VSS provides of other vendors make us think that these would probably have blown up by now.

I’m Presenting at the Technical Experts Conference 2012 Europe

I’ll be speaking at the Technical Experts Conference 2012 Europe in Barcelona on Windows Server 2012 Hyper-V and it’s storage and network related improvements and promising new features. Some of you might know that I’m a Microsoft MVP in the Virtual Machine Expertise (i.e. Hyper-V), but these sessions are not marketing or vapor ware. Being an MVP is about sharing knowledge and experiences with you. I’m are early adopter in production from the day the RTM bits became available and we’re already reaping the benefits of those features, so it’s more than just lab work and theory.

TEC2012-Europe-170x40-vVirtualization

I won’t be there alone, as my friends, colleagues and fellow MVPs Aidan Finn (@joe_elway), Carsten Rachfahl (@hypervserver) and Hans Vredevoort (@hvredevoort) will be there as well to present and share their knowledge, which is extensive, I assure you. It’s great to have the chance to come together again and talk about our technology passions.

You can find an overview of the session agenda here

So I hope you can join us for an interesting conference and interactive event where we can discuss your challenges and ways to address them. Trust me when I say that talking to other customers and technologist is a great way to learn, understand the needs and find opportunities. We learn a lot from presenting and talking to you. I’ve attended a lot of conferences in my career now and I still find them valuable. The return on investment for my employers has been great. Motivated and skilled employees can save a business 10 fold the cost spent to keep them that way.

Haven’t heard of TEC  before?

Neither did I before a couple of years, but by good fortune I had the opportunity to attend as a delegate and found it very worth while in both content and networking opportunities. As it turns out The Experts Conference Europe 2012 (TEC) has been running for over a decade now and it delivers level 400 sessions on core Microsoft technologies. It focuses on Active Directory and Identity, Exchange, Virtualization and User Workspace Management.

TEC2012-emailsignature-Europe-v3

TEC Europe is held at the Hotel Rey Carlos in Barcelona from 22-24 October 2012. Quest , as an alliance partner of Microsoft, welcomes program management, product management, development staff from Redmond and a number of field team members to the event every year to support the training requirements of its users. This means two things: It’s a valuable event and, I admit, I’m honored to be invited to speak at this event.

Budgets are tight

A great tip. Quest is offering a discount rate of 850 Euro to delegates who register by 21 September! You can get a discount code for registration by sending an email to [email protected]

Altaro Hyper-V Backup 3.5 Supports Windows 2012

Altaro is one of the first backup vendors to support Windows Server 2012 with release of Hyper-V Backup v3.5. It has few that can match that speed to market and then we’re talking the likes of CommVault who Altaro can teach some lessons left and right (I should know, I’m a long time CommVault customer and whilst a great product they should really address some issues, hire a GUI developer is one, get decent information and accessible support is another, we won’t even mention pricing Winking smile).

With Altaro Hyper-V Backup v3.5. we get full support Windows Server 2012. That is CSV 2.0, VSS backups of SMB 3.0  etc. As an early adaptor I can appreciate the speed and time to market of a backup product. I do not like 3rd party vendors keeping me back of getting the most of Volume License software assurance so these things matter to me when selecting products.

Check out htttp://www.altaro.com/hyper-v-backup/ for more information. Some of their customers are enough to make you look at their solution. At least it made me do so => Harvard University, Max-Planck Institute, Los Alamos National Laboratory, Princeton University, US Geological Survey, … etc. (I’m a scientist by training so yes these customers appeal to me Smile) .

Disclaimer: No I did not ever accept any offer for sponsorship from any vendor, even if asked, just because I wanted to make sure you know who’s story I’m telling.

Intel X520 Series NIC on Windows 2012 With Hyper-V Enabled Port Flapping Issue

When you install Windows Server 2012 RTM to a server with X520 series NIC cards you’ll notice that there is a native driver available and the performance of that driver is fantastic. It’s really impressive to see.

image

That’s great news but I’ve noticed an issue in RTM that I already dealt with in the release candidate.

The moment you install Hyper- V some of the X520 NIC ports can start flapping (connected/disconnected).  You’ll see the sequence below endlessly on one port, sometimes more.

image

image

image

As you can imagine this ruins the party in Hyper-V networking an bit too much for comfort Confused smile But it can be fixed. The root cause for this I do not know but it is driver related. The same thing happened in the release candidate. But now things are easier to fix. Navigate to the Intel Site to download their freshly released driver for the X520 series on Windows Server 2012 and install it (you don’t need to install the extra software with Advanced Network Services => native Windows NIC teaming has arrived). After that the flapping will be gone.

image

Hope this helps some folks out!