Introduction
With Windows Server 2016 came the hope and promise of improved backups for Hyper-V environments. And indeed Microsoft delivered on that and has given us faster, more scalable and more reliable backups. With VHD sets also came the promise of host based backups for guest clusters.
The problem is that this promise or, as it is perhaps better to be mild and careful, that expectation has not been met. Decent, robust host based backups of guest clusters in Windows Server 2016 are still not a reality. For me this means it blocked a few scenarios and we’re working on alternatives. This is a missed opportunity I think for MSFT to excel at virtualization.
The problem
Doing host based backup of guest clusters with VHD Set disks is supported in Windows Server 2016 under certain conditions.
At RTM it became clear that CSV inside the guest cluster was not supported.
You need a healthy cluster with all disks one line
These requirements are reflected in Errors discovered during backup of VHDS in guest clusters
Error code: ‘32768’. Failed to create checkpoint on collection ‘Hyper-V Collection’
Reason: We failed to query the cluster service inside the Guest VM. Check that cluster feature is installed and running.
Error code: ‘32770’. Active-active access is not supported for the shared VHDX in VM group
Reason: The VHD Set disk is used as a Cluster Shared Volume. This cannot be checkpointed
Error code: ‘32775’. More than one VM claimed to be the owner of shared VHDX in VM group ‘Hyper-V Collection’
Reason: Actually we test if the VHDS is used by exactly one owner. So having 0 owner also creates this error. The reason was that the shared drive was offline in the guest cluster
Unfortunately, this is not the only problems people are facing. Quite often the backup software doesn’t support backing up VHD Sets or when it does they fail. Some of those failings like being unable to checkpoint the VHD Set have been addressed via Windows Updates. But there are others issues.
Let’s look at the two most common ones.
Issue 1
You can make one backup an all subsequent backups fail. This is due to the avhdx files being in used and locked. This means that as long as the cluster is up and running the recovery checkpoint chain keeps growing. This can be “cleaned” or merged but only by taking down the cluster.
At the first backup live seems good.
The recovery checkpoint as a collection is indeed working.
All attempts at another backup fail.
Shutting down all cluster VMs and starting them up again does merge the recovery checkpoints.
Issue 2
You can make backups, successfully but the recovery checkpoints never get merged.
This sounds “better” but it isn’t. There is no way to merge the checkpoint. Manually merging the checkpoints of a VHD Set is bad voodoo.
Both situations get you into problems and I have found no solution so far. At the time or writing I’m back at the “never ending” recovery checkpoint chain situation. But that can change back to the 1st issue I guess. Sigh.
I have found no solution so far
For now I have been unable to solve these problem. There is no fix or even a workaround. The only to get out if this stale mate is to shut down every node of the guest clusters and then restart them all. Just a restart of the guest nodes of the cluster doesn’t do the trick of releasing the checkpoints files and merging them. While this allows you to take one backup successfully again, the problem returns immediately. For you reference that was my issue with the October 2017 CU (KB)
The other scenario we run into is that the backups do work but the recovery checkpoints never ever merge. Not even when you shut down the all the guest VM cluster nodes and start them. With frequent backup that turns into a disaster of a never ending chain of recovery checkpoints. This is actually the situation I was in again after the November 2017 updates on both guests & hosts (KB4049065: Update for Windows Server 2016 for x64-based Systems and KB4048953: 2017-11 Cumulative Update for Windows Server 2016 for x64-based Systems).
To me this situation is blocking the use of guest clustering with VHD Sets where a backup is required. For many reasons we do not wish to go the route of iSCSI or vFC to the guest. That doesn’t cut it for us.
Conclusion
Host level backups of guest clusters in Windows Server 2016 are still a no go. This despite the good hopes we had with VHD Sets to address this limitation and which we were eagerly awaiting. For many of us this is a show stopper for the successful virtualization guest clusters. Every month we try again and we’re not getting anywhere. Hence the frustration and the disappointment.
More than 1 year after Windows Server 2016 RTM we still cannot do consistent host level backup a Hyper-V guest cluster, not even those without CSV, but also not those with standard clustered disks. Trust me on the fact that many of us have given this feedback to Microsoft. They know and I suggest you keep voicing your concerns to them in order to keep it on their radar screen and higher on the priority list. You can do this by opening support calls and by asking for it on user voice. Please Microsoft, we need these workloads to be first class citizens. I’m clearly not the only unhappy camper out there as noticeable in various support forums: Cannot create checkpoint when shared vhdset (.vhds) is used by VM – ‘not part of a checkpoint collection’ error and Backing up a Windows Failover Cluster with Shared vhdx?
Thanks for voicing this out, exactly same situation here as well.
I know it’s far from ideal, but I’m thinking to work around it using Storage Replica of the shared guest-volume to another single VM, which I then should be able to simply backup, leaving the guest-cluster alone backup-wise… Still, not what you’d want though… Have not had the time yet to try and see if this would work though…
Move away from Windows Server 2016. It is still an immature product that will never mature anyway. With Microsoft releasing cumulative updates like they do with their Windows 10 joke, you will never find a solution and will always bang your head against the wall
While I have some valid grievances I do run and support Windows Server 2016 with great success. That’s all about careful planning & testing. It would be unwise to throw all that good away due to a couple of real issues.
Sadly, Storage Replica is a no go. Can only do Single-Single or Clustered-Clustered. Cannot do Clustered-Single. Bummer again 🙂
I guess it’s my own fault but I’ve got my VM in multiple groups with the same name.
”
PS U:\> Get-VM | ft Name, Groups -AutoSize
Name Groups
—- ——
…..QL1 {Hyper-V Collection, Hyper-V Collection, Hyper-V Collection}
”
I’m unable to delete the vm-groups and I can’t remove the VM as a member..
Also the vhds and vhdx are “checkpointed” but no checkpoint shown in GUI……
https://i.imgur.com/T9L83kf.png
It was possible by ID
PS U:\> Remove-VMGroupMember -id ddc55c96-51e6-4195-b0ab-ba65538eadb1 -VM $VM
That does not solve my nonexsisting checkpoint issue though…
Hello there,
I’m searching for a good reason to use Guest Cluster.
Initially we used cluster for High Availability. We had two or more physical servers and one (or more) storage LUN(s) presented to those servers.
Now that we have the same scenario with Hyper-V cluster or VSphare. Do we really need Guest Cluster at the cost of Backup and leaving behind snapshot/checkpoint facility.
Even now we have Veeam like backup software (this is what I’m using) as a Virtualization Backup Solution that provides many features that were not the part of old backup solutions like very short RPO & RTO, Instant On, as many restore points, easy granular restore, etc.
We are in the same pickle with Guest Clusters and VHDS. We have a new Server 2016 Hyper-V cluster using S2D that we have to tear down and rebuild. We were inventing the wheel when we built it, and it’s not very stable. Several design mistakes have come to light, and we now have a much better design to implement, but we have to move all the running VMs from the crippled cluster to a temporary cluster so we can rebuild the original. Moving the standalone VMs was easy using export/import, but we have an SQL Guest Cluster using VHDS. I’ve been searching high and low for a way to move that Guest Cluster to our temporary cluster. Hyper-V Replica sounded likely, but there doesn’t seem to be a working procedure for that yet from Microsoft. It’s “supported” but no way to implement it as of today. Veeam also can’t back up those SQL VMs either, so that’s strike 2!
This leaves me trying to convince my boss that Guest Clusters with VHDS just aren’t viable today when you are implementing a highly available infrastructure with cross-site DR capability.
Does anybody have hope that this will be resolved soon?
Dear Scott,
For your problem of backing up of guest cluster, you can use:
1- Veeam Agent that is available for Windows as well as for Linux. Once you backup, you can then restore as vhdx or vmdk. https://helpcenter.veeam.com/docs/agentforwindows/userguide/integration_disk_restore_launch.html?ver=21
2- You can install vmware converter inside your VM and can convert your VM to VM.
Hope this solves your problem. 🙂
Iqbal
I use Veeam with a Hyper-V 2016 cluster and run into this time and time again with “Production Checkpoints”. Sometimes they never end up merging and I have to manually edit disks and manually merge them all into the parents, but the view still shows the checkpoints. Only way I found to get rid of them is creating a new VM with the existing disks.
It really would be nice to be able to get rid of them because this is a huge pain.
This blog is about the issues with VHD Set guest clusters. if you have the problems that you describe with non clustered guests VMs this is often the cause of a backup infra related issues somewhere that cause backup hiccups. I’ve blogged on dealing with this here https://blog.workinghardinit.work/2015/10/15/remove-lingering-backup-checkpoints-from-a-hyper-v-virtual-machine/ Most often it doesn’t require manual cleanups, but if it does I have some blogs on that https://blog.workinghardinit.work/?s=merge+checkpoints&submit=Search . If that happens a lot it’s something your need to address and find the root cause.
With VHD Set the issues are, unfortunately less easy to fix.
Have you ever tried to manually merge those avhdx files?
Yes, but that’s not sufficient, the link between vhds a backing file is broken
This is now maybe fixed?
https://support.microsoft.com/en-us/help/4230569/error-messages-when-you-try-to-back-up-vms
https://www.veeam.com/kb2709
Yes, they have fixed many issues but there are still outstanding ones, some I haven’t even blogged about yet. On a whole guest clustering with host level backups are lacking in robustness for now, so still lab only, where the testing continues. Some extra blogs: You can look at https://blog.workinghardinit.work/2018/09/10/correcting-the-permissions-on-the-folder-with-vhds-files-checkpoints-for-host-level-hyper-v-guest-cluster-backups/ and https://blog.workinghardinit.work/2018/09/27/live-migration-fails-due-to-non-existent-sharedstoragepath-or-configstorerootpath/
Still at Feb 2019 level (both host & guest), with latest Veeam, attempt at VM Cluster backup leaves recovery checkpoint chain (visible in filesystem, but not in Powershell or GUI)
Basically UNUSABLE!
Please read https://blog.workinghardinit.work/2018/09/10/correcting-the-permissions-on-the-folder-with-vhds-files-checkpoints-for-host-level-hyper-v-guest-cluster-backups/ – handy script included – and references. Things have changed.
Maybe things moved on, but the end result I see is exactly the same, with permissions applied long time ago – https://www.veeam.com/kb2709
Unless it is random
It is not good enough yet afaik. So you are seeing the same issue. For my curiosity, do you also see lingering checkpoints after a backup on one of the guest nodes its OS disk?
Exactly, and the only way to get rid of these snapshots is to deal with them by hand (vm off, rename a hex to vhds, merge to parent in HyperV Manager ). Not for faint hearted, especially when dealing with disks few Tb in size
Last case with Veeam was “Oh, MS still do not have it right, but nobody knows why”
OS is never a problem, only vhds set
Hello!
I had a problem doing the backup for my guest cluster that was unable to take application consistent (or production checkpoint if we talk microsoft dialect) because of the error
Error code: ‘32775’. More than one VM claimed to be the owner of shared VHDX in VM group ‘Hyper-V Collection’
Obviously all was set up correctly with permissions and so on but still the backup fails.
After 5 weeks of research with Microsoft Support and tons of diagnostic logging, it turns out that some functions of Hyper-V are Case sensitive, and in my VMs configuration the shared disk were written with different Case.
for example: c:\ClusterStorage\CSV01\SharedDisk.vhds
and
c:\ClusterStorage\Csv01\sharedDisk.vhds
Well, this config setting take Hyper-V to the wrong assumption that a disk is not owned by any node, triggering the error.
So final suggestion: Double check case sensitiveness of the paths and make them case consistent on all the VM of the guest cluster.
Bonus tip: the engineer told me that there are few functions in Hyper-V that are Case sensitive, so I would take care to be case consistent thru all the System
Thank you for sharing this great find, one more thing to add to the checklist. Are things now predictably stable for your setup?
Well, it seems so, I’m still working on Veeam backups of the SQL that is hosted on this guest cluster, but we don’t see any more errors or differencing disk multiplication during or after the backups, so I would say that is predictably stable
Just built a brand new guest cluster on Server 2022 and having the issue of a new avhdx file generated for every backup, and not cleaned up. Have applied the steps in kb2709 but makes no difference. Interestingly, even if I just specify the single .vhds disk for backup (and exclude all others), every single virtual disk on the VM still gets a new .avhdx anyway!
Am now looking at using Veeam Agent to do the guest cluster backup. Wish MS would sort this problem out once and for all!