Frustrations about host level backups of Hyper-V guest clusters with Windows Server 2016

Introduction

With Windows Server 2016 came the hope and promise of improved backups for Hyper-V environments. And indeed Microsoft delivered on that and has given us faster, more scalable and more reliable backups. With VHD sets also came the promise of host based backups for guest clusters.

The problem is that this promise or, as it is perhaps better to be mild and careful, that expectation has not been met. Decent, robust host based backups of guest clusters in Windows Server 2016 are still not a reality. For me this means it blocked a few scenarios and we’re working on alternatives. This is a missed opportunity I think for MSFT to excel at virtualization.

The problem

Doing host based backup of guest clusters with VHD Set disks is supported in Windows Server 2016 under certain conditions.

At RTM it became clear that CSV inside the guest cluster was not supported.

You need a healthy cluster with all disks one line

These requirements are reflected in Errors discovered during backup of VHDS in guest clusters

Error code: ‘32768’. Failed to create checkpoint on collection ‘Hyper-V Collection’

Reason: We failed to query the cluster service inside the Guest VM. Check that cluster feature is installed and running.

Error code: ‘32770’. Active-active access is not supported for the shared VHDX in VM group

Reason: The VHD Set disk is used as a Cluster Shared Volume. This cannot be checkpointed

Error code: ‘32775’. More than one VM claimed to be the owner of shared VHDX in VM group ‘Hyper-V Collection’

Reason: Actually we test if the VHDS is used by exactly one owner. So having 0 owner also creates this error. The reason was that the shared drive was offline in the guest cluster

Unfortunately, this is not the only problems people are facing. Quite often the backup software doesn’t support backing up VHD Sets or when it does they fail. Some of those failings like being unable to checkpoint the VHD Set have been addressed via Windows Updates. But there are others issues.

Let’s look at the two most common ones.

Issue 1

You can make one backup an all subsequent backups fail. This is due to the avhdx files being in used and locked. This means that as long as the cluster is up and running the recovery checkpoint chain keeps growing. This can be “cleaned” or merged but only by taking down the cluster.

At the first backup live seems good.

image

The recovery checkpoint as a collection is indeed working.

clip_image003

All attempts at another backup fail.

clip_image005

Shutting down all cluster VMs and starting them up again does merge the recovery checkpoints.

Issue 2

You can make backups, successfully but the recovery checkpoints never get merged. clip_image007

This sounds “better” but it isn’t. There is no way to merge the checkpoint. Manually merging the checkpoints of a VHD Set is bad voodoo.

Both situations get you into problems and I have found no solution so far. At the time or writing I’m back at the “never ending” recovery checkpoint chain situation. But that can change back to the 1st issue I guess. Sigh.

I have found no solution so far

For now I have been unable to solve these problem. There is no fix or even a workaround. The only to get out if this stale mate is to shut down every node of the guest clusters and then restart them all. Just a restart of the guest nodes of the cluster doesn’t do the trick of releasing the checkpoints files and merging them. While this allows you to take one backup successfully again, the problem returns immediately. For you reference that was my issue with the October 2017 CU (KB)

The other scenario we run into is that the backups do work but the recovery checkpoints never ever merge. Not even when you shut down the all the guest VM cluster nodes and start them. With frequent backup that turns into a disaster of a never ending chain of recovery checkpoints. This is actually the situation I was in again after the November 2017 updates on both guests & hosts (KB4049065: Update for Windows Server 2016 for x64-based Systems and KB4048953: 2017-11 Cumulative Update for Windows Server 2016 for x64-based Systems).

To me this situation is blocking the use of guest clustering with VHD Sets where a backup is required. For many reasons we do not wish to go the route of iSCSI or vFC to the guest. That doesn’t cut it for us.

Conclusion

Host level backups of guest clusters in Windows Server 2016 are still a no go. This despite the good hopes we had with VHD Sets to address this limitation and which we were eagerly awaiting. For many of us this is a show stopper for the successful virtualization guest clusters. Every month we try again and we’re not getting anywhere. Hence the frustration and the disappointment.

More than 1 year after Windows Server 2016 RTM we still cannot do consistent host level backup a Hyper-V guest cluster, not even those without CSV, but also not those with standard clustered disks. Trust me on the fact that many of us have given this feedback to Microsoft. They know and I suggest you keep voicing your concerns to them in order to keep it on their radar screen and higher on the priority list. You can do this by opening support calls and by asking for it on user voice. Please Microsoft, we need these workloads to be first class citizens. I’m clearly not the only unhappy camper out there as noticeable in various support forums: Cannot create checkpoint when shared vhdset (.vhds) is used by VM – ‘not part of a checkpoint collection’ error and Backing up a Windows Failover Cluster with Shared vhdx?

Shared VHDX In Windows 2016: VHDS and the backing storage file

Introduction into the VHD Set

I have talked about the VHD Set with a VHDS file and a AVHDX backing storage file in Windows Server 2016 in a previous blog post A first look at shared virtual disks in Windows Server 2016. One of the questions I saw pass by a couple of times is whether this is still a “normal VHDX” or a new type of virtual disk. Well the VHDS files is northing but a small file containing some metadata to coordinate disk actions amongst the guest cluster nodes accessing the shared virtual disk. The avhdx file associated with that VHDS file is an automatically managed dynamically expanding or fixed virtual disk. How do I know this? Well I tested it.

There is nothing that preventing you from copying or moving the avhdx file of a VHD Set that not in use. You can rename the extension from avhdx to vhdx. You can attach it to another VM or mount it in the host and get to the data. In essence this is a vhdx file. The “a” in avhdx stands for automatic. The meaning of this is that an vhdx is under control of the hypervisor and you’re not supposed to be manipulating it but let the hypervisor handle this for you. But as you can see for yourself if you try the above you can get to the data if that’s the only option left. Normally you should just leave it alone. It does however serve as proof that the VHD Set uses an standard virtuak disk (VHDX) file.

I’ll demonstrate this with an example below.

Fun with a backing storage file in a VHD Set

Shut down all the nodes of the guest cluster so that the VHD Set files are not in use. We then rename the virtual disk’s extension avhdx to vhdx.

image

You can then mount it on the host.

image

And after mounting the VHDX we can see the content of the virtual disk we put there when it was a CSV in that guest cluster.

image

We add some files while this vhdx is mounted on the host

image

Rename the virtual disk back to a avhdx extension.

image

We boot the nodes of the guest cluster and have a look at the data on the CSV. Bingo!

image

I’m NOT advocating you do this as a standard operation procedure. This is a demo to show you that the backing storage files are normal VHDX files that are managed by the hypervisor and as such get the avhdx extension (automatic vhdx) to indicate that you should not manipulate it under normal circumstances. But in a pinch, it a normal virtual disk so you can get to it with all options and tools at your disposal if needed.