Frustrations about host level backups of Hyper-V guest clusters with Windows Server 2016

Introduction

With Windows Server 2016 came the hope and promise of improved backups for Hyper-V environments. And indeed Microsoft delivered on that and has given us faster, more scalable and more reliable backups. With VHD sets also came the promise of host based backups for guest clusters.

The problem is that this promise or, as it is perhaps better to be mild and careful, that expectation has not been met. Decent, robust host based backups of guest clusters in Windows Server 2016 are still not a reality. For me this means it blocked a few scenarios and we’re working on alternatives. This is a missed opportunity I think for MSFT to excel at virtualization.

The problem

Doing host based backup of guest clusters with VHD Set disks is supported in Windows Server 2016 under certain conditions.

At RTM it became clear that CSV inside the guest cluster was not supported.

You need a healthy cluster with all disks one line

These requirements are reflected in Errors discovered during backup of VHDS in guest clusters

Error code: ‘32768’. Failed to create checkpoint on collection ‘Hyper-V Collection’

Reason: We failed to query the cluster service inside the Guest VM. Check that cluster feature is installed and running.

Error code: ‘32770’. Active-active access is not supported for the shared VHDX in VM group

Reason: The VHD Set disk is used as a Cluster Shared Volume. This cannot be checkpointed

Error code: ‘32775’. More than one VM claimed to be the owner of shared VHDX in VM group ‘Hyper-V Collection’

Reason: Actually we test if the VHDS is used by exactly one owner. So having 0 owner also creates this error. The reason was that the shared drive was offline in the guest cluster

Unfortunately, this is not the only problems people are facing. Quite often the backup software doesn’t support backing up VHD Sets or when it does they fail. Some of those failings like being unable to checkpoint the VHD Set have been addressed via Windows Updates. But there are others issues.

Let’s look at the two most common ones.

Issue 1

You can make one backup an all subsequent backups fail. This is due to the avhdx files being in used and locked. This means that as long as the cluster is up and running the recovery checkpoint chain keeps growing. This can be “cleaned” or merged but only by taking down the cluster.

At the first backup live seems good.

image

The recovery checkpoint as a collection is indeed working.

clip_image003

All attempts at another backup fail.

clip_image005

Shutting down all cluster VMs and starting them up again does merge the recovery checkpoints.

Issue 2

You can make backups, successfully but the recovery checkpoints never get merged. clip_image007

This sounds “better” but it isn’t. There is no way to merge the checkpoint. Manually merging the checkpoints of a VHD Set is bad voodoo.

Both situations get you into problems and I have found no solution so far. At the time or writing I’m back at the “never ending” recovery checkpoint chain situation. But that can change back to the 1st issue I guess. Sigh.

I have found no solution so far

For now I have been unable to solve these problem. There is no fix or even a workaround. The only to get out if this stale mate is to shut down every node of the guest clusters and then restart them all. Just a restart of the guest nodes of the cluster doesn’t do the trick of releasing the checkpoints files and merging them. While this allows you to take one backup successfully again, the problem returns immediately. For you reference that was my issue with the October 2017 CU (KB)

The other scenario we run into is that the backups do work but the recovery checkpoints never ever merge. Not even when you shut down the all the guest VM cluster nodes and start them. With frequent backup that turns into a disaster of a never ending chain of recovery checkpoints. This is actually the situation I was in again after the November 2017 updates on both guests & hosts (KB4049065: Update for Windows Server 2016 for x64-based Systems and KB4048953: 2017-11 Cumulative Update for Windows Server 2016 for x64-based Systems).

To me this situation is blocking the use of guest clustering with VHD Sets where a backup is required. For many reasons we do not wish to go the route of iSCSI or vFC to the guest. That doesn’t cut it for us.

Conclusion

Host level backups of guest clusters in Windows Server 2016 are still a no go. This despite the good hopes we had with VHD Sets to address this limitation and which we were eagerly awaiting. For many of us this is a show stopper for the successful virtualization guest clusters. Every month we try again and we’re not getting anywhere. Hence the frustration and the disappointment.

More than 1 year after Windows Server 2016 RTM we still cannot do consistent host level backup a Hyper-V guest cluster, not even those without CSV, but also not those with standard clustered disks. Trust me on the fact that many of us have given this feedback to Microsoft. They know and I suggest you keep voicing your concerns to them in order to keep it on their radar screen and higher on the priority list. You can do this by opening support calls and by asking for it on user voice. Please Microsoft, we need these workloads to be first class citizens. I’m clearly not the only unhappy camper out there as noticeable in various support forums: Cannot create checkpoint when shared vhdset (.vhds) is used by VM – ‘not part of a checkpoint collection’ error and Backing up a Windows Failover Cluster with Shared vhdx?

Quick Fix Publish : VM won’t boot after October 2017 Updates for Windows Server 2016 and Windows 10 (KB4041691)

If you had WSUS (or SCCM) running tonight with auto approval on you might have woken up this morning to virtual machines that can boot anymore.

image

Great, another update gone wrong. Time to restore from backup as that can be the fasted way to restore services when in a pickle and if you have a good solutions for that in place. For the others you can do what I did is below. Actually a couple of us MVPs were on this issue at a number of sites as our fist task this morning. But first the root cause.

Well read this link Express update delivery ISV support and you have all you need. Basically the delta and the full cumulative update of October (KB4041691 – https://support.microsoft.com/en-us/help/4041691)  ended up in WSUS without you explicitly putting it there. That should not happen, normally the delta is not published for it to be downloaded and heaven forbid auto approved.  You could also have manually approved everything without really knowing what and why. Not a great idea at all.

image

So your VM get’s offered both of them and that is BAD!

image

Normally you get into this pickle if you some how managed to install both of these yourself or via other tools (see the link above), which you shouldn’t do.

Now if you don’t have decent restore capabilities from backups or snapshots there is another way out by removing the updates.

Boot into the problematic VM and select troubleshoot

image

Select to open the command prompt and stay away from any other auto repair options.

image

Microsoft advises to get rid of the SessionsPending reg key. To do so load the software registry hive as follows:

reg load hklm\temp c:\windows\system32\config\software

Delete the SessionsPending registry key, if it exists by running:

reg delete “HKLM\temp\Microsoft\Windows\CurrentVersion\Component Based Servicing\SessionsPending” /v Exclusive

Unload the software registry hive:

reg unload HKLM\temp

Run dism /image:c:\ /get-packages to find the updates installed that caused the issue

image

The yellow one are the ones of interest and you can see the first one never even got an install time/

We now use DISM to remove these updates.  Do first create the C:\Temp folder with MD temp if it doesn’t exist yet!

dism /image:c:\ /remove-package /packagename:myproblematicpackagetoremove /scratchdir:c:\temp

image

When done, close the command prompt, shut down the VM and then start it.

image

It will take a while but if will succeed and you’ll be greeted by a logon screen. Good luck!

Important: Do not try any other repair options or removing the updates with DISM might fail. We choose to remove all 3 updates from tonight to make sure. It might suffice to remove the delta one alone but we wanted to have an VM back as it was last night so more testing can be done before it is deployed again.

So, basically, don’t auto approve updates blindly, but test, validate & roll out in phases. Have great backup and TESTED restores. All by all we were only bitten in the lab, a couple of test/dev VMs and some of our infra VMs. Most of these are redundant and are patched stagger so our services were never badly effected. That gave us time to trouble shoot and investigate and warn our colleagues. As you can see here the issue was a delta update that made it into WSUS and was installed together with the full CU. Just manually downloading the CU and testing it would not have given you the heads up. About an issue. This is a reminder you need to test your real live situation and processes as realistically as possible. When you’re done with testing and cleaning up any fallout of this issue, make sure to patch your systems again!

Update: this also goes for Windows 10 Updates

Also see fellow MVP Mikael Nystrom blog post  https://deploymentbunny.com/2017/10/11/the-october-2017-update-inaccessible-boot-device/

Update: we now also have the official MSFT response & fix for each and every scenario right here https://support.microsoft.com/en-us/help/4049094/windows-devices-may-fail-to-boot-after-installing-october-10-version-o

An error occurred connecting to the cluster

An error occurred connecting to the cluster

This morning I woke up to a bunch of failed backup notifications of our trusted Veeam Backup & Replication v9.5 update 2 solution. After 3:30 AM the backups of one particular cluster started failing.

I went to have a look but I could not connect to the 3 node cluster.

image

I logged on to the cluster nodes themselves and did a quick verification of network connectivity, DNS etc. That was all fine. WMI services were running on all nodes but on node 2 and 3 they were not functional.

Cleary we have a WMI issue. And sure enough, no Hyper-V manager available on those 2 nodes but we did have it on the one properly functioning node.

We tested some PowerShell WMI queries (get-wmiobject mscluster_resourcegroup -computer NodeToTest -namespace “ROOT\MSCluster“) to the cluster and this confirmed that WMI was toast on those two nodes.

Fixing the issue

The good news was that all the VMs were all up and running  – a few that had RHS.exe issues – but were still alive pure Hyper-V wise. That explains why they didn’t have any support calls come in. So if we can fix this without causing down time this would be great. To try this we decided to restart the WMI service.

On problematic node 2 this worked. It restarted depending services as well such as Hyper-V Virtual Machine Management, User Access Logging Service, IP Helper and the Veeam Installer Service and the Veeam Hyper-V Integration Service. We got connectivity back via Hyper-V manager but the Failover Cluster manager GUI remained an issue but now only complained about node 3.

image

We wanted to avoid rebooting node 3 to avoid downtime to the VMs. So what we did there is stop the depending services that we could stop. It was vmms.exe that was stuck in shutdown we just killed the process manually with stop-Process -name “vmms” -force
That allowed the WMI service to be restarted. We then started the depending services manually and we got back the connectivity to Hyper-V Manager on node 3.

The Failover Cluster manager GUI could also connect again to the cluster. We checked the cluster for other issues. When done and found OK we live migrated the VMs node per node and did a reboot of every node one by one. This to have cleanly started nodes and to see if any trouble some event were logged during the startup. Normal operations were resumed.

Do note that there is a blog on TechNet about a similar issue but with a different error message. That was caused by missing cluswmi.mof file due to an ill advised use of run mofcomp.exe *.mof. This was not the case here. A reboot of the misbehaving nodes would have done the trick as well (as blogged here Trouble Connecting to Cluster Nodes? Check WMI! ) but we avoided as much downtime as possible here by going the route we did.

Microsoft Active Directory Replication Status Tool won’t upgrade

For getting a quick insight into the AD replication health of an environment the Microsoft Active Directory Replication Status Tool is a very handy instrument. The only annoyance is the expiration of the license that forces you to download a new one and upgrade. A bit of a convoluted way to update free software but hey it is handy and free.

image

And then again …

image

OK, I’ll download the new one. But the Microsoft Active Directory Replication Status Tool won’t upgrade. That’s because the currently installed version is newer than the one you just downloaded form the Microsoft site. That’s annoying, did they post the wrong version?

image

Let’s install the new version quickly in a VM. Now looking at the executable in the current install and the new one they are the same … so the license is the only thing causing an issue here; not a version difference actually.

Old version

image

New version

image

 

Let’s look at the license.xml file in C:\Program Files (x86)\Microsoft Active Directory Replication Status Tool\Licensing

image

The only difference between the old and the new installed is the license file.You can see it has the expiration dates in the future.

image

So the fix is easy, just uninstall the currently installed version of AD Replication Status tool wherever it is installed and reinstall the one you downloaded. It seems to be exactly the same version but that’s how you get it working again with a fresh license.xml file. Note that you cannot copy the license file between machine, the generated signature is wrong.

Hope this helps someone.