Frustrations about host level backups of Hyper-V guest clusters with Windows Server 2016

Introduction

With Windows Server 2016 came the hope and promise of improved backups for Hyper-V environments. And indeed Microsoft delivered on that and has given us faster, more scalable and more reliable backups. With VHD sets also came the promise of host based backups for guest clusters.

The problem is that this promise or, as it is perhaps better to be mild and careful, that expectation has not been met. Decent, robust host based backups of guest clusters in Windows Server 2016 are still not a reality. For me this means it blocked a few scenarios and we’re working on alternatives. This is a missed opportunity I think for MSFT to excel at virtualization.

The problem

Doing host based backup of guest clusters with VHD Set disks is supported in Windows Server 2016 under certain conditions.

At RTM it became clear that CSV inside the guest cluster was not supported.

You need a healthy cluster with all disks one line

These requirements are reflected in Errors discovered during backup of VHDS in guest clusters

Error code: ‘32768’. Failed to create checkpoint on collection ‘Hyper-V Collection’

Reason: We failed to query the cluster service inside the Guest VM. Check that cluster feature is installed and running.

Error code: ‘32770’. Active-active access is not supported for the shared VHDX in VM group

Reason: The VHD Set disk is used as a Cluster Shared Volume. This cannot be checkpointed

Error code: ‘32775’. More than one VM claimed to be the owner of shared VHDX in VM group ‘Hyper-V Collection’

Reason: Actually we test if the VHDS is used by exactly one owner. So having 0 owner also creates this error. The reason was that the shared drive was offline in the guest cluster

Unfortunately, this is not the only problems people are facing. Quite often the backup software doesn’t support backing up VHD Sets or when it does they fail. Some of those failings like being unable to checkpoint the VHD Set have been addressed via Windows Updates. But there are others issues.

Let’s look at the two most common ones.

Issue 1

You can make one backup an all subsequent backups fail. This is due to the avhdx files being in used and locked. This means that as long as the cluster is up and running the recovery checkpoint chain keeps growing. This can be “cleaned” or merged but only by taking down the cluster.

At the first backup live seems good.

image

The recovery checkpoint as a collection is indeed working.

clip_image003

All attempts at another backup fail.

clip_image005

Shutting down all cluster VMs and starting them up again does merge the recovery checkpoints.

Issue 2

You can make backups, successfully but the recovery checkpoints never get merged. clip_image007

This sounds “better” but it isn’t. There is no way to merge the checkpoint. Manually merging the checkpoints of a VHD Set is bad voodoo.

Both situations get you into problems and I have found no solution so far. At the time or writing I’m back at the “never ending” recovery checkpoint chain situation. But that can change back to the 1st issue I guess. Sigh.

I have found no solution so far

For now I have been unable to solve these problem. There is no fix or even a workaround. The only to get out if this stale mate is to shut down every node of the guest clusters and then restart them all. Just a restart of the guest nodes of the cluster doesn’t do the trick of releasing the checkpoints files and merging them. While this allows you to take one backup successfully again, the problem returns immediately. For you reference that was my issue with the October 2017 CU (KB)

The other scenario we run into is that the backups do work but the recovery checkpoints never ever merge. Not even when you shut down the all the guest VM cluster nodes and start them. With frequent backup that turns into a disaster of a never ending chain of recovery checkpoints. This is actually the situation I was in again after the November 2017 updates on both guests & hosts (KB4049065: Update for Windows Server 2016 for x64-based Systems and KB4048953: 2017-11 Cumulative Update for Windows Server 2016 for x64-based Systems).

To me this situation is blocking the use of guest clustering with VHD Sets where a backup is required. For many reasons we do not wish to go the route of iSCSI or vFC to the guest. That doesn’t cut it for us.

Conclusion

Host level backups of guest clusters in Windows Server 2016 are still a no go. This despite the good hopes we had with VHD Sets to address this limitation and which we were eagerly awaiting. For many of us this is a show stopper for the successful virtualization guest clusters. Every month we try again and we’re not getting anywhere. Hence the frustration and the disappointment.

More than 1 year after Windows Server 2016 RTM we still cannot do consistent host level backup a Hyper-V guest cluster, not even those without CSV, but also not those with standard clustered disks. Trust me on the fact that many of us have given this feedback to Microsoft. They know and I suggest you keep voicing your concerns to them in order to keep it on their radar screen and higher on the priority list. You can do this by opening support calls and by asking for it on user voice. Please Microsoft, we need these workloads to be first class citizens. I’m clearly not the only unhappy camper out there as noticeable in various support forums: Cannot create checkpoint when shared vhdset (.vhds) is used by VM – ‘not part of a checkpoint collection’ error and Backing up a Windows Failover Cluster with Shared vhdx?

Veeam Vanguard nominations are now open for 2018!

Just a quick blog post on the Veeam Vanguard program. The nominations for 2018 are open! That means that if you know people who would make a Veeam Vanguard you can nominate them. You can even nominate yourself, that’s perfectly fine. It’s not frowned upon, but it also doesn’t change anything in terms of evaluation for the program.

veeamvanguardnewlogo

Rick blogged on this yesterday on the Veeam blog in “Veeam Vanguard nominations are now open for 2018!” and gave some more insight of what the program is, tries to achieve and does. He also discusses the selection. The key take-away is that you cannot study for this and that it is not some kind of certification or such. Some of the current Vanguards were quoted on how they look at the program and one thing is constant in that. The fact that the people in these programs are contributors to the global tech community and it’s about sharing and helping others getting the best out of their environment and their investment in Veeam. It also helps Veeam as they get a very communicative group of people to give them feedback on their offerings, both products and services. It’s just one more tool that helps them get things right of fix thing when they got it wrong. Likewise understanding Veeam and their products better for us helps us make better decisions on design, implementation and operation of them.

You can have a look at the current lineup of Veeam Vanguards over here.

clip_image004

You’ll find a short video on the program on that page as well. So go meet the Vanguards and find their blog, their communities and follow @VeeamVanguard and the hash tag #VeeamVanguard to see what’s going on.

clip_image006

So, people, this is the moment if you want to nominate someone or yourself to join the Veeam Vanguards in 2018. You have time until December 29th 2017 to do so. I have always felt honored to be selected and have found memories of the events I was able to go to and I to this day I’m happy to be active in the Veeam Vanguard ecosystem. It’s a fine group of professionals in a program of a great company.

Testing Azure File Sync

Introduction

Since Azure File Sync is now out in public preview I can chat a bit about why I am interested in it. My interest is big enough for me to have tested it in private previews even when managers weren’t really interested to do so yet. It was too early for them and it certainly didn’t help that, in many scenarios, it involves “hardware” involved in some way or another. In cloud first world that’s bad by default even if that’s only a hypervisor host and storage.

But part of my value is providing solutions that serve the clear and present needs of the business today and tomorrow. All this without burdening them with a mortgage on their future capabilities to address the needs they’ll have then. In that aspect of my role I can and have to disagree with managers every now and then. Call it a perk or a burden, but I do tell them what they need to hear, not what they want to hear. So yes, I was and I am testing Azure File Sync. You have to remember that what’s too early in technology today, is tomorrows old news. To paraphrase Ferris Bueller, technology moves pretty fast. If you don’t stop and look around once in a while, you could miss it.

If you need to get up to speed on Azure File Sync fast, take a look at https://azure.microsoft.com/nl-nl/resources/videos/azure-friday-hybrid-storage-with-azure-file-sync-langhout/.

clip_image002

I’m not going to discuss all of the current or announced capabilities of Azure File Sync here. There is plenty of info out there and the Program Managers are very responsive and passionate when it comes to answering you detailed questions. There’s plenty of more info in the Microsoft Ignite 2017 Sessions: https://myignite.microsoft.com/sessions/54963 and https://myignite.microsoft.com/sessions/54931.

What drove me to Testing Azure File Sync

I was in general unhappy with the StorSimple offering and other 3rd party offerings. There, I’ve said it. It was too much of a hit or miss, point solution and some of them are really odd ones out in an otherwise streamlined IT environment. I’ll keep it at that. While there are still other players in the market we have another concern. That  is that at the cloud side of things we needed integration and capabilities a 3rd party just wouldn’t have the leverage for to achieve with Microsoft. The reality is that unless you are talking about cloud in combination with huge amounts of money you will not get their attention. It’s business, not personal. Let’s look at the major reasons for me to investigate Azure File Sync.

The need to access data from different locations in different ways

First of all, there is the need to access data from different locations in different ways. That is both on-premises and from Azure. In the cloud we access file share via REST APIs and that works fine for well controlled and secured application environments and services written in and for the modern workplace.

When we start involving human beings that falls apart. The security model with roles and responsibilities doesn’t map well into an access token you paste into a mapped drive of an Azure file share. Then there is the humongous amount of applications that cannot speak REST. The most often can speak NFTS and SMB which is ruled by ACLs. They are children of a client/server world and as such, when designed and written well, they shine on premises in a well-connected low latency world. Yes, that very long tail of classical applications …

clip_image004

Windows Applications and their long, long tail by TeamRGE.com as frequently found in RDS and Citrix VDI articles.

In cloud based RDS solutions and IAAS we also use SMB and NFTS/ACL to consume data, do don’t assume everything in the cloud is pure REST and claims based security.

I won’t even go into the need to have ACL on Azure File share, the options and need for extending AD to Azure over a VPN, ADFS, ADFS Proxies, Azure AD Pass-through, Azure AD Sync, … the whole nine yards to cover all kinds of scenarios. Sure, I hear you, AD will go away, it doesn’t matter anymore. On a long enough time scale such statements are always true. But looking at Office 365 success I’m not howling with the crowd e-mail is totally obsolete yet either. Long tails …

Anyway, when you need to consume the same data from different locations in different ways it often leads to partial or complete copies of data that might change in different places and needs to be protected in different place. That’s a lot of overhead to deal with and consistency can become a serious issue. But with this we have touched on our second driver.

Backups, data protection & recovery

Yup, good old backups. You know the thing too many of the mediocre IT managers out there is nothing to it anymore as they solved that problem over a decade ago. But time doesn’t stand still. Plain and simple. I sometimes deal with largish file server clusters leveraging SMB3 (and ODX when available). The LUNs for those files shares are generally anything between 8 TB and 20 TB totaling 100TB and upwards per cluster. There are dozens of LUNs on any cluster with tens of thousands of folders and hundred million of files. DFS namespace helps us keep a unified namespace for the consumers. Since Windows Server 2012, that’s ages ago, we have the technology to make this a breeze. SMB 3 with continuous availability and leveraging SMB3/ODX wherever we can and makes sense. Thank you, Microsoft!

One of the selling points for Azure File Sync is dealing with storage capacity issues. That’s the least of my worries actually. We have never struggled with a lack of storage on file servers. For crying out loud, we solved that problem well over a decade ago. Yes, even on premises. So that’s not our main motivation here.

What is still not covered well enough for us is backups. That number of files, not just the data volume, is a challenge. Not just the backup target storage for it and the infrastructure. Basically, that can be built performant enough and reasonably cost effective. The big challenge with those numbers is the time it takes to deal with backups.

We have had to revert to application consistent storage snapshots and replicas of those on campus and between cities to make help deal with the fact that “traditional backup” doesn’t really scale well to those volumes. That’s great but when data is valuable the 3-2-1 rule is set in stone. So, you need more. A lot more actually perhaps as today you might want to supplement it with extra rules to include air gapping and encryption. Got ransomware anyone?

Traditional backups with agent on the physical host or in the guest are not cutting it. When you virtualize, you can leverage on host based backups of entire VMs or split it up in host bases backups of the OS, the data virtual disks etc. It’s a lot more effective to backup number of 5/15/25 TB sized VHDX files than it is to so with in guest or in server backups. Even more so when you combine this with modern change block tracking and synthetic full backups and health check/repair capabilities in backup products.

The challenge was that with true virtualized clusters, meaning shared VHDX, the backup story wasn’t a 1st class citizen. Which made people either not do it or use iSCSI, vFC to the guest and deal with the backups as if it were physical machines, losing any benefits associated with host based backups of VMs. Things have gotten better with Windows Server 2016 so we had two paths to investigate to deal with these challenges.

There are 2 approaches to data protection we are investigating

There are 2 approaches to data protection we are investigating and solutions might be a combination of them.

Virtualize it all and then backup the VM. In other words, leverage the power of host level backups with virtual machines. Now things have become a whole lot better for guest cluster backups with Windows Server 2016 but we are not quite there yet in order to call them 1st class citizens. The other concern is that the growth in data will still beat any progress we can make in speed and reliability with backups. Another concern is that people don’t want to deal with storage and are looking for storage as a service. The easiest and smartest way to achieve this is to look a public cloud offerings that delivers what’s needed and not take on the burden of managing 3rd parties to deliver this.

The first problem might be covered by Veeam Backup & Replication v10 that will allow for backing up file shares. As long as shared VHDX or VHD Sets are not full blown equal citizens in Hyper-V when it comes to backups we’ll look at another solution.

That’s fine, but what about pure volume and that 3-2-1 (multiple copies, different technologies, different locations) rule or better? The cost and overhead with a second site can be a challenge for some. Can Azure File Sync help there? It sure can. But the big bonus is that is also a big step forward into helping with the need to access data from different locations in different ways

The problem that we might still have is that we are dependent on the same type of solution if its geo-replicated. Application consistent snapshots, storage array or Windows VSS based, that are geo replicated are a very important part of our protection against data loss. That have become even more obvious with the ever-growing threat of ransomware. In this threat model you try to separate access to data and backups both logically and physically as well as technology and operations wise. Full “air gapping” is the Walhalla here. Lacking that you need at least serious delaying factors to prevent an attacker to gain access to both source data, snapshot on Windows or on storage arrays, backup targets and replicas of all the above.

We could actually use Azure File Sync as a primary of secondary backup target and combine the best of both worlds. Interesting huh!

The fact that we can have snapshots of Azure File Shares is important to us for data recovery. It’s also important to have those replicated in case the Azure Datacenter West Europe goes south so to speak . The next question is whether our needs and wants are covered today in the preview?

Are all our needs and wants covered today in the preview?

The question we need to answer does it cover the needs you have well enough or better. Today Azure File Sync isn’t generally available and when it is it will be v1. So, things can and will improve.

Let’s forget of LUN size limitations (5TB) for the moment or some of our other future capability and functionality requests on the wish list. I don’t need 64TB or bigger per se but I’d like to see at least 20 to 30TB LUNs supported. Not sure if Azure File Sync snapshots will be limited by VSS to 64TB or that they break through that barrier like we can with certain hardware VSS providers.

Today we limit the size of LUNs for repair as well as data protection and recovery speeds or to minimize impact of a downed LUN. When the data is in the cloud and “only” a cache on premises these concerns might subside more and more. At Ignite 2017 Microsoft mentioned +/- 100TB sizes. So, there is an indication of that 64TB barrier being broken somehow.

The one thing we are not getting yet with Azure File Sync in preview is the multiple copies on two different technologies in two different locations (remember that 3-2-1 rule). You are and remain dependent on the Microsoft solution and maybe adherence to that rule is built into their technological design, maybe it isn’t. I don’t know right now. We have LRS for now, but not GRS. Do note that things change rather fast in the loud and what isn’t here today might be there tomorrow. GRS or ZRS is on the roadmap to be available when Azure File Sync goes into production.

clip_image006

Also, Microsoft realized that having snapshots in the same storage account was a risk. So, the backups will go to the recovery services vault in the future. It will also provide for long term data retention. This is important for archiving and authentic data that cannot be reproduced and as such has to be secured against altering.

Maybe you don’t trust MSFT or the cloud for 100%. If that’s the case and it’s a hard policy requirement of the business to have a protected data copy that is not dependent on the cloud provider we could propose a solution at a certain cost. You can use a file share cluster or, cheaper, a single file server in the datacenter (one location only) where everything is pinned locally (no cloud tiering) and which is used as a backup source to a backup target on different storage. The challenge there again is that you still have to deal with that massive volume of files yourself. If that’s worth the price/benefit versus the risks is not for me to decide. Yes, no, maybe so, that’s not. I deal in technological solutions and I buy and build results technological results for the needs of the organization. I do this in the context and in the landscape in which that business operates. I do not buy or build services, let alone journey’s. That’s the part I do and I do it very well . In my opinion Azure File Synch will play a major part in all this.

Other things on the wish list are:

  • Global file lock mechanism and orchestration so with an error handling model (Try, Catch, Finally) so automation in different sites can deal with locked files in the application logic helping with the challenges of distributed access (read/write/modify) to data. I’m not asking to reinvent SharePoint here, don’t get me wrong but this area remains a challenge. Both for the user of data as well as for those trying to solve those challenges in code. There are just so many applications working in different ways that perfection in this arena will not be of this world and any solution will require some tweaking and flexibility to cater to differ needs. But I think that an effort here could make a real value proposition.
  • Bringing ACL to Azure File Sync might enable new scenarios as well. It’s been on the wish list for a while and while MSFT is looking in to it we don’t have it yet.
  • Support for data deduplication. Kind of speaks for itself and both customer & Microsoft would benefit.
  • I’d love to see support for ReFSv3 as well in time because I’m still hoping Microsoft will bring the benefits of ReFSv3 too many more use cases than today. I think they got my feedback on that one loud and clear already.

So far for some of our musing that lead us to testing Azure File Sync. We’ll share more of our journey and become a bit more technical as we progress in our journey.

Quick Fix Publish : VM won’t boot after October 2017 Updates for Windows Server 2016 and Windows 10 (KB4041691)

If you had WSUS (or SCCM) running tonight with auto approval on you might have woken up this morning to virtual machines that can boot anymore.

image

Great, another update gone wrong. Time to restore from backup as that can be the fasted way to restore services when in a pickle and if you have a good solutions for that in place. For the others you can do what I did is below. Actually a couple of us MVPs were on this issue at a number of sites as our fist task this morning. But first the root cause.

Well read this link Express update delivery ISV support and you have all you need. Basically the delta and the full cumulative update of October (KB4041691 – https://support.microsoft.com/en-us/help/4041691)  ended up in WSUS without you explicitly putting it there. That should not happen, normally the delta is not published for it to be downloaded and heaven forbid auto approved.  You could also have manually approved everything without really knowing what and why. Not a great idea at all.

image

So your VM get’s offered both of them and that is BAD!

image

Normally you get into this pickle if you some how managed to install both of these yourself or via other tools (see the link above), which you shouldn’t do.

Now if you don’t have decent restore capabilities from backups or snapshots there is another way out by removing the updates.

Boot into the problematic VM and select troubleshoot

image

Select to open the command prompt and stay away from any other auto repair options.

image

Microsoft advises to get rid of the SessionsPending reg key. To do so load the software registry hive as follows:

reg load hklm\temp c:\windows\system32\config\software

Delete the SessionsPending registry key, if it exists by running:

reg delete “HKLM\temp\Microsoft\Windows\CurrentVersion\Component Based Servicing\SessionsPending” /v Exclusive

Unload the software registry hive:

reg unload HKLM\temp

Run dism /image:c:\ /get-packages to find the updates installed that caused the issue

image

The yellow one are the ones of interest and you can see the first one never even got an install time/

We now use DISM to remove these updates.  Do first create the C:\Temp folder with MD temp if it doesn’t exist yet!

dism /image:c:\ /remove-package /packagename:myproblematicpackagetoremove /scratchdir:c:\temp

image

When done, close the command prompt, shut down the VM and then start it.

image

It will take a while but if will succeed and you’ll be greeted by a logon screen. Good luck!

Important: Do not try any other repair options or removing the updates with DISM might fail. We choose to remove all 3 updates from tonight to make sure. It might suffice to remove the delta one alone but we wanted to have an VM back as it was last night so more testing can be done before it is deployed again.

So, basically, don’t auto approve updates blindly, but test, validate & roll out in phases. Have great backup and TESTED restores. All by all we were only bitten in the lab, a couple of test/dev VMs and some of our infra VMs. Most of these are redundant and are patched stagger so our services were never badly effected. That gave us time to trouble shoot and investigate and warn our colleagues. As you can see here the issue was a delta update that made it into WSUS and was installed together with the full CU. Just manually downloading the CU and testing it would not have given you the heads up. About an issue. This is a reminder you need to test your real live situation and processes as realistically as possible. When you’re done with testing and cleaning up any fallout of this issue, make sure to patch your systems again!

Update: this also goes for Windows 10 Updates

Also see fellow MVP Mikael Nystrom blog post  https://deploymentbunny.com/2017/10/11/the-october-2017-update-inaccessible-boot-device/

Update: we now also have the official MSFT response & fix for each and every scenario right here https://support.microsoft.com/en-us/help/4049094/windows-devices-may-fail-to-boot-after-installing-october-10-version-o