A 1st look at Discrete Device Assignment in Hyper-V

Let’s take a 1st look at Discrete Device Assignment in Hyper-V

Discrete Device Assignment (DDA) is the ability to take suitable PCI Express devices in a server and pass them through directly to a virtual machine.

This sounds a lot like SR-IOV that we got in Windows 2012  It might also make you think of virtual Fibre Channel in a VM where you get access to the FC HBA on the host. The concept is similar as in that it giving the physical capabilities to a virtual machine.

But this is about Windows Server 2016 new capability which takes all this a lot further. The biggest use case for this seems to be GPUs and NVMe disks. In environment where absolute speed matters the gains by doing DDA are worth the effort. Especially when this works well with live migrations (not yet as far as I can see right now, and it’s probably quite a challenge).

There a great series of blog post on DDA

Microsoft also has a survey script available to find potential DDA devices on your hosts. It will reporting on which of them meet the criteria for passing them through to a VM.

https://github.com/Microsoft/Virtualization-Documentation/tree/master/hyperv-samples/benarm-powershell/DDA

That script is a very nice educational tool by the way. When I look at one of my servers I find that I have a couple of devices that are potential candidates:

So I’m looking at the Mellanox NICs. I wondered if this is the first step to RDMA in the guest. Probably not. The way I’m seeing the network stack evolve in Windows 2016 that’s the place where that will be handled.

image

And the Nvidia GRID K1 –  one of the poster childs of DDA and needed to compete in the high end VDI market

image

And yes the Emulex FC HBAs in the server. This is interesting and I’m curious about the vFC/DDA with FC story. But I have no further info on this. vFC is still there in W2K16 and I don’t think this DDA will replace it.

image

And finally it showed me the PERC controller as a potential candidate. Now trying to get a PERC 730 exposed to a VM with DDA sounds like and experiment I might just spend some evening hours one just to see where it leads. But only NVMe is supported. But hey a boy can play right?

image

So we have some lab work to do and we’ll see where we end up when we get TPv5 in our hands. What I also really need to get my hands on are some NVMe disks Smile But soon I’ll publish my findings on configuring this with an NVIDIA GRID K1 GPU.

I am a Veeam Vanguard 2016

I am a Veeam Vanguard 2016

I have the distinct pleasure to share with you that I am a Veeam Vanguard 2016

VeeamVanguardAward

I received this news earlier in my mailbox. This means I can add a token to my inaugural member trophy soon.

Personally I have always enjoyed working with the Veeam products, Veeam as a company and the employees. Why? Quality. The fact they truly support their products and they acknowledge they exist because they help customers deal with their needs is what makes it work. When a product works and helps us get the job done I’m not shy about sharing my experiences and opinions to help out others.

That doesn’t mean I, as a customer, am the emperor at Veeam but it does mean I’m valued and they listen to the feedback I have. Be honest, don’t hold back, but never be “vicious” and a good company will welcome what you have to say. The have high standards when it comes to supporting their customers without failing their support engineers. It’s a balance you need to find, and they have found it. It only augmented my opinion of them. They are very professional and I enjoy the interactions I’ve had with their managers, strategists, evangelists and developers.

veeam_vanguard

Those are the discussions and interactions that are both very educational, interesting and fun to do. So I’m looking for more of that now that I have been renewed as a Veeam Vanguard for 2016!

The Hyper-V Amigos Episode 10

It’s with great pride that the Hyper-V Amigos ride again and for The Hyper-V Amigos Episode 10 they dive into what’s new and improved in Windows Server 2016 Failover Clustering.

image

Well OK we only discuss a few subjects in this web cast as there is only a limited amount of time. I’ll present an overview of during my session at the German Cloud and Datacenter conference on May 12th in Germany. An hour is not enough for a deep dive into everything but we will build on our session we did at the Technical Summit (November 2014) in Germany on Improvements in Failover Cluster 2012 R2 ad get you up to speed so you can select what to investigate further.

Until then, enjoy the webcast and I hope it helps prepare you for what’s coming and entices you to join us at the Cloud and Datacenter Summit in Germany on May 12th! And if clustering alone is not enough to bring you over check out the agenda and you might realize what great gathering of experts is happing at the conference. Just look at the content, the breath and depth of the cloud and datacenter technologies being discussed is vast!

The Cluster and 0.Cluster Registry Hives

The cluster database

In a Windows Server Cluster the cluster database is where the cluster configuration gets stored. It’s a file called CLUSDB with some assisting files (CLUSDB.1.container, CLUSDB.2.container, CLUSDB.blf) and you’ll find those in C:\Windows\Cluster (%systemroot%\Cluster).

clip_image002

But the cluster database also lives in a registry hive that gets loaded when the cluster service gets started. You’ll find under HKEY_LOCAL_MACHINE and it’s called Cluster. You might also find a 0.Cluster hive on one of the nodes of the cluster.

clip_image004

The 0.Cluster hive gets loaded on a node that is the owner of the disk witness. So if you have a cloud share or a file share witness this will not be found on any cluster node. Needless to say if there is no witness at all it won’t be found either.

On a lab cluster you can shut down the cluster service and see that the registry hive or hives go away. When you restart the cluster service the Cluster hive will reappear. 0.Cluster won’t as some other node is now owner of the disk witness and even when restarting the cluster service gets a vote back for the witness the 0.Cluster hive will be on that owner node.

If you don’t close the Cluster or 0.Cluster registry hive and navigate to another key when you test this you’ll get an error message thrown that the key cannot be opened. It won’t prevent the cluster service from being stopped but you’ll see an error as the key has gone. If you navigate away, refresh (F5) you’ll see they have indeed gone.

So far the introduction about the Cluster and 0.Cluster Registry Hives.

How is the cluster database kept in sync and consistent?

Good so now we know the registry lives in multiple places and gets replicated between nodes. That replication is paramount to a healthy cluster and it should not be messed with. You can see an DWORD value under the Cluster Key called PaxosTag (see https://support.microsoft.com/en-us/kb/947713 for more information). That’s here the version number lives that keep track of any changes and which is important in maintaining the cluster DB consistency between the nodes and the disk witness – if present – as it’s responsible for replicating changes.

clip_image006

You might know that certain operations require all the nodes to be on line and some do not. When it’s require you can be pretty sure it’s a change that’s paramount to the health of the cluster.

To demonstrate the PaxosTag edit the Cluster Networks Live Migration settings by enabling or disable some networks.

clip_image008

Hit F5 on the registry Cluster/0.cluster Hive and notice the tag has increased. That will be the case on all nodes!

As said when you have a disk witness the owner node of the witness disk also has 0.Cluster hive which gets loaded from the copy of the cluster DB that resides on the cluster witness disk.

clip_image010

As you can see you find 0.hive for the CLUSTERDB and the equivalent supporting files (.container, .blf) like you see under C:\Windows\Cluster on the cluster disk un the Cluster folder. Note that there is no reason to have a drive letter assigned to the witness disk. You don’t need to go there and I only did so to easily show you the content.

Is there a functional difference between a disk witness and a file share or cloud witness?

Yes, a small one you’ll notice under certain conditions. Remember a file share of cloud witness does not hold a copy of the registry database. That also means there’s so no 0.Cluster hive to be found in the registry of the owner node. In the case of a file share you’ll find a folder with a GUID for its name and some files and with a cloud witness you see a file with the GUID of the ClusterInstanceID for its name in the storage blob. It’s bit differently organized but the functionality of these two is exactly the same. This information is used to determine what node holds the latest change and in combination with the PaxosTag what should be replicated.

The reason I mention this difference is that the disk witness copy of the Cluster DB is important because it gives a disk witness a small edge over the other witness types under certain scenarios.

Before Windows 2008 there was no witness disk but a “quorum drive”. It always had the latest copy of the database. It acted as the master copy and was the source for replicating any changes to all nodes to keep them up to date. When a cluster is shut down and has to come up, the first node would download the copy from the quorum drive and then the cluster is formed. The reliance on that quorum copy was a single point of failure actually. So that’s has changed. The PaxosTag is paramount here. All nodes and the disk witness hold an up to date copy, which would mean the PaxosTag is the same everywhere. Any change as you just tested above updates that PaxosTag on the node you’re working on and is replicated to every other node and to the disk witness.

So now when a cluster is brought up the first node you start compares it’s PaxosTag with the one on the disk witness. The higher one (more recent one) “wins” and that copy is used. So either the local clusterDB is used and updates the version on the disk witness or vice versa. No more single point of failure!

There’s a great article on this subject called Failover Cluster Node Startup Order in Windows Server 2012 R2. When you read this you’ll notice that the disk witness has an advantage in some scenarios when it comes to the capability to keep a cluster running and started. With a file share or cloud witness you might have to use -forcequorum to get the cluster up if the last node to be shut down can’t be started the first. Sure these are perhaps less common or “edge” scenarios but still. There’s a very good reason why the dynamic vote and dynamic witness have been introduced and it makes the cluster a lot more resilient. A disk witness can go just a little further under certain conditions. But as it’s not suited for all scenarios (stretched cluster) we have the other options.

Heed my warnings!

The cluster DB resides in multiple places on each node in both files and in the registry. It is an extremely bad idea to mess round in the Cluster and 0.Cluster registry hives to clean out “cluster objects”. You’re not touching the CLUSDB file that way or the PaxosTag used for replicating changes and things go bad rather quickly. It’s a bad situation to be in and for a VM you tried to remove that way you might see:

  • You cannot live or quick migrate that VM. You cannot start that VM. You cannot remove that VM from the cluster. It’s a phantom.
  • Even worse, you cannot add a node to the cluster anymore.
  • To make it totally scary, a server restart ends up with a node where the cluster service won’t start and you’ve just lost a node that you have to evict from the cluster.

I have luckily only seen a few situations where people had registry corruption or “cleaned out” the registry of cluster objects they wanted to get rid of. This is a nightmare scenario and it’s hard, if even possible at all, to recover from without backups. So whatever pickle you get into, cleaning out objects in the Cluster and/or 0.Cluster registry hive is NOT a good idea and will only get you into more trouble.

Heed the warnings in the aging but still very relevant TechNet blog Deleting a Cluster resource? Do it the supported way!

I have been in very few situations where I managed to get out of such a mess this but it’s a tedious nightmare and it only worked because I had some information that I really needed to fix it. Once I succeeded with almost no down time, which was pure luck. The other time cluster was brought down, the cluster service on multiple nodes didn’t even start anymore and it was a restore of the cluster registry hives that saved the day. Without a system state backup of the cluster node you’re out of luck and you have to destroy that cluster and recreate it. Not exactly a great moment for high availability.

If you decide to do muck around in the registry anyway and you ask me for help I’ll only do so if it pays 2000 € per hour, without any promise or guarantee of results and where I bill a minimum of 24 hours. Just to make sure you never ever do that again.