Client Access & Windows Server 2016 Site Aware Stretched Clusters

Introduction

There’s more to business continuity than having multiple locations. When it comes to high availability, or perhaps more accurately disaster recovery and business continuity people tend to focus on the good news. Some managers don’t want to be bothered by the details of our incompetency (i.e. reality and laws of physics) and vendors only like to focus on what they can sell with the biggest profit margin. Anything raining on that party falls under annoying details. When such a manager and such a sales man find each other it’s a match made in heaven. You’re the one who’s bringing the rain. It comes under the form of a simple question. How are we going to expose the failed over services internally and externally to the users and customers? What you mean that million-dollar investment in multiple SANs, clusters and consultants isn’t sufficient? Nope!

clip_image002

One piece of very good news is that in Windows Server 2016 Failover Clustering we can now leverage a cloud witness as well, next to a file share witness. This has the benefits we do not need a 3rd site for the file share witness. Which was not always feasible, sometimes a bit convoluted to achieve in the cloud via IAAS or depended on a rather less dependable server or PC somewhere in a branch office.

What’s the problem?

The problem is that failing over the workload with the services (VMs, SQL, File Servers, …) in a healthy, consistent state is only part of the challenge. The other part is to make sure that your clients (human or machines) can actually access those failed over services. If required or possible without noticing or with the smallest possible interruption possible. Even when you can achieve failover with only seconds of service interruption, some applications just can handle this gracefully or not at all.

The thing is when you have multiple sites that often means distinct separate subnets / networks. So when that VM with IP address of 10.10.100.124 on default untagged VLAN 100 fails over to the other site how will the clients in the various branch offices or on the internet access it services? DNS point to 10.10.100.124 under normal conditions.

Well when the IP address can be updated for the DNS record thanks to “Multi-Subnet Resource Configuration” (SQL Server, File Share) thing will work again, eventually, given enough time.

clip_image004

Multi-Subnet Resource Configuration works as follows. We have a single network name resource which we make dependent on multiple IP Address resources. In cluster terms that’s a “OR” decency when looking at the validation report. The secret sauce is that only one of the IP address resources of the network name resource is online at any given time. This gets registered in DNS and that’s what the clients use to access the service.

This works but the DNS record need to be upgraded, DNS replication needs to happen, client their DNS cache needs to expire and update etc. You can be looking at half an hour of down time actually.

But what if Multi-Subnet Resource Configuration isn’t an option or we’re in a hurry? What are options and how well and fast do these work? That’s the point at which the storage vendor is already counting the profits, the PM states the job’s done and the boss has already decided the project is a success and the network guys have some questions about YOUR problems. Let’s discuss some of options to deal with accessing services after a site activation.

Note: Hyper-V replica has the ability inject an alternate IP address on failover but we’re talking about a stretched cluster here, where replication happens at the storage level, not at the application level (Hyper-V) for the virtual machines.

Software Defined Networking Aka Network Virtualization.

Using Hyper-V Network Virtualization (HNV) abstract VMs logical subnet boundaries. This gives each virtual network the illusion it is running as a physical network. The typical example for this is multiple tenants that have the same IP space. The fact that it overlays physical network is also very handy when it comes to one and the same tenants in multi-site scenarios. Virtual networks allow VMs to move across different physical networks without re-configuring IP address in guest OS.

clip_image005

This totally abstracts the networks and it works great for virtual machines (Hyper-V). It doesn’t have to be limited to a single DC or site. Do note that there’s things to discuss around CSVs, Live Migration cluster wise and routing, gateways, DNS, geo load balancing access wise but you get the idea. When it comes to different subnets, different sites in regards to clustering things are not as easy as it seems. For this discussion we’re limiting ourselves to client connectivity to resources that move to another site and don’t dive into the details of network virtualization either.

Network Name Properties

There’s two cluster network name resource property setting you can configure to help reducing downtime after a failover.

RegisterAllProvidersIP cluster network name resource property

Remember our first story of “Multi-Subnet Resource Configuration” with the DNS updates and cache that has to expire? Well this can be enhanced as long as the applications can hand handle it. We can configure the DNS registration behavior via the RegisterAllProvidersIP property of a cluster networks name resource.

Get-ClusterResource MySQLServer |

Set-ClusterParameter RegisterAllProvidersIP 1

By setting this to TRUE all the IP address resources, on line and off line, are registered in DNS. If you have a “enlightened” application that can check for and handle multiple IP addresses and determine which one to use it allows for faster client reconnects. This works great with SQL Server.

HOstRecordTTL cluster network name resource property

This is great but has limited scope as the application has to have the logic to handle multiple registered IP addresses for the same resource and figure out when to use which one. SQL Server can do this, so can Exchange. What about a file server? RegisterAllProvidersIP won’t work but we can reduce the time to live of the DNS record for a cluster network resource IP address on the client from 20 minutes to 5 minutes or lower.

Get-ClusterResource MyFileShare |

Set-ClusterParameter HostRecordTTL 300

This is not an option for Hyper-V, there network virtualization works better or we use other options. Read on!

Stretch your VLANs

Here the VLAN(s) stretch across the sites. This means that the IP address of the service (VMs, SQL Servers, File Shares, …) never changes making it very easy to have the clients reconnect very fast.

clip_image007

Easy for the apps and the system administrators. Well sort of, chances are that the networks admins will chip in and put a kill contract out on you with some assassins. Just saying. In a perfect world this would be a good idea. In reality layer 2 and spanning tree are making sure you’ll sort of regret it or at least deal with the drawback and fall out. Choose wisely.

Abstract the network devices

This is a network vendors provided solution and I don’t see it very much in the wild. In this approach the network devices use a 3rd IP address that get registered in DNS for use by the clients. The fact that the workload switches between subnets when failing over between sites is irrelevant to the clients.

clip_image018

Cisco has this in a couple of solutions where NAT or a VIP is used to achieve this. As this is network appliance/ hardware based it works with any workload.

SLA your way out

Some people “mitigate” the prolonged down time by having a separate SLA for local failover versus site failover. Cool, but if I was cynical I could state that this is just lawyer behavior. You create fine print and “cover your ass” for that scenario. It’s not really solving anything but accepting longer down time and having all involved parties recognize and accept that fact. This is a valid approach.

Be creative & drive towards maximum portability

In an ideal world you can provision apps & services so fast you only need to protect and failover the persistent data. A world of micro services, containers where servers and virtual machines are cattle. But many of us will have to deal with servers being holey cows for now.

The above approaches are the most common options. There are more variations to these. One of those could be bases around the use of a dedicated management domain on both sites. It’s a concept I’ve used a couple of time where and when allowed.

It has some drawbacks or at least some complexities to deal with and one such example might be when configuring host based backups that need access to the guest VMs. This requires some extra firewall configuration. Nothing that would prevent you from doing so with good backup products like Veeam and it’s something you’re probably used to doing already for monitoring and backups across domains anyway.

But it also has serious benefits as the actual business domains are completely separate from the management domain and potentially 100% virtualized but that’s not a hard requirement as long as you keep the remaining physical servers in their own site dependent subnet which routes, these don’t move anyway, and they should have workloads that are distributed anyway like AD, Exchange DAGs, etc. The big benefits compared to a stretched cluster is that you can have the same subnet(s) on both sides of the stretched cluster for your virtual machines and you change the routing and endpoints for your public and private access to the services. Instead of making the changes to the cluster resources you do so higher up at the stack. It’s a bit like moving your data center to new location “as is” and directing the clients to the new location. This removes the need for stretched VLAN, or implementing network virtualization, at the cost of a bit more down time & work to “switch”. It’s worth considering.

It helps to leveraging DNS and geo load balancing technologies in this but the core infrastructure (the site ware stretched cluster) can run in a fully routed / Layer 3 fashion.

Sure you’ll still need to make sure the traffic from the offices goes to the correct data center now and it really rocks if you have your internet presence geo-load balanced in some way but let’s face it. But you needed to have that in order for any approach anyway.

Closing thoughts

There is a lot more detail and complexity to all of this than I covered in this short article. This is meant an eye opener, a point from where to start the discussion with the business demanding 24/7, 99.999% a zero cost and effort. Like Amazon or Azure but then better, cheaper and on premises. Ouch! As you might expect, this can’t be dealt with in just a few pages. Getting a solid, working disaster avoidance, recovery and business continuity plan & process is going to take some effort to create and maintain.

Fully failing over without any work or a second of downtime is a very expensive illusion and you might be better off with 15 to 20 minutes of down time for 90% of the workload and 30 to 60 minutes for the remaining 10% that trying to chase the ultimate perfection of 100% zero downtime ever for all services. Chances are you’ll go broke trying and pretending, which means failing. Remember that when your primary data center was just taken down or worse, burnt down, dealing with a couple of hours of down time to get you secondary site up and running 100 % isn’t actually as bad as it seems when discussing 2 or 3 hours of down time in a management meeting. Somehow it always seems a bigger deal when not faced with the alternative of the business being wiped out.

One final note, don’t forget to tell your bosses you’re going to have to practices this a couple of times per year. Doing it for real count’s a practice only if it’s the 3rd time you do it. Good luck!

Setting up Discrete Device Assignment with a GPU

Introduction

Let’s take a look at setting up Discrete Device Assignment with a GPU. Windows Server 2016 introduces Discrete Device Assignment (DDA). This allows a PCI Express connected device, that supports this, to be connected directly through to a virtual machine.

The idea behind this is to gain extra performance. In our case we’ll use one of the four display adapters in our NVIDIA GROD K1 to assign to a VM via DDA. The 3 other can remain for use with RemoteFX. Perhaps we could even leverage DDA for GPUs that do not support RemoteFX to be used directly by a VM, we’ll see.

As we directly assign the hardware to VM we need to install the drivers for that hardware inside of that VM just like you need to do with real hardware.

I refer you to the starting blog of a series on DDA in Windows 2016:

Here you can get a wealth of extra information. My experimentations with this feature relied heavily on these blogs and MSFT provide GitHub script to query a host for DDA capable devices. That was very educational in regards to finding out the PowerShell we needed to get DDA to work! Please see A 1st look at Discrete Device Assignment in Hyper-V to see the output of this script and how we identified that our NVIDIA GRID K1 card was a DDA capable candidate.

Requirements

There are some conditions the host system needs to meet to even be able to use DDA. The host needs to support Access Control services which enables pass through of PCI Express devices in a secure manner. The host also need to support SLAT and Intel VT-d2 or AMD I/O MMU. This is dependent on UEFI, which is not a big issue. All my W2K12R2 cluster nodes & member servers run UEFI already anyway. All in all, these requirements are covered by modern hardware. The hardware you buy today for Windows Server 2012 R2 meets those requirements when you buy decent enterprise grade hardware such as the DELL PowerEdge R730 series. That’s the model I had available to test with. Nothing about these requirements is shocking or unusual.

A PCI express device that is used for DDA cannot be used by the host in any way. You’ll see we actually dismount it form the host. It also cannot be shared amongst VMs. It’s used exclusively by the VM it’s assigned to. As you can imagine this is not a scenario for live migration and VM mobility. This is a major difference between DDA and SR-IOV or virtual fibre channel where live migration is supported in very creative, different ways. Now I’m not saying Microsoft will never be able to combine DDA with live migration, but to the best of my knowledge it’s not available today.

The host requirements are also listed here: https://technet.microsoft.com/en-us/library/mt608570.aspx

  • The processor must have either Intel’s Extended Page Table (EPT) or AMD’s Nested Page Table (NPT).

The chipset must have:

  • Interrupt remapping: Intel’s VT-d with the Interrupt Remapping capability (VT-d2) or any version of AMD I/O Memory Management Unit (I/O MMU).
  • DMA remapping: Intel’s VT-d with Queued Invalidations or any AMD I/O MMU.
  • Access control services (ACS) on PCI Express root ports.
  • The firmware tables must expose the I/O MMU to the Windows hypervisor. Note that this feature might be turned off in the UEFI or BIOS. For instructions, see the hardware documentation or contact your hardware manufacturer.

You get this technology both on premises with Windows Server 2016 as and with virtual machines running Windows Server 2016; Windows 10 (1511 or higher) and Linux distros that support it. It’s also an offering on high end Azure VMs (IAAS). It supports both Generation 1 and generation 2 virtual machines. All be it that generation 2 is X64 bit only, this might be important for certain client VMs. We’ve dumped 32 bit Operating systems over decade ago so to me this is a non-issue.

For this article I used a DELL PowerEdge R730, a NVIIA GRID K1 GPU. Windows Server 2016 TPv4 with CU of March 2016 and Windows 10 Insider Build 14295.

Microsoft supports 2 devices at the moment:

  • GPUs and coprocessors
  • NVMe (Non-Volatile Memory express) SSD controllers

Other devices might work but you’re dependent on the hardware vendor for support. Maybe that’s OK for you, maybe it’s not.

Below I describe the steps to get DDA working. There’s also a rough video out on my Vimeo channel: Discrete Device Assignment with a GPU in Windows 2016 TPv4.

Setting up Discrete Device Assignment with a GPU

Preparing a Hyper-V host with a GPU for Discrete Device Assignment

First of all, you need a Windows Server 2016 Host running Hyper-V. It needs to meet the hardware specifications discussed above, boot form EUFI with VT-d enabled and you need a PCI Express GPU to work with that can be used for discrete device assignment.

It pays to get the most recent GPU driver installed and for our NVIDIA GRID K1 which was 362.13 at the time of writing.

clip_image001

On the host when your installation of the GPU and drivers is OK you’ll see 4 NIVIDIA GRID K1 Display Adapters in device manager.

clip_image002

We create a generation 2 VM for this demo. In case you recuperate a VM that already has a RemoteFX adapter in use, remove it. You want a VM that only has a Microsoft Hyper-V Video Adapter.

clip_image003

In Hyper-V manager I also exclude the NVDIA GRID K1 GPU I’ll configure for DDA from being used by RemoteFX. In this show case that we’ll use the first one.

clip_image005

OK, we’re all set to start with our DDA setup for an NVIDIA GRID K1 GPU!

Assign the PCI Express GPU to the VM

Prepping the GPU and host

As stated above to have a GPU assigned to a VM we must make sure that the host no longer has use of it. We do this by dismounting the display adapter which renders it unavailable to the host. Once that is don we can assign that device to a VM.

Let’s walk through this. Tip: run PoSh or the ISE as an administrator.

We run Get-VMHostAssignableDevice. This return nothing as no devices yet have been made available for DDA.

I now want to find my display adapters

#Grab all the GPUs in the Hyper-V Host

$MyDisplays = Get-PnpDevice | Where-Object {$_.Class -eq “Display”}

$MyDisplays | ft -AutoSize

This returns

clip_image007

As you can see it list all adapters. Let’s limit this to the NVIDIA ones alone.

#We can get all NVIDIA cards in the host by querying for the nvlddmkm

#service which is a NVIDIA kernel mode driver

$MyNVIDIA = Get-PnpDevice | Where-Object {$_.Class -eq “Display”} |

Where-Object {$_.Service -eq “nvlddmkm”}

$MyNVIDIA | ft -AutoSize

clip_image009

If you have multiple type of NVIDIA cared you might also want to filter those out based on the friendly name. In our case with only one GPU this doesn’t filter anything. What we really want to do is excluded any display adapter that has already been dismounted. For that we use the -PresentOnly parameter.

#We actually only need the NVIDIA GRID K1 cards, let’s filter some #more,there might be other NVDIA GPUs.We might already have dismounted #some of those GPU before. For this exercise we want to work with the #ones that are mountedt he paramater -PresentOnly will do just that.

$MyNVidiaGRIDK1 = Get-PnpDevice -PresentOnly| Where-Object {$_.Class -eq “Display”} |

Where-Object {$_.Service -eq “nvlddmkm”} |

Where-Object {$_.FriendlyName -eq “NVIDIA Grid K1”}

$MyNVidiaGRIDK1 | ft -AutoSize

Extra info: When you have already used one of the display adapters for DDA (Status “UnKnown”). Like in the screenshot below.

clip_image011

We can filter out any already unmounted device by using the -PresentOnly parameter. As we could have more NVIDIA adaptors in the host, potentially different models, we’ll filter that out with the FriendlyName so we only get the NVIDIA GRID K1 display adapters.

clip_image013

In the example above you see 3 display adapters as 1 of the 4 on the GPU is already dismounted. The “Unkown” one isn’t returned anymore.

Anyway, when we run

$MyNVidiaGRIDK1 = Get-PnpDevice -PresentOnly| Where-Object {$_.Class -eq “Display”} |

Where-Object {$_.Service -eq “nvlddmkm”} |

Where-Object {$_.FriendlyName -eq “NVIDIA Grid K1”}

$MyNVidiaGRIDK1 | ft -AutoSize

We get an array with the display adapters relevant to us. I’ll use the first (which I excluded form use with RemoteFX). In a zero based array this means I disable that display adapter as follows:

Disable-PnpDevice -InstanceId $MyNVidiaGRIDK1[0].InstanceId -Confirm:$false

Your screen might flicker when you do this. This is actually like disabling it in device manager as you can see when you take a peek at it.clip_image015

When you now run

$MyNVidiaGRIDK1 = Get-PnpDevice -PresentOnly| Where-Object {$_.Class -eq “Display”} |

Where-Object {$_.Service -eq “nvlddmkm”} |

Where-Object {$_.FriendlyName -eq “NVIDIA Grid K1”}

$MyNVidiaGRIDK1 | ft -AutoSize

Again you’ll see

clip_image017

The disabled adapter has error as a status. This is the one we will dismount so that the host no longer has access to it. The array is zero based we grab the data about that display adapter.

#Grab the data (multi string value) for the display adapater

$DataOfGPUToDDismount = Get-PnpDeviceProperty DEVPKEY_Device_LocationPaths -InstanceId $MyNVidiaGRIDK1[0].InstanceId

$DataOfGPUToDDismount | ft -AutoSize

clip_image019

We grab the location path out of that data (it’s the first value, zero based, in the multi string value).

#Grab the location path out of the data (it’s the first value, zero based)

#How do I know: read the MSFT blogs and read the script by MSFT I mentioned earlier.

$locationpath = ($DataOfGPUToDDismount).data[0]

$locationpath | ft -AutoSize

clip_image021

This locationpath is what we need to dismount the display adapter.

#Use this location path to dismount the display adapter

Dismount-VmHostAssignableDevice -locationpath $locationpath -force

Once you dismount a display adapter it becomes available for DDA. When we now run

$MyNVidiaGRIDK1 = Get-PnpDevice -PresentOnly| Where-Object {$_.Class -eq “Display”} |

Where-Object {$_.Service -eq “nvlddmkm”} |

Where-Object {$_.FriendlyName -eq “NVIDIA Grid K1”}

$MyNVidiaGRIDK1 | ft -AutoSize

We get:

clip_image023

As you can see the dismounted display adapter is no longer present in display adapters when filtering with -presentonly. It’s also gone in device manager.

clip_image024

Yes, it’s gone in device manager. There’s only 3 NVIDIA GRID K1 adaptors left. Do note that the device is unmounted and as such unavailable to the host but it is still functional and can be assigned to a VM.That device is still fully functional. The remaining NVIDIA GRID K1 adapters can still be used with RemoteFX for VMs.

It’s not “lost” however. When we adapt our query to find the system devices that have dismounted I the Friendly name we can still get to it (needed to restore the GPU to the host when needed). This means that -PresentOnly for system has a different outcome depending on the class. It’s no longer available in the display class, but it is in the system class.

clip_image026

And we can also see it in System devices node in Device Manager where is labeled as “PCI Express Graphics Processing Unit – Dismounted”.

We now run Get-VMHostAssignableDevice again see that our dismounted adapter has become available to be assigned via DDA.

clip_image029

This means we are ready to assign the display adapter exclusively to our Windows 10 VM.

Assigning a GPU to a VM via DDA

You need to shut down the VM

Change the automatic stop action for the VM to “turn off”

clip_image031

This is mandatory our you can’t assign hardware via DDA. It will throw an error if you forget this.

I also set my VM configuration as described in https://blogs.technet.microsoft.com/virtualization/2015/11/23/discrete-device-assignment-gpus/

I give it up to 4GB of memory as that’s what this NVIDIA model seems to support. According to the blog the GPUs work better (or only work) if you set -GuestControlledCacheTypes to true.

“GPUs tend to work a lot faster if the processor can run in a mode where bits in video memory can be held in the processor’s cache for a while before they are written to memory, waiting for other writes to the same memory. This is called “write-combining.” In general, this isn’t enabled in Hyper-V VMs. If you want your GPU to work, you’ll probably need to enable it”

#Let’s set the memory resources on our generation 2 VM for the GPU

Set-VM RFX-WIN10ENT -GuestControlledCacheTypes $True -LowMemoryMappedIoSpace 2000MB -HighMemoryMappedIoSpace 4000MB

You can query these values with Get-VM RFX-WIN10ENT | fl *

We now assign the display adapter to the VM using that same $locationpath

Add-VMAssignableDevice -LocationPath $locationpath -VMName RFX-WIN10ENT

Boot the VM, login and go to device manager.

clip_image033

We now need to install the device driver for our NVIDIA GRID K1 GPU, basically the one we used on the host.

clip_image035

Once that’s done we can see our NVIDIA GRID K1 in the guest VM. Cool!

clip_image037

You’ll need a restart of the VM in relation to the hardware change. And the result after all that hard work is very nice graphical experience compared to RemoteFX

clip_image039

What you don’t believe it’s using an NVIDIA GPU inside of a VM? Open up perfmon in the guest VM and add counters, you’ll find the NVIDIA GPU one and see you have a GRID K1 in there.

clip_image041

Start some GP intensive process and see those counters rise.

image

Remove a GPU from the VM & return it to the host.

When you no longer need a GPU for DDA to a VM you can reverse the process to remove it from the VM and return it to the host.

Shut down the VM guest OS that’s currently using the NVIDIA GPU graphics adapter.

In an elevated PowerShell prompt or ISE we grab the locationpath for the dismounted display adapter as follows

$DisMountedDevice = Get-PnpDevice -PresentOnly |

Where-Object {$_.Class -eq “System” -AND $_.FriendlyName -like “PCI Express Graphics Processing Unit – Dismounted”}

$DisMountedDevice | ft -AutoSize

clip_image045

We only have one GPU that’s dismounted so that’s easy. When there are more display adapters unmounted this can be a bit more confusing. Some documentation might be in order to make sure you use the correct one.

We then grab the locationpath for this device, which is at location 0 as is an array with one entry (zero based). So in this case we could even leave out the index.

$LocationPathOfDismountedDA = ($DisMountedDevice[0] | Get-PnpDeviceProperty DEVPKEY_Device_LocationPaths).data[0]

$LocationPathOfDismountedDA

clip_image047

Using that locationpath we remove the DDA GPU from the VM

#Remove the display adapter from the VM.

Remove-VMAssignableDevice -LocationPath $LocationPathOfDismountedDA -VMName RFX-WIN10ENT

We now mount the display adapter on the host using that same locationpath

#Mount the display adapter again.

Mount-VmHostAssignableDevice -locationpath $LocationPathOfDismountedDA

We grab the display adapter that’s now back as disabled under device manager of in an “error” status in the display class of the pnpdevices.

#It will now show up in our query for -presentonly NVIDIA GRIDK1 display adapters

#It status will be “Error” (not “Unknown”)

$MyNVidiaGRIDK1 = Get-PnpDevice -PresentOnly| Where-Object {$_.Class -eq “Display”} |

Where-Object {$_.Service -eq “nvlddmkm”} |

Where-Object {$_.FriendlyName -eq “NVIDIA Grid K1”}

$MyNVidiaGRIDK1 | ft -AutoSize

clip_image049

clip_image051

We grab that first entry to enable the display adapter (or do it in device manager)

#Enable the display adapater

Enable-PnpDevice -InstanceId $MyNVidiaGRIDK1[0].InstanceId -Confirm:$false

The GPU is now back and available to the host. When your run you Get-VMHostAssignableDevice it won’t return this display adapter anymore.

We’ve enabled the display adapter and it’s ready for use by the host or RemoteFX again. Finally, we set the memory resources & configuration for the VM back to its defaults before I start it again (PS: these defaults are what the values are on standard VM without ever having any DDA GPU installed. That’s where I got ‘m)

#Let’s set the memory resources on our VM for the GPU to the defaults

Set-VM RFX-WIN10ENT -GuestControlledCacheTypes $False -LowMemoryMappedIoSpace 256MB -HighMemoryMappedIoSpace 512MB

clip_image053

Now tell me all this wasn’t pure fun!

The Hyper-V Amigos Episode 10

It’s with great pride that the Hyper-V Amigos ride again and for The Hyper-V Amigos Episode 10 they dive into what’s new and improved in Windows Server 2016 Failover Clustering.

image

Well OK we only discuss a few subjects in this web cast as there is only a limited amount of time. I’ll present an overview of during my session at the German Cloud and Datacenter conference on May 12th in Germany. An hour is not enough for a deep dive into everything but we will build on our session we did at the Technical Summit (November 2014) in Germany on Improvements in Failover Cluster 2012 R2 ad get you up to speed so you can select what to investigate further.

Until then, enjoy the webcast and I hope it helps prepare you for what’s coming and entices you to join us at the Cloud and Datacenter Summit in Germany on May 12th! And if clustering alone is not enough to bring you over check out the agenda and you might realize what great gathering of experts is happing at the conference. Just look at the content, the breath and depth of the cloud and datacenter technologies being discussed is vast!

The Cluster and 0.Cluster Registry Hives

The cluster database

In a Windows Server Cluster the cluster database is where the cluster configuration gets stored. It’s a file called CLUSDB with some assisting files (CLUSDB.1.container, CLUSDB.2.container, CLUSDB.blf) and you’ll find those in C:\Windows\Cluster (%systemroot%\Cluster).

clip_image002

But the cluster database also lives in a registry hive that gets loaded when the cluster service gets started. You’ll find under HKEY_LOCAL_MACHINE and it’s called Cluster. You might also find a 0.Cluster hive on one of the nodes of the cluster.

clip_image004

The 0.Cluster hive gets loaded on a node that is the owner of the disk witness. So if you have a cloud share or a file share witness this will not be found on any cluster node. Needless to say if there is no witness at all it won’t be found either.

On a lab cluster you can shut down the cluster service and see that the registry hive or hives go away. When you restart the cluster service the Cluster hive will reappear. 0.Cluster won’t as some other node is now owner of the disk witness and even when restarting the cluster service gets a vote back for the witness the 0.Cluster hive will be on that owner node.

If you don’t close the Cluster or 0.Cluster registry hive and navigate to another key when you test this you’ll get an error message thrown that the key cannot be opened. It won’t prevent the cluster service from being stopped but you’ll see an error as the key has gone. If you navigate away, refresh (F5) you’ll see they have indeed gone.

So far the introduction about the Cluster and 0.Cluster Registry Hives.

How is the cluster database kept in sync and consistent?

Good so now we know the registry lives in multiple places and gets replicated between nodes. That replication is paramount to a healthy cluster and it should not be messed with. You can see an DWORD value under the Cluster Key called PaxosTag (see https://support.microsoft.com/en-us/kb/947713 for more information). That’s here the version number lives that keep track of any changes and which is important in maintaining the cluster DB consistency between the nodes and the disk witness – if present – as it’s responsible for replicating changes.

clip_image006

You might know that certain operations require all the nodes to be on line and some do not. When it’s require you can be pretty sure it’s a change that’s paramount to the health of the cluster.

To demonstrate the PaxosTag edit the Cluster Networks Live Migration settings by enabling or disable some networks.

clip_image008

Hit F5 on the registry Cluster/0.cluster Hive and notice the tag has increased. That will be the case on all nodes!

As said when you have a disk witness the owner node of the witness disk also has 0.Cluster hive which gets loaded from the copy of the cluster DB that resides on the cluster witness disk.

clip_image010

As you can see you find 0.hive for the CLUSTERDB and the equivalent supporting files (.container, .blf) like you see under C:\Windows\Cluster on the cluster disk un the Cluster folder. Note that there is no reason to have a drive letter assigned to the witness disk. You don’t need to go there and I only did so to easily show you the content.

Is there a functional difference between a disk witness and a file share or cloud witness?

Yes, a small one you’ll notice under certain conditions. Remember a file share of cloud witness does not hold a copy of the registry database. That also means there’s so no 0.Cluster hive to be found in the registry of the owner node. In the case of a file share you’ll find a folder with a GUID for its name and some files and with a cloud witness you see a file with the GUID of the ClusterInstanceID for its name in the storage blob. It’s bit differently organized but the functionality of these two is exactly the same. This information is used to determine what node holds the latest change and in combination with the PaxosTag what should be replicated.

The reason I mention this difference is that the disk witness copy of the Cluster DB is important because it gives a disk witness a small edge over the other witness types under certain scenarios.

Before Windows 2008 there was no witness disk but a “quorum drive”. It always had the latest copy of the database. It acted as the master copy and was the source for replicating any changes to all nodes to keep them up to date. When a cluster is shut down and has to come up, the first node would download the copy from the quorum drive and then the cluster is formed. The reliance on that quorum copy was a single point of failure actually. So that’s has changed. The PaxosTag is paramount here. All nodes and the disk witness hold an up to date copy, which would mean the PaxosTag is the same everywhere. Any change as you just tested above updates that PaxosTag on the node you’re working on and is replicated to every other node and to the disk witness.

So now when a cluster is brought up the first node you start compares it’s PaxosTag with the one on the disk witness. The higher one (more recent one) “wins” and that copy is used. So either the local clusterDB is used and updates the version on the disk witness or vice versa. No more single point of failure!

There’s a great article on this subject called Failover Cluster Node Startup Order in Windows Server 2012 R2. When you read this you’ll notice that the disk witness has an advantage in some scenarios when it comes to the capability to keep a cluster running and started. With a file share or cloud witness you might have to use -forcequorum to get the cluster up if the last node to be shut down can’t be started the first. Sure these are perhaps less common or “edge” scenarios but still. There’s a very good reason why the dynamic vote and dynamic witness have been introduced and it makes the cluster a lot more resilient. A disk witness can go just a little further under certain conditions. But as it’s not suited for all scenarios (stretched cluster) we have the other options.

Heed my warnings!

The cluster DB resides in multiple places on each node in both files and in the registry. It is an extremely bad idea to mess round in the Cluster and 0.Cluster registry hives to clean out “cluster objects”. You’re not touching the CLUSDB file that way or the PaxosTag used for replicating changes and things go bad rather quickly. It’s a bad situation to be in and for a VM you tried to remove that way you might see:

  • You cannot live or quick migrate that VM. You cannot start that VM. You cannot remove that VM from the cluster. It’s a phantom.
  • Even worse, you cannot add a node to the cluster anymore.
  • To make it totally scary, a server restart ends up with a node where the cluster service won’t start and you’ve just lost a node that you have to evict from the cluster.

I have luckily only seen a few situations where people had registry corruption or “cleaned out” the registry of cluster objects they wanted to get rid of. This is a nightmare scenario and it’s hard, if even possible at all, to recover from without backups. So whatever pickle you get into, cleaning out objects in the Cluster and/or 0.Cluster registry hive is NOT a good idea and will only get you into more trouble.

Heed the warnings in the aging but still very relevant TechNet blog Deleting a Cluster resource? Do it the supported way!

I have been in very few situations where I managed to get out of such a mess this but it’s a tedious nightmare and it only worked because I had some information that I really needed to fix it. Once I succeeded with almost no down time, which was pure luck. The other time cluster was brought down, the cluster service on multiple nodes didn’t even start anymore and it was a restore of the cluster registry hives that saved the day. Without a system state backup of the cluster node you’re out of luck and you have to destroy that cluster and recreate it. Not exactly a great moment for high availability.

If you decide to do muck around in the registry anyway and you ask me for help I’ll only do so if it pays 2000 € per hour, without any promise or guarantee of results and where I bill a minimum of 24 hours. Just to make sure you never ever do that again.