Latency kills


I was investigating a very problematic Windows Server 2016 Hyper-V cluster. That cluster was just performing horribly. “Everything” was hanging, stalling, crashing and RHS.exe errors where flying around while WER dumps got created by the dozen. Things were extremely slow up to the points functionality was just failing. The “fun” thing was that the cluster validation wizard while slow gave that cluster a big thumb up and a supported status as all was well.

Prying around

Time to pry around a bit and see if we could find something wrong. We save live migrations stall, fail, last forever in pending or get stuck at a certain percentage, sometimes finally succeeding with ridiculous blackout times. We could not open up virtual machine properties or very slowly. The FCI GUI was highly unresponsive but so was the Hyper-V Manager GUI or even PowerShell. Those were hanging at even loading the virtual machines or enumerating them with Get-VM. Everything was slow to the point it timed out or crashed. Restarting the services (Cluster, Hyper-V) didn’t do anything and restarting VMMS was super slow or just got stuck. It was a depressing sight for which people tended to blame Hyper-V / Microsoft.

As the title gives away it was latency. Not just ordinary high latency. Real bad latency. That kind of latency kills. Extreme latency produces symptoms that are similar to bugs or corrupt components of roles and features. We have a tendency to look at those first in the event logs and then we look at the network and its usual suspects (VMQ, SET, DCB). But nothing pointed to an issue that I could find.

So, storage maybe?Well we did find one Hyper-V host in the cluster with one HBA port producing too many error so we disabled that FC port for testing. No joy the Hyper-V cluster after a clean reboot of all nodes remained problematic. So on to the storage array itself.

Well holy smoke! On the two volumes for CSV in those cluster we saw latencies that were so bad I could not even believe a single VM would boot. It actually made my appreciation for Hyper-V and clustering grow as it managed to do at least a couple of things. With such latencies I would expect the services to just crash & call it a day.


The horrific latency on one of the CSV LUNs.

Looking at the logs we saw that the latencies occurred on the FC HBAs of the controllers. Each one above 50ms, peaking to 150-250ms and one huge peak at almost 500ms. We saw this on all four HBA’s.


The latency on one of the 4 FC HBA’s on one of the controllers. Not a good day. All HBAs had high latencies like this.

The issues were not at the host level (host HBA’s) or not even at the IOPS/bandwidth level of the storage itself. The latency for some reason was spiking. Further investigation lead to the conclusion that the issue was related to synchronous replication going totally wrong. Moving the replication mode to asynchronous fixed that. We’re now investigating why this happened and how to prevent this from happening again. But that’s another story.


Latency on one of the 4 FC HBAs on one of the controllers after we fixed the issue.

Do not assume anything

So, there you go. Everything depends on everything in some direct or indirect way. It’s all connected and that my friends, is why I’m a proponent of “service resilience engineering” where the responsible team owns the entire stack. That’s is how you can act fast.

Set the preferred site for a CSV in a site-aware stretched failover cluster


I have presented many time over the past tears on the new and enhanced capabilities of Microsoft Failover Clustering in Windows Server 2016 (Experts Live, Cloud & Datacenter Conference Germany, MicroWarehouse’s Windows Server 2016 Launch Event etc.) Feedback has shown me that there is still a lot of need for good failover cluster design and implementation guidance.

One area of enhancement is that you now have site-aware failover clusters in Windows Server 2016. That helps optimize the, availability, behavior and performance of the workload. It leveraged cluster fault domains and in this case those fault domains are the sites where the cluster nodes reside.


Set the preferred site for a CSV in a site-aware stretched failover cluster

You can leverage the site awareness to do all kind of configuration optimizations. You can set a preferred site creating a primary and a DRC site. The cluster behavior will optimize for that scenario. It will also help with situation like quorum split more easily and elegantly. You can create an “Active-Active” site configuration because a cluster groups, such as virtual machines can have their own preferred site.

As you can see in the picture above there is a thing called Storage Affinity. That means that VMs follow storage and are placed in same site where their associated storage resides. As such VMs will begin live migrating to the same site as their associated CSV after 1 minute. The CSV load balancer will distribute within the preferred site. That’s cool. But when setting a preferred site at the cluster group level like for virtual machines, how does one do this for a CSV?

It’s actually quite simple. A CSV is a cluster group, just like a VM is. So, for every CSV you can set that preferred site. You just grab the cluster group a bit differently. Let’s look at an example.

For a VM you’d do this: (Get-ClusterGroup -Name DidierTest01).PreferredSite = ‘Dublin’

Now for a CSV we go about it as follows:

Get-ClusterSharedVolume “Cluster Disk 1” | Get-ClusterGroup | Fl *


The preferred site has not been set yet. To set the preferred site for a CSV you can do the following:

$NTFS01 = Get-ClusterSharedVolume “Cluster Disk 1” | Get-ClusterGroup $NTFS01.PreferredSite = “Dublin” $NTFS01.PreferredSite


You can remove a preferred site by setting it to $Null:

$NTFS01.PreferredSite = $Null

That was not to hard was it? There is one other thing to keep in mind. Do not forget to set up your site fault domains first and set the site for your cluster nodes before you start configuring preferred sites at the cluster group level or it will throw an error. That’s the minimal setup of a site-aware cluster you need to have in place before you can do more fine-grained configurations.

New-ClusterFaultDomain –Name Dublin –Type Site –Description “Primary” –Location “Dublin DC1”
New-ClusterFaultDomain –Name Cork –Type Site –Description “Secondary” –Location “Cork DC2″
Set-ClusterFaultDomain –Name Node-A –Parent Dublin
Set-ClusterFaultDomain –Name Node-B –Parent Dublin
Set-ClusterFaultDomain –Name Node-C –Parent Cork
Set-ClusterFaultDomain –Name Node-D –Parent Cork


If you don’t do this and try to set preferred sites at the cluster group level you’ll get an error like:

Exception setting “PreferredSite”: “Unable to save property changes for ‘e95ad724-97d3-4848-91db-198ab8312737’.
The parameter is incorrect”
At line:1 char:1
+ $NTFS01.PreferredSite = “Dublin”
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [], SetValueInvocationException
+ FullyQualifiedErrorId : ExceptionWhenSetting

There is a lot more to say about site-aware stretched clusters but how to deal with setting a preferred site for a CSV must be the most common question I get on this subject. Well, now it’s published to help you all out. I hope it helps.

Set-AdfsSllCertificate: PS0159: the operation is not supported at the current Farm Behavior Level ‘1’. Raise the farm to at least version ‘2’ before retrying.


A Windows Server 2016 Farm running had its service communication certificate about to expire so it was time to renew it. Easy you think, get a new cert, get it up and running on all farm member and configure your ADFS farm to use it. Easy enough running  Set-AdfsSllCertificate until you get an error.


Set-AdfsSllCertificate: PS0159: the operation is not supported at the current Farm Behavior Level ‘1’. Raise the farm to at least version ‘2’ before retrying.

The cause

At first I was a bit surprised. This is by design and it is mentioned in Managing SSL Certificates in AD FS and WAP in Windows Server 2016. This is typically one of those statement you don’t pay attention to too much until you have the issue.

It only occurs with upgraded ADFS Farms (Windows Server 2012 R2 to Windows Server 2016) that have not been raised to the Farm Behavior Level 3. This was the case as the domain was still running Windows Server 2012 R2 DCs and the forest and domain schema updates had not been run yet at the time the ADFS Farm was upgrade from Windows Server 2012 R2 to Windows Server 2016. See Migrate a Windows Server 2012 R2 AD FS farm to a Windows Server 2016 AD FS farm Hence no upgrade was done as without the schema updates you can’t do this and the new functionality this exposes was not available yet anyway. This didn’t cause any issue as the certificate was valid and all operations worked.

Now, when you install a ADFS farm from scratch on Windows Server 2016 the Farm Behavior Level will read as “3” even it if the domain does not have the forest and domain schemas yet. Basically it sort of lies. But in such an event you won’t have the issue renewing the service communication certificate.

The fix

By now the Windows Server 2016 Active Directory schema updates have been run and 80% of all domain controllers are already running Windows Server 2016 at the moment the service communication certificate expired. To be able to replace it we need to do as the error message says: raise the Farm Behavior Level which is now possible as the schema updates are in place.

We check it is indeed still at “1”. clip_image004

We raise the level. by running Invoke-AdfsFarmBehaviorLevelRaise


As you cans see it ran successfully. We can check the Farm Behavior Level


Running Set-AdfsSslCertificate now does work and all is well again.


There you go, no more errors.