Latency kills

Introduction

I was investigating a very problematic Windows Server 2016 Hyper-V cluster. That cluster was just performing horribly. “Everything” was hanging, stalling, crashing and RHS.exe errors where flying around while WER dumps got created by the dozen. Things were extremely slow up to the points functionality was just failing. The “fun” thing was that the cluster validation wizard while slow gave that cluster a big thumb up and a supported status as all was well.

Prying around

Time to pry around a bit and see if we could find something wrong. We save live migrations stall, fail, last forever in pending or get stuck at a certain percentage, sometimes finally succeeding with ridiculous blackout times. We could not open up virtual machine properties or very slowly. The FCI GUI was highly unresponsive but so was the Hyper-V Manager GUI or even PowerShell. Those were hanging at even loading the virtual machines or enumerating them with Get-VM. Everything was slow to the point it timed out or crashed. Restarting the services (Cluster, Hyper-V) didn’t do anything and restarting VMMS was super slow or just got stuck. It was a depressing sight for which people tended to blame Hyper-V / Microsoft.

As the title gives away it was latency. Not just ordinary high latency. Real bad latency. That kind of latency kills. Extreme latency produces symptoms that are similar to bugs or corrupt components of roles and features. We have a tendency to look at those first in the event logs and then we look at the network and its usual suspects (VMQ, SET, DCB). But nothing pointed to an issue that I could find.

So, storage maybe?Well we did find one Hyper-V host in the cluster with one HBA port producing too many error so we disabled that FC port for testing. No joy the Hyper-V cluster after a clean reboot of all nodes remained problematic. So on to the storage array itself.

Well holy smoke! On the two volumes for CSV in those cluster we saw latencies that were so bad I could not even believe a single VM would boot. It actually made my appreciation for Hyper-V and clustering grow as it managed to do at least a couple of things. With such latencies I would expect the services to just crash & call it a day.

clip_image002

The horrific latency on one of the CSV LUNs.

Looking at the logs we saw that the latencies occurred on the FC HBAs of the controllers. Each one above 50ms, peaking to 150-250ms and one huge peak at almost 500ms. We saw this on all four HBA’s.

clip_image004

The latency on one of the 4 FC HBA’s on one of the controllers. Not a good day. All HBAs had high latencies like this.

The issues were not at the host level (host HBA’s) or not even at the IOPS/bandwidth level of the storage itself. The latency for some reason was spiking. Further investigation lead to the conclusion that the issue was related to synchronous replication going totally wrong. Moving the replication mode to asynchronous fixed that. We’re now investigating why this happened and how to prevent this from happening again. But that’s another story.

clip_image006

Latency on one of the 4 FC HBAs on one of the controllers after we fixed the issue.

Do not assume anything

So, there you go. Everything depends on everything in some direct or indirect way. It’s all connected and that my friends, is why I’m a proponent of “service resilience engineering” where the responsible team owns the entire stack. That’s is how you can act fast.

Windows Server 2016 RDMA and the Hyper-V vSwitch – Part I

Introduction

With Windows Server 2012 R2, using both RDMA and the Hyper-V vSwitch on the same host required separate physical network adapters (pNICs). There are 2 reasons for this.

  • First a vSwitch is generally created with a native Windows NIC team. Such a NIC team does not expose RDMA capabilities.
  • Second is that in Windows Server 2012 R2 you cannot expose RDMA capabilities via a vSwitch, even when you are using a non-teamed RDMA capable NIC.

As a result, the need for RDMA required more NICs on the Hyper-V hosts and/or a fully converged had some serious drawbacks. As servers have been quite capable and our VMs serve ever more intensive workloads this was not dramatic. Leveraging 2*10Gbps for a vSwitch and 2*10Gbps for redundant RDMA / SMB Direct traffic have long been one of my favorite designs. It leaves room for other traffic, such as backups, and it allows for high VM density. But with 40Gbps NICs that is overkill and a tad expensive in many scenarios, even when connecting to a SOFS share for Hyper-V storage, so 4*40Gbps on a Hyper-V host is not something I ever saw in real life.

Windows Server 2016 can expose RDMA capabilities via a vSwitch even without SET

What many people seem to have missed is that reason 2 has gone in Windows Server 2016 Hyper-V. Reason 1 still holds true. But that has been solved by Switch Embedded Teaming (SET). This means that you actually do not need SET to leverage RDMA with an vSwitch in Windows Server 2016 Hyper-V. You can do this as follows:

#Create a vSwith
New-VMSwitch -Name RDMACapable-vSwitch -NetAdapterName "NODE-A-S4P1-SW12P05-SMB1"

#Now add a host vNIC for the SMB Direct Traffic
Add-VMNetworkAdapter -SwitchName RDMACapable-vSwitch -Name SMB1 -ManagementOS

#Enable RDMA on it
Enable-NetAdapterRDMA "vEthernet (SMB1)"

#Grab that vNIC on the management OS and set the VLAN - PFC requires tagged VLANs
$NicSMB1 = Get-VMNetworkAdapter -Name SMB1 -ManagementOS
Set-VMNetworkAdapterVLAN -VMNetworkAdapter $NicSMB1 -Access -VlanID 110


Below is what this looks like. We have one vNIC on the management OS leveraging RDMA/SMB Direct consuming all 10Gbps if the NIC we connected to the vSwith. This is a nice lab demo but you can see this isn’t perhaps the best idea in real life.

clip_image002

Other things to note

Do realize this still requires the pNIC to be RDMA capable. This is not some sort of soft RoCE or other software RDMA magic as of today. The pNIC also has to have RDMA enabled or virtual NIC won’t be able to leverage RDMA but fall back to SMB (Multichannel only) instead of SMB Direct. Likewise, RDMA has to be enabled on the vNIC as well. So don’t forget, RDMA must be enabled on both the pNIC and the vNIC for this to work.clip_image004

DCB’s PFC/ETS requires a tagged VLAN to carry the priority, do don’t forget to tag the vNIC. There is actually no need to tag the pNIC as long as the switch port has the tagged VLAN set – most likely as a trunk or in general mode. If you don’t tag consistently across the entire network stack you’ll have network issues anyway and RDMA performance will be bad if it works at all.

Finally, don’t forget this is example is not using VMM /Network Controller and as such is using Set-VMNetworkAdapterVLAN and not Set-VMNetworkAdapterIsolation.

In real life, we need better and more than a single NIC vSwitch

The caveat here is that, while you have a converged setup, you have no redundancy for the vSwitch (there is no team). This also means that you’re are limited to a single NIC in regards to throughput for that vSwith. Depending on the needs of the solutions that might be perfectly fine. It it’s not – in most real-world scenarios you’ll need redundancy – you have to use SET in a converged scenario. That’s what we’ll take a look at in part 2. Then there is the question about QoS as you don’t want SMB Direct traffic to consume to much bandwidth at will. That’s still another issue to discuss and address.

Unable to correctly configure Time Service on non PDC Domain Controller

Introduction

Around new year, between the 31st 2016 and the 1st of January 2017 some ISP had issues with the time service. It jumped 24 hours ahead. This cause all kinds of on line services issues ranging from non working digital TV to problems with the time service within companies. That caused some intervention time and temporarily switching the external reliable NTP time server sources to another provider that didn’t show this behavior. Some services required a server reboot to sort things out but things were operational again. But it became clear we still had a lingering issue afterwards as we were unable to correctly configure Time Service on non PDC Domain Controller.

Unable to correctly configure Time Service on non PDC Domain Controller

A few days later we still had one domain, which happend to be 100% virtualized, with issues. As turned out the second domain controller, which did not hold the PDC role, wasn’t syncing with the PDC. No matter what we tried to get it to do so. If you want to find out how to do this properly for virtualized environment I refer you to a blog post by Ben Armstrong Time Synchronization in Hyper-V and fellow MVP Kevin Green Hyper V Time Synchronization on a Windows Based Network.

But no matter what I did, the DC kept  getting the wrong date. I could configure it to refer to the PDC as much as I wanted, nothing helped. It also kept saying the source for the time was the local CMOS (w32tm /query /source). I kept getting an error, we’re normally able to fix by configuring the time service correctly.

image

Another trouble shooting path

The IT universe was not aligned to let me succeed. So that’s when you quit … for a coffee break. You relax a bit, look out of the window whilst sipping from your coffee. After that you dive back in.

I dove into the registry settings for the Windows time service in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time of a functional DC in my lab and the one of the problematic DC in the production domain.  I started comparing the settings and it all seemed to be in order. But for one serious issue with the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\Security key on the problematic DC.

Trying to open that key greeted me with the following error:

Error Opening key

Security cannot be opened. An error is preventing this key from being opened. Details: The system cannot find the file specified.

image

That key was empty. Not good!

image

I exported the entire W32Time registry key and the Security key as a backup for good measure on the problematic DC and I grabbed an export of the security key from the working DC (any functional domain joined server will do) and imported that into the problematic DC. The next step was to restart the time service but that wasn’t enough or I was to inpatient. So finally I restarted the DC and after 10 minutes I got the result I needed …

image

Problem solved Smile

The Hyper-V Processor Relative Weight

Introduction

Hyper-V offers 3 ways of managing or tweaking the CPU scheduler to provide the best possible configuration for certain scenarios and use cases. The defaults normally work fine but of certain conditions you might want to tweak them for the best possible outcome.  The CPU resource controls at your disposal are:

  • Virtual machine reserve  – Think of this as the minimum CPU “QoS”
  • Virtual machine limit – Think of this as the maximum CPU “QoS”
  • Relative Weight – Think of this as the scale defining what VM is more important.

Note that you should understand what these setting are and can do. Threat them like spices. Select the ones you need and don’t overdo it. They’re there to help you, if needed you can leverage all three. But it’s highly unlikely you’ll need to do so. Using one or two will server you best if and when you need them.

In this blog post we’ll look at the relative weight.

Relative weight

Relative weight is a relative number between 1 and 10000 that you can assign to a virtual machine. This determines the relative importance of a virtual machines CPU resources in regards to other virtual machines. So it’s not a % or number of cycles, it’s just a arbitrary weight. By default this is set to 100.

image

You need to come up with a scale and stick to it. 100, 200 and 300 for low, medium and high important virtual machines is a good example. You could also create 10 “classes”  1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500. This leaves room to even create even more (lower, in between and higher).

Note that as long as there are sufficient CPU resources on a host the relative weight does not come into play. It really doesn’t matter whether a virtual machine has relative weight of 1000 versus 5000 at that time. They both get whatever they need as there’s plenty to go around.

Relative weight kicks in when the demand is higher than the availability on a physical host. When you have left all the virtual machines at the default of 100 they will all get an equal share. But when you’ve set virtual machines with a higher relative weight these will be getting a higher share of the available CPU cycles.

Use Cases

Not all virtual machines are created equal. In reality some workloads are more important than others. This might be development and test versus production or high priority workloads versus lower priority workloads. The lower priority workloads are the once that you care about less when there is contention for CPY cycles. Or workloads where less CPU cycles and slower response times don’t make a real difference.

Another use case might be on your developers or lab host where you have a CPU sensitive workload you give a much higher weight and leave the others at the default of 100.

To make sure the high priority workloads or those that really depend on more CPU cycles being delivered fast don’t have to play second fiddle to those that don’t have those needs we use relative weight. It’s very flexible and only kicks in when needed, so there is no waste or inefficiency there.

Limitations

The biggest limitation is in the name. It’s all relative. Where as reserve or limit give you a minimum and a maximum respectively, the relative weight only defines what virtual machine more important than another in regards to CPU cycles. So some virtual machines get more than others but that might not be enough. It’s all about balance between virtual machines, not guaranteed minima or maxima.

You need to agree on a standard within the company to define weight. If everyone starts using a different scale you’re in trouble.

Let’s take one admin who uses 100 for less important virtual machines, 200 for standard virtual machines and 300 for the most important ones. That’s all great when he’s the only one defining the settings and when he does so consistently on all nodes/ cluster for all VMs. In that case all is well even when VMs move around between hosts or between clusters. But what happens when many admins use different “scales”. Well it’s a mess and the behavior won’t be what you want when your colleague used 1000, 2000 and 3000 respectively for the same definition. It’s also smart to not use 100, 101 and 102. leave some margin for adding a category when needed.

Conclusion

This is one handy tool to have at your disposal and I tend to use it to proactively set a higher weight for very important VMs. Even in an environment where there are no predefined categories or know minima this allows me to tell Hyper-V that, if there ever is contention for CPU cycles, the virtual machines with a higher weight are the one to serve a bigger share of the limited resources.