Windows Server 2016 RDMA and the Hyper-V vSwitch – Part II

Introduction

In part I this article I demonstrated that some of the rules in regards to SMB Direct and the Hyper-V vSwitch as we know them for Windows Server 2012 R2 have changed with Windows Server 2016. We focused on the fact that you can expose RDMA to a vNIC exposed to the management OS created on a vSwitch. This means that while in Windows Server 2012 R2 you cannot expose RDMA capabilities via a vSwitch, even when you are using a non-teamed RDMA capable NIC, this is no longer true with Windows Server 2016.

While a demo with a vSwitch on a single NIC as we did in part I is nice it’s unlikely you’ll use this often if at all in the real world? Here we require redundancy and that means NIC teaming. To do so we normally use a vSwitch created on a native Windows NIC team. But a native NIC teaming does not expose RDMA capabilities. And as such a vSwitch created against a Windows native NIC team cannot leverage RDMA either. Which was the one of the reasons why a fully converged scenario in Windows Server 2012 R2 was too limited for many scenarios. Loss of RSS on the vNIC exposes to the management OS was another. The solution to this in Windows Server 2016 Hyper-V comes with Switch Embedded Teaming (SET). Now using SET in each and every situation might not be a good idea. It depends. But we do need to know how to configure it. So let’s dive in.

Switch Embedded Teaming (SET) exposes RDMA to the vSwitch

Switch Embedded Teaming (SET in Windows Server 2016 allows multiple identical (make, model, firmware, drivers to be supported) NICs to be used or “teamed” within the vSwitch itself. The important thing to note here this does not use windows NIC teaming or LBFO (Load Balancing and Fail Over).

SET is the future and is needed or use with the Network Controller and Software Defined Networking in Windows. SET can also be used without these technologies. While today it supports a good deal of the capabilities of native Windows NIC teaming it also lacks some of them. In general SET is meant for full or partial converged scenarios with 10GBps or better NICs, not 1Gbps networking in a (hyper)converged Hyper-V scenario.

Please see New Windows Server 2016 NIC and Switch Embedded Teaming User Guide for Download for more information as there is just too much to tell about it.

Setting it up

We start out with a 2-node cluster where each node has 2 RDMA NICs (Mellanox ConnectX-3) with RDMA enabled and DCB configured. Live migration of VMs between those nodes works over SMB Direct works. All NIC are on the same subnet 172.16.0.0/16 (thanks to Window Server 2016 Same Subnet Multichannel) and are on VLAN 110. In Failover Cluster Manager (FCM) that looks like below.

clip_image002

We’ll now use the rNICs to create a Switch Embedded Team.

clip_image004

Note that the teaming mode is switch independent, the only option supported with SET in Window Server 2016.

clip_image006

This also gives us a vNIC exposed to the management OS (default)

clip_image008

This is also visible as a vNIC in the mamagement OS called “vEthernet (RDMA-SET-vSwitch)”

clip_image010

This will be used to manage the host and to make its purpose clear we’ll rename it.

We’ll create 2 separate management OS vNICs for the RDMA traffic later. For now, we want the HOST-MGNT vNIC to have connectivity to the LAN and for that we need to tag it with VLAN 10.

image

The vNIC actually “inherited” the IP configuration of one of our physical NICs and we need to change that to either DHCP or a correct LAN IP address and settings.

clip_image014

You can use the code below to set the HOST-MGNT vNIC to DHCP

To finalize the HOST-MGNT vNIC configuration we enable priority tagging on it. If we don’t we won’t see any traffic other than SMB Direct tagged at all!

clip_image016

Before we go any further we’ll remove the VLAN tag from the rNICS as we don’t want it interfering with egress traffic being tagged by them or ingress traffic being filtered because it doesn’t match the VLAN ID on the rNICs.

From here on we’ll focus on the RDMA capable vNICs well create and use for SMB traffic.

We create 2 vNIC on the management OS for SMB Direct traffic.

clip_image018

Now these vNIC need an IP address, this can be in the same subnet because we have Windows Server 2016 SMB multichannel.

We than also need to put the vNICs in the correct VLAN. Remember that DCB / PFC priority tagging needs tagged VLAN so carry that priority. Right now, we can see that these are untagged.

clip_image020

So we tag them with VLAN ID 110

clip_image022

We enable jumbo frames on the vNICs. Remember that the physical NICs in the SET have jumbo frames enabled as well.

clip_image024

Normally all traffic that is originated from vNICs gets any QOS values set to zero. There is one exception to this and that’s SMB Direct traffic. SMB Direct traffic gets tagged with its QoS priority and that is not reset to 0 as it bypasses the vSwitch completely. But if we set other priorities on other types of traffic for DCB PFC and or ETS that passes over these vNICs we must enable priority tagging on these NICs as well or they’ll be stripped away.

clip_image026

The association of the vNIC to pNICs is random. This also changes during creation and destruction (disabling NICs, restarting the OS). We can map a vNCI to a particular pNIC. This prevents suboptimal use of the available pNICs and provides for a well know predictable path of the traffic. We do this with the below PowerShell commands.

Finally, last but not least, we should enable RDMA on our two vNICs or SMB Direct will not kick in at all.

Right now, we have it all configured correctly on one node of our 2-node cluster. The SMB network look like this now:

clip_image028

The cluster now looks like below.

clip_image030

We can live migrate VMs over SMB Direct in this mixed scenario where one node has pNICs RDMA NICs, 1 node has SET with vNICs for RMDA.

clip_image032

When looking at this in report mode we clearly see Node-A send SMB Direct traffic (tagged with priority 4, green) over its RDMA enabled SET vNICs to Node-B which still has a complete physical rNIC set up (blue).

clip_image034

As you can see in the screen shots above we now have RDMA / SMB Direct working with SET / RDMA vNICs on one node (Node-A) and pure physical RDMA NICs on the other (Node-B). This gives us bandwidth aggregation and redundancy. To complete the exercise, we configure SET on the other node as well. But it’s clear SET and RDAM will also work in a mixed environment.

We’ll discuss some details about certain aspects of the vNIC configuration in future articles. Things like the why and how of Set-VMNetworkAdapterTeamMapping and the use of -IeeePriorityTag. But for now, this is it. Go try it out! It’s the basis for anything you’ll do with SDNv2 in W2K16 and beyond.

Windows Server 2016 RDMA and the Hyper-V vSwitch – Part I

Introduction

With Windows Server 2012 R2, using both RDMA and the Hyper-V vSwitch on the same host required separate physical network adapters (pNICs). There are 2 reasons for this.

  • First a vSwitch is generally created with a native Windows NIC team. Such a NIC team does not expose RDMA capabilities.
  • Second is that in Windows Server 2012 R2 you cannot expose RDMA capabilities via a vSwitch, even when you are using a non-teamed RDMA capable NIC.

As a result, the need for RDMA required more NICs on the Hyper-V hosts and/or a fully converged had some serious drawbacks. As servers have been quite capable and our VMs serve ever more intensive workloads this was not dramatic. Leveraging 2*10Gbps for a vSwitch and 2*10Gbps for redundant RDMA / SMB Direct traffic have long been one of my favorite designs. It leaves room for other traffic, such as backups, and it allows for high VM density. But with 40Gbps NICs that is overkill and a tad expensive in many scenarios, even when connecting to a SOFS share for Hyper-V storage, so 4*40Gbps on a Hyper-V host is not something I ever saw in real life.

Windows Server 2016 can expose RDMA capabilities via a vSwitch even without SET

What many people seem to have missed is that reason 2 has gone in Windows Server 2016 Hyper-V. Reason 1 still holds true. But that has been solved by Switch Embedded Teaming (SET). This means that you actually do not need SET to leverage RDMA with an vSwitch in Windows Server 2016 Hyper-V. You can do this as follows:


Below is what this looks like. We have one vNIC on the management OS leveraging RDMA/SMB Direct consuming all 10Gbps if the NIC we connected to the vSwith. This is a nice lab demo but you can see this isn’t perhaps the best idea in real life.

clip_image002

Other things to note

Do realize this still requires the pNIC to be RDMA capable. This is not some sort of soft RoCE or other software RDMA magic as of today. The pNIC also has to have RDMA enabled or virtual NIC won’t be able to leverage RDMA but fall back to SMB (Multichannel only) instead of SMB Direct. Likewise, RDMA has to be enabled on the vNIC as well. So don’t forget, RDMA must be enabled on both the pNIC and the vNIC for this to work.clip_image004

DCB’s PFC/ETS requires a tagged VLAN to carry the priority, do don’t forget to tag the vNIC. There is actually no need to tag the pNIC as long as the switch port has the tagged VLAN set – most likely as a trunk or in general mode. If you don’t tag consistently across the entire network stack you’ll have network issues anyway and RDMA performance will be bad if it works at all.

Finally, don’t forget this is example is not using VMM /Network Controller and as such is using Set-VMNetworkAdapterVLAN and not Set-VMNetworkAdapterIsolation.

In real life, we need better and more than a single NIC vSwitch

The caveat here is that, while you have a converged setup, you have no redundancy for the vSwitch (there is no team). This also means that you’re are limited to a single NIC in regards to throughput for that vSwith. Depending on the needs of the solutions that might be perfectly fine. It it’s not – in most real-world scenarios you’ll need redundancy – you have to use SET in a converged scenario. That’s what we’ll take a look at in part 2. Then there is the question about QoS as you don’t want SMB Direct traffic to consume to much bandwidth at will. That’s still another issue to discuss and address.

Simplified SMB Multichannel and Multi-NIC Cluster Networks

Simplified SMB Multichannel and Multi-NIC Cluster Networks

One of the seemingly small feature enhancements in Windows Server 2016 Failover clustering is simplified SMB multichannel and multi-NIC cluster networks. In Windows 2016 failover clustering now recognizes and uses multiple NICs on the same subnet for cluster networking (Cluster & client access).

image

Why was this introduced?

The growth in the capabilities of the hardware ( Compute, memory, storage & networking) meant that failover clustering had to leverage this capability more easily and for more use cases than before. Talking about SMB, that now also is used for not “only” CSV and live migration but also for Storage Spaces Direct and Storage Replica.

  • It gives us better utilization of the network capabilities and throughput with Storage Spaces Direct, CSV, SQL, Storage Replica etc.
  • Failover clustering now works with multichannel as any other workload without the extra requirement of needing multiple subnets. This is more important that it seems to me at first. But in many environment getting another VLAN and/or extra subnet is a hurdle. Well that hurdle has gone.
  • For IPv6 Link local Subnets it just works, these are auto configured as cluster only networks.
  • The cluster Validation wizard won’t nag about it anymore and knows it’s a valid failover cluster configuration

See it in action!

You can find a quick demo of simplified SMB multichannel and multi-NIC cluster networks on my Vimeo channel here

image

In this video I demo 2 features. One is new and that is virtual machine compute resiliency. The other is an improved feature, simplified SMB multichannel and multi NIC cluster networks. The Multichannel demo is the first part of the video. Yes, it’s with RDMA RoCEv2, you know I just have to do SMB Direct when I can!

You can read more about simplified SMB multichannel and multi-NIC cluster networks on TechNet in here. Happy Reading!

RDMA/RoCE & Windows Server 2016 TPv4 Testing

Introduction

My good buddy and fellow MVP Aidan Finn has promoted disabling advanced features by default in order to avoid downtime for a long time and with good reason. I agree with the fact that too many implementations of features such as UNMAP, ODX, VMQ are causing us issues. This has to improve and meanwhile something has to be done to avoid the blast & fallout of such issues. In this trend Windows Server 2016 is taking steps towards disabling capabilities by default.

Windows 2016 & RDMA/RoCE

In Windows 2012 (R2), RDMA was enabled by default on ConnectX-3 adapters. This is great when you’ve provisioned a lossless fabric for them to use and configure the hosts correctly. As you know by now RoCE requires DCB Priority flow control and optionally ETS to deliver stellar performance.

If SMB 3 detects that RDMA cannot work properly it will fall back top TCP (that’s what that little TCP “standby” connection is for in those SMB Multichannel/Direct drawings. Not working correctly can mean that and RoCE/RDMA connection cannot be establish or fails under load.

To make sure that people get the behavior they desire and not run into issues the idea is to move to have RDMA disabled by default in Windows Server 2016 when DCB/PFC is disabled. At least for the inbox drivers. This is mentioned in RDMARoCE Considerations for Windows 2016 on ConnectX-3 Adapters. When you want it you’ll have to enable it on purpose meaning that they assume you’ve also set up a lossless Ethernet fabric and configured your hosts.

If you don’t want this there is a way to return to the old behavior and that’s a registry key called “NDKWithGlobalPause”. When this key is set to “1” you are basically forcing the NIC to work with Global Pause. Nice for a lab, but not for real world production with RDMA. We want it to be 0, which is the default.

image

My Lab Experience

Now setting this parameter to 1 for testing on the Mellanox drivers (NOT inbox) in a running lab server it caused a very nice blue screen. Now granted, I’m playing with the 5.10 drivers which are normally not meant for Windows 2016 TPv4. I’m still trying to understand the use & effects of this setting but for now I’m not getting far.

image

This does not happen on Windows 2012 R2 by the way. I should be using the http://www.mellanox.com/page/products_dyn?product_family=129 for Windows 2016 TPv4 testing, even if that one still says TPv3. This beta driver enables:

  • NDKPI 2.0
  • RoCE over SR-IOV
  • Virtualization and RoCE/SMB-Direct on the same port
  • VXLAN Hardware offload
  • PacketDirect

Conclusion

In an ideal world all these advanced features would be enabled out of the box to be used when available and beneficial. Unfortunately, this idea has not worked out well in the real world. Bugs in operating systems, features, NIC firmware and drivers, in storage array firmware and software as well as in switch firmware have made for too many issues for too many people.

There is a push to disable them all by default. The reason for this clear. Avoid down time, data corruption etc. While I can understand this and agree with the practice to avoid issues and downtime it’s also a bit sad.  I do hope that work is being done to make sure that these performance and scalability features become truly reliable and that we don’t end up disabling them all, never to turn them on again. That would mean we’re back to banking on pure raw power or growth for performance & efficiency gains. What does that mean for the big convergence push? If we can’t get these capabilities to be reliable enough for use “as is” now, how much riskier will it become when they are all stacked on top of each other in a converged setup? I’m not to stoked about having ODX, UNMAP, VMQ, RMDA etc. turned off as a “solution” however. I want them to work well and not lose them as a “fix”, that’s unacceptable. When I do turn them on and configure the environment correctly I want them to work well. This industry has some serious work to do in getting there. All this talk of software defined anything will not go far outside of cloud providers if we remain at the mercy of firmware and drivers. In that respect I have seen many software defined solutions get a reality check as early implementations are often a step back in functionality, reliability and capability. It’s very much still a journey. The vision is great, the promise is tempting, but in a production environment I tend to be conservative until I have proven myself it works for us.