Windows Server 2016 RDMA and the Hyper-V vSwitch – Part I

Introduction

With Windows Server 2012 R2, using both RDMA and the Hyper-V vSwitch on the same host required separate physical network adapters (pNICs). There are 2 reasons for this.

  • First a vSwitch is generally created with a native Windows NIC team. Such a NIC team does not expose RDMA capabilities.
  • Second is that in Windows Server 2012 R2 you cannot expose RDMA capabilities via a vSwitch, even when you are using a non-teamed RDMA capable NIC.

As a result, the need for RDMA required more NICs on the Hyper-V hosts and/or a fully converged had some serious drawbacks. As servers have been quite capable and our VMs serve ever more intensive workloads this was not dramatic. Leveraging 2*10Gbps for a vSwitch and 2*10Gbps for redundant RDMA / SMB Direct traffic have long been one of my favorite designs. It leaves room for other traffic, such as backups, and it allows for high VM density. But with 40Gbps NICs that is overkill and a tad expensive in many scenarios, even when connecting to a SOFS share for Hyper-V storage, so 4*40Gbps on a Hyper-V host is not something I ever saw in real life.

Windows Server 2016 can expose RDMA capabilities via a vSwitch even without SET

What many people seem to have missed is that reason 2 has gone in Windows Server 2016 Hyper-V. Reason 1 still holds true. But that has been solved by Switch Embedded Teaming (SET). This means that you actually do not need SET to leverage RDMA with an vSwitch in Windows Server 2016 Hyper-V. You can do this as follows:

#Create a vSwith
New-VMSwitch -Name RDMACapable-vSwitch -NetAdapterName "NODE-A-S4P1-SW12P05-SMB1"

#Now add a host vNIC for the SMB Direct Traffic
Add-VMNetworkAdapter -SwitchName RDMACapable-vSwitch -Name SMB1 -ManagementOS

#Enable RDMA on it
Enable-NetAdapterRDMA "vEthernet (SMB1)"

#Grab that vNIC on the management OS and set the VLAN - PFC requires tagged VLANs
$NicSMB1 = Get-VMNetworkAdapter -Name SMB1 -ManagementOS
Set-VMNetworkAdapterVLAN -VMNetworkAdapter $NicSMB1 -Access -VlanID 110


Below is what this looks like. We have one vNIC on the management OS leveraging RDMA/SMB Direct consuming all 10Gbps if the NIC we connected to the vSwith. This is a nice lab demo but you can see this isn’t perhaps the best idea in real life.

clip_image002

Other things to note

Do realize this still requires the pNIC to be RDMA capable. This is not some sort of soft RoCE or other software RDMA magic as of today. The pNIC also has to have RDMA enabled or virtual NIC won’t be able to leverage RDMA but fall back to SMB (Multichannel only) instead of SMB Direct. Likewise, RDMA has to be enabled on the vNIC as well. So don’t forget, RDMA must be enabled on both the pNIC and the vNIC for this to work.clip_image004

DCB’s PFC/ETS requires a tagged VLAN to carry the priority, do don’t forget to tag the vNIC. There is actually no need to tag the pNIC as long as the switch port has the tagged VLAN set – most likely as a trunk or in general mode. If you don’t tag consistently across the entire network stack you’ll have network issues anyway and RDMA performance will be bad if it works at all.

Finally, don’t forget this is example is not using VMM /Network Controller and as such is using Set-VMNetworkAdapterVLAN and not Set-VMNetworkAdapterIsolation.

In real life, we need better and more than a single NIC vSwitch

The caveat here is that, while you have a converged setup, you have no redundancy for the vSwitch (there is no team). This also means that you’re are limited to a single NIC in regards to throughput for that vSwith. Depending on the needs of the solutions that might be perfectly fine. It it’s not – in most real-world scenarios you’ll need redundancy – you have to use SET in a converged scenario. That’s what we’ll take a look at in part 2. Then there is the question about QoS as you don’t want SMB Direct traffic to consume to much bandwidth at will. That’s still another issue to discuss and address.

You cannot connect multiple NICs to a single Hyper-V vSwitch without teaming on the host

Can you connect multiple NICs to a single Hyper-V vSwitch without teaming on the host

Recently I got a question on whether a Hyper-V virtual switch can be connected to multiple NICs without teaming. The answer is no. You cannot connect multiple NICs to a single Hyper-V vSwitch without teaming on the host.

This question makes sense as many people are interested in the ease of use and the great results of SMB Multichannel when it comes to aggregation and redundancy. But the answer lies in the name “SMB”. It’s only available for SMB traffic. Believe it or not but there is still a massive amount of network traffic that is not SMB and all that traffic has to pass through the Hyper-v vSwitch.

What can we do?

Which means that any redundant scenario that requires other traffic to be supported than SMB 3 will need to use a different solution than SMB Multichannel. Basically, this means using NIC teaming on a server. In the pre Windows Server 2012 era that meant 3rd party products. Since Windows Server 2012 it means native LBFO (switch independent, static or LACP). In Windows Server 2016 Switch Embedded Teaming (SET) was added to your choice op options. SET only supports switch independent teaming (for now?).

If redundancy on the vSwitch is not an option you can use multiple vSwitches connected to separate NIC and physical switches with Windows native LBFO inside the guests. That works but it’s a lot of extra work and overhead so you only do this when it makes sense. One such an example is SR-IOV which isn’t exposed on top of  a LBFO team.

vNIC Speed in guests on Windows Server 2016 Hyper-V

Prior to Windows Server 2016 Hyper-V the speed a vNIC reported was an arbitrary fixed value. In Windows 2012 R2 Hyper-V that was 10Gbps.

This is a screen shot of a Windows Server 2012 R2 VM attached to a vSwitch on an LBFO of 2*1Gbps running on Windows Server 2012 R2 Hyper-V.

image

This is a screen shot of a Windows Server 2016 VM attached to a vSwitch on an LBFO of 2*10Gbp running on Windows Server 2012 R2 Hyper-V.

image

As you can see the fixed speed of 10Gbps meant that even when the switch was attached to a LBFO with 2 1Gbps NIC it would show 10Gbps etc. Obviously that would not happen unless the 2 VMs are on the same host and the limitations of the NIC don’t come into play as these are bypassed.Do note that the version of Windows in the guest doesn’t matter here as demonstrated above.

The reported speed behavior has changed in Windows Server 2016 Hyper-V. You’ll get a more realistic view of the network capabilities of the VM in some scenarios and configurations.

Here’s a screenshot of a VM attached to a vSwitch on a 1Gbps NIC.

image

As you can see it reports the speed the vSwitch can achieve, 1Gbps. Now let’s look at a VM who’s vNIC is attached to a LFBO of two 10Gbps NICs.

image

This NIC reports 20Gbps inside of the VM, so that’s 2 * 10Gbps.

You get the idea. the vNIC reports the aggregated maximum bandwidth of the NICs used for the  vSwitch. If we had four 10Gbps NICs in the LBFO used for the vSwitch we could see 40Gbps.

You do have to realize some things:

  • Whether a VM has access to the the entire aggregated bandwidth depends on the model of the aggregation. Just consider Switch independent teaming versus LACP teaming modes.
  • The reported bandwidth has no knowledge of any type of QoS. Not hardware based, or virtual via Hyper-V itself.
  • The bandwidth also depends on the capabilities of the other components involved (CPU, PCIe, VMQ, uplink speed, potentially disk speed etc.)
  • Traffic within a host bypasses the physical NIC and as such isn’t constraint  by the NIC capabilities it self.
  • As before the BIOS power configuration has an impact on the speed of your 10Gbps or higher NICs.

In Defense of Switch Independent Teaming With Hyper-V

For many old timers (heck, that includes me) NIC teaming with LACP mode was the best of the best, at least when it comes to teaming options. Other modes often led to passive/active, less than optimal receiving network traffic aggregation. Basically, and perhaps over simplified, I could say the other options were only used if you had no other choice to get things to work. Which we did a lot … I used Intel’s different teaming modes for various reasons in the past (before we had MLAG, VLT, VPC, …). Trying to use LACP where possible was a good approach in the past in physical deployments and early virtualized environments when 1Gbps networking dominated the datacenter realm and Windows did not have native support for LBFO.

But even LACP, even in those days, had some drawbacks. It’s the most demanding form of teaming. For one it required switch stacking. This demands the same brand and type of switches and that means you have no redundancy during firmware upgrades. That’s bad, as the only way to work around that is to move all workload to another rack unit … if you even had the capability to do that! So even in days past we chose different models if teaming out of need or because of the above limitations for high availability. But the superiority of NIC teaming with LACP still stands for many and as modern switches support MLAG, VLT, etc. the drawback of stacking can be avoided. So does that mean LACP for NIC teaming is always the superior choice today?

Some argue it is and now they have found support in the documentation about Microsoft CPS system documentation about Microsoft CPS system. Look, even if Microsoft chose to use LACP in their solutions it’s based on their particular design and the needs of that design I do not concur that this is the best overall. It is however a valid & probably the choice for their specific setup. While I applaud the use of MLAG (when available to you a no or very low cost) to have all bases covered but it does not mean that LACP is the best choice for the majority of use cases with Hyper-V deployments. Microsoft actually agrees with me on this in their Windows Server 2012 R2 NIC Teaming (LBFO) Deployment and Management guide. They state that Switch Independent configuration / Dynamic distribution (or Hyper-V Port if on Hyper-V and if not on W2K12R2)  is the best possible default choice is for teaming in both native and Hyper-V environments. I concur, even if perhaps not that strong for native workloads (it depends). Exceptions to this:

  • Teaming is being performed in a VM (which should be rare),
  • Switch dependent teaming (e.g., LACP) is required by policy, or
  • Operation of a two-member Active/Standby team is required by policy.

In other words in 2 out of 3 cases the reason is a policy, not a technical superior solution …

Note that there are differences between Address Hash, Hyper-V Port mode & the new dynamic distribution modes and the latter has made things better in W2K12R2 in regards to bandwidth but you’ll need the read the white papers. Use dynamic as default, it is the best. Also note that LACP/Switch Dependent doesn’t mean you can send & receive to and from a VM over the aggregated bandwidth of all team members. Life is more complicated than that. So if that’s you’re main reason for switch dependent, and think you’re done => be ware Winking smile.

Switch Independent is also way better for optimization of VMQ. You have more queues available (sum-of-queues) and the IO path is very predictable & optimized.

If you don’t control the switches there’s a lot more cross team communication involved to set up teaming for your hosts. There’s more complexity in these configurations so more possibilities for errors or bugs. Operational ease is also a factor.

The biggest draw back could be that for receiving traffic you cannot get more than the bandwidth a single team member can deliver. That’s true but optimizing receiving traffic has it’s own demands and might not always be that great if the switch configuration isn’t that smart & capable. Do I ever miss the potential ability to aggregate incoming traffic. In real life I do not (yet) but in some configurations it could do a great job to optimize that when needed.

When using 10Gbps or higher you’ll rarely be in a situation where receiving traffic is higher than 10Gbps or higher and if you want to get that amount of traffic you really need to leverage DVMQ. And a as said switch independent teaming with port of dynamic mode gives you the most bang for the buck. as you have more queues available. This drawback is mitigated a bit by the fact that modern NICs have way larger number of queues available than they used to have. But if you have more than one VM that is eating close to 10Gbps in a non lab environment and you planning to have more than 2 of those on a host you need to start thinking about 40Gbps instead of aggregating a fistful of 10Gbps cables. Remember the golden rules a single bigger pipe is always better than a bunch of small pipes.

When using 1Gbps you’ll be at that point sooner and as 1Gbps isn’t a great fit for (Dynamic) VMQ anyway I’d say, sure give LACP a spin to try and get a bit more bandwidth but will it really matter? In native workloads it might but with a vSwith?  Modern CPUs eat 1Gbps NICs for breakfast, so I would not bother with VMQ. But when you’re tied to 1Gbps it’s probably due to budget constraints and you might not even have stackable, MLAG, VLT or other capable switches. But the arguments can be made, it depends (see Don’t tell me “It depends”! But it does!). But in any case I start saving for 10Gbps Smile

Today as the PC8100 series and the N4000 Series (budget 10Gbps switches, yes I know “budget” is relative but in the 10Gbps world, but they offer outstanding value for money), I tend to set up MLAG with two of these per rack. This means we have all options and needs covered at no extra cost and without sacrificing redundancy under any condition. However look at the needs of your VMs and the capability of your NICs before using LACP for teaming by default. The fact that switch independent works with any combination of budget switches to get redundancy doesn’t mean it’s only to be used in such scenarios. That’s a perk for those without more advanced gear, not a consolation price.

My best advise: do not over engineer it. Engineer it for the best possible solution for the environment at hand. When choosing a default it’s not about the best possible redundancy and bandwidth under certain conditions. It’s about the best possible redundancy and bandwidth under most conditions. It’s there that switch independent comes into it’s own, today more than ever!

There is one other very good, but luckily also a very rare case where LACP/Switch dependent will save you and switch independent won’t: dead switch ports, where the port becomes dysfunctional. So while switch independent protects against NIC, Switch, cable failures, here it doesn’t help you as it doesn’t know (it’s about link failures, not logical issues on a port).

For the majority of my Hyper-V deployments I do not use switch dependent / LACP. The situation where I did had to do with Windows NLB in combination with ICMP Multicast.

Note: You can do VLT, MLAG, stacking and still leverage switch independent teaming, LACP or static switch dependent is NOT mandatory even when possible.