Windows Server 2016 RDMA and the Hyper-V vSwitch – Part I

Introduction

With Windows Server 2012 R2, using both RDMA and the Hyper-V vSwitch on the same host required separate physical network adapters (pNICs). There are 2 reasons for this.

  • First a vSwitch is generally created with a native Windows NIC team. Such a NIC team does not expose RDMA capabilities.
  • Second is that in Windows Server 2012 R2 you cannot expose RDMA capabilities via a vSwitch, even when you are using a non-teamed RDMA capable NIC.

As a result, the need for RDMA required more NICs on the Hyper-V hosts and/or a fully converged had some serious drawbacks. As servers have been quite capable and our VMs serve ever more intensive workloads this was not dramatic. Leveraging 2*10Gbps for a vSwitch and 2*10Gbps for redundant RDMA / SMB Direct traffic have long been one of my favorite designs. It leaves room for other traffic, such as backups, and it allows for high VM density. But with 40Gbps NICs that is overkill and a tad expensive in many scenarios, even when connecting to a SOFS share for Hyper-V storage, so 4*40Gbps on a Hyper-V host is not something I ever saw in real life.

Windows Server 2016 can expose RDMA capabilities via a vSwitch even without SET

What many people seem to have missed is that reason 2 has gone in Windows Server 2016 Hyper-V. Reason 1 still holds true. But that has been solved by Switch Embedded Teaming (SET). This means that you actually do not need SET to leverage RDMA with an vSwitch in Windows Server 2016 Hyper-V. You can do this as follows:


Below is what this looks like. We have one vNIC on the management OS leveraging RDMA/SMB Direct consuming all 10Gbps if the NIC we connected to the vSwith. This is a nice lab demo but you can see this isn’t perhaps the best idea in real life.

clip_image002

Other things to note

Do realize this still requires the pNIC to be RDMA capable. This is not some sort of soft RoCE or other software RDMA magic as of today. The pNIC also has to have RDMA enabled or virtual NIC won’t be able to leverage RDMA but fall back to SMB (Multichannel only) instead of SMB Direct. Likewise, RDMA has to be enabled on the vNIC as well. So don’t forget, RDMA must be enabled on both the pNIC and the vNIC for this to work.clip_image004

DCB’s PFC/ETS requires a tagged VLAN to carry the priority, do don’t forget to tag the vNIC. There is actually no need to tag the pNIC as long as the switch port has the tagged VLAN set – most likely as a trunk or in general mode. If you don’t tag consistently across the entire network stack you’ll have network issues anyway and RDMA performance will be bad if it works at all.

Finally, don’t forget this is example is not using VMM /Network Controller and as such is using Set-VMNetworkAdapterVLAN and not Set-VMNetworkAdapterIsolation.

In real life, we need better and more than a single NIC vSwitch

The caveat here is that, while you have a converged setup, you have no redundancy for the vSwitch (there is no team). This also means that you’re are limited to a single NIC in regards to throughput for that vSwith. Depending on the needs of the solutions that might be perfectly fine. It it’s not – in most real-world scenarios you’ll need redundancy – you have to use SET in a converged scenario. That’s what we’ll take a look at in part 2. Then there is the question about QoS as you don’t want SMB Direct traffic to consume to much bandwidth at will. That’s still another issue to discuss and address.

NVM Express over Fabrics

Any technologist who’s read, let alone used NVM Express (NVMe), is pretty enthusiastic about it’s capabilities and if it was not for availability and financial restrictions we’d all have at least a couple in our home systems and labs. Let Windows Server 2016 nested virtualization come, woehoe!

Intel-P3700-PCB-Angled-Top

It seems to succeed very well in making sure the host can keep up with the performance (low latencies, high throughput) delivered by SSD drives, better than our current interfaces. Today you can drop them into your workstation or server and get going. They’ll give your home lab stellar storage performance today and Microsoft publicly talked publicly about them being supported in Storage Spaces Direct in Windows Server 2016 at Ignite 2015. But there is more to come.

Many of us are very happy with future visions on how PCIe will dislodge SAS/SATA as the preferred SSD interface. This might seem feasible for local storage right now and it can be leveraged for caching or with local PCIe RAID controllers which if shared enable Cluster in a Box (CiB) scenarios. But how to deal with this in an actual storage array, what if we want to size this to a larger scale? There are no “PCIe JBODS”. So what does one do? Well, how did we do it in the past with FC? We created a fabric. Below we see several local & remote NVMe architectures even hybrid ones with traditional SAS.

image

That’s exactly what NVM Express Inc. is doing, creating the specs for a fabric. This holds the promise to achieve superior results due to the elimination of SCSI translation which reduces latency significantly by delivering NVMe end to end. Not only that but we also see the following efforts in the NVM Express Specification 1.2 to give it enterprise grade capabilities beyond pure performance.

  • Enhanced status reporting
  • Expanded capabilities including live firmware updates

There have been some early demos of NVMe over Fabrics mainly focusing on the “remote” performance. While local NVMe SSDs have the edge on absolute IOPS the difference with NVMe over a fabric is not significant. The reduction in latency is measured in < 10 µs,so that’s good news. The fabric leverages RDMA (yes, yet another reason that my time spending with this technology has been a useful investment). This can be Infiniband, RoCE or iWarp. There’s also the new kid on the block “Intel Omni Scale”  (even if their early demo used iWARP). There’s also a Mellanox RoCE demo.

image

Now with NVMe SSD disk speeds it seems that the writing is on the wall that ever better fabric performance will be needed to support the tremendous throughput this evolution of storage can deliver. RDMA seems poised for success in regards to this. Now, yes, strictly speaking the NVMe traffic does not require RDMA but let’s just say I don’t see anyone building it without. I also think this means even iWarp fabrics will use DCB (PFC) to make sure we have a lossless network. The amount of traffics will be immense and why not optimize for the best possible performance? I hold the opinion this is beneficial for east-west traffic today in larger environments, especially when in converged networks. Unless the Intel® Omni-Path Architecture blows everyone else away that is Smile. Too early to tell.

Now does this dictate the total and absolute obsolescence of iSCSI and FC? No. There is no reason why a NVMe Fabrics storage solution cannot offer storage to hosts via FC, iSCSI, SMB 3, NFS, FCoE, … They, potentially could even offer all RDMA flavors like iWarp, RoCE or Infiniband to the hosts so you won’t lose your prior investments or get locked into one flavor of RDMA. I have no magic ball so I cannot tell you if this will happen. What I do now that when it comes to MPIO versus multichannel for load balancing and even failover and recovery, multichannel does a (far?) superior job in my honest opinion even when the hypervisor uses separate sessions per virtual machine to achieve better load balancing over iSCSI or the like. So perhaps storage vendors will finally deliver full SMB 3 support in their stacks … if not, well we’ll just abstract your storage way with SOFS. Your loss.  Anyway, I digress. One thing I do know is that I’ll keep a keen eye on what Microsoft is doing in this space, especially in regards to Windows Server 2016 capabilities & scalabilities. It’s time to up the level on scalability & support for newer state of the art technologies once again. It will ensure we get to run our stack on the very best hardware for years to come.

Presenting at ITProceed 2015 & E2EVC 2015 Berlin on SMB Direct

You cannot afford to ignore SMB3 and it’s capabilities related to storage traffic such as multichannel, RDMA and encryption. SMB Direct over RoCE seems to have a bright future as it continuous to evolve and improve in Windows Server 2106. The need for DCB (PFC and optionally ETS) intimidates some people. But it should not.

I’ll be putting SMB Direct & RoCE into perspective at ITPROCEED | Welcome to THE IT Pro Event of the year! and #E2EVC E2EVC 2015 Berlin, June 12-14, 2015 Berlin, Germany, sharing experiences, tips and demos!  Come see PFC & ETS in action and learn what it can do for you. Storage vendors should most certainly consider supporting all features of SMB 3 natively as a competitive advantage. So Join me for the talk “SMB Direct – The Secret Decoder Ring”.

All these talks are at extremely affordable community driven events to make sure you can attend. The sessions are given by speakers who do this for the community (speakers and attendees do this in their own time and pay for their our own travel/expenses) and who work with these technology in real life and provide feedback to vendors on the issues or opportunities we see. This makes the sessions very interesting and anything but marketing, slide ware or sales pitches. See you there!

Hyper-V Amigos Showcast Episode 9 – RDMA, RoCE, PFC and ETS

Just before Carsten Rachfahl and I left for Microsoft Ignite we recorded episode 9 of the Hyper-V Amigo Showcast. In this episode we’ll discuss SMB Direct over RoCE (RDMA over Converged Ethernet) which requires lossless Ethernet.

image

Data Center Bridging is the way to achieve this. It has four standards, PFC (802.1Qbb), ETS (802.1Qaz), CN (802.1Qau) and DCBx, but only two are important to us now.Priority Flow Control (PFC) is mandatory

image

and Enhanced Transmission Selection is optional (but very handy depending on your environment).

image

If you need more information on this start with these blogs on the subject. But without further delay here’s Hyper-V Amigos Showcast Episode 9 – RDMA, RoCE, PFC and ETS