Windows Server 2016 RDMA and the Hyper-V vSwitch – Part I

Introduction

With Windows Server 2012 R2, using both RDMA and the Hyper-V vSwitch on the same host required separate physical network adapters (pNICs). There are 2 reasons for this.

  • First a vSwitch is generally created with a native Windows NIC team. Such a NIC team does not expose RDMA capabilities.
  • Second is that in Windows Server 2012 R2 you cannot expose RDMA capabilities via a vSwitch, even when you are using a non-teamed RDMA capable NIC.

As a result, the need for RDMA required more NICs on the Hyper-V hosts and/or a fully converged had some serious drawbacks. As servers have been quite capable and our VMs serve ever more intensive workloads this was not dramatic. Leveraging 2*10Gbps for a vSwitch and 2*10Gbps for redundant RDMA / SMB Direct traffic have long been one of my favorite designs. It leaves room for other traffic, such as backups, and it allows for high VM density. But with 40Gbps NICs that is overkill and a tad expensive in many scenarios, even when connecting to a SOFS share for Hyper-V storage, so 4*40Gbps on a Hyper-V host is not something I ever saw in real life.

Windows Server 2016 can expose RDMA capabilities via a vSwitch even without SET

What many people seem to have missed is that reason 2 has gone in Windows Server 2016 Hyper-V. Reason 1 still holds true. But that has been solved by Switch Embedded Teaming (SET). This means that you actually do not need SET to leverage RDMA with an vSwitch in Windows Server 2016 Hyper-V. You can do this as follows:


Below is what this looks like. We have one vNIC on the management OS leveraging RDMA/SMB Direct consuming all 10Gbps if the NIC we connected to the vSwith. This is a nice lab demo but you can see this isn’t perhaps the best idea in real life.

clip_image002

Other things to note

Do realize this still requires the pNIC to be RDMA capable. This is not some sort of soft RoCE or other software RDMA magic as of today. The pNIC also has to have RDMA enabled or virtual NIC won’t be able to leverage RDMA but fall back to SMB (Multichannel only) instead of SMB Direct. Likewise, RDMA has to be enabled on the vNIC as well. So don’t forget, RDMA must be enabled on both the pNIC and the vNIC for this to work.clip_image004

DCB’s PFC/ETS requires a tagged VLAN to carry the priority, do don’t forget to tag the vNIC. There is actually no need to tag the pNIC as long as the switch port has the tagged VLAN set – most likely as a trunk or in general mode. If you don’t tag consistently across the entire network stack you’ll have network issues anyway and RDMA performance will be bad if it works at all.

Finally, don’t forget this is example is not using VMM /Network Controller and as such is using Set-VMNetworkAdapterVLAN and not Set-VMNetworkAdapterIsolation.

In real life, we need better and more than a single NIC vSwitch

The caveat here is that, while you have a converged setup, you have no redundancy for the vSwitch (there is no team). This also means that you’re are limited to a single NIC in regards to throughput for that vSwith. Depending on the needs of the solutions that might be perfectly fine. It it’s not – in most real-world scenarios you’ll need redundancy – you have to use SET in a converged scenario. That’s what we’ll take a look at in part 2. Then there is the question about QoS as you don’t want SMB Direct traffic to consume to much bandwidth at will. That’s still another issue to discuss and address.

RDMA/RoCE & Windows Server 2016 TPv4 Testing

Introduction

My good buddy and fellow MVP Aidan Finn has promoted disabling advanced features by default in order to avoid downtime for a long time and with good reason. I agree with the fact that too many implementations of features such as UNMAP, ODX, VMQ are causing us issues. This has to improve and meanwhile something has to be done to avoid the blast & fallout of such issues. In this trend Windows Server 2016 is taking steps towards disabling capabilities by default.

Windows 2016 & RDMA/RoCE

In Windows 2012 (R2), RDMA was enabled by default on ConnectX-3 adapters. This is great when you’ve provisioned a lossless fabric for them to use and configure the hosts correctly. As you know by now RoCE requires DCB Priority flow control and optionally ETS to deliver stellar performance.

If SMB 3 detects that RDMA cannot work properly it will fall back top TCP (that’s what that little TCP “standby” connection is for in those SMB Multichannel/Direct drawings. Not working correctly can mean that and RoCE/RDMA connection cannot be establish or fails under load.

To make sure that people get the behavior they desire and not run into issues the idea is to move to have RDMA disabled by default in Windows Server 2016 when DCB/PFC is disabled. At least for the inbox drivers. This is mentioned in RDMARoCE Considerations for Windows 2016 on ConnectX-3 Adapters. When you want it you’ll have to enable it on purpose meaning that they assume you’ve also set up a lossless Ethernet fabric and configured your hosts.

If you don’t want this there is a way to return to the old behavior and that’s a registry key called “NDKWithGlobalPause”. When this key is set to “1” you are basically forcing the NIC to work with Global Pause. Nice for a lab, but not for real world production with RDMA. We want it to be 0, which is the default.

image

My Lab Experience

Now setting this parameter to 1 for testing on the Mellanox drivers (NOT inbox) in a running lab server it caused a very nice blue screen. Now granted, I’m playing with the 5.10 drivers which are normally not meant for Windows 2016 TPv4. I’m still trying to understand the use & effects of this setting but for now I’m not getting far.

image

This does not happen on Windows 2012 R2 by the way. I should be using the http://www.mellanox.com/page/products_dyn?product_family=129 for Windows 2016 TPv4 testing, even if that one still says TPv3. This beta driver enables:

  • NDKPI 2.0
  • RoCE over SR-IOV
  • Virtualization and RoCE/SMB-Direct on the same port
  • VXLAN Hardware offload
  • PacketDirect

Conclusion

In an ideal world all these advanced features would be enabled out of the box to be used when available and beneficial. Unfortunately, this idea has not worked out well in the real world. Bugs in operating systems, features, NIC firmware and drivers, in storage array firmware and software as well as in switch firmware have made for too many issues for too many people.

There is a push to disable them all by default. The reason for this clear. Avoid down time, data corruption etc. While I can understand this and agree with the practice to avoid issues and downtime it’s also a bit sad.  I do hope that work is being done to make sure that these performance and scalability features become truly reliable and that we don’t end up disabling them all, never to turn them on again. That would mean we’re back to banking on pure raw power or growth for performance & efficiency gains. What does that mean for the big convergence push? If we can’t get these capabilities to be reliable enough for use “as is” now, how much riskier will it become when they are all stacked on top of each other in a converged setup? I’m not to stoked about having ODX, UNMAP, VMQ, RMDA etc. turned off as a “solution” however. I want them to work well and not lose them as a “fix”, that’s unacceptable. When I do turn them on and configure the environment correctly I want them to work well. This industry has some serious work to do in getting there. All this talk of software defined anything will not go far outside of cloud providers if we remain at the mercy of firmware and drivers. In that respect I have seen many software defined solutions get a reality check as early implementations are often a step back in functionality, reliability and capability. It’s very much still a journey. The vision is great, the promise is tempting, but in a production environment I tend to be conservative until I have proven myself it works for us.

Musings On Switch Embedded Teaming, SMB Direct and QoS in Windows Server 2016 Hyper-V

When you have been reading up on what’s new in Windows Server 2016 Hyper-V networking you probably read about Switch Embedded Teaming (SET). Basically this takes the concept of teaming and has this done by the vSwitch. Which means you don’t have to team at the host level. The big benefit that this opens up is the RDMA can be leveraged on vNICs. With host based teaming the RDMA capabilities of your NICs are no longer exposed, i.e. you can’t leverage RDMA. Now this has become possible and that’s pretty big.

clip_image001

With the rise of 10, 25, 40, 50 and 100 Gbps NICs and switches the lure to go fully converged becomes even louder. Given the fact that we now don’t lose RDMA capabilities to the vNICs exposed to the host that call sounds only louder to many.  But wait, there’s even more to lure us to a fully converged solution, the fact that we now do no longer lose RSS on those vNICs! All good news.

I have written an entire whitepaper on convergence and it benefits, drawback, risks & rewards. I will not repeat all that here. One point I need to make that lossless traffic and QoS are paramount to the success of fully converged networking. After all we don’t want lossy storage traffic and we need to assure adequate bandwidth for all our types of traffic. For now, in Technical Preview 3 we have support for Software Defined Networking (SDN) QoS.

What does that mean in regards to what we already use today? There is no support for native QoS  and vSwitch QoS in Windows Server 2016 TPv3. There is however the  mention of DCB (PFC/ETS ), which is hardware QoS in the TechNet docs on Remote Direct Memory Access (RDMA) and Switch Embedded Teaming (SET). Cool!

But wait a minute. When we look at all kinds of traffic in a converged Hyper-V environment we see CSV (storage traffic), live migration (all variations), backups over SMB3 all potentially leveraging SMB Direct. Due to the features and capabilities in SMB3 I like that. Don’t get me wrong about that. But it also worries me a bit when it comes to handling QoS on the hardware side of things.

In DCB Priority Flow Control (PFC) is the lossless part, Enhanced Transmission Selection (ETS) is the minimum bandwidth QoS part. But how do we leverage ETS when all types of traffic use SMB Direct. On the host it all gets tagged with the same priority. ETS works by tagging different priorities to different workloads and assuring minimal bandwidths out of a total of 100% without reserving it for a workload if it doesn’t need it. Here’s a blog post on ETS with a demo video DCB ETS Demo with SMB Direct over RoCE (RDMA .

Does this mean a SDN QoS only approach to deal with the various type of SMB Direct traffic or do they have some aces up their sleeves?

This isn’t a new “concern” I have but with SET and the sustained push for convergence it does has the potential to become an issue. We already have the SMB bandwidth limitation feature for live migration. That what is used to prevent LM starving CSV traffic when needed. See Preventing Live Migration Over SMB Starving CSV Traffic in Windows Server 2012 R2 with Set-SmbBandwidthLimit.

Now in real life I have rarely, if ever, seen a hard need for this. But it’s there to make sure you have something when needed. It hasn’t caused me issues yet, but I’m a performance & scale first, in “a non-economies of scale” world compared to hosters. As such convergence is a tool I use with moderation. My testing when traffic competes without ETS is that they all get part of the cake but not super predictable/ consistent. SMB bandwidth limitation is a bit of a “bolted on” solution => you can see the perf counters push down the bandwidth in an epic struggle to contain it, but as said it’s a struggle, not a nice flat line.

Also Set-SmbBandwidthLimit is not a percentage, but hard max bandwidth limit, so when you lose a SET member the math is off and you could be in trouble fast. Perhaps it’s these categories that could or will be used but it doesn’t seem like the most elegant solution/approach. That with ever more traffic leveraging SMB Direct make me ever more curious. Some switches offer up to 4 lossless queues now so perhaps that’s the way to go leveraging more priorities … Interesting stuff! My preferred and easiest QoS tool, get even bigger pipes, is an approach convergence and evolution of network needs keeps pushing over. Anyway, I’ll be very interested to see how this is dealt with. For now I’ll conclude my musings On Switch Embedded Teaming, SMB Direct and QoS in Windows Server 2016 Hyper-V

E2EVC 2015 Berlin SMB Direct Slide Deck

I attended and presented at E2EVC 2015 in Berlin from June 12th to June 14th. The networking was a blast. No “marchitecure” bull shit or vendor fairy tales what so ever and lots of very open discussions on the realities we’re seeing and facing in virtualization and cloud. Most account managers and esoteric presales would die a painful (but fast) death in this environment.

image

One session was with my Hyper-V Amigo buddy Carsten Rachfahl and was pure demo extravaganza, so no slides. My own session was “SMB Direct – The Secret Decoder Ring” and was an attempt to position this technology what by looking at the why and where followed by the how by who and when.

image

I hope a lot of people had at least a better understanding of SMB Direct, RDMA and DCB. The second aim was to take away the fear many people have of this tech by showcasing it in short demos. Time constraints where a challenge so it was not a 200 level session.

Please download the presentation here if interested.

Enjoy. If you have any concerns or questions, ask, and I’ll try to answer.