Windows Server 2012 Supports Data Center TCP (DCTCP)

In the grand effort to make Windows Server 2012 scale above and beyond the call of duty Microsoft has been addressing (potential) bottle necks all over the stack. CPU, NUMA, Memory, storage and networking.

Data Center TCP (DCTCP) is one of the many improvements by which Microsoft aims to deliver a lot better network throughput with affordable switches. Switches that can mange large amounts of network traffic tend to have large buffers and those push up the prices a lot. The idea here is that a large buffer creates the ability to deal with burst and prevents congestions. Call it over provisioning if you want.  While this helps it is far from ideal. Let’s say it a blunt instrument.

To mitigate this issue Windows Server 2012 is now capable dealing with network congestion in  a more intelligent way. It does so by reacting to the degree & not merely the presence of congestion using DCTCP. The goals are:

  • Achieve low latency, high burst tolerance, and high throughput, with small buffer switches (read cheaper).
  • Requires Explicit Congestion Notification (ECN, RFC 3168) capable switches. This should be no showstopper you’d think as it’s probably pretty common on most data center / rack switches but that doesn’t seem to be the case for the real cheap ones where this would shine … Sad smile
  • Algorithm enables when it makes sense to do so (low round trip times, i.e. it will be used inside the data center where it makes sense, not over a world wide WAN or internet). 

To see if it is applied run Get-NetTcpConnection:

image

As you can see this is applied here on a DELL PC8024F switch for the CSV and LM networks. The internet connected NIC (connection of the RDP session) shows:

image

Yup, it’s East-West traffic only, not North-South where it makes no sense.

When I was prepping a slide deck for a presentation on what this is, does and means I compared it to the green wave traffic light control. The space between consecutive traffic lights is the buffer and the red light are stops the traffic has to deal with due congestion. This leaves room for a lot of improvement and the way to achieve this is traffic control that intelligently manages the incoming flow so that at every hop there is a green light and the buffer isn’t saturated.

image

Windows Server 2012 in combination with Explicit Congestion Notification (ECN) provides the intelligent traffic control to realize the green wave.

image

The result is very smooth low latency traffic with high burst tolerance and high throughput with cheaper small buffer switches. To see the difference look at the picture   below (from Microsoft BUILD)of what this achieves. Pretty impressive. Here’s a paper by Microsoft Research on the subject

image

NIC Teaming in Windows 8 & Hyper-V

One of the many new features in Windows 8 is native NIC Teaming or Load Balancing and Fail Over (LBFO). This is, amongst many others, a most welcome and long awaited improvement. Now that Microsoft has published a great whitepaper (see the link at the end) on this it’s time to publish this post that has been simmering in my drafts for too long. Most of us dealing with NIC teaming in Windows have a lot of stories to tell about incompatible modes depending on the type of teaming, vendors and what other advanced networking features you use.  Combined with the fact that this is a moving target due to a constant trickle of driver & firmware updates to rid of us bugs or add support for features. This means that what works and what doesn’t changes over time. So you have to keep an eye on this. And then we haven’t even mentioned whether it is supported or not and the hassle & risk involved with updating a driver Smile

When it works it rocks and provides great benefits (if not it would have been dead). But it has not always been a very nice story. Not for Microsoft, not for the NIC vendors and not for us IT Pros. Everyone wants things to be better and finally it has happened!

Windows 8 NIC Teaming

Windows 8 brings in box NIC Teaming, also know as Load Balancing and Fail Over (LBFO), with full Microsoft support. This makes me happy as a user. It makes the NIC vendors happy to get out of needing to supply & support LBOF. And it makes Microsoft happy because it was a long missing feature in Windows that made things more complex and error prone than they needed to be.

So what do we get form Windows NIC Teaming

  • It works both in the parent & in the guest. This comes in handy, read on!

image

  • No need for anything else but NICs and Windows 8, that’s it. No 3rd party drivers software needed.
  • A nice and simple GUI to configure & mange it.
  • Full PowerShell support for the above as well so you can automate it for rapid & consistent deployment.
  • Different NIC vendors are supported in the same team.  You can create teams with different NIC vendors in the same host. You can also use different NIC across hosts. This is important for Hyper-V clustering & you don’t want to be forced to use the same NICs everywhere. On top of that you can live migrate transparently between servers that have different NIC vendor setups. The fact that Windows 8 abstracts this all for you is just great and give us a lot more options & flexibility.
  • Depending on the switches you have it supports a number of teaming modes:
    • Switch Independent:  This uses algorithms that do not require the switch to participate in the teaming. This means the switch doesn’t care about what NICs are involved in the teaming and that those teamed NICS can be connected to different switches. The benefit of this is that you can use multiple switches for fault tolerance without any special requirements like stacking.
    • Switch Dependent: Here the switch is involved in the teaming. As a result this requires all the NICs in the team to be connected to the same switch unless you have stackable switches. In this mode network traffic travels at the combined bandwidth of the team members which acts as a as a single pipeline.There are two variations supported.
      1. Static (IEEE 802.3ad) or Generic: The configuration on the switch and on the server identify which links make up the team. This is a static configuration with no extra intelligence in the form of protocols assisting in the detection of problems (port down, bad cable or misconfigurations).
      2. LACP (IEEE 802.1ax, also known as dynamic teaming). This leverages the Link Aggregation Control Protocol on the switch to dynamically identify links between the computer and a specific switch. This can be useful to automatically reconfigure a team when issues arise with a port, cable or a team member.
  • There are 2 load balancing options:
    1. Hyper-V Port: Virtual machines have independent MAC addresses which can be used to load balance traffic. The switch sees a specific source MAC addresses connected to only one connected network adapter, so it can and will balance the egress traffic (from the switch) to the computer over multiple links, based on the destination MAC address for the virtual machine. This is very useful when using Dynamic Virtual Machine Queues. However, this mode might not be specific enough to get a well-balanced distribution if you don’t have many virtual machines. It also limits a single virtual machine to the bandwidth that is available on a single network adapter. Windows Server 8 Beta uses the Hyper-V switch port as the identifier rather than the source MAC address. This is because a virtual machine might be using more than one MAC address on a switch port.
    2. Address Hash: A hash (there a different types, see the white paper mention at the end for details on this) is created based on components of the packet. All packets with that hash value are assigned to one of the available network adapters. The result is that all traffic from the same TCP stream stays on the same network adapter. Another stream will go to another NIC team member, and so on. So this is how you get load balancing. As of yet there is no smart or adaptive load balancing available that make sure the load balancing is optimized by monitoring distribution of traffic and reassigning streams when beneficial.

Here a nice overview table from the whitepaper:

image

Microsoft stated that this covers the most requested types of NIC teaming but that vendors are still capable & allowed to offer their own versions, like they have offered for many years, when they find that might have added value.

Side Note

I wonder how all this is relates/works with to Windows NLB, not just on a host but also in a virtual machine in combination with windows NIC teaming in the host (let alone the guest). I already noticed that Windows NLB doesn’t seem to work if you use Network Virtualization in Windows 8. That combined with the fact there is not much news on any improvements in WNLB (it sure could use some extra features and service monitoring intelligence) I can’t really advise customers to use it any more if they want to future proof their solutions. The Exchange team already went that path 2 years ago. Luckily there are some very affordable & quality solution out there. Kemp Technologies come to mind.

  • Scalability.You can have up to 32 NIC in a single team. Yes those monster setups do exist and it provides for a nice margin to deal with future needs Smile
  • There is no THEORETICAL limit on how many virtual interfaces you can create on a team. This sounds reasonable as otherwise having an 8 or 16 member NIC team makes no sense. But let’s keep it real, there are other limits across the stack in Windows, but you should be able to get up to at least 64 interfaces generally. Use your common sense. If you couldn’t put 100 virtual machines in your environment on just two 1Gbps NICs due to bandwidth concerns & performance reasons you also shouldn’t do that on two teamed 1Gbps NICs either.
  • You can mix NIC of different speeds in the same team. Mind you, this is not necessarily a good idea. The best option is to use NICs of the same speed. Due to failover and load balancing needs and the fact you’d like some predictability in a production environment. In the lab this can be handy when you need to test things out or when you’d rather have this than no redundancy.

Things to keep in mind

SR-IOV & NIC teaming

Once you team NICs they do not expose SR-IOV on top of that. Meaning that if you want to use SR-IOV and need resilience for your network you’ll need to do the teaming in the guest. See the drawing higher up. This is fully supported and works fine. It’s not the easiest option to manage as it’s on a per guest basis instead of just on the host but the tip here is using the NIC Teaming UI on a host to manage the VM teams at the same time.  Just add the virtual machines to the list of managed servers.

image

Do note that teams created in a virtual machine can only run in Switch Independent configuration, Address Hash distribution mode. Only teams where each of the team members is connected to a different Hyper-V switch are supported. Which is very logical, as the picture below demonstrates, because you won’t have a redundant solution.

image

Security Features & Policies Break SR-IOV

Also note that any advanced feature like security policies on the (virtual) switch will disable SR-IOV, it has to or SR-IOV could be used as an effective security bypass mechanism. So beware of this when you notice that SR-IOV doesn’t seem to be working.

RDMA & NIC Teaming Do Not Mix

Now you need also to be aware of the fact that RDMA requires that each NIC has a unique IP addresses. This excludes NIC teaming being used with RDMA. So in order to get more bandwidth than one RDMA NIC can provide you’ll need to rely on Multichannel. But that’s not bad news.

TCP Chimney

TCP Chimney is not supported with network adapter teaming in Windows Server “8” Beta. This might change but I don’t know of any plans.

Don’t Go Overboard

Note that you can’t team teamed NIC whether it is in the host or parent or in virtual machines itself. There is also no support for using Windows NIC teaming to team two teams created with 3rd party (Intel or Broadcom) solutions. So don’t stack teams on top of each

Overview of Supported / Not Supported Features With Windows NIC Teaming

image

Conclusion

There is a lot more to talk about and a lot more to be tested and learned. I hope to get some more labs going and run some tests to see how things all fit together. The aim of my tests is to be ready for prime time when Windows 8 goes RTM. But buyer beware, this is  still “just” Beta material.

For more information please download the excellent whitepaper NIC Teaming (LBFO) in Windows Server "8" Beta

Windows 8 introduces SR-IOV to Hyper-V

We dive a bit deeper into SR-IOV today. I’m not a hardware of software network engineer but this is my perspective on what it is and why it’s valuable addition to the toolbox of Hyper-V in Windows 8.

What is SR-IOV?

SR-IOV stands for Single Root I/O Virtualization. The “Single Root” part means that the PCIe device can only be shared with one system. The Multi Root I/O Virtualization (MR-IOV) is a specification where it can be shared by multiple systems. This is beyond the scope of this blog but you can imagine this being used in future high density blade server topologies and such to share connectivity among systems.

What does SR-IOV do?

Basically SR-IOV allows a single PCIe device to emulate multiple instances of that physical PCIe device on the PCI bus. So it’s a sort of PCIe virtualization. SR-IOV achieves this by using NICs that support this (hardware dependent) by use physical functions (PFs) and virtual functions (VFs). The physical device (think of this a port on a NIC)  is known as a Physical Function (PF) . The virtualized instances of that physical device (that port on our NIC that gets emulated x times) are the Virtual Functions (VF). A PF acts like a full blown PCIe device and is configurable, it acts and functions like a physical device. There is only one PF per port on a physical NIC. VF are only capable of data transfers in and out of devices and can’t be configured or act like real PCIe devices. However you can have many of them tied to one PF but they share the configuration of the PF.

It’s up to the hypervisor (software dependency)  to  assign one or more of these VFs to a virtual Machine (VM) directly. The guest can then use the VF NIC ports via VF driver (so there need to be VF drivers in the integration components) and traffic is send directly (via DMA) in and out of the guest to the physical NIC bypassing the virtual switch of the hyper visor completely. This reduces overhead on CPU load and increases performance of the host and as such also helps with network I/O to and from the guests, it’s as if the virtual machine uses the physical NIC in the host directly. The hyper visor needs to support SR-IOV because it needs to know what PFs and VFs are en how they work.

image

So SR-IOV depends on both hardware (NIC) and software (hypervisor) that supports it. It’s not just the NIC by the way, SR-IOV also needs a modern BIOS with virtualization support. Now most decent to high end server CPUs today support it, so that’s not an issue. Likewise for the NIC.  A modern quality NIC targeted at the virtualization market supports this.  And of cause SR-IOV also needs to be supported by the hypervisor. Until Windows 8, Hyper-V did not support SR-IOV but now it does.

I’ve read in an HP document that you can have 1 to 6 PFs per device (NIC port) and up to 256 “virtual devices” or VF per NIC today. But in reality that might not viable due to the overhead in hardware resources associated with this. So 64 or 32 VFs might be about the maximum but still, 64*2=128 virtual devices from a dual port 10Gbps NIC is already pretty impressive to me. I don’t know what they are for Hyper-V 3.0 but there will be limits to the number of SR-IOV NIC is a server and the number of VFs per core and host but I think they won’t matter to much for most of us in reality. And as technology advances we’ll only see these limits go up as the SR-IOV standard itself allows for more VFs.

So where does SR-IOV fit in when compared to VMQ?

Well it does away with some overhead that still remains with VMQ. VMQ took away the overload of a single core in the host have to be involved in handle all the incoming traffic. But still the hypervisor still has to touch every packet coming in and out. With SR-IOV that issue is addressed as it allows moving data in and out of a virtual machine to the physical NIC via Direct memory Access (DMA). So with this the CPU bottle neck is removed entirely from the process of moving data in and out of virtual machines. The virtual switch never touches it. To see a nice explanation of SR-IOV take a look at the Intel SR-IOV Explanation video on YouTube.

Intel SR-IOV Explanation

VMQ Coalescing tried to address some of the pain of the next bottle neck of using VMQ, which is the large number of interrupts needed to handle traffic if you have a lot of queues. But as we discussed already this functionality is highly under documented and it’s a bit of black art. Especially when NIC teaming and some NIC advanced software issues come in to play. Dynamic VMQ is supposed to take care of that black art and make it more reliable and easier.

Now in contrast to VMQ & RSS that don’t mix together in a Hyper-V environment you can combine SR-IOV with RSS, they work together.

Benefits Versus The Competition

One of the benefits That Hyper-V 3.0 in Windows 8 has over the competition is that you can live migrate to an node that’s not using SR-IOV. That’s quite impressive.

Potential Drawback Of Using SR-IOV

A draw back is that by bypassing the Extensible Virtual Switch you might lose some features and extensions. Whether this is  very important to you depends on your environment and needs. It would take me to far for this blog post but CISCO seems to have enough aces up it’s sleeve to have an integrated management & configuration interface to manage both the networking done in the extensible virtual switch as the SR-IOV NICs. You can read more on this over here Cisco Virtual Networking: Extend Advanced Networking for Microsoft Hyper-V Environments. Basically they:

  1. Extend enterprise-class networking functions to the hypervisor layer with Cisco Nexus 1000V Series Switches.
  2. Extend physical network to the virtual machine with Cisco UCS VM-FEX.

Interesting times are indeed ahead. Only time will tell what many vendors have to offer in those areas & for what type customer profiles (needs/budgets).

A Possible Usage Scenario

You can send data traffic over SR-IOV if that suits your needs. But perhaps you’ll want to keep that data traffic flowing over the extensible Hyper-V virtual switch. But if you’re using iSCSI to the guest why not send that over the SR-IOV virtual function to reduce the load to the host? There is still a lot to learn and investigate on this subject As a little side note. How are the HBAs in Hyper-V 3.0 made available to the virtual machines? SR-IOV, but the PCIe device here is a Fibre HBA not a NIC. I don’t know any details but I think it’s similar.

DVMQ In Windows 8 Hyper-V

VMQ or VMDq

To discuss Dynamic VMQ (DVMQ) we first need to talk about VMQ or VMDq in Intel speak. VMQ lets the physical NIC create unique virtual network queues for each virtual machine (VM) on the host. These are used to pass network packets directly from the hypervisor to the VM. This reduces a lot of overhead CPU core overhead on the host associated with network traffic as it spreads the load over multiple cores. The same sort of CPU overhead you might see with 10Gbps networking on a server that isn’t using RSS (see my previous blog post Know What Receive Side Scaling (RSS) Is For Better Decisions With Windows 8. Under high network traffic one core will hit 100% while the others are idle. This means you‘ll never get more than 3Gbps to 4Gbps of bandwidth out of your 10Gbps card as the CPU is the bottleneck.

VMQ leverages the NIC hardware for sorting, routing & packet filtering of the network packets from an external virtual machine network directly to virtual machines and enables you to use 8gbps or more of your 10Gbps bandwidth.

Now the number of queues isn’t unlimited and are allocated to virtual machines on a first-come, first-served basis. So don’t enable this for machines without heavy network traffic, you’ll just waste queues. It is advised to use it only on those virtual machines with heavy inbound traffic because VMQ is most effective at improving receive-side performance. So use your queues where they make a difference.

If you want to see what VMQ is all about take a look at this video by Intel.

Intel VMDq Explanation

 

The video gives you a nice animated explanation of the benefits. You can think of it as providing the same benefits to the host as RSS does. VMQ also prevents one core being overloaded with interrupts due to high network IO and as such becoming the bottle neck blocking performance. This is important as you might end up buying more servers to handle certain loads due to this. Sure with 1Gbps networking the modern CPUs can handle a very big load but with 10Gbps becoming ever more common this issue is a lot more acute again than it used to be. That’s why you see RSS being enabled by default in Windows 2008 R2.

VMQ Coalescing – The Good & The Bad

There is a  little dark side to VMQ. You’ve successfully relieved the bottleneck on the host for network filtering and sorting but you know have a potential bottle neck where you need a CPU interrupt for every queue. The documentation states as follows:

The network adapter delivers interrupts to the Management Operating system for each VMQ on the processor based processor VMQ affinity. If the interrupts are spread across many processors, the number of interrupts delivered can grow substantially, until the overhead of interrupt handling can outweigh the benefit of using VMQ. To reduce the number of interrupts used, Microsoft has encouraged network adapter manufacturers to design for interrupt coalescing, also called shared interrupts. Using shared interrupts, the network adapter processes multiple queues with the same processor affinity in the same interrupt. This reduces the overall number of interrupts. At the time of this publication, all network adapters that support VMQ also support interrupt coalescing.

Now coalescing works but the configuration and the possible headaches it can give you are material for another blog post.  It can be very tedious and you have to manage every action on your NIC and Virtual Switch configuration like a hawk or you’ll get registry values overwritten, value types changes and such. This leads to all kind of issues, ranging from VMQ coalescing not working to your cluster going down the drain  (worse case). The management of VMQ coalescing seems rather tedious and a such error prone. This is not good. Combine that with the sometime problematic NIC teaming and you get a lot of possible and confusing permutations where things can go wrong. Use it when you can handle the complexity or it might bite you.

Dynamic VMQ For A Better Future

Now bring on Dynamic VMQ (DVMQ). All I know about this is from the Build sessions and I’ll revisit this once I get to test it for real with the beta or release candidate. I really hope this is better documented and doesn’t’ come associated with the issues we’ve had with VMQ Coalescing.  It brings the promise of easy and trouble free VMQ where the load is evenly balanced among the cores and avoids to the burden of to many interrupts. A sort of auto scaling if you like that optimizes queue handling & interrupts.

image

That means it can replace VMQ Coalescing and DVMQ will deal with this potential bottleneck on its own. Due to the issues I’ve had with coalescing I’m looking forward to that one. Take note that you should be able to live migrate virtual machines from host with VMQ capabilities to a host that hasn’t. You do lose the performance benefit but I have no confirmation on this yet and as mentioned I’m waiting for the Beta bits to try it out. It’s like Live Migration between an SR-IOV enabled host and non SR-IOV enabled host, which is confirmed as possible. On that front Microsoft seems to be doing a real good job, better than the competition.