NIC Firmware/Driver Updates Reset Your RSS/VMQ optimizations

When optimizing your RSS/VMQ settings for maximum performance you’ll normally (I hope) do this in PowerShell. Save that script with some comments on why you configure it that way and make it part of your Hyper-V host deployment scripts

Why? Automation is king but you’ll need it again for sure. Why? Well there is this “tendency” that NIC firmware/driver updates reset your RSS/VMQ optimizations back to their defaults.That’s a bit of a bummer if you have to redo all the work instead of having a script ready to go. I have seen many a deployment where the configuration was missing after firmware/driver upgrades so please, check!

image

Figure: Where has my optimized configuration gone after a driver/firmware upgrade?

The good news is this isn’t a show stopper issue as things will keep working, but without your optimizations and with VMQ, depending on your NIC team setup for the vSwitch issues might occur. When doing NIC teaming for your virtual switch it’s important to get it right.  With switch dependent teaming (LACP/Static) the NICs in the team need to use overlapping processor sets (Min Queues). When doing switch independent teaming the NICs in the team need to use non-overlapping processor sets. So you need to configure each NIC in your team to use the different processors (Sum of Queues).

On top of that you might want to / should separate RSS/VMQ cores from each other. SMB Direct for CSV/LM will also help achieve this as there we leverage CPU offloading to the NIC.

In Defense of Switch Independent Teaming With Hyper-V

For many old timers (heck, that includes me) NIC teaming with LACP mode was the best of the best, at least when it comes to teaming options. Other modes often led to passive/active, less than optimal receiving network traffic aggregation. Basically, and perhaps over simplified, I could say the other options were only used if you had no other choice to get things to work. Which we did a lot … I used Intel’s different teaming modes for various reasons in the past (before we had MLAG, VLT, VPC, …). Trying to use LACP where possible was a good approach in the past in physical deployments and early virtualized environments when 1Gbps networking dominated the datacenter realm and Windows did not have native support for LBFO.

But even LACP, even in those days, had some drawbacks. It’s the most demanding form of teaming. For one it required switch stacking. This demands the same brand and type of switches and that means you have no redundancy during firmware upgrades. That’s bad, as the only way to work around that is to move all workload to another rack unit … if you even had the capability to do that! So even in days past we chose different models if teaming out of need or because of the above limitations for high availability. But the superiority of NIC teaming with LACP still stands for many and as modern switches support MLAG, VLT, etc. the drawback of stacking can be avoided. So does that mean LACP for NIC teaming is always the superior choice today?

Some argue it is and now they have found support in the documentation about Microsoft CPS system documentation about Microsoft CPS system. Look, even if Microsoft chose to use LACP in their solutions it’s based on their particular design and the needs of that design I do not concur that this is the best overall. It is however a valid & probably the choice for their specific setup. While I applaud the use of MLAG (when available to you a no or very low cost) to have all bases covered but it does not mean that LACP is the best choice for the majority of use cases with Hyper-V deployments. Microsoft actually agrees with me on this in their Windows Server 2012 R2 NIC Teaming (LBFO) Deployment and Management guide. They state that Switch Independent configuration / Dynamic distribution (or Hyper-V Port if on Hyper-V and if not on W2K12R2)  is the best possible default choice is for teaming in both native and Hyper-V environments. I concur, even if perhaps not that strong for native workloads (it depends). Exceptions to this:

  • Teaming is being performed in a VM (which should be rare),
  • Switch dependent teaming (e.g., LACP) is required by policy, or
  • Operation of a two-member Active/Standby team is required by policy.

In other words in 2 out of 3 cases the reason is a policy, not a technical superior solution …

Note that there are differences between Address Hash, Hyper-V Port mode & the new dynamic distribution modes and the latter has made things better in W2K12R2 in regards to bandwidth but you’ll need the read the white papers. Use dynamic as default, it is the best. Also note that LACP/Switch Dependent doesn’t mean you can send & receive to and from a VM over the aggregated bandwidth of all team members. Life is more complicated than that. So if that’s you’re main reason for switch dependent, and think you’re done => be ware Winking smile.

Switch Independent is also way better for optimization of VMQ. You have more queues available (sum-of-queues) and the IO path is very predictable & optimized.

If you don’t control the switches there’s a lot more cross team communication involved to set up teaming for your hosts. There’s more complexity in these configurations so more possibilities for errors or bugs. Operational ease is also a factor.

The biggest draw back could be that for receiving traffic you cannot get more than the bandwidth a single team member can deliver. That’s true but optimizing receiving traffic has it’s own demands and might not always be that great if the switch configuration isn’t that smart & capable. Do I ever miss the potential ability to aggregate incoming traffic. In real life I do not (yet) but in some configurations it could do a great job to optimize that when needed.

When using 10Gbps or higher you’ll rarely be in a situation where receiving traffic is higher than 10Gbps or higher and if you want to get that amount of traffic you really need to leverage DVMQ. And a as said switch independent teaming with port of dynamic mode gives you the most bang for the buck. as you have more queues available. This drawback is mitigated a bit by the fact that modern NICs have way larger number of queues available than they used to have. But if you have more than one VM that is eating close to 10Gbps in a non lab environment and you planning to have more than 2 of those on a host you need to start thinking about 40Gbps instead of aggregating a fistful of 10Gbps cables. Remember the golden rules a single bigger pipe is always better than a bunch of small pipes.

When using 1Gbps you’ll be at that point sooner and as 1Gbps isn’t a great fit for (Dynamic) VMQ anyway I’d say, sure give LACP a spin to try and get a bit more bandwidth but will it really matter? In native workloads it might but with a vSwith?  Modern CPUs eat 1Gbps NICs for breakfast, so I would not bother with VMQ. But when you’re tied to 1Gbps it’s probably due to budget constraints and you might not even have stackable, MLAG, VLT or other capable switches. But the arguments can be made, it depends (see Don’t tell me “It depends”! But it does!). But in any case I start saving for 10Gbps Smile

Today as the PC8100 series and the N4000 Series (budget 10Gbps switches, yes I know “budget” is relative but in the 10Gbps world, but they offer outstanding value for money), I tend to set up MLAG with two of these per rack. This means we have all options and needs covered at no extra cost and without sacrificing redundancy under any condition. However look at the needs of your VMs and the capability of your NICs before using LACP for teaming by default. The fact that switch independent works with any combination of budget switches to get redundancy doesn’t mean it’s only to be used in such scenarios. That’s a perk for those without more advanced gear, not a consolation price.

My best advise: do not over engineer it. Engineer it for the best possible solution for the environment at hand. When choosing a default it’s not about the best possible redundancy and bandwidth under certain conditions. It’s about the best possible redundancy and bandwidth under most conditions. It’s there that switch independent comes into it’s own, today more than ever!

There is one other very good, but luckily also a very rare case where LACP/Switch dependent will save you and switch independent won’t: dead switch ports, where the port becomes dysfunctional. So while switch independent protects against NIC, Switch, cable failures, here it doesn’t help you as it doesn’t know (it’s about link failures, not logical issues on a port).

For the majority of my Hyper-V deployments I do not use switch dependent / LACP. The situation where I did had to do with Windows NLB in combination with ICMP Multicast.

Note: You can do VLT, MLAG, stacking and still leverage switch independent teaming, LACP or static switch dependent is NOT mandatory even when possible.

Windows Server 2012 R2 Virtual RSS (vRSS) In Action

The Need For Speed

My clients or employers are normally into lots of data, large files &  number crunching. That means they like CPU power, memory, bandwidth & IOPS. That doesn’t mean they want to spend fortunes but it does mean I got a standing order to get every last ounce of optimization and efficiency out of commodity hardware.

Today that means we prefer to work with DELL generation 12 servers like the R620 & R720 which are the best possible value for money you can get. They deliver all features in box at no extra cost / licensing crap and they can be fine tuned and optimized for top performance. Windows Server 2012 R2 can handle loads higher than the hardware today can throw at it so you want all you can get without breaking the bank. Add Intel X520/540, Mellanox (RoCE) or Cheslio (iWarp) 10Gbps cards and you’re ready to rumble with some nice PowerConnect 81XX Series or even the Force10 S4810 10Gbps switches.

CPU Power is not an issue, just bring money. You have 8-12 core CPUs. You can scale up to 4 socket systems & scale out as well. Licensing is probably your biggest concern here due to cost.

Memory?  DDR3 is readily available and compared to other costs cheap. DDR4 is on the way. For max performance you possibly won’t do Dynamic Memory to avoid a NUMA hit but otherwise it helps with efficiencies.

Network. We’ll you’ve seen/heard/read me on topics like 10Gbps, RDMA/SMB Direct, (Dynamic VMQ) but today we’ll look at virtual RSS. In Windows Server 2008 R2 VMQ helped us beyond the scalability issues of a single core on the host having to handle all interrupts for the traffic going the virtual machines. In Windows 2012 that got optimized with Dynamic VMQ. For workloads that need the absolute best performance & lowest latency we got SR-IOV support. That works fine but it has some potentially important draw backs on bot the manageability (NIC teaming needs to be done inside the guest) & security front (it bypasses the virtual switch, so no ACLs or NVGRE).

Today with Windows 2012 R2 we have a new optimization for network traffic in a virtual environment. It’s “Virtual RSS” or vRSS. I tell you it’s sweet and you’d do well to investigate this and enable it in your environment especially if, like we, you move lots of data around & virtualization is default. We don’t do physical unless for very strict reasons, i.e. it cannot be virtualized. Otherwise … no excuses, the economics and benefits are just to good not to do so. It’s not as supper low latency as SR-IOV but depending on your needs you might never notice and … it plays nicely with all other network features in Hyper-V.

Show me vRSS already!

Inside the guest we see vRSS in action as multiple cores are put in to action to handle the interrupts of incoming traffic. This takes away the single vCPU bottle neck.

image

And the result is a sweet 17.2Gbps from VM1 to VM2. For the sharp eyed ones amongst you, they are on the same host so yes in this case we get top bandwidth as the traffic doesn’t have to go across the wire over the 2*10Gbps NIC team but stays within the same host or better, vSwitch.

image

The GUI is very friendly and suggest I can go to 100Gbps networking hardware Smile Well, not yet I’m afraid, but I’ve taken note Winking smile.

Here’s what it looks like when the sending and receiving VM are on different hosts and the vSwitch is connected to a 2*10Gbps team, switch independent, dynamic mode. You “only” get 10Gbps as the team can send on all members but receive on only one. But that’s still fine.

image

So how this look like on the host?

image

Here you see Dynamic VMQ in action. To prevent one core of becoming overloaded the host puts more of them into service. This depends on the load and it’s dynamic. Hence Dynamic VMQ. Where VMQ was great it still a limiting factor as you used to be tied to one core / VM.

This means that our network traffic processing is no longer limited by the OS or better the use of a single core, both inside the VM and on the host. Our bottleneck now is the maximum throughput the NICs can deliver. In our 2 member NIC team that’s 20Gbps max under the right circumstances. Yup. Line speed. Need more? Throw 40Gbps NICs at the problem or even 100Gbps pretty soon. Windows Server 2012 R2 is ready for this.

Sharing the wealth

Now to make sure a bunch of these VMs on your cluster don’t starve the rest of the VM population with their greed for and lust after bandwidth you have the option of using Bandwidth management on the VM NIC.

image

This is one of the cases where this option can be very useful.

VM mobility without boundaries

Also consider this capability and this type of high workloads combined while leveraging SMB Direct. This offloads the processing of CSV traffic,  Live Migration, Shared Nothing Live Migration and under certain conditions Storage Live Migration to the NIC by leveraging RDMA.  In other words it doesn’t tax your host CPUs. This means you can have these kinds of network traffic loads going on and still live migrate at will. Scalable VM mobility anyone? You’ll understand what tremendous network loads the combination Windows Server 2012 R2 Hyper-V host & guests can handle.

It’s virtualization, not magic

OK, time for a little reality check, just in case it’s needed. Virtualization is technology, not magic. For all of you “thinking” they can push 20GBps into 20 VMs simultaneously on single host over just a single Team of 2 *10Gbps … ah well, get real Smile OK?

Teaser

I can talk about the benefits of vRSS all day and show you some more screenshots but I’m working on “vRSS, The Movie’”.  Perhaps even a sequel already.

DVMQ In Windows 8 Hyper-V

VMQ or VMDq

To discuss Dynamic VMQ (DVMQ) we first need to talk about VMQ or VMDq in Intel speak. VMQ lets the physical NIC create unique virtual network queues for each virtual machine (VM) on the host. These are used to pass network packets directly from the hypervisor to the VM. This reduces a lot of overhead CPU core overhead on the host associated with network traffic as it spreads the load over multiple cores. The same sort of CPU overhead you might see with 10Gbps networking on a server that isn’t using RSS (see my previous blog post Know What Receive Side Scaling (RSS) Is For Better Decisions With Windows 8. Under high network traffic one core will hit 100% while the others are idle. This means you‘ll never get more than 3Gbps to 4Gbps of bandwidth out of your 10Gbps card as the CPU is the bottleneck.

VMQ leverages the NIC hardware for sorting, routing & packet filtering of the network packets from an external virtual machine network directly to virtual machines and enables you to use 8gbps or more of your 10Gbps bandwidth.

Now the number of queues isn’t unlimited and are allocated to virtual machines on a first-come, first-served basis. So don’t enable this for machines without heavy network traffic, you’ll just waste queues. It is advised to use it only on those virtual machines with heavy inbound traffic because VMQ is most effective at improving receive-side performance. So use your queues where they make a difference.

If you want to see what VMQ is all about take a look at this video by Intel.

Intel VMDq Explanation

 

The video gives you a nice animated explanation of the benefits. You can think of it as providing the same benefits to the host as RSS does. VMQ also prevents one core being overloaded with interrupts due to high network IO and as such becoming the bottle neck blocking performance. This is important as you might end up buying more servers to handle certain loads due to this. Sure with 1Gbps networking the modern CPUs can handle a very big load but with 10Gbps becoming ever more common this issue is a lot more acute again than it used to be. That’s why you see RSS being enabled by default in Windows 2008 R2.

VMQ Coalescing – The Good & The Bad

There is a  little dark side to VMQ. You’ve successfully relieved the bottleneck on the host for network filtering and sorting but you know have a potential bottle neck where you need a CPU interrupt for every queue. The documentation states as follows:

The network adapter delivers interrupts to the Management Operating system for each VMQ on the processor based processor VMQ affinity. If the interrupts are spread across many processors, the number of interrupts delivered can grow substantially, until the overhead of interrupt handling can outweigh the benefit of using VMQ. To reduce the number of interrupts used, Microsoft has encouraged network adapter manufacturers to design for interrupt coalescing, also called shared interrupts. Using shared interrupts, the network adapter processes multiple queues with the same processor affinity in the same interrupt. This reduces the overall number of interrupts. At the time of this publication, all network adapters that support VMQ also support interrupt coalescing.

Now coalescing works but the configuration and the possible headaches it can give you are material for another blog post.  It can be very tedious and you have to manage every action on your NIC and Virtual Switch configuration like a hawk or you’ll get registry values overwritten, value types changes and such. This leads to all kind of issues, ranging from VMQ coalescing not working to your cluster going down the drain  (worse case). The management of VMQ coalescing seems rather tedious and a such error prone. This is not good. Combine that with the sometime problematic NIC teaming and you get a lot of possible and confusing permutations where things can go wrong. Use it when you can handle the complexity or it might bite you.

Dynamic VMQ For A Better Future

Now bring on Dynamic VMQ (DVMQ). All I know about this is from the Build sessions and I’ll revisit this once I get to test it for real with the beta or release candidate. I really hope this is better documented and doesn’t’ come associated with the issues we’ve had with VMQ Coalescing.  It brings the promise of easy and trouble free VMQ where the load is evenly balanced among the cores and avoids to the burden of to many interrupts. A sort of auto scaling if you like that optimizes queue handling & interrupts.

image

That means it can replace VMQ Coalescing and DVMQ will deal with this potential bottleneck on its own. Due to the issues I’ve had with coalescing I’m looking forward to that one. Take note that you should be able to live migrate virtual machines from host with VMQ capabilities to a host that hasn’t. You do lose the performance benefit but I have no confirmation on this yet and as mentioned I’m waiting for the Beta bits to try it out. It’s like Live Migration between an SR-IOV enabled host and non SR-IOV enabled host, which is confirmed as possible. On that front Microsoft seems to be doing a real good job, better than the competition.