Upgrading Firmware Of Mellanox RoCE Cards for Final Windows Server 2012 RDMA Testing

Upgrading Mellanox Firmware

As we are preparing to roll out Windows Server 2012 R2 we are also updating the Mellanox cards we have. At the moment of writing the final driver & firmware for Windows Server 2012 R2 isn’t out yet, but let’s take a look at the process so you’re ready for prime time. If you need the latest public Mellanox driver for Windows Server 2012 R2 it’s here. Installing the driver is a straight forward process (upgrading servers with Mellanox drivers in place has been an issue however).

Mellanox provides good documentation on their site (http://www.mellanox.com/page/firmware_HCA_FW_identificationhttp://www.mellanox.com/page/firmware_NIC_FW_update) but for Mellanox newbies & many Windows server admins the process might be a bit more hands on than via a single installer they are used to.

What do you need?

The Windows Mellanox Firmware Tools (WinMFT). This gives you all the tools you need to get the job done.

It helps us with two things: find out Card ID and using that we can determine the PSID (Board ID) which tells us what firmware we need to down load.

The Win MFT tools are also used to burn the firmware.

Practical Tip 1: I have found that it pays to launch the installers Mellanox provides from an elevated command prompt as other wise UAC might trip up some clean finalization of a launched msi. The driver installer is more sensitive to this that the firmware installer.

Practical Tip 2: I you have OEM Mellanox cards from DELL/ HP/IBM … and they haven’t released the new firmware yet you can always burn your own. Please find the instructions here.

Walkthrough

I have a Windows Server 2012 R2 RTM running and I already installed the latest beta drivers I could find on the Mellanox site. But I’m a firmware version behind. So let’s fix this.

image

I put all the files I need in one handy spot

image

I launch an elevated command prompt

image

And from there I lauch the WinMFT installer

image

Just follow the instructions. image

image

image

image

image

Now you’re ready to determine the Device ID of your Mellanox card. From that same elevated command prompt navigate to C:Program FilesMellanoxWinMFT and run mst status

image

Grab the Device ID (marked in green) and execute following command:

flint -d /dev/mst/mt4099_pci_cr0 query

image

The Board ID (marked in yellow) is actually the PSID (more information here) will tell you what firmware to download from the Mellanox site). By the way, note this also tells you the current firmware.

You download the firmware from http://www.mellanox.com/page/firmware_download by selecting the card you have. In my case a ConnectX®-3 EN PCI-Ex Network Interface Cards (Ethernet Only NICs) and is use the Board ID to find my download.

image

All that’s left to do is burn the firmware image by executing the following command:

flint -d /dev/mst/mt4099_pci_cr0  -i C:SysAdminMellanoxFirmwarefw-ConnectX3-rel-2_30_3000-MCX312A-XCB_A2-A6-3.4.142_EN.bin burn

This requires you to confirm by typing in “y” and you can follow the process via a counter.image

When done you’ll need to reboot the server I order for the new firmware to actually get used. You can verify success by running the command again or by checking the information tab of you cards configuration settings. As you can see we’re running 2.30.3000 now.

image

So here you go. You might need to do this again after October 18th 2013 but you’re ready for now and all the testing you do is on the latest version of both the driver and the firmware. Happy testing!

Preventing Live Migration Over SMB Starving CSV Traffic in Windows Server 2012 R2 with Set-SmbBandwidthLimit

One of the big changes in Windows Server 2012 R2 is that all types of Live Migration can now leverage SMB 3.0 if the right conditions are met. That means that Multichannel & SMB Direct (RDMA) come in to play more often and simultaneously. Shared Nothing Live Migration & certain forms of Storage Live Migration are often a lot more planned due to their nature. So one can mitigate the risk by planning.  Good old standard Live Migration of virtual machines however is often less planned. It can be done via Cluster Aware Updating, to evacuate a host for hardware maintenance, via Dynamic optimization. This means it’s often automated as well. As we have demonstrated many times Live Migration can (easily) fill 20Gbps of bandwidth. If you are sharing 2*10Gbps NICs for multiple purposes like CSV, LM, etc. Quality of Service (QoS) comes in to play. There are many ways to achieve this but in our example here I’ll be using DCB  for SMB Direct with RoCE.

New-NetQosPolicy “CSV” –NetDirectPortMatchCondition 445 -PriorityValue8021Action 4
Enable-NetQosFlowControl –Priority 4
New-NetQoSTrafficClass "CSV" -Priority 4 -Algorithm ETS -Bandwidth 40
Enable-NetAdapterQos –InterfaceAlias SLOT41-CSV1+LM2
Enable-NetAdapterQos –InterfaceAlias SLOT42-LM1+CSV2
Set-NetQosDcbxSetting –willing $False

Now as you can see I leverage 2*10Gbps NIC, non teamed as I want RDMA. I have Failover/redundancy/bandwidth aggregation thanks to SMB 3.0. This works like a charm. But when leveraging Live Migration over SMB in Windows Server 2012 R2 we note that the LM traffic also goes over port 445 and as such is dealt with by the same QoS policy on the server & in the switches (DCB/PFC/ETS). So when both CSV & LM are going one how does one prevent LM form starving CSV traffic for example? Especially in Scale Out File Server Scenario’s this could be a real issue.

The Solution

To prevent LM traffic & CSV traffic from hogging all the SMB bandwidth ruining the SOFS party in R2 Microsoft introduced some new capabilities in Windows Server 2012 R2. In the SMBShare module you’ll find:

  • Set-SmbBandwidthLimit
  • Get-SmbBandwidthLimit
  • Remove-SmbBandwidthLimit

image

To use this you’ll need to install the Feature called SMB Bandwidth Limit via Server Manager or using PowerShell:  Add-WindowsFeature FS-SMBBW

You can limit SMB bandwidth for Virtual machine (Storage IO to a SOFS), Live Migration & Default (all the rest).  In the below example we set it to 8Gbps maximum.

Set-SmbBandwidthLimit -Category LiveMigration -BytesPerSecond 1000MB

So there you go, we can prevent Live Migration from hogging all the bandwidth. Jose Baretto mentions this capability on his recent blog post on Windows Server 2012 R2 Storage: Step-by-step with Storage Spaces, SMB Scale-Out and Shared VHDX (Virtual). But what about Fibre Channel or iSCSI environments?  It might not be the total killer there as in SOFS scenario but still. As it turns out the Set-SmbBandwidthLimit also works in those scenarios. I was put on the wrong track by thinking it was only for SOFS scenarios but my fellow MVP Carsten Rachfahl kindly reminded me of my own mantra “Trust but verify” and as a result, I can confirm it even works to cap off Live Migration traffic over SMB that leverages RDMA (RoCE). So don’t let the PowerShell module name (SMBShare) fool you, it’s about all SMB traffic within the categories.

So without limit LM can use all bandwidth (2*10Gbps)

image

With Set-SmbBandwidthLimit -Category LiveMigration -BytesPerSecond 1250MB you can see we max out at 10Gbps (2*5Gbps).
image

Some Remarks

I’d love to see a minimum bandwidth implementation of this (that could include safety buffer for spikes in CSV traffic with SOFS). The hard cap limit might lead to some wasted bandwidth. In other scenarios you could still get into trouble. What if you have 2*10Gbps available but one of those dies on you and you capped Live Migration Traffic at 16Gbps. With one NIC gone you’re potentially in trouble until the NIC has been replaced. OK, this is not a daily occurrence & depending on you environment & setup this is less or more of a potential issue.

Virtual Receive Side Scaling (vRSS) In Windows Server 2012 R2 Hyper-V

What is it?

One of the cool new features that takes scalability in Windows Server 2012 R2 Hyper-V to a new level is virtual Receive Side Scaling (vRSS). While since In Windows Server 2012, Receive Side Scaling (RSS) over SR-IOV is supported it’s best suited for some specialized environments that require the best possible speeds at the lowest possible latencies. While SR-IOV is great for performance it’s not as flexible as for example you can’t team them so if you need redundancy you’ll need to do guest NIC Teaming.

vRSS is supported on the VM network path (vNIC, vSwitch, pNIC) and allows VMs to scale better under heavier network loads. The lack of RSS support in the guest means that there is only one logical CPU (core 0) that has to deal with all the network interrupts.  vRSS avoid this bottleneck by spreading network traffic among multiple VM processors. Which is great news for data copy heavy environments.

What do you need?

Nothing special, it works with any NICs that supports VMQ and that’s about all 10Gbps NICs you can buy or posses. So no investment is needed. It’s basically the DVMQ capability on the host NIC that has VMQ capabilities that allows for vRSS to be exposed inside of the VM over the vSwitch. To take advantage of vRSS, VMs must be configured to use multiple cores, and they must support RSS => turn it on in the vNIC configuration in the guest OS and don’t try to use a home PC 1Gbps card Winking smile

image

vRSS is enabled automatically when the VM uses RSS on the VM network path. The other good news is that this works over NIC Teaming. So you don’t have to do in guest NIC Teaming.

What does it look like?

Now without SR-IOV it was a serious challenge to push that 10Gbps vNIC to it’s limit due to all the interrupt handling being dealt with by a single CPU core. Here’s what a VMs processor looks like under a sustained network load without vRSS. Not to shabby, but we want more Smile

image
As you can see the incoming network traffic has the be dealt with by good old vCore 0. While DVMQ allows for multiple processors on the host dealing with the interrupts for the VMs it still means that you have a single core per VM. That one core is possibly a limiting factor (if you can get the network throughput and storage IO, that is). vRSS deals with this limitation. Look at the throughput we got copying  lot of data to the VM below leveraging vRSS. Yeah that’s 8.5Gbps inside of a VM. Sweet Open-mouthed smile. I’m sure I can get to 10Gbps …

image

Teamed NIC Live Migrations Between Two Hosts In Windows Server 2012 Do Use All Members

Introduction

Between this blog NIC Teaming in Windows Server 2012 Brings Simple, Affordable Traffic Reliability and Load Balancing to your Cloud Workloads which states TCP/IP can recover from missing or out-of-order packets. However, out-of-order packets seriously impact the throughput of the connection. Therefore, teaming solutions make every effort to keep all the packets associated with a single TCP stream on a single NIC so as to minimize the possibility of out-of-order packet delivery. So, if your traffic load comprises of a single TCP stream (such as a Hyper-V live migration), then having four 1Gb/s NICs in an LACP team will still only deliver 1 Gb/s of bandwidth since all the traffic from that live migration will use one NIC in the team. However, if you do several simultaneous live migrations to multiple destinations, resulting in multiple TCP streams, then the streams will be distributed amongst the teamed NICsand other information out their such as support forum replies it is dictated that when you live migrate between two nodes in a cluster only one stream is active and you will never exceed the bandwidth of a single team member. When running some simple tests with a 10Gbps NIC team this seems true. We also know that you can consume near to all of the aggregated bandwidth of the members in a NIC Team for live migration if you these conditions are met:

1. The Live Migrations must not all be destined for the same remote machine. Live migration will only use one TCP stream between any pair of hosts. Since both Windows NIC Teaming and the adjacent switch will not spread traffic from a single stream across multiple interfaces live migration between host A and host B, no matter how many VMs you’re migrating, will only use one NIC’s bandwidth.

2. You must use Address Hash (TCP ports) for the NIC Teaming. Hyper-V Port mode will put all the outbound traffic, in this case, on a single NIC.

When we look at these conditions and compare them to the behavior we expect from the various forms of NIC teaming in Windows 2012 this is a bit surprising as one might expect all member to be involved. So let’s take a look at some of the different NIC Teaming setups.

Any form of NIC teaming with Hyper-V Port Mode

This one is easy as condition 2 above is very much true. In all my testing with any NIC team configuration in the Hyper-V Port mode traffic distribution algorithms I have not been able to exceed 10Gbps. I have seen no difference between dependent static of LACP mode or switch independent (active-active) for this condition. As you can see in the screenshot below, the traffic maxes out at 10Gbps.

clip_image002

clip_image004

This is also demonstrated in the following screenshots taking with the resource manager where you can see only half of the bandwidth of the Team is being used.

clip_image006

clip_image008

Exceeding a single NIC team member’s bandwidth when migrating between 2 nodes

The first condition of the previous heading doesn’t seem true. In some easy testing with a low number of virtual machines and not too much memory assigned you never exceed the bandwidth of one 10Gbps NIC team member. So on the surface, with some quick testing it might seem that way.

But during testing on a 2 node cluster with dual port 10Gbps cards and I have found the following

Switch Dependent LACP and Static

  1. Take a sufficient number of large memory virtual machines to exceed the capacity of a single 10Gbps pipe for a longer time (that way you’ll see it in the GUI).
  2. Live migrate them all from host A to host B (“Pause” with “Drain Roles” or “select all” + “Move”)
  3. Note that with a 2 node cluster there is no possibility to Live Migrate to multiple nodes simultaneous. It’s A to or B or B to A or both at the same time.

Basically it didn’t take long to see well over 10Gbpsbeing used. So the information out there seems to be wrong. Yes we can leverage the aggregated bandwidth when we migrate from host A to host B as long as we have enough memory assigned to the VMs and we migrate a sufficient number of them. Switch dependent teaming, whether it is static or LACP does its job as you would expect.

Let’s think about this. The number of VMs you need to lie migrate to see > 10Gbpss used is not fixed in stone. Could it be that there is some intelligence in the Live Migration algorithm where it decides to set up multiple streams when a certain number of virtual machines with sufficient memory are migrated as the sorting is mitigated by the amount of bandwidth that can leveraged? Perhaps he VMMS.EXE kicks off more streams when needed/beneficial? Further experimenting indicates that this is not the case. All you need is > 1 VM being live migrated. When looking at this in task manager you do need them to be of sufficient memory size and/or migrate enough of them to make it visible. I have also tried playing with the number of allowed simultaneous live migrations to see if this has an effect but I did not find one (i.e. 4, 6 or 12).

It looks like it is more like one TCP/IP connection per Live Migration that is indeed tied to one NIC member. So when you live migrate VMS between two hosts you see one VM live migration go over 1 member and the other the other as static/LACP switch dependent teaming did does its job. When you do enough live migrations of large VMs simultaneously you see this in Task Manager as shown below. In this case as each VM live migration stream sticks to a NIC team member you do not need to worry about out of order packets impacting performance.

clip_image010

But to make sure and to prevent falling victim to the fall victim to the limits of the task manger GUI during testing this behavior we also used performance monitor to see what’s going on. This confirms we are indeed using both 10Gbps NIC team member on both the target and the source host server. This is even the case with 2 virtual machines Live Migration. As long as it’s more than one and the memory assigned is enough to make the live migration last long enough you can see it in Task Manager; otherwise it might miss it. Performance Monitor however does not..clip_image012

clip_image002[4]

clip_image004[4]

This is interesting and frankly a bit unexpected as the documentation on this subject is not reflecting this. However it IS in agreement with the NIC teaming documented behavior for other tan Live Migration traffic. We took a closer look however and can reproduce this over and over again. Again we tested both switch dependent static and LACP modes and we found the behavior to be the same.

Switch Independent with Address Hash

Let’s test Live Migration over switch independent teaming with Address Hash. Here we see that the source server sends on the two member of the NIC team but that the target server receives on only one. This is normal behavior for switch independent teaming. But from the documentation we expect that one member on the source server would send and one member on the target server would receive. Not so.

Basically with Windows Server 2012 this doesn’t give you any benefit for throughput. You are limited to the bandwidth of one member, i.e. 10Gbps.

clip_image018

clip_image020

Red is Total Bytes received on the target host. It’s clear only one member is being used. Green is Bytes Sent/Sec on the source server. As you can see both team members are involved. In a switch independent scenario the receiving side limits the throughput. This is in agreement the documented behavior of switch independent NIC teaming with Address hash.

Helpful documentation on this is Windows Server 2012 NIC Teaming (LBFO) Deployment and Management (A Guide to Windows Server 2012 NIC Teaming for the novice and the expert).

Hope this helps sort out some of the confusion.