Hyper-V Amigo Chat Ignite 2015

Many MVP’s attended Microsoft Ignite 2015 in Chicago to see what our future will look like.

Hyper-V Amigo Chat Microsoft Ignite 2015 Thumb 1 (2)

Carsten published the “Hyper-V Amigo Chat” we did right after Ignite. The conference was a blast for us all. Tired but happy we chat about storage space direct, Nano Server, ReFS, Dedupe, Azure Stack, … Enjoy!

Here’s the link to the video Hyper V Amigos Chat – Microsoft Ignite 2015 on Carsten’s blog.

Hyper-V Amigos Showcast Episode 9 – RDMA, RoCE, PFC and ETS

Just before Carsten Rachfahl and I left for Microsoft Ignite we recorded episode 9 of the Hyper-V Amigo Showcast. In this episode we’ll discuss SMB Direct over RoCE (RDMA over Converged Ethernet) which requires lossless Ethernet.

image

Data Center Bridging is the way to achieve this. It has four standards, PFC (802.1Qbb), ETS (802.1Qaz), CN (802.1Qau) and DCBx, but only two are important to us now.Priority Flow Control (PFC) is mandatory

image

and Enhanced Transmission Selection is optional (but very handy depending on your environment).

image

If you need more information on this start with these blogs on the subject. But without further delay here’s Hyper-V Amigos Showcast Episode 9 – RDMA, RoCE, PFC and ETS

DCB PFC Demo with SMB Direct over RoCE (RDMA)

In this blog post we’ll demo Priority Flow control. We’re using the demo comfit as described in SMB Direct over RoCE Demo – Hosts & Switches Configuration Example

There is also a quick video to illustrate all this on Vimeo. It’s not training course grade I know, but my time to put into these is limited.

I’m using Mellanox ConnectX-3 ethernet cards, in 2 node DELL PowerEdge R720 Hyper- cluster lab. We’ve configured the two ports for SMB Direct & set live migration to leverage them both over SMB Direct. For that purpose we tagged SMB Direct traffic with priority 4 and all other traffic with priority 1. We only made priority lossless as that’s required for RoCE and the other traffic will deal with not being lossless by virtue of being TCP/IP.

Priority Flow Control is about making traffic lossless. Well some traffic. While we’d love to live by Queens lyrics “I want it all, I want it all and I want it now” we are limited. If not so by our budgets, than most certainly by the laws of physics. To make sure we all understand what PFC does here’s a quick reminder: It tells the sending party to stop sending packets, i.e. pause a moment (in our case SMB Direct traffic) to make sure we can handle the traffic without dropping packets. As RoCE is for all practical purposes Infiniband over Ethernet and is not TCP/IP, so you don’t have the benefits of your protocol dealing with dropped packets, retransmission … meaning the fabric has to be lossless*. So no it DOES NOT tell non priority traffic to slow down or stop. If you need to tell other traffic to take a hike, you’re in ETS country :-)

* If any switch vendor tells you to not bother with DCB and just build (read buy their switches = $$$$$) a lossless fabric (does that exist?) and rely on the brute force quality of their products to have a lossless experience … could be an interesting experiment Smile.

Note: To even be able to start SMB Direct SMB Multichannel must be enabled as this is the mechanism used to identify RDMA capabilities after which a RDMA connection is attempted. If this fails you’ll fall back to SMB Multichannel. So you will have ,network connectivity.

You want RDMA to work and be lossless. To visualize this we can turn to the switch where we leverage the counter statistics to see PFC frames being send or transmitted. A lab example from a DELL PowerConnect 8100/N4000 series below.

image

To verify that RDMA is working as it should we should also leverage the Mellanox Adapter Diagnostic and native Windows RDMA Activity counters. First of all make sure RDMA is working properly. Basically you want the error counters to be zero and stay that way.

Mellanox wise these must remain at zero (or not climb after you got it right):

  • Responder CQE Errors
  • Responder Duplicate Request Received
  • Responder Out-Of-Order Sequence Received
  • … there’s lots of them …

image

Windows RDMA Activity wise these should be zero (or not climb after you got it right):

  • RDMA completion Queue Errors
  • RDMA connection Errors
  • RDMA Failed connection attempts

image

The event logs are also your friend as issues will log entries to look out for like

PowerShell is your friend (adapt severity levels according to your need!)

Get-WinEvent -ListLog “*SMB*” | Get-WinEvent | ? { $_.Level -lt 4 -and $_. Message -like “*RDMA*” } | FL LogName, Id, TimeCreated, Level, Message

Entries like this are clear enough, it ain’t working!

The network connection failed.
Error: The I/O request was canceled.
Connection type: Rdma
Guidance:
This indicates a problem with the underlying network or transport, such as with TCP/IP, and not with SMB. A firewall that blocks port 445 or 5445 can also cause this issue.
 
RDMA interfaces are available but the client failed to connect to the server over RDMA transport.
Guidance:
Both client and server have RDMA (SMB Direct) adaptors but there was a problem with the connection and the client had to fall back to using TCP/IP SMB (non-RDMA).

 

To view PFC action in Windows we rely on the Mellanox Adapter QoS Counters

image

Below you’ll see the number of  pause frames being sent & received on each port. Click on the image to enlarge.

image

An important note trying to make sense of it all: … pauze and receive frames are sent and received hop to hop. So if you see a pause frame being sent on a server NIC port you should see them being received on the switch port and not on it’s windows target you are live migrating from. The 4 pause frames sent in the screenshot above are received by the switchport as you can see from the PFC Stats for that port.

image

People, if you don’t see errors in the error counters and event viewer that’s good. If you see the PFC Pause frame counters move up a bit that’s (unless excessive) also good and normal, that PFC doing it’s job making sure the traffic is lossless. If they are zero and stay zero for ever you did not buy a lossless fabric that doesn’t need DCB, it’s more likely you DCB/PFC is not working Winking smile and you do not have a lossless fabric at all. The counters are cumulative over time so they don’t reset to zero bar resetting the NIC or a reboot.

image

When testing feel free to generate lots of traffic all over the place on the involved ports & switches this helps with seeing all this in action and verifying RDMA/PFC works as it should. I like to use ntttcp.exe to generate traffic, the most recent version will let you really put a load on 10GBps and higher NICs. Hammer that network as hard as you can Winking smile.

Again a simple video to illustrate this on Vimeo.

SMB Direct over RoCE Demo – Hosts & Switches Configuration Example

As mentioned in Where SMB Direct, RoCE, RDMA & DCB fit into the stack this post’s only function is to give you an overview of the configurations used in the demo blogs/videos. First we’ll configure one Windows Server 2012 R2 host. I hope it’s clear this needs to be done on ALL hosts involved. The NICs we’re configuring are the 2 RDMA capable 10GbE NICs we’ll use for CSV traffic, live migration and our simulated backup traffic. These are Mellanox ConnectX-3 RoCE cards we hook up to a DCB capable switch. The commands needed are below and the explanation is in the comments. Do note that the choice of the 2 policies, priorities and minimum bandwidths are for this demo. It will depend on your environment what’s needed.

#Install DCB on the hosts
Install-WindowsFeature Data-Center-Bridging
#Mellanox/Windows RoCE drivers don't support DCBx (yet?), disable it.
Set-NetQosDcbxSetting -Willing $False
#Make sure RDMA is enable on the NIC (should be by default)
Enable-NetAdapterRdma –Name RDMA-NIC1
Enable-NetAdapterRdma –Name RDMA-NIC2
#Start with a clean slate
Remove-NetQosTrafficClass -confirm:$False
Remove-NetQosPolicy -confirm:$False

#Tag the RDMA NIC with the VLAN chosen for PFC network
Set-NetAdapterAdvancedProperty -Name "RDMA-NIC-1" -RegistryKeyword "VlanID" -RegistryValue 110
Set-NetAdapterAdvancedProperty -Name "RDMA-NIC-2" -RegistryKeyword "VlanID" -RegistryValue 120

#SMB Direct traffic to port 445 is tagged with priority 4
New-NetQosPolicy "SMBDIRECT" -netDirectPortMatchCondition 445 -PriorityValue8021Action 4
#Anything else goes into the "default" bucket with priority tag 1 :-)
New-NetQosPolicy "DEFAULT" -default  -PriorityValue8021Action 1

#Enable PFC (lossless) on the priority of the SMB Direct traffic.
Enable-NetQosFlowControl -Priority 4
#Disable PFC on the other traffic (TCP/IP, we don't need that to be lossless)
Disable-NetQosFlowControl 0,1,2,3,5,6,7

#Enable QoS on the RDMA interface
Enable-NetAdapterQos -InterfaceAlias "RDMA-NIC1"
Enable-NetAdapterQos -InterfaceAlias "RDMA-NIC2"

#Set the minimum bandwidth for SMB Direct traffic to 90% (ETS, optional)
#No need to do this for the other priorities as all those not configured
#explicitly goes in to default with the remaining bandwith.
New-NetQoSTrafficClass "SMBDirect" -Priority 4 -Bandwidth 90 -Algorithm ETS

We also show you in general how to setup the switch. Don’t sweat the exact syntax and way of getting it done. It differs between switch vendors and models (we used DELL Force10 S4810 and PowerConnect 8100 / N4000 series switches), it’s all very alike and yet very specific. The important thing is that you see how what you do on the switches maps to what you did on the hosts.

!Disable 802.3x flow control (global pause)- doesn't mix with DCB/PFC
workinghardinit#configure
workinghardinit(conf)#interface range tengigabitethernet 0/0 -47 
workinghardinit(conf-if-range-te-0/0-47)#no flowcontrol rx on tx on
workinghardinit(conf-if-range-te-0/0-47)# exit
workinghardinit(conf)# interface range fortyGigE 0/48 , fortyGigE 0/52
workinghardinit(conf-if-range-fo-0/48-52)#no flowcontrol rx on tx off
workinghardinit(conf-if-range-fo-0/48-52)#exit

!Enable DCB & Configure VLANs
workinghardinit(conf)#service-class dynamic dot1p
workinghardinit(conf)#dcb enable
workinghardinit(conf)#exit
workinghardinit#copy running-config startup-config
workinghardinit#reload

!We use a <> VLAN per subnet
workinghardinit#configure
workinghardinit(conf)#interface vlan 110
workinghardinit (conf-if-vl-vlan-id*)#tagged tengigabitethernet 0/0-47
workinghardinit (conf-if-vl-vlan-id*)#tagged port-channel 3
workinghardinit(conf)#interface vlan 120
workinghardinit (conf-if-vl-vlan-id*)#tagged tengigabitethernet 0/0-47
workinghardinit (conf-if-vl-vlan-id*)#tagged port-channel 3
workinghardinit (conf-if-vl-vlan-id*)#exit


!Create & configure DCB Map Policy
workinghardinit(conf)#dcb-map SMBDIRECT
workinghardinit(conf-dcbmap-profile-name*)#priority-group 0 bandwidth 90 pfc on 
workinghardinit(conf-dcbmap-profile-name*)#priority-group 1 bandwidth 10 pfc off 
workinghardinit(conf-dcbmap-profile-name*)#priority-pgid 1 1 1 1 0 1 1 1
workinghardinit(conf-dcb-profile-name*)#exit 

!Apply DCB map to the switch ports & uplinks
workinghardinit(conf)#interface range ten 0/047
workinghardinit(conf-if-range-te-0/0-47)# dcb-map SMBDIRECT 
workinghardinit(conf-if-range-te-0/0-47)#exit
workinghardinit(conf)#interface range fortyGigE 0/48 , fortyGigE 0/52
workinghardinit(conf-if-range-fo-0/48,fo-0/52)# dcb-map SMBDIRECT
workinghardinit(conf-if-range-fo-0/48,fo-0/52)#exit
workinghardinit(conf)#exit
workinghardinit#copy running-config startup-config 

With the hosts and the switches configured we’re ready for the demos in the next two blog posts. We’ll show Priority Flow Control (PFC) and Enhanced Transmission Selection (ETS) in action with some tips on how to test this yourselves.