DCB ETS Demo with SMB Direct over RoCE (RDMA)

It’s time to demonstrate ETS in action! There is a quick video on ETS on Vimeo to show what it look like.

I’m using Mellanox ConnectX-3 ethernet cards, in 2 node DELL PowerEdge R720 Hyper- cluster lab. We’ve configured the two ports for SMB Direct & set live migration to leverage them both over SMB Direct. For the purpose of this demo we’ll generate non RDMA over RoCE (TCP/IP) traffic over these two 10Gbps ports to simulate a problematic scenario where all bandwidth is already being used and to see how Enhanced Transmission Selection (ETS) will help in this scenario.  I have done this with DELL Force 10, PowerConnect 8100, N4000 series or a mix of both. This particular demo was leveraging PC8132Fs. I use what’s available to me in a lab at the time of writing.

To achieve the network load this we leverage ntttcp.exe to generate the non RDMA TCP/IP traffic. Using the Mellanox QoS counters we visualize this. In blue you see the sending traffic from node A, in red the receiving traffic on Node B. Note that this traffic is tagged with priority 1. We tag SMB Direct traffic with priority 4.

image

You can see that both Mellanox cards are running at full bandwidth, 2* 10Gbps from node A to node B and it’s all none RDMA traffic. Also note that I’m hitting all 16 physical cores (hyper threading is enabled). By doing so I avoid being bottlenecked by a singe core as in contrast to RDMA traffic there’s no huge CPU offload going on here.image

As these are the cards I have assigned to use for live migration (depending on the setup also  CSV or SOFS traffic) over SMB Direct you’ll see that the competition for bandwidth will be fierce if we don’t have a mechanism to guide this to a desired outcome. That’s exactly what we leverage DCB with PFC and ETS for.

So let’s kick off live migration of 4 virtual machines with 10GB of memory each. That should take about 20 seconds on 2 * 10Gbps cards. We first live migrate them form node B to Node A. That’s in the reverse direction of where we are sending TCP/IP traffic. You see 10Gbps being used all over and this is expected.

image

Remember that the network is full duplex. That means that you can send at 10Gbps (TCP/IP from node A to node B, RDMA from node B to A and vice versa) and receive at 10Gbps on a port. Actually if the backplane of the switch is powerful enough you can do so on all ports. So this is normal. Node A is sending TCP/IP traffic to node B at line speed and Node B is sending SMB Direct traffic to node A (the live migration) at line speed.

But what if we live migrate over SMB Direct in the same direction as the TCP/IP traffic is going, from node A to node B? Well have a look. To me this looks awesome.

image

ETS kicks in immediately. We configure the minimum bandwidth for SMB Direct Traffic to be 90%. Anything left after that (10%) is given to other traffic, in this demo the TCP/IP traffic we generated. As priority 4 tagged RoCE traffic is also configured to be lossless with PFC you don’t have to worry about dropping packets under contention. Now think about this and how you can steer your traffic behavior at times when the resources need to be divided amongst competing workloads.

I hope you now have a better idea on why QoS is useful, how it works and that it indeed does work. While I have taken the opportunity to demonstrate this with SMB Direct over RoCE I’d like to stress that QoS is not just about RoCE where it’s  “mandatory” due to the fact it requires at least PFC. It’s a very much a needed tool that’s very beneficial in any converged scenario and that the optional ETS might be a very good idea, depending on your environment.

Again, to get you a better idea, here’s a short, quick video on ETS on Vimeo.

SMB Direct over RoCE Demo – Hosts & Switches Configuration Example

As mentioned in Where SMB Direct, RoCE, RDMA & DCB fit into the stack this post’s only function is to give you an overview of the configurations used in the demo blogs/videos. First we’ll configure one Windows Server 2012 R2 host. I hope it’s clear this needs to be done on ALL hosts involved. The NICs we’re configuring are the 2 RDMA capable 10GbE NICs we’ll use for CSV traffic, live migration and our simulated backup traffic. These are Mellanox ConnectX-3 RoCE cards we hook up to a DCB capable switch. The commands needed are below and the explanation is in the comments. Do note that the choice of the 2 policies, priorities and minimum bandwidths are for this demo. It will depend on your environment what’s needed.

#Install DCB on the hosts
Install-WindowsFeature Data-Center-Bridging
#Mellanox/Windows RoCE drivers don't support DCBx (yet?), disable it.
Set-NetQosDcbxSetting -Willing $False
#Make sure RDMA is enable on the NIC (should be by default)
Enable-NetAdapterRdma –Name RDMA-NIC1
Enable-NetAdapterRdma –Name RDMA-NIC2
#Start with a clean slate
Remove-NetQosTrafficClass -confirm:$False
Remove-NetQosPolicy -confirm:$False

#Tag the RDMA NIC with the VLAN chosen for PFC network
Set-NetAdapterAdvancedProperty -Name "RDMA-NIC-1" -RegistryKeyword "VlanID" -RegistryValue 110
Set-NetAdapterAdvancedProperty -Name "RDMA-NIC-2" -RegistryKeyword "VlanID" -RegistryValue 120

#SMB Direct traffic to port 445 is tagged with priority 4
New-NetQosPolicy "SMBDIRECT" -netDirectPortMatchCondition 445 -PriorityValue8021Action 4
#Anything else goes into the "default" bucket with priority tag 1 :-)
New-NetQosPolicy "DEFAULT" -default  -PriorityValue8021Action 1

#Enable PFC (lossless) on the priority of the SMB Direct traffic.
Enable-NetQosFlowControl -Priority 4
#Disable PFC on the other traffic (TCP/IP, we don't need that to be lossless)
Disable-NetQosFlowControl 0,1,2,3,5,6,7

#Enable QoS on the RDMA interface
Enable-NetAdapterQos -InterfaceAlias "RDMA-NIC1"
Enable-NetAdapterQos -InterfaceAlias "RDMA-NIC2"

#Set the minimum bandwidth for SMB Direct traffic to 90% (ETS, optional)
#No need to do this for the other priorities as all those not configured
#explicitly goes in to default with the remaining bandwith.
New-NetQoSTrafficClass "SMBDirect" -Priority 4 -Bandwidth 90 -Algorithm ETS

We also show you in general how to setup the switch. Don’t sweat the exact syntax and way of getting it done. It differs between switch vendors and models (we used DELL Force10 S4810 and PowerConnect 8100 / N4000 series switches), it’s all very alike and yet very specific. The important thing is that you see how what you do on the switches maps to what you did on the hosts.

!Disable 802.3x flow control (global pause)- doesn't mix with DCB/PFC
workinghardinit#configure
workinghardinit(conf)#interface range tengigabitethernet 0/0 -47 
workinghardinit(conf-if-range-te-0/0-47)#no flowcontrol rx on tx on
workinghardinit(conf-if-range-te-0/0-47)# exit
workinghardinit(conf)# interface range fortyGigE 0/48 , fortyGigE 0/52
workinghardinit(conf-if-range-fo-0/48-52)#no flowcontrol rx on tx off
workinghardinit(conf-if-range-fo-0/48-52)#exit

!Enable DCB & Configure VLANs
workinghardinit(conf)#service-class dynamic dot1p
workinghardinit(conf)#dcb enable
workinghardinit(conf)#exit
workinghardinit#copy running-config startup-config
workinghardinit#reload

!We use a <> VLAN per subnet
workinghardinit#configure
workinghardinit(conf)#interface vlan 110
workinghardinit (conf-if-vl-vlan-id*)#tagged tengigabitethernet 0/0-47
workinghardinit (conf-if-vl-vlan-id*)#tagged port-channel 3
workinghardinit(conf)#interface vlan 120
workinghardinit (conf-if-vl-vlan-id*)#tagged tengigabitethernet 0/0-47
workinghardinit (conf-if-vl-vlan-id*)#tagged port-channel 3
workinghardinit (conf-if-vl-vlan-id*)#exit


!Create & configure DCB Map Policy
workinghardinit(conf)#dcb-map SMBDIRECT
workinghardinit(conf-dcbmap-profile-name*)#priority-group 0 bandwidth 90 pfc on 
workinghardinit(conf-dcbmap-profile-name*)#priority-group 1 bandwidth 10 pfc off 
workinghardinit(conf-dcbmap-profile-name*)#priority-pgid 1 1 1 1 0 1 1 1
workinghardinit(conf-dcb-profile-name*)#exit 

!Apply DCB map to the switch ports & uplinks
workinghardinit(conf)#interface range ten 0/047
workinghardinit(conf-if-range-te-0/0-47)# dcb-map SMBDIRECT 
workinghardinit(conf-if-range-te-0/0-47)#exit
workinghardinit(conf)#interface range fortyGigE 0/48 , fortyGigE 0/52
workinghardinit(conf-if-range-fo-0/48,fo-0/52)# dcb-map SMBDIRECT
workinghardinit(conf-if-range-fo-0/48,fo-0/52)#exit
workinghardinit(conf)#exit
workinghardinit#copy running-config startup-config 

With the hosts and the switches configured we’re ready for the demos in the next two blog posts. We’ll show Priority Flow Control (PFC) and Enhanced Transmission Selection (ETS) in action with some tips on how to test this yourselves.

SMB Direct with DCB, PFC, ETS … How do I know it works?!

A question that comes up over time, again and again, is how do you know SMB Direct is working. The question stems from a nagging feeling that configuring DCB is a bit of playing wizard’s apprentice and we might not completely know what we’re doing, i.e. lack of experience.

image

Many have suspected me of brewing up DCB configurations in a dark corner of the data center where no one else dares venture. But those are unsubstantiated rumors. But in coming blog posts we’ll address how to configure it end to end and we’ll show how to find out if it’s really working and how to test that.

Finding out if it really works, testing and monitoring isn’t magic. It boils down to using tools you know. Performance counters for RDMA Activity and SMB direct are natively available in Windows. Use them!The NIC vendors also provide very detailed counters, those are excellent and of great value when testing and confirming things work as they should. The latter is very important. Because after people are satisfied SMB Direct works they want to know if DCB is configured correctly. Does PFC work, are pause frames being send and received? Is it really lossless?  Does ETS really kick in when needed, do I get the minimum bandwidth I configured? These are very valid questions people struggle with. But the answer eludes many, almost like the question if the refrigerator light really goes out when you close the door.

It’s hard to do deep down in the network packets … that often requires a very specialized skillset and experience with packet analyzers etc. Nothing most of you can’t learn but often this is not a priority. But with some creativity and the performance counters on windows provided by the NIC vendors and the statistics counters on the switches you can demonstrate that both PFC & ETS doe work and kick in.

So in upcoming blogs & videos I’ll demonstrate the configuring SMB Direct over RoCE leveraging 2 parts of DCB:

  • PFC (Priority Flow Control) – mandatory for SMB Direct over RoCE
  • ETS (Enhanced Transmission Selection) – optional but I advise you to leveraged it for SMB Direct over RoCE

Actually, when doing true converged, no matter what route you go, QoS is not really optional any more.

The biggest challenge is to get people to wrap their heads around the concepts and it’s behavior. Once you do that you’ll understand how and why to configure it. It took me time and effort, there’s no way around it, but it’s well worth the effort.

Look, DCB is not 100% fully matured or perfect especially in large scale environments over > 2 or 3 hops. Frak, while I love tinkering, testing and playing with this stuff I have never been a “QoS first person”. If I can I thrown resources at the problem (CPU cycles; memory, bandwidth, …). QoS is like a gun. You only draw it when you must use it and than you’d better do it right otherwise you don’t touch it, bar for practice/training/ education. While perfection is not of this world and improvements are being worked on (ECN) it does work and deliver. How many of you had a large scale > 2 hops , > 20 switches deployment with FC, FCoE or iSCSI to worry about? So can it deliver what you need today in most scenarios? Yes! Can I fix the short comings of any random technologies? No. Can I leverage current technologies with great success despite this? Yes! So can you. There is a reason I get hired and paid. Trust me it’s not my looks, my bed side manner or charismatic appearance Winking smile.

Side note 1: I’m cannot possibly provide a switch configuration guide in a step by step fashion as the details vary by vendor, they can also be switch model/type specific and it all depends on your environment & needs. So no I cannot and will not attempt to write a bunch of these. This would be way too much work and way too expensive (time, hardware etc.), so unless I’m paid very generously to do so, you’re out of luck. It might be cheaper to hire me or to come to the free community sessions, presentations, ATE evenings and study up.

Optimizing Backups: PowerShell Script To Move All Virtual Machines On A Cluster Shared Volume To The Node Owing That CSV

When you are optimizing the number of snapshots to be taken for backups or are dealing with storage vendors software that leveraged their hardware VSS provider you some times encounter some requirements that are at odds with virtual machine mobility and dynamic optimization.

For example when backing up multiple virtual machines leveraging a single CSV snapshot you’ll find that:

  • Some SAN vendor software requires that the virtual machines in that job are owned by the same host or the backup will fail.
  • Backup software can also require that all virtual machines are running on the same node when you want them to be be protected using a single CSV snap shot. The better ones don’t let the backup job fail, they just create multiple snapshots when needed but that’s less efficient and potentially makes you run into issues with your hardware VSS provider.

image

VEEAM B&R v8 in action … 8 SQL Server VMs with multiple disks on the same CSV being backed up by a single hardware VSS writer snapshot (DELL Compellent 6.5.20 / Replay Manager 7.5) and an off host proxy Organizing & orchestrating backups requires some effort, but can lead to great results.

Normally when designing your cluster you balance things out a well as you can. That helps out to reduce the needs for constant dynamic optimizations. You also make sure that if at all possible you keep all files related to a single VM together on the same CSV.

Naturally you’ll have drift. If not you have a very stable environment of are not leveraging the capabilities of your Hyper-V cluster. Mobility, dynamic optimization, high to continuous availability are what we want and we don’t block that to serve the backups. We try to help out to backups as much a possible however. A good design does this.

If you’re not running a backup every 15 minutes in a very dynamic environment you can deal with this by live migrating resources to where they need to be in order to optimize backups.

Here’s a little PowerShell snippet that will live migrate all virtual machines on the same CSV to the owner node of that CSV. You can run this as a script prior to the backups starting or you can run it as a weekly scheduled task to prevent the drift from the ideal situation for your backups becoming to huge requiring more VSS snapshots or even failing backups. The exact approach depends on the storage vendors and/or backup software you use in combination with the needs and capabilities of your environment.

cls

$Cluster = Get-Cluster
$AllCSV = Get-ClusterSharedVolume -Cluster $Cluster

ForEach ($CSV in $AllCSV)
{
    write-output "$($CSV.Name) is owned by $($CSV.OWnernode.Name)"
    
    #We grab the friendly name of the CSV
    $CSVVolumeInfo = $CSV | Select -Expand SharedVolumeInfo
    $CSVPath = ($CSVVolumeInfo).FriendlyVolumeName

    #We deal with the \ being and escape character for string parsing.
    $FixedCSVPath = $CSVPath -replace '\\', '\\'

    #We grab all VMs that who's owner node is different from the CSV we're working with
    #From those we grab the ones that are located on the CSV we're working with
      $VMsToMove = Get-ClusterGroup | ? {($_.GroupType –eq 'VirtualMachine') -and ( $_.OwnerNode -ne $CSV.OWnernode.Name)} | Get-VM | Where-object {($_.path -match $FixedCSVPath)} 
     
    ForEach ($VM in $VMsToMove)

    {
        write-output "`tThe VM $($VM.Name) located on $CSVPath is not running on host $($CSV.OwnerNode.Name) who owns that CSV"
        write-output "`tbut on $($VM.Computername). It will be live migrated."
        #Live migrate that VM off to the Node that owns the CSV it resides on
        Move-ClusterVirtualMachineRole -Name $VM.Name -MigrationType Live -Node $CSV.OWnernode.Name
    }

Now there is a lot more to discuss, i.e. what and how to optimize for virtual machines that are clustered. For optimal redundancy you’ll have those running on different nodes and CSVs. But even beyond that, you might have the clustered VMs running on different cluster, which is the failure domain.  But I get the remark my blogs are wordy and verbose so … that’s for another time Winking smile