DCB PFC Demo with SMB Direct over RoCE (RDMA)

In this blog post we’ll demo Priority Flow control. We’re using the demo comfit as described in SMB Direct over RoCE Demo – Hosts & Switches Configuration Example

There is also a quick video to illustrate all this on Vimeo. It’s not training course grade I know, but my time to put into these is limited.

I’m using Mellanox ConnectX-3 ethernet cards, in 2 node DELL PowerEdge R720 Hyper- cluster lab. We’ve configured the two ports for SMB Direct & set live migration to leverage them both over SMB Direct. For that purpose we tagged SMB Direct traffic with priority 4 and all other traffic with priority 1. We only made priority lossless as that’s required for RoCE and the other traffic will deal with not being lossless by virtue of being TCP/IP.

Priority Flow Control is about making traffic lossless. Well some traffic. While we’d love to live by Queens lyrics “I want it all, I want it all and I want it now” we are limited. If not so by our budgets, than most certainly by the laws of physics. To make sure we all understand what PFC does here’s a quick reminder: It tells the sending party to stop sending packets, i.e. pause a moment (in our case SMB Direct traffic) to make sure we can handle the traffic without dropping packets. As RoCE is for all practical purposes Infiniband over Ethernet and is not TCP/IP, so you don’t have the benefits of your protocol dealing with dropped packets, retransmission … meaning the fabric has to be lossless*. So no it DOES NOT tell non priority traffic to slow down or stop. If you need to tell other traffic to take a hike, you’re in ETS country 🙂

* If any switch vendor tells you to not bother with DCB and just build (read buy their switches = $$$$$) a lossless fabric (does that exist?) and rely on the brute force quality of their products to have a lossless experience … could be an interesting experiment Smile.

Note: To even be able to start SMB Direct SMB Multichannel must be enabled as this is the mechanism used to identify RDMA capabilities after which a RDMA connection is attempted. If this fails you’ll fall back to SMB Multichannel. So you will have ,network connectivity.

You want RDMA to work and be lossless. To visualize this we can turn to the switch where we leverage the counter statistics to see PFC frames being send or transmitted. A lab example from a DELL PowerConnect 8100/N4000 series below.


To verify that RDMA is working as it should we should also leverage the Mellanox Adapter Diagnostic and native Windows RDMA Activity counters. First of all make sure RDMA is working properly. Basically you want the error counters to be zero and stay that way.

Mellanox wise these must remain at zero (or not climb after you got it right):

  • Responder CQE Errors
  • Responder Duplicate Request Received
  • Responder Out-Of-Order Sequence Received
  • … there’s lots of them …


Windows RDMA Activity wise these should be zero (or not climb after you got it right):

  • RDMA completion Queue Errors
  • RDMA connection Errors
  • RDMA Failed connection attempts


The event logs are also your friend as issues will log entries to look out for like

PowerShell is your friend (adapt severity levels according to your need!)

Get-WinEvent -ListLog “*SMB*” | Get-WinEvent | ? { $_.Level -lt 4 -and $_. Message -like “*RDMA*” } | FL LogName, Id, TimeCreated, Level, Message

Entries like this are clear enough, it ain’t working!

The network connection failed.
Error: The I/O request was canceled.
Connection type: Rdma
This indicates a problem with the underlying network or transport, such as with TCP/IP, and not with SMB. A firewall that blocks port 445 or 5445 can also cause this issue.
RDMA interfaces are available but the client failed to connect to the server over RDMA transport.
Both client and server have RDMA (SMB Direct) adaptors but there was a problem with the connection and the client had to fall back to using TCP/IP SMB (non-RDMA).


To view PFC action in Windows we rely on the Mellanox Adapter QoS Counters


Below you’ll see the number of  pause frames being sent & received on each port. Click on the image to enlarge.


An important note trying to make sense of it all: … pauze and receive frames are sent and received hop to hop. So if you see a pause frame being sent on a server NIC port you should see them being received on the switch port and not on it’s windows target you are live migrating from. The 4 pause frames sent in the screenshot above are received by the switchport as you can see from the PFC Stats for that port.


People, if you don’t see errors in the error counters and event viewer that’s good. If you see the PFC Pause frame counters move up a bit that’s (unless excessive) also good and normal, that PFC doing it’s job making sure the traffic is lossless. If they are zero and stay zero for ever you did not buy a lossless fabric that doesn’t need DCB, it’s more likely you DCB/PFC is not working Winking smile and you do not have a lossless fabric at all. The counters are cumulative over time so they don’t reset to zero bar resetting the NIC or a reboot.


When testing feel free to generate lots of traffic all over the place on the involved ports & switches this helps with seeing all this in action and verifying RDMA/PFC works as it should. I like to use ntttcp.exe to generate traffic, the most recent version will let you really put a load on 10GBps and higher NICs. Hammer that network as hard as you can Winking smile.

Again a simple video to illustrate this on Vimeo.

I Can’t Afford 10GBps For Hyper-V And Other Lies

You’re wrong

There, I said it. Sure you can. Don’t think you need to be a big data center to make this happen. You just need to think and work outside the box a bit and when you’re not a large enterprise, that’s a bit more easy to do. Don’t do it like a big name brand, traditionalist partner would do it (strip & refit the entire structural cabling in the server room, high end gear with big margins everywhere). You’re going for maximum results & value, not sales margins and bonuses.

I would even say you can’t afford to stay on 1Gbps much longer or you’ll be dealing with the fall out of being stuck in the past. Really some of us are already look at > 10Gbps connections to the servers, actually. You need to move from 1Gbps or you’ll be micro managing a way around issues sucking all the fun out of your work with ever diminishing results and rising costs for both you and the business.

Give your Windows Server 2012R2 Hyper-V environment the bandwidth it needs to shine and make the company some money. If all you want to do is to spent as little money as possible I’m not quite sure what your goal is? Either you need it or you don’t.  I’m convinced we need it. So we must get it. Do what it takes. Let me show you one way to get what you need.

Sounds great what do I do?

Take heart, be brave and of good courage! Combine it with skills, knowledge & experience to deliver a 10Gbps infrastructure as part of ongoing maintenance & projects. I just have to emphasize that some skills are indeed needed, pure guts alone won’t do it.

First of all you need to realize that you do not need to rip and replace your existing network infrastructure. That’s very hard to get approval for, takes too much time and rapidly becomes very expensive in both dollars and efforts. Also, to be honest, quiet often you don’t have that kind of pull. I for one certainly do not. And if I’d try to do that way it takes way too many meetings, diplomacy, politics, ITIL, ITML & Change Approval Board actions to make it happen. This adds to the cost even more, both in time and money. So leave what you have in place, for this exercise we assume it’s working fine but you can’t afford to have wait for many hours while all host drains in 6 node cluster and you need to drain all of them to add memory. So we have a need (OK you’ll need a better business case than this but don’t make to big a deal of it or you’ll draw unwanted attention) and we’ve taking away the fear factor of fork lift replacing the existing network which is a big risk & cost.

So how do I go about it?

Start out as part of regular upgrades, replacement or new deployments. The money is their for those projects. Make sure to add some networking budget and leverage other projects need to support the networking needs.

Get a starter budget for a POC of some sort, it will get your started to acquire some more essential missing  bits.

By reasonably cheap switches of reasonable port count that do all you need. If they’re readily available in a frame work contract, great. You can get it as part of the normal procedures. But if you want to nock another 6% to 8% of the cost order them directly from the vendor. Cut out the middle man.

Buy some gear as part of your normal refresh cycle. Adapt that cycle life time a bit to suit your needs where possible. Funding for operation maintenance & replacement should already be in place right?

Negotiate hard with your vendor. Listen, just like in the storage world, the network world has arrived at a point where they’re not going to be making tons of money just because they are essential. They have lots of competition and it’s only increasing. There are deals to be made and if you chose the right hardware it’s gear that won’t lock you into proprietary cabling, SPF+ modules and such. Or not to much anyway Smile.

Design options and choices

Small but effective

If you’re really on minimal budget just introduce redundant (independent) stand alone 10Gbps switches for the East-West traffic that only runs between the nodes in the data center. CSV, Live Migration, backup. You don’t even need to hook it up to the network for data traffic, you only need to be able to remotely manage it and that’s what they invented Out Off Band (OOB) ports for. See also an old post of mine Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4). In the smallest cheapest scenario I use just 2 independent switches. In the other scenario build a 2 node spine and the leaf. In my examples I use DELL network gear. But use whatever works best for your needs and your environment. Just don’t go the “nobody ever got fired for buying XXX” route, that’s fear, not courage! Use cheaper NetGear switches if that fits your needs. Your call, see my  recent blog post on this 10Gbps Cheap & Without Risk In Even The Smallest Environments.

Medium sized excellence

First of all a disclaimer: medium sized isn’t a standardized way of measuring businesses and their IT needs. There will be large differences depending on you neck of the woods Smile.

Build your 10Gbps infrastructure the way you want it and aim it to grow to where it might evolve. Keep it simple and shallow. Go wide where you need to. Use the Spine/Leaf design as a basis, even if what you’re building is smaller than what it’s normally used for. Borrow the concept. All 10Gbps traffic, will be moving within that Spine/Leaf setup. Only client server traffic will be going out side of it and it’s a small part of all traffic. This is how you get VM mobility, great network speeds in the server room avoiding the existing core to become a bandwidth bottleneck.

You might even consider doing Infiniband where the cost/Gbps is very attractive and it will serve you well for a long time. But it can be a hard sell as it’s “another technology”.

Don’t panic, you don’t need to buy a bunch of Nexus 7000’s  or Force10 Z9000 to do this in your moderately sized server room. In medium sized environment I try to follow the “Spine/Leaf” concept even if it’s not true ECMP/CLOSS, it’s the principle. For the spine choose the switches that fit your size, environment & growth. I’ve used the Force10 S4810 with great success and you can negotiate hard on the price. The reasons I went for the higher priced Force10 S4810 are:

  • It’s the spine so I need best performance in that layer so that’s where I spend my money.
  • I wanted VLT, stacking is a big no no here. With VLT I can do firmware upgrades without down time.
  • It scales out reasonably by leveraging eVLT if ever needed.

For the ToR switches I normally go with PowerConnect 81XX F series or the N40XXF series, which is the current model. These provide great value for money and I can negotiate hard on price here while still getting 10Gbps with the features I need. I don’t need VLT as we do switch independent NIC teaming with Windows. That gives me the best scalability wit DVMQ & vRSS and allows for firmware upgrades without any network down time in the rack. I do sacrifice true redundant LACP within the rack but for the few times I might really need to have that I could go cross racks & still maintain a rack a failure domain as the ToRs are redundant. I avoid stacking, it’s a single point of failure during firmware upgrades and I don’t like that. Sure I can could leverage the rack a domain of failure to work around that but that’s not very practical for ordinary routine maintenance. The N40XXF also give me the DCB capabilities I need for SMB Direct.

Hook it up to the normal core switch of the existing network, for just the client/server.(North/South) traffic. I make sure that any VLANs used for CSV, live migration, can’t even reach that part of the network.  Even data traffic (between virtual machines, physical servers) goes East-West within your Spine/Leave and never goes out anyway unless you did something really weird and bad.

As said, you can scale out VLT using eVLT that creates a port channel between 2 VLT domains. That’s nice. So in a medium sized business you’re pretty save in growth. If you grow beyond this, we’ll be talking about a way larger deployment anyway and true ECMP/CLOS and that’s not the scale I’m dealing with where. For most medium sized business or small ones with bigger needs this will do the job. ECMP/CLOS Spine/leaf actually requires layer 3 in the design and as you might have noticed I kind if avoid that. Again, to get to a good solution today instead of a real good solution next year which won’t happen because real good is risky and expensive. Words they don’t like to hear above your pay grade.

The picture below is just for illustration of the concept. Basically I normally have only one VLT domain and have two 10Gbps switches per rack. This gives me racks as failure domains and it allows me to forgo a lot of extra structural cabling work to neatly provide connectivity form the switches  to the server racks .image

You have a  scalable, capable & affordable 10Gbps or better infrastructure that will run any workload in style.. After testing you simply start new deployments in the Spine/Leaf and slowly mover over existing workloads. If you do all this as part of upgrades it won’t cause any downtime due to the network being renewed. Just by upgrading or replacing current workloads.

The layer 3 core in the picture above is the uplink to your existing network and you don’t touch that. Just let if run until there nothing left in there and you can clean it up or take it out. Easy transition. The core can be left in place or replaces when needed due to age or capabilities.

To keep things extra affordable

While today the issues with (structural) 10Gbps copper CAT6A and NICs/Switches seem solved, when I started doing 10Gbps fibre cabling of Copper Twinax Direct Attach was the only way to go. 10GBaseT wasn’t an option yet and I still love the flexibility of fibre, it consumes less space and weighs less then CAT6A. Fibre also fits easily in existing cable infrastructure. Less hassle. But CAT6A will work fine today, no worries.

If you decide to do fibre, buy OM3, you can get decent, affordable cabling on line. Order it as consumable supplies.

Spend some time on the internet and find the SFP+ that works with your switches to save a significant amount of money. Yup some vendor switches work with compatible non OEM branded SPF+ modules. Order them as consumable supplies, but buy some first to TEST! Save money but do it smart, don’t be silly.

For patch cabling 10Gbps Copper Twinax Direct Attach works great for short ranges and isn’t expensive, but the length is limited and they get thicker & more sturdy and thus unwieldy by length. It does have it’s place and I use them where appropriate.

Isn’t this dangerous?

Nope. Technology wise is perfectly sound and nothing new. Project wise it delivers results, fast, effective and without breaking the bank. Functionally you now have all the bandwidth you need to stop worrying and micromanaging stuff to work around those pesky bandwidth issues and focus on better ways of doing things. You’ve given yourself options & possibilities. Yay!

Perhaps the approach to achieve this isn’t very conventional. I disagree. Look, anyone who’s been running projects & delivering results knows the world isn’t that black and white. We’ve been doing 10Gbps for 4 years now this way and with (repeated) great success while others have to wait for the 1Gbps structural cabling to be replaced some day in the future … probably by 10Gbps copper in a 100Gbps world by the time it happens. You have to get the job done. Do you want results, improvements, progress and success or just avoid risk and cover your ass? Well then, choose & just make it happen. Remember the business demands everything at the speed of light, delivered yesterday at no cost with 99.999% uptime.  So this approach is what they want, albeit perhaps not what they say.

The Hyper V Amigos Showcast Episode 3: Live Migration

Here’s the 3rd episode of the Hyper-V Amigos show cast. As Carsten was overwhelmed with work (running your own business is very hard work) and had some issues with his storage spaces lab due to testing we’re discussing live migration optimizations in this installment.

 Carsten Rachfahl and I had a lot of fun again, even during the second take, yes we needed one. Apparently these software thingies require me to click on “record” Smile as there is no intelligent agent yet to act on my intention.

Carsten & I discussing & showing some live migration optimizations


I have written many blog posts on this subject already and I’m sure I’ll write more. Optimizing the use of the hypervisor (Hyper-V) across the entire storage, compute/memory & networking stack is one of my specialties and I enjoy this part of my job very much. I also like to share this information as real.

I’m sure you’ll agree that Hyper-V has come a long way in short period of time and I’m pretty sure we’re going to see Microsoft continue this pace for quite a while.

I have a blog post coming out (it’s in the queue) on my 4 top recommendations for optimal live migrations but here’s a search of relevant blog posts on this topic, and we referred to some of them during our show cast:


When you’re done reading al these posts on live migration you’ll have earned a nice refreshing beverage of your choice Mug.

One more thing, if you like these show casts let us know! Last but not least, I’m doing a demo heavy (only) session at ITProceed on June 12th 2014. Many local experts, community members  and I will be around afterwards to discuss these technologies.

Don’t Let Live Migration Settings Behavior Confuse You

Configuring live migration settings on a cluster

In the cluster under Networks, Live Migration Settings you can select what networks are available for live migration and their order of preference.image


Basically this setting determines what NICs can be used. It also determines and in what order of preference the available networks can be used by Live Migration. It does not determine bandwidth aggregation or failover. All it does is provide the order in which the redundant networks will be used. It’s up to the cluster service NETFT, Multichannel or SMB Direct to provide the bandwidth aggregation if possible As you can see we use LM/CSV over SMB and as our two NICS are RDMA capable 10Gbps NIC, multichannel will discover RDMA capabilities & leverage SMB Direct if it can be establish otherwise it will just stick with multichannel. If you would  team that NIC shows up as just one network. Also not that if you lose a NIC during live migration it might fail for some VMs under certain scenarios, but you cluster nodes will maintain the capability & recover. The  names of the network reflect this: LM1+CSV2 & CSV1+LM2 will be used both but if for some reason multichannel goes completely south the names reflect the metrics of these networks. The lowest is CSV1+LM2 and the second lowest is LM1+CSV2, reflect on how NETFT will select to use which automatically based on the metrics. I.e. It’s “self documentation” for human consumption.


Sometimes you might get surprising results. An example. If you’ve selected SMB for Live Migration and you have selected only one of the NICs here.image Still when you look at perform you might see both being used! What’s happening is that multichannel will kick in (and use two or more similar NICs when it finds them and if applicable move to RDMA.


So here we select SMB for the live migration type and the two equally capable 10GBps NICs available for live migration it will use them, even if you selected only one of them in the cluster network settings for live migration.

Why is that? Well, there is still another location where live migration settings are defined and that is in Hyper-V Manager. Take a look at the screenshot below.


The Incoming live migration settings is set to “Use any available network for live migration”. If you have this on it will still leverage both as when one is used multichannel drag the other one into action anyway, no matter what you set in network settings for live migration in the Cluster GUI (it set to use only one and dimmed out).

Do note that on Hyper-V Manager the settings for live migrations specify “Incoming live migrations”. That leads us to believe that it’s the target, the node where the VMs are migrated to that determines what NICs get used. But let’s put this to the test.

Testing A Hyper-V cluster with two nodes – Node A and Node B

On the cluster network settings you select only one network.


On  Hyper-V cluster Node A you have configured the following for live migration via Hyper-V Manager to “Use these IP addresses for live migration”. You cannot add or remove networks, the networks used are defined by the cluster.


On  Hyper-V cluster Node B you have configured the following for live migration via Hyper-V Manager to “Use any available network for live migration”.

As we now kick of a live migration from node A to node B we’ll see both NICs being used. Why well because Node B is the target and Node B has the setting “Use any available network for live migration”. So why then only these 2, why not pick up any other suitable NICs? Well we’ve configured the live migration on both nodes to use SMB. As this cluster is RDMA capable that means it will leverage multichannel/SMB direct. The auto configuration will select the best, equally capable NICs for this and that’s these two in our scenario. Remember the capabilities of the NICs have to match. So no mixtures of 1 * 1Gbps and 1 *10Gbps or 1 * multichannel and 1 SMB Direct.

The confusion really sets in that even if live migrate from Node B to A it will also use both NICs! Hum, that is “Incoming Live Migrations” is not always “correct” it seems, not when using SMB as a performance option at least. Apparently multichannel will kick in in both directions.

If you set both to Node A and Node B to “Use these IP addresses for live migration” and leave the cluster network setting with only one network it does only use one, even with SMB as a performance option for live migration. This is expected.

Note: I had one interesting hiccup while testing this configuration: when doing the latter one of the VMs failed in live migration of the entire host. I ran it again and that one VM still used both networks. All others went well during host migration with just one being used. That was a bit of a huh Confused smile moment and it sure tripped me up & kept me busy for a while. I blame RDMA and the position of the planets & constellations.

Things aren’t always what they seem at first and it’s good to keep that in mind. The moment you think you got if figured out, you’re wrong Winking smile. So look again & investigate.