Live Migration Can Benefit From Jumbo Frames

Does live migration benefit from Jumbo frames? This question always comes back so I’d just blog it hear again even if I have mentioned it as part of other blog posts. Yes it does! How do I know. Because I’ve tested and used it with Windows Server 2008 R2, 2012 & 2012 R2. Why? because I have a couple of mantra’s:

  • Assumption are the mother of all fuckups
  • Assume makes an ASS out of U and ME
  • Trust but verify

What can I say. I have been doing 10Gbps since for Live Migration with Hyper-V. And let me tell you my experiences with an otherwise completely optimized server (mainly BIOS performance settings): It will help you with up to 20% more bandwidth use.

And thanks to Windows Server 2012 R2 supporting SMB for live migration we can very nicely visualize this with 2*10Gbps NICS, not teamed, used by live migration leveraging SMB Multichannel. On one of the 10Gbps we enable Jumbo Frames on the other one we do not. We than live migrate a large memory VM back and forth. Now you tell me which one is which.

image

Now enable Jumbo frames on both 10Gbps NICs and again we live migrate the large memory VM back and forth. More bandwidth used, faster live migration.

image

I can’t make it any more clear. No jumbo frames will not kill your performance unless you have it messed up end to end. Don’t worry if you have a cheaper switch where you can only enable it switch wide instead op port per port. The switch is a pass through. So unless you set messed up sizes on sender/receiving host that the switch in between can’t handle, it will work even without jumbo frames and without heaven falling down on your head Smile. Configure it correctly, test it, and you’ll see.

Preventing Live Migration Over SMB Starving CSV Traffic in Windows Server 2012 R2 with Set-SmbBandwidthLimit

One of the big changes in Windows Server 2012 R2 is that all types of Live Migration can now leverage SMB 3.0 if the right conditions are met. That means that Multichannel & SMB Direct (RDMA) come in to play more often and simultaneously. Shared Nothing Live Migration & certain forms of Storage Live Migration are often a lot more planned due to their nature. So one can mitigate the risk by planning.  Good old standard Live Migration of virtual machines however is often less planned. It can be done via Cluster Aware Updating, to evacuate a host for hardware maintenance, via Dynamic optimization. This means it’s often automated as well. As we have demonstrated many times Live Migration can (easily) fill 20Gbps of bandwidth. If you are sharing 2*10Gbps NICs for multiple purposes like CSV, LM, etc. Quality of Service (QoS) comes in to play. There are many ways to achieve this but in our example here I’ll be using DCB  for SMB Direct with RoCE.

New-NetQosPolicy “CSV” –NetDirectPortMatchCondition 445 -PriorityValue8021Action 4
Enable-NetQosFlowControl –Priority 4
New-NetQoSTrafficClass "CSV" -Priority 4 -Algorithm ETS -Bandwidth 40
Enable-NetAdapterQos –InterfaceAlias SLOT41-CSV1+LM2
Enable-NetAdapterQos –InterfaceAlias SLOT42-LM1+CSV2
Set-NetQosDcbxSetting –willing $False

Now as you can see I leverage 2*10Gbps NIC, non teamed as I want RDMA. I have Failover/redundancy/bandwidth aggregation thanks to SMB 3.0. This works like a charm. But when leveraging Live Migration over SMB in Windows Server 2012 R2 we note that the LM traffic also goes over port 445 and as such is dealt with by the same QoS policy on the server & in the switches (DCB/PFC/ETS). So when both CSV & LM are going one how does one prevent LM form starving CSV traffic for example? Especially in Scale Out File Server Scenario’s this could be a real issue.

The Solution

To prevent LM traffic & CSV traffic from hogging all the SMB bandwidth ruining the SOFS party in R2 Microsoft introduced some new capabilities in Windows Server 2012 R2. In the SMBShare module you’ll find:

  • Set-SmbBandwidthLimit
  • Get-SmbBandwidthLimit
  • Remove-SmbBandwidthLimit

image

To use this you’ll need to install the Feature called SMB Bandwidth Limit via Server Manager or using PowerShell:  Add-WindowsFeature FS-SMBBW

You can limit SMB bandwidth for Virtual machine (Storage IO to a SOFS), Live Migration & Default (all the rest).  In the below example we set it to 8Gbps maximum.

Set-SmbBandwidthLimit -Category LiveMigration -BytesPerSecond 1000MB

So there you go, we can prevent Live Migration from hogging all the bandwidth. Jose Baretto mentions this capability on his recent blog post on Windows Server 2012 R2 Storage: Step-by-step with Storage Spaces, SMB Scale-Out and Shared VHDX (Virtual). But what about Fibre Channel or iSCSI environments?  It might not be the total killer there as in SOFS scenario but still. As it turns out the Set-SmbBandwidthLimit also works in those scenarios. I was put on the wrong track by thinking it was only for SOFS scenarios but my fellow MVP Carsten Rachfahl kindly reminded me of my own mantra “Trust but verify” and as a result, I can confirm it even works to cap off Live Migration traffic over SMB that leverages RDMA (RoCE). So don’t let the PowerShell module name (SMBShare) fool you, it’s about all SMB traffic within the categories.

So without limit LM can use all bandwidth (2*10Gbps)

image

With Set-SmbBandwidthLimit -Category LiveMigration -BytesPerSecond 1250MB you can see we max out at 10Gbps (2*5Gbps).
image

Some Remarks

I’d love to see a minimum bandwidth implementation of this (that could include safety buffer for spikes in CSV traffic with SOFS). The hard cap limit might lead to some wasted bandwidth. In other scenarios you could still get into trouble. What if you have 2*10Gbps available but one of those dies on you and you capped Live Migration Traffic at 16Gbps. With one NIC gone you’re potentially in trouble until the NIC has been replaced. OK, this is not a daily occurrence & depending on you environment & setup this is less or more of a potential issue.

Adventures In RDMA – The RoCE Path Over DCB To Windows Server 2012 R2 SMB 3.0 Glory

Prologue

On gloomy day, it was dark, grey and cold, we gave battle with RoCE & DCB (PFC/ETS). The fight was a long one, the battle field uncharted and we had only our veteran attitude towards adversity to guide us through the switch configurations. It seemed that no man had gone that far to the edges of the Windows Server 2012 empire. And when it came to RoCE & DCB meets Didier, I needed to show it that it had been conquered and was remembered of a quote in Gladiator:

Quintus: People RoCE/DCB configs should know when they are conquered.
Maximus: Would you, Quintus? Would I?

image

After many, many lonely & unsuccessful hours dealing with Performance monitor, switch configurations, reloads, firmware, drivers & Windows we got results:

… “it’s working” … “holey s* look at those numbers” …

On that dark day in a scarcely illuminated room, in the faint glare of the monitors even the CLI  of the switches in PUTTY felt like a grim cold place. But all that changed at as the impressive results brightened up the day and made all efforts seem worthwhile. “Didier Victor” I thought as I looked away from the screen, ‘”Once more”.image

But it has been a hard won victory. And should you fight this battle? We’ll let’s discuss this a bit now we’ve got your attention. RDMA is a learning process for many of us and neither Infiniband,  iWARP or RoCE are the one that need to win at this game. It’s you, via the knowledge you’ll gain working with RDMA technologies.

SMB Direct or SMB over RDMA comes in flavors

Infiniband (Mellanox)

That’s been here for a while. Has high cost associated (depends on where you come from) and also has a psychological barrier to it. Try discussing buying 10Gbps versus Infiniband with semi technical managerial types. You’ll know what I mean.

Deploying Windows Server 2012 with SMB Direct (SMB over RDMA) and the Mellanox ConnectX-2/ConnectX-3 using InfiniBand – Step by Step

iWARP (Chelsio / Intel)

RDMA but it’s TCP/IP offloaded to the card. It can leverage DCB but doesn’t require it.

Deploying Windows Server 2012 with SMB Direct (SMB over RDMA) and the Chelsio T4 cards using iWARP – Step by Step

RoCE (Mellanox)

“Infiniband over Ethernet” > so you “NEED” (no not a real hard requirement) DBC with PFC/ETS (DCBx can be handy) for it to work best. No need for Congestion Notification as it’s for TCP/IP but could be nice with iWARP (see above). Do note that you’ll need to configure your switches for DCB & that’s highly dependent on the vendor & even type of switches.

Deploying Windows Server 2012 with SMB Direct (SMB over RDMA) and the Mellanox ConnectX-3 using 10GbE/40GbE RoCE – Step by Step

Here’s an older overview of RDMA flavor’s pros & cons:image

Please see Jose Barreto’s excellent work on explaining SMB 3.0 over RDMA in his presentations at SNIA, TechEd and on his blog.

While I have heard of two people I have in my network working with Infiniband for SMB Direct and Windows Server 2012 (R2) most of us are doing 10Gbps. Pricing for Infiniband has a bad reputation. Not because Infiniband is super costly compared to 10/40Gbps (I’m told by most people who ask quotes are positively surprised) but when you can’t afford a Porsche you’re not shopping for a Ferrari either.  Especially not when a mid size sedan will serve al of your needs above and beyond the call of duty. On top of that you might have bought all that nice “converged network ready” 10Gbps gear some years ago. Some of us may be working towards 40Gbps but most are 10Gbps shops. My 40Gbps is “limited” to the inter links & uplinks. Meaning that we either go for iWarp or RoCE.

RoCE or iWARP

Which one is best of those two? Well, as the line is drawn between vendors. RoCE today equals Mellanox (yes the Infiniband vendor, RoCE is sometimes called “Infiniband on layer 4 over Ethernet layer 2”) and iWarp means Chelsio or Intel (their cards look a bit old in the teeth however).

You’ll find comparisons by both vendors claiming superiority for varied reasons. Here’s the Mellanox side http://www.mellanox.com/pdf/whitepapers/WP_RoCE_vs_iWARP.pdf & here’s Chelsio’s take http://www.chelsio.com/roce/ & http://www.moderntech.com.hk/sites/default/files/whitepaper/V09_iWAR_Summary_WP_0.pdf. It’s good to look at your needs and map them. But I cannot declare a winner. I did notice that at least one vendor of SOFS/CiB uses iWarp. Is that a statement? And if so about what? Price? Easy of use? Perfomance/Cost?

What I do find is that Chelsio is really hacking into RoCE as you can see here http://www.chelsio.com/wp-content/uploads/2011/05/RoCE-The-Grand-Experiment1.pdf, http://www.chelsio.com/roce-whitepaper/, http://www.chelsio.com/wp-content/uploads/2011/05/RoCE-FAQ-1204121.pdf So that begs the question are the right or are the scared of RoCE, as the Infiniband boys are out to eat their lunch?

My take on this for now

iWarp is way easier to get started. That’s for sure. RoCE  is firmware sensitive (NIC, Switches), driver sensitive (NIC). Configuring your switches (DCB) now is usually followed by a rebooting that switch (so you might not do that so easily in production and depending on where in the stack those switches live you really need to Force10 VLT or Cisco vPC, Arista MLAG  or a independent redundant switches to get away with it. RoCE loves green field. Stacking I hear you say? I don’t like stacking on that spot of the stack as firmware updates will get you to suffer through a single point of failure.

Disclaimer: RoCE in itself does not  DEMAND/REQUIRE DCB but the consensus is that it will work better, especially under heavy load. Weather SMB Direct over RoCE requires DCB is another question. For all practical purposes I’m working from the prerequisite it does for a production environment. But as you can do RoCE RDMA between to NIC with no DCB switch in between this indicates that the hard requirement for DCB is not there. Mind you not using DCB might not be smart in regards to QoS & error handling (no TCP/IP goodness handling this for you). But I’m no expert on this subject. Paul Grun however is and he’s involved with RoCE at  https://www.openfabrics.org/component/search/?searchword=Paul+grun&ordering=&searchphrase=all They tend to know their stuff. Read some of the comments below this article and you’ll know a lot http://www.hpcwire.com/hpcwire/2010-04-22/roce_an_ethernet-infiniband_love_story.html But PFC isn’t Walhalla either and some claim you can just forget about it and build non blocking networks. I guess you could if your pockets are deep enough Smile. And you might go a very long way without the need for RDMA. Many do … and when you talk to some network people & vendors they can’t agree either as everyone is on the same learning curve but from a different perspective. There is no one size fits all & it all depends.

iWarp doesn’t require DCB so you can get away with cheaper switches. Or, not so cheap switches that don’t support DCB (choose wisely). So cheaper switches is probably true on the low end. But, even very economically priced switches from DELL have good DCB support. Some other vendors who are more expensive don’t.

DCB is uncharted terrain for SMB Direct purposes & new to many for us. So if you want to do RDMA the easy way  … go iWARP. As said, the use of DCB for PFC/ETS is not mandatory in that case, you’ll get great results and it’s easy.  Mind you, you’ll still be dabbling with DCB if you want to do lossless magic in the switches Smile. Why you say? Well, that “converged network” story makes it kind of interesting to do so and PFC, DCBx/TLV is generic and can be leveraged for other things than iSCSI or FCoE.  And for all practical purposes SMB 3.0 with SMB Direct is a storage protocol since Windows Server 2012 made it so (CSV). Or do you do DCB for iSCSI/FCoE & iWarp for SMB Direct? After all there’s only 2 lossless queues to be had. But hey how many do you need? Choices, choices and no vast pool of experienced practitioners yet.

iWARP routes, it’s not bound by a single Ethernet broadcast domain. That could be useful info depending on your environment & needs. I’ll note that I leverage RDMA for East-West traffic, not north south & as such this could not be an issue. The time that I do “Shared Nothing Live Migration" from on premise to the cloud has not arrived yet.

The Mellanox cards in my neck of the woods were 35% cheaper than Chelsio (SFP+)

What about the scalability? “iWarp doesn’t scale that well” is stated left and right but I think that might often be based on older information. Chelsio makes a strong case for iWARP scalability. Especially when it comes to long distances, multiple hops & routing.

Again, your mileage may vary. But for “the smaller environments” who want to leverage RDMA with SMB 3.0 I’d say that iWarp is the easiest path to go & will do just fine. Now if you’re already into lossless Ethernet for iSCSI or working with FCoE you might have all the hardware you need & the experience to deal with DCB. The latter might not always be true however. Most people have Lossless Ethernet for iSCSI or FCoE set up by the vendor or consultants who’ll use well defined step by step guides. These do not exist for the RoCE variant of SMB 3.0 over RDMA.

The case for RoCE can be made as well.  Some claim that high volume of connections consumes memory when using iWARP and TCP’s flow and reliability controls are less suited for large-scale datacenters & cloud deployments due to performance issues. Where iWARP does not know multicast, RoCE does and that could be important to you.

So why did I or still do RoCE?

So why did I walk the walk? Basically because just talking the talk isn’t enough. We considered it an investment in our education. DCB is not going away (the abstraction isn’t their yet and won’t be for a while) and we need to gain knowledge of it to both handle it and make informed decisions. By the way once you go to lossless you might leverage DCB/PFC with iWarp as well just like you do for iSCSI to make it lossless (leveraging DCBx/TLV). Keep in mind that DCB is key in converged networking and as such deserves your attention. That’s why I chose not to avoid it but gave battle. DCB is all over the place when it comes to converged networking (iSCSI, FCoE), so we need to learn the good, the bad and the ugly. Until that day that perhaps, the hardware stack is that good, powerful & has so much bandwidth TCP/IP never needs it built in protection for packet loss. Hmmmmmm, I remember people saying that about 10Gbps, but then they wanted to send everything over 2*10Gbps pipes and it becomes an issue again?

It’s early days yet but you have to give credit to Microsoft for getting RDMA/DCB on the radar screen of the worlds virtualization & storage admins than ever before. It’s not a well established segment yet and it will be interesting to see how this all turns out. I do know that now that I’ve figured out a thing or two about RoCE, I won’t be intimidated & won’t make choices out of fear. And do remember that if you have plenty of idle CPU cycles & 10Gbps you might not even need RDMA. The value for me and my employers is the knowledge gained. DCB has it’s role to play but we’ll leverage iWARP or RoCE without a preference. Today you have 2 choices. RoCE is the newer one while iWarp has been around longer and both have avid proponents it seems.

I know one thing. If you need or want RDMA in any existing 10Gbps environment with minimal effort & no risk to existing switch infrastructure, you’ll use iWarp it seems.

Epilogue

You sit there staring at a truckload of VMs with 120GB of memory assigned in total being evacuated in +/- 70 seconds seconds, while doing a Shared Nothing Live Migration between the same hosts and without consuming CPU load …  and have DCB for SMB 3.0 running on your switches … Yes!

image

Remember, “What we do in life, echo’s in eternity” Winking smile You might think now that I’m a bit nutty, but I assure you that in my quest to find someone who had hands on experience configuring DCB on switches for SMB Direct with RoCE I had to turn to myself as no one seems to have done it.  I’ll be sharing more info on our setup and configurations in the future. Once you wrap your head around the concepts, you understand why things are done and how. There in lies the value for me.

Live Migration over NIC Team in Switch Independent Mode With Dynamic Load Balancing & Compression in Windows Server 2012 R2

In a previous blog post Live Migration over NIC Team in Switch Independent Mode With Dynamic Load Balancing & TCP/IP in Windows Server 2012 R2 we looked at what Dynamic load balancing mode in NIC teaming can do for us . Especially in a switch independent configuration as until now there was no possibility to leverage the complete bandwidth provided by the NIC team when migrating between only 2 nodes. I that blog we used TCP/IP. Now we’ll configure Compression and see what that does for us.

So we set up a NIC team in switch independent mode with Dynamic load balancing, it’s identical as that one used for the tests with TCP/IP.

Compression basically slashes the live migration times in half at a cost. CPU cycles.And again with Dynamic load balancing we can now also use all member of a NIC team for live migration even in switch independent mode. The speeds for live migrating 6 VMs  with 9GB of memory simultaneously were 12-14 seconds.

image

Take a look at the screen shot above. You see 6 VMs coming in to the host where these counters are collected and after that you see them being live migrated away from the host. As we have plenty of idle cycles I this test lab they get used, both when being the target and the source of the VMs being live migrated. You can also see that a lot less bandwidth is needed to achieve a faster live migration experience (compared to TCP/IP).

By the looks of it the extra bandwidth will help out when we have less CPU and vice versa. This is both the case for a single NIC or teamed NICs. Do note that you cannot combine compression with Multichannel. That means that the only scenario allowing for multiple NICs to be used with compression is NIC teaming. When you have a bunch  of free 1Gbps NICs in surplus this might get things moving for you!

Interesting stuff. I’m really looking forward to the moment we can run production loads on these configurations …