Adventures In RDMA – The RoCE Path Over DCB To Windows Server 2012 R2 SMB 3.0 Glory

Prologue

On gloomy day, it was dark, grey and cold, we gave battle with RoCE & DCB (PFC/ETS). The fight was a long one, the battle field uncharted and we had only our veteran attitude towards adversity to guide us through the switch configurations. It seemed that no man had gone that far to the edges of the Windows Server 2012 empire. And when it came to RoCE & DCB meets Didier, I needed to show it that it had been conquered and was remembered of a quote in Gladiator:

Quintus: People RoCE/DCB configs should know when they are conquered.
Maximus: Would you, Quintus? Would I?

image

After many, many lonely & unsuccessful hours dealing with Performance monitor, switch configurations, reloads, firmware, drivers & Windows we got results:

… “it’s working” … “holey s* look at those numbers” …

On that dark day in a scarcely illuminated room, in the faint glare of the monitors even the CLI  of the switches in PUTTY felt like a grim cold place. But all that changed at as the impressive results brightened up the day and made all efforts seem worthwhile. “Didier Victor” I thought as I looked away from the screen, ‘”Once more”.image

But it has been a hard won victory. And should you fight this battle? We’ll let’s discuss this a bit now we’ve got your attention. RDMA is a learning process for many of us and neither Infiniband,  iWARP or RoCE are the one that need to win at this game. It’s you, via the knowledge you’ll gain working with RDMA technologies.

SMB Direct or SMB over RDMA comes in flavors

Infiniband (Mellanox)

That’s been here for a while. Has high cost associated (depends on where you come from) and also has a psychological barrier to it. Try discussing buying 10Gbps versus Infiniband with semi technical managerial types. You’ll know what I mean.

Deploying Windows Server 2012 with SMB Direct (SMB over RDMA) and the Mellanox ConnectX-2/ConnectX-3 using InfiniBand – Step by Step

iWARP (Chelsio / Intel)

RDMA but it’s TCP/IP offloaded to the card. It can leverage DCB but doesn’t require it.

Deploying Windows Server 2012 with SMB Direct (SMB over RDMA) and the Chelsio T4 cards using iWARP – Step by Step

RoCE (Mellanox)

“Infiniband over Ethernet” > so you “NEED” (no not a real hard requirement) DBC with PFC/ETS (DCBx can be handy) for it to work best. No need for Congestion Notification as it’s for TCP/IP but could be nice with iWARP (see above). Do note that you’ll need to configure your switches for DCB & that’s highly dependent on the vendor & even type of switches.

Deploying Windows Server 2012 with SMB Direct (SMB over RDMA) and the Mellanox ConnectX-3 using 10GbE/40GbE RoCE – Step by Step

Here’s an older overview of RDMA flavor’s pros & cons:image

Please see Jose Barreto’s excellent work on explaining SMB 3.0 over RDMA in his presentations at SNIA, TechEd and on his blog.

While I have heard of two people I have in my network working with Infiniband for SMB Direct and Windows Server 2012 (R2) most of us are doing 10Gbps. Pricing for Infiniband has a bad reputation. Not because Infiniband is super costly compared to 10/40Gbps (I’m told by most people who ask quotes are positively surprised) but when you can’t afford a Porsche you’re not shopping for a Ferrari either.  Especially not when a mid size sedan will serve al of your needs above and beyond the call of duty. On top of that you might have bought all that nice “converged network ready” 10Gbps gear some years ago. Some of us may be working towards 40Gbps but most are 10Gbps shops. My 40Gbps is “limited” to the inter links & uplinks. Meaning that we either go for iWarp or RoCE.

RoCE or iWARP

Which one is best of those two? Well, as the line is drawn between vendors. RoCE today equals Mellanox (yes the Infiniband vendor, RoCE is sometimes called “Infiniband on layer 4 over Ethernet layer 2”) and iWarp means Chelsio or Intel (their cards look a bit old in the teeth however).

You’ll find comparisons by both vendors claiming superiority for varied reasons. Here’s the Mellanox side http://www.mellanox.com/pdf/whitepapers/WP_RoCE_vs_iWARP.pdf & here’s Chelsio’s take http://www.chelsio.com/roce/ & http://www.moderntech.com.hk/sites/default/files/whitepaper/V09_iWAR_Summary_WP_0.pdf. It’s good to look at your needs and map them. But I cannot declare a winner. I did notice that at least one vendor of SOFS/CiB uses iWarp. Is that a statement? And if so about what? Price? Easy of use? Perfomance/Cost?

What I do find is that Chelsio is really hacking into RoCE as you can see here http://www.chelsio.com/wp-content/uploads/2011/05/RoCE-The-Grand-Experiment1.pdf, http://www.chelsio.com/roce-whitepaper/, http://www.chelsio.com/wp-content/uploads/2011/05/RoCE-FAQ-1204121.pdf So that begs the question are the right or are the scared of RoCE, as the Infiniband boys are out to eat their lunch?

My take on this for now

iWarp is way easier to get started. That’s for sure. RoCE  is firmware sensitive (NIC, Switches), driver sensitive (NIC). Configuring your switches (DCB) now is usually followed by a rebooting that switch (so you might not do that so easily in production and depending on where in the stack those switches live you really need to Force10 VLT or Cisco vPC, Arista MLAG  or a independent redundant switches to get away with it. RoCE loves green field. Stacking I hear you say? I don’t like stacking on that spot of the stack as firmware updates will get you to suffer through a single point of failure.

Disclaimer: RoCE in itself does not  DEMAND/REQUIRE DCB but the consensus is that it will work better, especially under heavy load. Weather SMB Direct over RoCE requires DCB is another question. For all practical purposes I’m working from the prerequisite it does for a production environment. But as you can do RoCE RDMA between to NIC with no DCB switch in between this indicates that the hard requirement for DCB is not there. Mind you not using DCB might not be smart in regards to QoS & error handling (no TCP/IP goodness handling this for you). But I’m no expert on this subject. Paul Grun however is and he’s involved with RoCE at  https://www.openfabrics.org/component/search/?searchword=Paul+grun&ordering=&searchphrase=all They tend to know their stuff. Read some of the comments below this article and you’ll know a lot http://www.hpcwire.com/hpcwire/2010-04-22/roce_an_ethernet-infiniband_love_story.html But PFC isn’t Walhalla either and some claim you can just forget about it and build non blocking networks. I guess you could if your pockets are deep enough Smile. And you might go a very long way without the need for RDMA. Many do … and when you talk to some network people & vendors they can’t agree either as everyone is on the same learning curve but from a different perspective. There is no one size fits all & it all depends.

iWarp doesn’t require DCB so you can get away with cheaper switches. Or, not so cheap switches that don’t support DCB (choose wisely). So cheaper switches is probably true on the low end. But, even very economically priced switches from DELL have good DCB support. Some other vendors who are more expensive don’t.

DCB is uncharted terrain for SMB Direct purposes & new to many for us. So if you want to do RDMA the easy way  … go iWARP. As said, the use of DCB for PFC/ETS is not mandatory in that case, you’ll get great results and it’s easy.  Mind you, you’ll still be dabbling with DCB if you want to do lossless magic in the switches Smile. Why you say? Well, that “converged network” story makes it kind of interesting to do so and PFC, DCBx/TLV is generic and can be leveraged for other things than iSCSI or FCoE.  And for all practical purposes SMB 3.0 with SMB Direct is a storage protocol since Windows Server 2012 made it so (CSV). Or do you do DCB for iSCSI/FCoE & iWarp for SMB Direct? After all there’s only 2 lossless queues to be had. But hey how many do you need? Choices, choices and no vast pool of experienced practitioners yet.

iWARP routes, it’s not bound by a single Ethernet broadcast domain. That could be useful info depending on your environment & needs. I’ll note that I leverage RDMA for East-West traffic, not north south & as such this could not be an issue. The time that I do “Shared Nothing Live Migration" from on premise to the cloud has not arrived yet.

The Mellanox cards in my neck of the woods were 35% cheaper than Chelsio (SFP+)

What about the scalability? “iWarp doesn’t scale that well” is stated left and right but I think that might often be based on older information. Chelsio makes a strong case for iWARP scalability. Especially when it comes to long distances, multiple hops & routing.

Again, your mileage may vary. But for “the smaller environments” who want to leverage RDMA with SMB 3.0 I’d say that iWarp is the easiest path to go & will do just fine. Now if you’re already into lossless Ethernet for iSCSI or working with FCoE you might have all the hardware you need & the experience to deal with DCB. The latter might not always be true however. Most people have Lossless Ethernet for iSCSI or FCoE set up by the vendor or consultants who’ll use well defined step by step guides. These do not exist for the RoCE variant of SMB 3.0 over RDMA.

The case for RoCE can be made as well.  Some claim that high volume of connections consumes memory when using iWARP and TCP’s flow and reliability controls are less suited for large-scale datacenters & cloud deployments due to performance issues. Where iWARP does not know multicast, RoCE does and that could be important to you.

So why did I or still do RoCE?

So why did I walk the walk? Basically because just talking the talk isn’t enough. We considered it an investment in our education. DCB is not going away (the abstraction isn’t their yet and won’t be for a while) and we need to gain knowledge of it to both handle it and make informed decisions. By the way once you go to lossless you might leverage DCB/PFC with iWarp as well just like you do for iSCSI to make it lossless (leveraging DCBx/TLV). Keep in mind that DCB is key in converged networking and as such deserves your attention. That’s why I chose not to avoid it but gave battle. DCB is all over the place when it comes to converged networking (iSCSI, FCoE), so we need to learn the good, the bad and the ugly. Until that day that perhaps, the hardware stack is that good, powerful & has so much bandwidth TCP/IP never needs it built in protection for packet loss. Hmmmmmm, I remember people saying that about 10Gbps, but then they wanted to send everything over 2*10Gbps pipes and it becomes an issue again?

It’s early days yet but you have to give credit to Microsoft for getting RDMA/DCB on the radar screen of the worlds virtualization & storage admins than ever before. It’s not a well established segment yet and it will be interesting to see how this all turns out. I do know that now that I’ve figured out a thing or two about RoCE, I won’t be intimidated & won’t make choices out of fear. And do remember that if you have plenty of idle CPU cycles & 10Gbps you might not even need RDMA. The value for me and my employers is the knowledge gained. DCB has it’s role to play but we’ll leverage iWARP or RoCE without a preference. Today you have 2 choices. RoCE is the newer one while iWarp has been around longer and both have avid proponents it seems.

I know one thing. If you need or want RDMA in any existing 10Gbps environment with minimal effort & no risk to existing switch infrastructure, you’ll use iWarp it seems.

Epilogue

You sit there staring at a truckload of VMs with 120GB of memory assigned in total being evacuated in +/- 70 seconds seconds, while doing a Shared Nothing Live Migration between the same hosts and without consuming CPU load …  and have DCB for SMB 3.0 running on your switches … Yes!

image

Remember, “What we do in life, echo’s in eternity” Winking smile You might think now that I’m a bit nutty, but I assure you that in my quest to find someone who had hands on experience configuring DCB on switches for SMB Direct with RoCE I had to turn to myself as no one seems to have done it.  I’ll be sharing more info on our setup and configurations in the future. Once you wrap your head around the concepts, you understand why things are done and how. There in lies the value for me.

Future Proofing Storage Acquisitions Without A Crystal Ball

Dealing with an unknown future without a crystal ball

I’ve said it before and I’ll say it again. Storage Spaces in Windows Server 2012 (R2) is are the first steps of MSFT to really make a difference (or put a dent into) in the storage world. See TechEd 2013 Revelations for Storage Vendors as the Future of Storage lies With Windows 2012 R2 (that was a nice blog by the way to find out what resellers & vendors have no sense of humor & perspective). It’s not just Microsoft who’s doing so. There are many interesting initiatives at smaller companies to to the same. The question is not if these offerings can match the features sets, capabilities and scenario’s of the established storage vendors offerings. The real question is if the established vendors offer enough value for money to maintain themselves in a good enough is good enough world, which in itself is a moving target due to the speed at which technology & business needs evolve. The balance of cost versus value becomes critical for selecting storage. You need it now and you know you’ll run it for 3 to 5 years. Perhaps longer, which is fine if it serves your needs, but you just don’t know. Due to speed of change you can’t invest in a solution that will last you for the long term. You need a good fit now at reasonable cost with some headway for scale up / scale out. The ROI/TCO has to be good within 6 months or a year. If possible get a modular solution. One where you can replace the parts that are the bottle neck without having to to a fork lift upgrade. That allows for smaller, incremental, affordable improvements until you have either morphed into a new system all together over a period of time or have gotten out of the current solution what’s possible and the time has arrived to replace it. Never would I  invest in an expensive, long term, fork lift, ultra scalable solution. Why not. To expensive and as such to high risk. The risk is due to the fact I don’t have one of these:

https://i1.wp.com/trustbite.co.nz/wp-content/uploads/2010/01/Crystal-Ball.jpg?w=584

So storage vendors need to perform a delicate balancing act. It’s about price, value, technology evolution, rapid adoption, diversification, integration, assimilation & licensing models in a good enough is good enough world where the solution needs to deliver from day one.

I for one will be very interested if all storage vendors can deliver enough value to retain the mid market or if they’ll become top feeders only. The push to the cloud, the advancements in data replication & protection in the application and platform layer are shaking up the traditional storage world. Combine that with the fast pace at which SSD & Flash storage are evolving together with Windows Server 2012 that has morphed into a very capable storage platform and the landscape looks very volatile for the years to come. Think about  ever more solutions at the application (Exchange, SQL server) and platform layer (Hyper-V replica) with orchestration on premise and/or in the cloud and the pressure is really on.

So how do you choose a solution in this environment?

Whenever you are buying storage the following will happen. Vendors, resellers & sales people, are going to start pulling at you. Now, some are way better than others at this, some are even down right good at this whole process a proceed very intelligently.

Sometimes it involves FUD, doom & gloom combined with predictions of data loss & corruption by what seem to be prophets of disaster. Good thing is when you buy whatever they are selling that day, they can save you from that. The thing is this changes with the profit margin and kickbacks they are getting. Sometimes you can attribute this to the time limited value of technology, things evolve and todays best is not tomorrows best. But some of them are chasing the proverbial $ so hard they portray themselves as untrustworthy fools.

That’s why I’m not to fond of the real big $ projects. Too much politics & sales. Sure you can have people take care of but you are the only one there to look out for your own interests. To do that all you need to do is your own due diligence and be brave. Look, a lot of SAN resellers have never ever run a SAN, servers, Hyper-V clusters, virtualized SQL Server environments or VDI solutions in your real live production environments for a sustained period of time. You have. You are the one whose needs it’s all about as you will have to live and work with the solution for years to come.  We did this exercise and it was worth while. We got the best value for money looking out for our own interests.

Try this with a reseller or vendor. Ask them about how their hardware VSS providers & snapshot software deals with the intricacies of CSV 2.0 in a Hyper-V cluster. Ask them how it works and tell them you need to references to speak to who are running this in production. Also make sure you find your own references. You can, it’s a big world out there and it’s a fun exercise to watch their reactions Winking smile

As Aidan remarked in his blog on ODX–Not All SANs Are Created Equally

These comparisons reaffirm what you should probably know: don’t trust the whitepapers, brochures, or sales-speak from a manufacturer.  Evidently not all features are created equally.

You really have to do your own due diligence. Some companies can afford the time, expense & personnel to have the shortlisted vendors deliver a system for them to test. Costs & effort rise fast if you need to get a setup that’s comparable to the production environment. You need to device tests that mimic real life scenario’s in storage capacity, IOPS, read/write patterns and make sure you don’t have bottleneck outside of the storage system in the lab.

Even for those that can, this is a hard thing to do. Some vendors also offer labs at their Tech Centers or Solutions Centers where customers or potential customers can try out scenarios. No matter what options you have, you’ll realize that this takes a lot of effort. So what do I do? I always start early. You won’t have all the information, question & answers available with a few hours of browsing the internet & reading some brochures. You’ll also notice that’s there’s always something else to deal with or do, so give your self time, but don’t procrastinate. I did visit the Tech Centers & Solution Centers in Europe of short listed vendors. Next to that I did a lot of reading, asked questions and talked to a lot of people about their view and experiences with storage. Don’t just talk to the vendors or resellers. I talked a lot with people in my network, at conferences and in the community. I even tracked down owners of the shortlisted systems and asked to talk to them. All this was part of my litmus test of the offered storage solutions. While perfection is not of this world there is a significant difference between vendor’s claims and the reality in the field. Our goal was to find the best solution for our needs based on price/value and who’s capabilities & usability & support excellence materialized with the biggest possible majority of customers in the field.

Friendly Advice To Vendors

So while the entire marketing and sales process is important for a vendor I’d like to remind all of them of a simple fact. Delivering what you sell makes for very happy customers who’s simple stories of their experiences with the products will sell it by worth of mouth. Those people can afford to talk about the imperfections & some vNext wishes they have. That’s great as those might be important to you but you’ll be able to see if they are happy with their choice and they’ll tell you why.

Windows Server 2012 R2 Unmap, ODX On A Dell Compellent SAN Demo

UNMAP & ODX Video

Some things are easier to show using a video so have a look at a video on UNMAP/ODX used with Windows Server 2012 R2 and Compellent SAN:

You can also go directly to the Vimeo page by clicking on the below screen shotimage

We start out with a 10.5TB large thinly provisioned LUN that has about 203GB of space in use on the SAN. So the LUN on the SAN might be 10.5TB and windows sees a volume that is 10.5TB only the effective data stored consumes storage space on the SAN. That ought to demonstrate the principle of thin provisioning adequately Smile. The nice PowerShell counter is made possible via the Compellent PowerShell Command Set.

We then copy 42GB worth of ISO files inside a Windows Server 2012 virtual machine from a fixed VHD to a dynamically expanding VDHX. Those are nice speeds. And look at how the size of the VHDX file grows on the CSV volume and how the space used on the SAN is growing. That’s because the LUN is thinly provisioned.

Secondly we copy the same ISO files to a fixed size VHDX. Again, some really nice speeds. As the VHDX is fixed in size you do not see it grow. When looking at the little SAN counter however we do see that the thinly provisioned LUN is using more storage capacity.

Once that is done we see that the total space consumed on the SAN for that CSV LUN has risen to 284GB. We then delete the data from both dynamically expanding VHDX and are about to run the Optimize-Volume command when we notice that the SAN has already reclaimed the space. So we don’t run the optimize command. Keep that in mind. By the way, this process is done as part of standard maintenance (defrag) and some NTFS check pointing mechanism that’s run every 5 minutes and sends down the info from the virtual layer to the physical layer to the SAN. During demo’s it’s kind of boring to sit around and wait for it to happen Smile. Just remember that in real life it’s a zero touch feature, you don’t need to baby sit it.

We then also delete the ISO files from the fixed VHDX and run Optimize-Volume G –Retrim and as result you see the space reclaimed on the SAN. As this is a fixed disk the size of the VDHX will not change. But what about the dynamically expanding VHDX? Well you need to shut it down for that. But hey, nothing happens. So we fire it up again and do run Optimize-Volume H –Retrim before shutting it down again and voila.

So what do you need for this?

Rest assured. You don’t need the most high end, most expensive, complex and proprietary SAN hardware to get this done. What you need is good software (firmware) on quality commodity hardware and you’re golden. If any SAN vendor wants to charge you a license fee for ODX/UNMAP just throw them out. If they don’t even offer it walk away from them and just use storage spaces. There are better alternatives than overpriced SANs lacking features.

I’ve found that systems like Equalogic & Compellent are in the sweet point for 90 % of their markets based on price versus capabilities and features.  Let’s look at the a Compellent for example. For all practical intend this SAN runs on commodity hardware. It’s servers & disk bays. SAS to the storage & FC, iSCSI or SMB/NFS for access. With capable hardware the magic is in the software. Make no mistake about it, commodity hardware when done right, is very, very capable. You don’t need a special proprietary hardware & processors unless for some specialized nice markets. And if you think you do, what about buying commodity hardware anyway at 50% of the cost and replacing it with the latest of the greatest commodity hardware after 4 years and still come out on top cost wise whilst beating the crap out of that now 4 year old ASIC and reaping the benefits of a new capabilities the technology evolutions offers? Things move fast and you can’t predict the future anyway.

Hands on with Hyper-V Clustering Maintenance Mode & Cluster Aware Updating TechNet Screencast

I’ve blogged and given some presentations on Cluster Aware Updating before and I also did a web cast on this subject on Technet. You can find the video of that screencast right here Hands on with Hyper-V Clustering Maintenance Mode & Cluster Aware Updating.

image

I hope you get something out of it. Once I got my head wrapped around around the XML to make the BIOS, firmware & driver updates from DELL to work as well as the pre configured inbox functionality (DGR & QFE updates) it has proven equally valuable for those kinds of updates.