Verifying SMB 3.0 Multichannel/RDMA Is Working In Windows Server 2012 (R2)

So you have spend some money on RDMA cards (RoCE in this example), spent even more money on 10Gbps Switches with DCB capabilities and last but not least you have struggled for many hours to get PFC, ETS, … configured. So now you’d like to see that your hard work has paid of, you want to see that RDMA power that SMB 3.0 leverages in action. How?

You could just copy files and look at the speed but when you have sufficient bandwidth and the limiting factor is in disk IO for example how would you know? Well let’s have a look below.

You can take a look at performance monitor for RDMA specific counters like “RDMA Activity” and “SMB Direct Connection”.image

Whilst copying six 3.4GB ISO files over the RDMA connection we see a speed of 1.05 GB/s. Not to shabby.  But hay nothing a good 10Gbps with TCP/IP can handle under the right conditions.image

It’s the RDMA counters in Performance Monitor that show us the traffic that going via SMB Direct.image

Another give away that RDMA is in play comes from Task Manager, Performance counters for the RDMA NIC => 1.3Mbps send traffic can’t possibly give us 1.05GB/s in copy speed magically Smile

image

When you run netstat –xan (instead of the usual –an) you get to see the RDMA connection. The mode is “Kernel” instead of the usual “TCP” or “UDP” with –an showing the TCP/IP connections/Listerners.

 image

If you want to go all geeky there is an event log where you look at RDMA events amongst others. Jose Baretto discusses this in Deploying Windows Server 2012 with SMB Direct (SMB over RDMA) and the Mellanox ConnectX-3 using 10GbE/40GbE RoCE – Step by Step with instructions how to use it. You’ll need to go to Event Viewer.On the menu, select “View” then “Show Analytic and Debug Logs”
Expand the tree on the left: Applications and Services Log, Microsoft, Windows, SMB Client, ObjectStateDiagnostic. On the “Actions” pane on the right, select “Enable Log”
You then run your RDMA work. And then disable the log to view the events. Some filtering & PowerShell might come in handy to comb through them.image

Complete VM Mobility Across The Data Center with SMB 3.0, RDMA, Multichannel & Windows Server 2012 (R2)

Introduction

The moment I figured out that Storage Live Migration (in certain scenarios) and Shared Nothing Live Migration leverage SMB 3.0 and as such Multichannel and RDMA in Windows Server 2012 I was hooked. I just couldn’t let go of the concept of leveraging RDMA for those scenarios.  Let me show you the value of my current favorite network design for some demanding Hyper-V environments. I was challenged a couple of time on the cost/port of this design which is, when you really think of it, a very myopic way of calculating TCO/ROI. Really it is. And this week at TechEd North America 2013 Microsoft announced that all types of Live Migrations support Multichannel & RDMA (next to compression) in Windows Server 2012 R2.  Watch that in action at minute 39 over here at Understanding the Hyper-V over SMB Scenario, Configurations, and End-to-End Performance. You should have seen the smile on my face when I heard that one! Yes standard Live Migration now uses multiple NIC (no teaming) and RDMA for lightning fast  VM mobility & storage traffic. People you will hit the speed boundaries of DDR3 memory with this! The TCO/ROI of our plans just became even better, just watch the session.

So why might I use more than two 10Gbps NIC ports in a team with converged networking for Hyper-V in Windows 2012? It’s a great solution for sure and a combined bandwidth of 2*10Gbps is more than what a lot of people have right now and it can handle a serious workload. So don’t get me wrong, I like that solution. But sometimes more is asked and warranted depending on your environment.

The reason for this is shown in the picture below. Today there is no more limit on the VM mobility within a data center. This will only become more common in the future.

image

This is not just a wet dream of virtualization engineers, it serves some very real needs. Of cause it does. Otherwise I would not spend the money. It consumes extra 10Gbps ports on the network switches that need to be redundant as well and you need to have 10Gbps RDMA capable cards and DCB capable switches.  So why this investment? Well I’m designing for very flexible and dynamic environments that have certain demands laid down by the business. Let’s have a look at those.

The Road to Continuous Availability

All maintenance operations, troubleshooting and even upgraded/migrations should be done with minimal impact to the business. This means that we need to build for high to continuous availability where practical and make sure performance doesn’t suffer too much, not noticeably anyway. That’s where the capability to live migrate virtual machines of a host, clustered or not, rapidly and efficiently with a minimal impact to the workload on the hosts involved comes into play.

Dynamics Environments won’t tolerate downtime

We also want to leverage our resources where and when they are needed the most. And the infrastructure for the above can also be leveraged for that. Storage live migration and even Shared Nothing Live Migration can be used to place virtual machine workloads where they are getting the resources they need. You could see this as (dynamically) optimizing the workload both within and across clusters or amongst standalone Hyper-V nodes. This could be to a SSD only storage array or a smaller but very powerful node or cluster in regards to CPU, memory and Disk IO. This can be useful in those scenarios where scientific applications, number crunching or IOPS intesive  software or the like needs them but only for certain times and not permanently.

Future proofing for future storage designs

Maybe you’re an old time fiber channel user or iSCSI rules your current data center and Windows Server 2012 has not changed that. But that doesn’t mean it will not come. The option of using a Scale Out File Server and leverage SMB 3.0 file shares to providing storage for Hyper-V deployments is a very attractive one in many aspects. And if you build the network as I’m doing you’re ready to switch to SMB 3.0 without missing a heart beat. If you were to deplete the bandwidth x number of 10Gbps can offer, no worries you’ll either use 40Gbps and up or Infiniband. If you don’t want to go there … well since you just dumped iSCSI or FC you have room for some more 10Gbps ports Smile

Future proofing performance demands

Solutions tend to stay in place longer than envisioned and if you need some long levity and a stable, standard way of doing networking, here it is. It’s not the most economical way of doing things but it’s not as cost prohibitive as you think. Recently I was confronted again with some of the insanities of enterprise IT. A couple of network architects costing a hefty daily rate stated that 1Gbps is only for the data center and not the desktop while even arguing about the cost of some fiber cable versus RJ45 (CAT5E). Well let’s look beyond the North – South traffic and the cost of aggregating band all the way up the stack with shall we? Let me tell you that the money spent on such advisers can buy you in 10Gbps capabilities in the server room or data center (and some 1Gbps for the desktops to go) if you shop around and negotiate well. This one size fits all and the ridiculous economies of scale “to make it affordable” argument in big central IT are not always the best fit in helping the customers. Think  a little bit outside of the box please and don’t say no out of habit or laziness!

Conclusion

In some future blog post(s) we’ll take a look at what such a network design might look like and why. There is no one size fits all but there are not to many permutations either. In our latest efforts we had been specifically looking into making sure that a single rack failure would not bring down a cluster. So when thinking of the rack as a failure domain we need to spread the cluster nodes across multiple racks in different rows. That means we need the network to provide the connectivity & capability to support this, but more on that later.

SMB Direct RoCE Does Not Work Without DCB/PFC

Introduction

SMB Direct RoCE Does Not Work Without DCB/PFC. “Yes”, you say, “we know, this is well documented. Thank you.” but before you sign of hear me out.

Recently I plugged to RoCE cards into some test servers and linked them to a couple of 10Gbps switches. I did some quick large file copy testing and to my big surprise RDMA kicked in with stellar performance even before I had installed the DCB feature, let alone configure it. So what’s the deal here. Does it work without DCB? Does the card fail back to iWarp? Highly unlikely. I was expecting it to fall back to plain vanilla 10Gbps and not being used at all but it was. A short shout out to Jose Barreto to discuss this helped clarify this.

DCB/PFC is a requirement RoCE

The more busy the network gets the faster the performance will drop. Now in our test scenario we had two servers  for a total of 4 RoCE ports on the network consisting of a beefy 48 port 10Gbps switches. So we didn’t see the negative results of this here.

DCB (Data Center Bridging) and Priority Flow Control are considered a requirement for any kind of RoCE deployment. RDMA with RoCE operates at the Ethernet layer. That means there is no overhead from TCP/IP, which is great for performance. This is the reason you want to use RDMA actually. It also means it’s left on it’s own to deal with Ethernet-level collisions and errors. For that it needs DCB/PFC other wise you’ll run into performance issues due to a ton of retries at the higher network layers.

The reason that iWarp doesn’t require DCB/PCF is that it works at the TCP/IP level also offloaded by using a TCP/IP stack on the NIC instead of the OS. So errors are handled by TCP/IP at a cost: iWarp results in the same benefits as RoCE but it doesn’t scale that well. Not that iWarp performance is lousy, far form! Mind you, for bandwidth management reasons,you’d be better of using DCB or some form of QoS as well.

Conclusion

So no, not configuring  DCB on your servers and the switches isn’t an option, but apparently it isn’t blocked either so beware of this. It might appear to be working fine but it’s a bad idea. Also don’t think it defaults back to iWarp mode, it doesn’t, as one card does one thing not both. There is no shortcut. RoCE RDMA does not work error free out of the box so you do have the install the DCB feature and configure it together with the switches.

Using RAMDisk To Test Windows Server 2012 Network Performance

I’m testing & playing different Windows Server 2012 & Hyper-V networking scenarios with 10Gbps, Multichannel, RDAM, Converged networking etc. Partially this is to find out what works best for us in regards to speed, reliability, complexity, supportability and cost.

Basically you have for basic resources in IT around which the eternal struggle for the prefect balance finds place. These are:

  • CPU
  • Memory
  • Networking
  • Storage

We need both the correct balance in capabilities, capacities and speed for these in well designed system. For many years now, but especially the last 2 years it very save to say that, while the sky is the limit, it’s become ever easier and cheaper to get what we need when it comes to CPU, Memory. These have become very powerful, fast and affordable relative to the entire cost of a solution.

Networking in the 10Gbps era is also showing it’s potential in quantity (bandwidth), speed (latency) and cost (well it’s getting there) without reducing the CPU or memory to trash thanks to a bunch of modern off load technologies. And basically in this case it’s these qualities we want to put to the test.

The most trouble some resource has been storage and it has been for quite a while. While SSD do wonders for many applications the balance between speed, capacity & cost isn’t that sweet as for our other resources.

In some environments were I’m active they have a need for both capacity and IOPS and as such they are in luck as next to caching a lot of spindles still equate to more IOPS. For testing the boundaries of one resource one needs to make sure non of the others hit theirs. That’s not easy as for performance testing can’t always have a truck load of spindles on a modern high speed SAN available.

RAMDisk to ease the IOPS bottleneck

To see how well the 10Gbps cards with and without Teaming, Multichannel, RDMA are behaving and what these configuration are capable of I wanted to take as much of the disk IOPS bottle neck out of the equation as possible. Apart from buying a Violin system capable of doing +1 million IOPS, which isn’t going to happen for some lab work, you can perhaps get the best possible IOPS by combining some local SSD and RAMDisk. RAMDisk is spare memory used as a virtual disk. It’s very fast and cost effective per IOPS. But capacity wise it’s not the worlds best, let alone most cost effective solution.

image

I’m using free RAMDisk software provided by StarWind. I chose this as they allow for large sized RAMDisks. I’m using the ones of 54GB right now to speed test copying fixed sized VHDX files. It install flawlessly on Windows Server 2012 and it hasn’t caused me any issues. Throw in some SSDs on the servers for where you need persistence and you’re in business for some very nice lab work.

clip_image001

You also need to be aware it doesn’t persist data when you reboot the system or lose power. This is not an issue if all we are doing is speed testing as we don’t care. Otherwise you’ll need to find a workaround and realize those ‘”flush the data to persistent storage” isn’t full proof or super fast, the SSDs do help here.

You have to register but the good news is that they don’t spam you to death at all, which I find cool. As said the tool is free, works with Windows Server 2012 and allows for larger RAMDisks where other free ones are often way to limited in size.

It has allowed me to do some really nice testing. Perhaps you want to check this out as well. WARNING: The below picture is a lab setup … I’m not a magician and it’s not the kind of IOPS I have all over the datacenters with 4 Cheapo SATA disks I touched my special magic pixie dust.

image

With #WinServ 2012 storage costs/performance/capacity are the only thing limiting you  http://twitter.yfrog.com/mnuo9fp #SMB3.0 #Multichannel

Some quick tests with a 52GB NTFS RAMDisk formatted with a 64K NTFS Allocation unit size.

image image

I also tested with another free tool from SoftPerfect ® RAM Disk FREE. It performs well but I don’t get to see the RAMDisk in the Windows Disk Management GUI, at least not on Windows Server 2012. I have not tested with W2K8R2.

NTFS Allocation unit size 4K NTFS Allocation unit size 64K
image image