RDMA Over RoCE With DCB Requires Tagged Non Default VLANs

It’s DCB That Requires This

For those of you who are experimenting with the RoCE variant of RDMA for SMB Direct in Windows Server 2012 (R2), make sure you have a VLAN tag in your configuration if this is more than a simple RDMA over two NICs. The moment you get DBC with PFC & ETS involved you’ll need non default tagged VLANs. Do note that PFC alone is good enough, ETS is strictly speaking not a requirement, but I’d consider doing it if you can.

With Enhanced Transmission Selection (ETS) the network traffic type is classified using the priority value in the VLAN tag of the Ethernet frame. The priority value is the Priority Code Point (PCP), which is described in the IEEE 802.1Q specification and uses a 3-bit field in the VLAN tag with eight possible priority values (0 to 7).

Priority-based Flow Control (PFC) allows to individually pause priorities of tagged traffic and helps to provide lossless or “no drop” behavior for a certain priority at the receiving port. As  above, each frame transmitted by a sending port is tagged with a priority value (0 to 7) in the VLAN tag. So for the traffic pause and resume functionality to work we need a VLAN tag to carry the priority value.

Does It Work Without?

But you’ll tell me that, as you may be lacking a DCB capable switch for lab purposes, you used a direct cable between your two RoCE NICs. And guess what RoCE, might have indeed worked for you without a VLAN tag. You can test & get a feel for what RoCE/RDMA can do for you with just the NICs. But as there is no switch involved you’re not using DCB for PFC/ETS and without that the need for the tagged VLAN isn’t there. Also see https://blog.workinghardinit.work/2013/05/03/smb-direct-roce-does-not-work-without-dcbpfc/.

So there you go. Design your RoCE/RDMA network based on DCB with PFC( and ETS) and not just on the tests with an direct cable or you might miss a few details that are quite important. Happy testing!

Learn & Evaluate Windows Server 2012 R2 Preview

If you are anything like a lot of people I know and myself you will be very eager to start testing the new features & capabilities of Windows Server 2012 R2 that is now available for testing purposes in preview. Now Microsoft has launched their IT Pro Summer Grand Prix campaign that might get you something extra next to the knowledge you will gain.


If you are going to do this why not surf to http://www.microsoft.com/nl-be/technet/summer-grandprix/#track1 and dive into Track 1 of the IT Pro Summer Grand Prix to download the public preview of Windows Server 2012 R2. image

When you do so, feel free to leave you contact information and be eligible to win a rather exclusive Windows Server headset.

Stay tuned because the next track will be all about System Center (June 15th) that holds even bigger benefits for being an early evaluator of the R2 wave.

Happy testing, learning and playing! One thing I found is that Windows 2012 R2 installations are that fast it feels like driving a race car Smile


Teamed NIC Live Migrations Between Two Hosts In Windows Server 2012 Do Use All Members


Between this blog NIC Teaming in Windows Server 2012 Brings Simple, Affordable Traffic Reliability and Load Balancing to your Cloud Workloads which states TCP/IP can recover from missing or out-of-order packets. However, out-of-order packets seriously impact the throughput of the connection. Therefore, teaming solutions make every effort to keep all the packets associated with a single TCP stream on a single NIC so as to minimize the possibility of out-of-order packet delivery. So, if your traffic load comprises of a single TCP stream (such as a Hyper-V live migration), then having four 1Gb/s NICs in an LACP team will still only deliver 1 Gb/s of bandwidth since all the traffic from that live migration will use one NIC in the team. However, if you do several simultaneous live migrations to multiple destinations, resulting in multiple TCP streams, then the streams will be distributed amongst the teamed NICsand other information out their such as support forum replies it is dictated that when you live migrate between two nodes in a cluster only one stream is active and you will never exceed the bandwidth of a single team member. When running some simple tests with a 10Gbps NIC team this seems true. We also know that you can consume near to all of the aggregated bandwidth of the members in a NIC Team for live migration if you these conditions are met:

1. The Live Migrations must not all be destined for the same remote machine. Live migration will only use one TCP stream between any pair of hosts. Since both Windows NIC Teaming and the adjacent switch will not spread traffic from a single stream across multiple interfaces live migration between host A and host B, no matter how many VMs you’re migrating, will only use one NIC’s bandwidth.

2. You must use Address Hash (TCP ports) for the NIC Teaming. Hyper-V Port mode will put all the outbound traffic, in this case, on a single NIC.

When we look at these conditions and compare them to the behavior we expect from the various forms of NIC teaming in Windows 2012 this is a bit surprising as one might expect all member to be involved. So let’s take a look at some of the different NIC Teaming setups.

Any form of NIC teaming with Hyper-V Port Mode

This one is easy as condition 2 above is very much true. In all my testing with any NIC team configuration in the Hyper-V Port mode traffic distribution algorithms I have not been able to exceed 10Gbps. I have seen no difference between dependent static of LACP mode or switch independent (active-active) for this condition. As you can see in the screenshot below, the traffic maxes out at 10Gbps.



This is also demonstrated in the following screenshots taking with the resource manager where you can see only half of the bandwidth of the Team is being used.



Exceeding a single NIC team member’s bandwidth when migrating between 2 nodes

The first condition of the previous heading doesn’t seem true. In some easy testing with a low number of virtual machines and not too much memory assigned you never exceed the bandwidth of one 10Gbps NIC team member. So on the surface, with some quick testing it might seem that way.

But during testing on a 2 node cluster with dual port 10Gbps cards and I have found the following

Switch Dependent LACP and Static

  1. Take a sufficient number of large memory virtual machines to exceed the capacity of a single 10Gbps pipe for a longer time (that way you’ll see it in the GUI).
  2. Live migrate them all from host A to host B (“Pause” with “Drain Roles” or “select all” + “Move”)
  3. Note that with a 2 node cluster there is no possibility to Live Migrate to multiple nodes simultaneous. It’s A to or B or B to A or both at the same time.

Basically it didn’t take long to see well over 10Gbpsbeing used. So the information out there seems to be wrong. Yes we can leverage the aggregated bandwidth when we migrate from host A to host B as long as we have enough memory assigned to the VMs and we migrate a sufficient number of them. Switch dependent teaming, whether it is static or LACP does its job as you would expect.

Let’s think about this. The number of VMs you need to lie migrate to see > 10Gbpss used is not fixed in stone. Could it be that there is some intelligence in the Live Migration algorithm where it decides to set up multiple streams when a certain number of virtual machines with sufficient memory are migrated as the sorting is mitigated by the amount of bandwidth that can leveraged? Perhaps he VMMS.EXE kicks off more streams when needed/beneficial? Further experimenting indicates that this is not the case. All you need is > 1 VM being live migrated. When looking at this in task manager you do need them to be of sufficient memory size and/or migrate enough of them to make it visible. I have also tried playing with the number of allowed simultaneous live migrations to see if this has an effect but I did not find one (i.e. 4, 6 or 12).

It looks like it is more like one TCP/IP connection per Live Migration that is indeed tied to one NIC member. So when you live migrate VMS between two hosts you see one VM live migration go over 1 member and the other the other as static/LACP switch dependent teaming did does its job. When you do enough live migrations of large VMs simultaneously you see this in Task Manager as shown below. In this case as each VM live migration stream sticks to a NIC team member you do not need to worry about out of order packets impacting performance.


But to make sure and to prevent falling victim to the fall victim to the limits of the task manger GUI during testing this behavior we also used performance monitor to see what’s going on. This confirms we are indeed using both 10Gbps NIC team member on both the target and the source host server. This is even the case with 2 virtual machines Live Migration. As long as it’s more than one and the memory assigned is enough to make the live migration last long enough you can see it in Task Manager; otherwise it might miss it. Performance Monitor however does not..clip_image012



This is interesting and frankly a bit unexpected as the documentation on this subject is not reflecting this. However it IS in agreement with the NIC teaming documented behavior for other tan Live Migration traffic. We took a closer look however and can reproduce this over and over again. Again we tested both switch dependent static and LACP modes and we found the behavior to be the same.

Switch Independent with Address Hash

Let’s test Live Migration over switch independent teaming with Address Hash. Here we see that the source server sends on the two member of the NIC team but that the target server receives on only one. This is normal behavior for switch independent teaming. But from the documentation we expect that one member on the source server would send and one member on the target server would receive. Not so.

Basically with Windows Server 2012 this doesn’t give you any benefit for throughput. You are limited to the bandwidth of one member, i.e. 10Gbps.



Red is Total Bytes received on the target host. It’s clear only one member is being used. Green is Bytes Sent/Sec on the source server. As you can see both team members are involved. In a switch independent scenario the receiving side limits the throughput. This is in agreement the documented behavior of switch independent NIC teaming with Address hash.

Helpful documentation on this is Windows Server 2012 NIC Teaming (LBFO) Deployment and Management (A Guide to Windows Server 2012 NIC Teaming for the novice and the expert).

Hope this helps sort out some of the confusion.

Using RAMDisk To Test Windows Server 2012 Network Performance

I’m testing & playing different Windows Server 2012 & Hyper-V networking scenarios with 10Gbps, Multichannel, RDAM, Converged networking etc. Partially this is to find out what works best for us in regards to speed, reliability, complexity, supportability and cost.

Basically you have for basic resources in IT around which the eternal struggle for the prefect balance finds place. These are:

  • CPU
  • Memory
  • Networking
  • Storage

We need both the correct balance in capabilities, capacities and speed for these in well designed system. For many years now, but especially the last 2 years it very save to say that, while the sky is the limit, it’s become ever easier and cheaper to get what we need when it comes to CPU, Memory. These have become very powerful, fast and affordable relative to the entire cost of a solution.

Networking in the 10Gbps era is also showing it’s potential in quantity (bandwidth), speed (latency) and cost (well it’s getting there) without reducing the CPU or memory to trash thanks to a bunch of modern off load technologies. And basically in this case it’s these qualities we want to put to the test.

The most trouble some resource has been storage and it has been for quite a while. While SSD do wonders for many applications the balance between speed, capacity & cost isn’t that sweet as for our other resources.

In some environments were I’m active they have a need for both capacity and IOPS and as such they are in luck as next to caching a lot of spindles still equate to more IOPS. For testing the boundaries of one resource one needs to make sure non of the others hit theirs. That’s not easy as for performance testing can’t always have a truck load of spindles on a modern high speed SAN available.

RAMDisk to ease the IOPS bottleneck

To see how well the 10Gbps cards with and without Teaming, Multichannel, RDMA are behaving and what these configuration are capable of I wanted to take as much of the disk IOPS bottle neck out of the equation as possible. Apart from buying a Violin system capable of doing +1 million IOPS, which isn’t going to happen for some lab work, you can perhaps get the best possible IOPS by combining some local SSD and RAMDisk. RAMDisk is spare memory used as a virtual disk. It’s very fast and cost effective per IOPS. But capacity wise it’s not the worlds best, let alone most cost effective solution.


I’m using free RAMDisk software provided by StarWind. I chose this as they allow for large sized RAMDisks. I’m using the ones of 54GB right now to speed test copying fixed sized VHDX files. It install flawlessly on Windows Server 2012 and it hasn’t caused me any issues. Throw in some SSDs on the servers for where you need persistence and you’re in business for some very nice lab work.


You also need to be aware it doesn’t persist data when you reboot the system or lose power. This is not an issue if all we are doing is speed testing as we don’t care. Otherwise you’ll need to find a workaround and realize those ‘”flush the data to persistent storage” isn’t full proof or super fast, the SSDs do help here.

You have to register but the good news is that they don’t spam you to death at all, which I find cool. As said the tool is free, works with Windows Server 2012 and allows for larger RAMDisks where other free ones are often way to limited in size.

It has allowed me to do some really nice testing. Perhaps you want to check this out as well. WARNING: The below picture is a lab setup … I’m not a magician and it’s not the kind of IOPS I have all over the datacenters with 4 Cheapo SATA disks I touched my special magic pixie dust.


With #WinServ 2012 storage costs/performance/capacity are the only thing limiting you  http://twitter.yfrog.com/mnuo9fp #SMB3.0 #Multichannel

Some quick tests with a 52GB NTFS RAMDisk formatted with a 64K NTFS Allocation unit size.

image image

I also tested with another free tool from SoftPerfect ® RAM Disk FREE. It performs well but I don’t get to see the RAMDisk in the Windows Disk Management GUI, at least not on Windows Server 2012. I have not tested with W2K8R2.

NTFS Allocation unit size 4K NTFS Allocation unit size 64K
image image