SMB Direct With RoCE in a Mixed Switches Environment

I’ve been setting up a number of Hyper-V clusters with  Mellanox ConnectX3 Pro dual port 10Gbps Ethernet cards. These Mellanox cards provide a nice amount of queues (128) for DVMQ and also give us RDMA/SMB Direct capabilities for CSV & live migration traffic.

Mixed Switches Environments

Now RoCE and DCB is a learning curve for all of us and not for the faint of heart. DCB configuration is non trivial, certainly not across multiple hops and different switches. Some say it’s to be avoided or can’t be done.

You can only get away with a single pair of (uniform) switches in smaller deployments. On top of that I’m seeing more and more different types of switches being used to optimize value, so it’s not just a lab exercise to do this. Combine this with the fact that DCB is an unavoidable technology in networking, unless it get’s replaced with something better and easier, and you might as well try and learn. So I did.

Well right now I’m successfully seeing RoCE traffic going across cluster nodes spread over different racks in different rows at excellent speeds. The core switches are DELL Force10 S4810 and the rack switches are PowerConnect 8132Fs. By borrowing an approach from spine/leave designs this setup delivers bandwidth where they need it a a price point they can afford. They don’t need more expensive switches for the rack or the core as these do support DCB and give the port count needed at the best price point.  This isn’t supposed to be the top in non blocking network design. Nope but what’s available & affordable today in you hands is better than perfection tomorrow. On top of that this is a functional learning experience for all involved.

We see some pause frames being sent once in a while and this doesn’t impact speed that very much. It does guarantee lossless traffic which is what we need for RoCE. When we live migrate 300GB worth of memory across the nodes in the different racks we get great results. It varies a bit depending on the load the switches & switch ports are under but that’s to be expected.

Now tests have shown us that we can live migrate just as fast with non RDMA 10Gbps as we can with RDMA leveraging “only” Multichannel. So why even bother? The name of the game low latency and preserving CPU cycles for SQL Server or storage traffic over SMB3. Why? We can just buy more CPUs/Cores. Great, easy & fast right? But then with SQL licensing comes into play and it becomes very expensive. Also storage scenarios under heavy load are not where you want to drop packets.

Will this matter in your environment? Great question! It depends on your environment. Sometimes RDMA is needed/warranted, sometimes it isn’t. But the Mellanox cards are price competitive and why not test and learn right? That’s time well spent and prepares you for the future.

But what if it goes wrong … ah well if the nodes fail to connect over RDAM you still have Multichannel and if the DCB stuff turns out not to be what you need or can handle, turn it of and you’ll be good.

RoCE stuff to test: Routing

Some claim it can’t be done reliably. But hey they said that for non uniform switch environments too Winking smile. So will it all fall apart and will we need to standardize on iWarp in the future?  Maybe, but isn’t DCB the technology used for lossless, high performance environments (FCoE but also iSCSI) so why would not iWarp not need it. Sure it works without it quite well. So does iSCSI right, up to a point? I see these comments a lot more form virtualization admins that have a hard time doing DCB (I’m one so I do sympathize) than I see it from hard core network engineers. As I have RoCE cards and they have become routable now with the latest firmware and drivers I’d love to try and see if I can make RoCE v2 or Routable RoCE work over different types of switches but unless some one is going to sponsor the hardware I can’t even start doing that. Anyway, lossless is the name of the game whether it’s iWarp or RoCE. Who know what we’ll be doing in 5 years? 100Gbps iWarp & iSCSI both covered by DCB vNext while FC, FCoE, Infiniband & RoCE have fallen into oblivion? We’ll see.

Hyper-V Guest Protected Network Testing Tip

I’ve been pinged a few times over the years with people saying that the new protected network feature does not work for them. This setting is set per vNIC of the virtual machine.

image

The issue lies in how & what people test, bar any number of other reasons why a live migration might not start or complete.  What people tend to do is disable a NIC to which the vSwitch is connected. But a Protected Network is about media sense loss detection of network disconnects and this requires the NIC to be actually there and enabled. Remember, we’re talking about the NIC on the host connected to the virtual switch. A physical link failure here, meaning that the virtual switch the protected virtual network adapter no longer has network connectivity, will lead to all the VMs with  the protected network enabled do be live migrated to another node in the cluster that still has a connected virtual switch for the same network.  The latter is to avoid  senseless virtual machine migrations to other nodes that might also have lost connectivity due to a failed physical switch.

So the point is that testing by disabling the NIC in the OS will not do. You need to unplug the cables to the virtual switch or disable the port on the switch or even shutdown the switch (a bit drastic).

Do note that it can take a little time for the live migration to kick in,  it varies a bit, but it beats having to wait for the issue to be resolved. You’ll see event id 1255 logged when the VMs lose network connectivity:image

In this day and age with NIC teaming to redundant switches & the fact that you might be using converged networking these tests aren’t as simple as you might think. Also don’t pull out all if the cables used for clustering if you want the cluster to be able to help you out here with a live migration. Because when the other cluster nodes can’t talk to the node your testing in any way it will be kicked out of the cluster, the VMs will go down, be moved to another node and started. This might seem obvious but if you a are using a teamed 10Gbps solution in a converged setup this might cause exactly that.

Another thing to note is that if you have a virtual switch with a dedicated backup network exposed to hosts & VMs that can tolerate down time you might want to disable protected networks on that vNIC as you don’t want to live migrate the VMs of when that network has an issue. It all depends on your needs & tastes.

Last but not least please behave, and don’t do anything silly in production when testing this. Be careful in your testing.

Hot add/remove of network adapters and enabling device naming in Windows Server Hyper-V

One of the cool new features in Window Server vNext Hyper-V (in Technical Preview at the moment of writing) is that you gain the ability to hot add and remove NICs.  That might sound not to important to the non initiated in the fine art of virtualization & clouds. But it is. You see anything you can do to a VM configuration wise that does not require downtime is good. That’s what helps shift the needle of high availability to that holey grail of continuous availability.

On top of that the names of the network adapters are now exposed to the guest. Which is also great. It’s become lot easier to automate the VM network configuration.

Hot adding NICs can be done via the GUI and PoSh.

image

But naming the network adapter seems a PowerShell only game for now (nothing hard, no sweat). This can be done during creation of the network adapter. Here I add a NIC to VM RAGNAR connected to the ISCSI-GUEST switch & named ISCSI.

Add-VMNetworkAdapter –VMName RAGNAR –SwitchName ISCSI-GUEST –Name ISCSI

Now I want this name to be reflected into the VM’s NCI configuration properties. This is done by enabling device naming. You can do this via the GUI or PoSh.

Set-VMNetworkAdapter –VMName RAGNAR –Name ISCSI –Devicenaming On

That’s it.

image

So now let’s play with our existing network adapter “Network Adapter” which connects our Hyper-V guests to the LAN via the HYPER-V-GUESTS switch? Can you rename it?  Yes, you can. In PoSh run this:

Rename-VMNetworkAdapter –VMName RAGNAR –Name “Network Adapter” –NewName “LAN”

And that’s it. If you refresh the setting of your VM or reopen it you’ll see the name change.

image

The one thing that I see in the Tech Preview is that I need to reboot the VM to see the Name change reflected inside the VM in the NIC configuration under advance properties, called “Hyper-V Network Adapter Name”. Existing one show their old name and new one are empty until then.

image

 

Two important characteristics to note about enabling device naming

You notice that one can edit this field in NIC configuration of the VM but it doesn’t move up the stack into the settings of the VM. Security wise this seems logical to me and it’s not intended to work. It’s a GUI limitation that the field cannot be disabled for editing but no one can try and  be “funny” by renaming the ethernet adapter in the VMs settings via the guest Winking smile

Do note that this is not exactly the same a Consistent Device Naming in Windows 2012 or later. It’s not reflected in the name of the NIC in the GUI, these are still Ethernet, Ethernet 2, … Enable device naming is mainly meant to enable identifying the adapter assigned to the VM inside the VM, mainly for automation. You can name the NIC in the Guest whatever works best for you and you’ll never lose the correlations between the Network adapter in your VM settings and the Hyper-V Network Adapter name in the NIC configuration properties. In that respect is a bit more solid/permanent even if some one found it funny to rename all vNICs to random names you’re still OK with this feature.

That’s it off, you go! Download the Technical Preview bits from MSDN, start exploring and learning. Knowledge is seldom a bad thing Winking smile

Configuring timestamps in logs on DELL Force10 switches

When you get your Force10 switches up and running and are about to configure them you might notice that, when looking at the logs, the default timestamp is the time passed since the switch booted. During configuration looking at the logs can very handy in seeing what’s going on as a result of your changes. When you’re purposely testing it’s not too hard to see what events you need to look at. When you’re working on stuff or trouble shooting after the fact things get tedious to match up. So one thing I like to do is set the time stamp to reflect the date and time.

This is done by setting timestamps for the logs to datetime in configuration mode. By default it uses uptime. This logs the events in time passed since the switch started in weeks, days and hours.

service timestamps [log | debug] [datetime [localtime] [msec] [show-timezone] | uptime]

I use: service timestamps log datetime localtime msec show-timezone

F10>en
Password:
F10#conf
F10(conf)#service timestamps log datetime localtime msec show-timezone
F10(conf)#exit

Don’t worry if you see $ sign appear left or right of your line like this:

F10(conf)##$ timestamps log datetime localtime msec show-timezone

it’s just that the line is to long and your prompt is scrolling Winking smile.

This gives me the detailed information I want to see. Opting to display the time zone and helps me correlate the events to other events and times on different equipment that might not have the time zone set (you don’t always control this and perhaps it can’t be configured on some devices).

image

As you can see the logging is now very detailed (purple). The logs on this switch were last cleared before I added these timestamps instead op the uptime to the logs. This is evident form the entry for last logging  buffer cleared: 3w6d12h (green).

Voila, that’s how we get to see the times in your logs which is a bit handier if you need to correlate them to other events.