Windows Server 2016 RDMA and the Hyper-V vSwitch – Part II

Introduction

In part I this article I demonstrated that some of the rules in regards to SMB Direct and the Hyper-V vSwitch as we know them for Windows Server 2012 R2 have changed with Windows Server 2016. We focused on the fact that you can expose RDMA to a vNIC exposed to the management OS created on a vSwitch. This means that while in Windows Server 2012 R2 you cannot expose RDMA capabilities via a vSwitch, even when you are using a non-teamed RDMA capable NIC, this is no longer true with Windows Server 2016.

While a demo with a vSwitch on a single NIC as we did in part I is nice it’s unlikely you’ll use this often if at all in the real world? Here we require redundancy and that means NIC teaming. To do so we normally use a vSwitch created on a native Windows NIC team. But a native NIC teaming does not expose RDMA capabilities. And as such a vSwitch created against a Windows native NIC team cannot leverage RDMA either. Which was the one of the reasons why a fully converged scenario in Windows Server 2012 R2 was too limited for many scenarios. Loss of RSS on the vNIC exposes to the management OS was another. The solution to this in Windows Server 2016 Hyper-V comes with Switch Embedded Teaming (SET). Now using SET in each and every situation might not be a good idea. It depends. But we do need to know how to configure it. So let’s dive in.

Switch Embedded Teaming (SET) exposes RDMA to the vSwitch

Switch Embedded Teaming (SET in Windows Server 2016 allows multiple identical (make, model, firmware, drivers to be supported) NICs to be used or “teamed” within the vSwitch itself. The important thing to note here this does not use windows NIC teaming or LBFO (Load Balancing and Fail Over).

SET is the future and is needed or use with the Network Controller and Software Defined Networking in Windows. SET can also be used without these technologies. While today it supports a good deal of the capabilities of native Windows NIC teaming it also lacks some of them. In general SET is meant for full or partial converged scenarios with 10GBps or better NICs, not 1Gbps networking in a (hyper)converged Hyper-V scenario.

Please see New Windows Server 2016 NIC and Switch Embedded Teaming User Guide for Download for more information as there is just too much to tell about it.

Setting it up

We start out with a 2-node cluster where each node has 2 RDMA NICs (Mellanox ConnectX-3) with RDMA enabled and DCB configured. Live migration of VMs between those nodes works over SMB Direct works. All NIC are on the same subnet 172.16.0.0/16 (thanks to Window Server 2016 Same Subnet Multichannel) and are on VLAN 110. In Failover Cluster Manager (FCM) that looks like below.

clip_image002

We’ll now use the rNICs to create a Switch Embedded Team.

clip_image004

Note that the teaming mode is switch independent, the only option supported with SET in Window Server 2016.

clip_image006

This also gives us a vNIC exposed to the management OS (default)

clip_image008

This is also visible as a vNIC in the mamagement OS called “vEthernet (RDMA-SET-vSwitch)”

clip_image010

This will be used to manage the host and to make its purpose clear we’ll rename it.

We’ll create 2 separate management OS vNICs for the RDMA traffic later. For now, we want the HOST-MGNT vNIC to have connectivity to the LAN and for that we need to tag it with VLAN 10.

image

The vNIC actually “inherited” the IP configuration of one of our physical NICs and we need to change that to either DHCP or a correct LAN IP address and settings.

clip_image014

You can use the code below to set the HOST-MGNT vNIC to DHCP

To finalize the HOST-MGNT vNIC configuration we enable priority tagging on it. If we don’t we won’t see any traffic other than SMB Direct tagged at all!

clip_image016

Before we go any further we’ll remove the VLAN tag from the rNICS as we don’t want it interfering with egress traffic being tagged by them or ingress traffic being filtered because it doesn’t match the VLAN ID on the rNICs.

From here on we’ll focus on the RDMA capable vNICs well create and use for SMB traffic.

We create 2 vNIC on the management OS for SMB Direct traffic.

clip_image018

Now these vNIC need an IP address, this can be in the same subnet because we have Windows Server 2016 SMB multichannel.

We than also need to put the vNICs in the correct VLAN. Remember that DCB / PFC priority tagging needs tagged VLAN so carry that priority. Right now, we can see that these are untagged.

clip_image020

So we tag them with VLAN ID 110

clip_image022

We enable jumbo frames on the vNICs. Remember that the physical NICs in the SET have jumbo frames enabled as well.

clip_image024

Normally all traffic that is originated from vNICs gets any QOS values set to zero. There is one exception to this and that’s SMB Direct traffic. SMB Direct traffic gets tagged with its QoS priority and that is not reset to 0 as it bypasses the vSwitch completely. But if we set other priorities on other types of traffic for DCB PFC and or ETS that passes over these vNICs we must enable priority tagging on these NICs as well or they’ll be stripped away.

clip_image026

The association of the vNIC to pNICs is random. This also changes during creation and destruction (disabling NICs, restarting the OS). We can map a vNCI to a particular pNIC. This prevents suboptimal use of the available pNICs and provides for a well know predictable path of the traffic. We do this with the below PowerShell commands.

Finally, last but not least, we should enable RDMA on our two vNICs or SMB Direct will not kick in at all.

Right now, we have it all configured correctly on one node of our 2-node cluster. The SMB network look like this now:

clip_image028

The cluster now looks like below.

clip_image030

We can live migrate VMs over SMB Direct in this mixed scenario where one node has pNICs RDMA NICs, 1 node has SET with vNICs for RMDA.

clip_image032

When looking at this in report mode we clearly see Node-A send SMB Direct traffic (tagged with priority 4, green) over its RDMA enabled SET vNICs to Node-B which still has a complete physical rNIC set up (blue).

clip_image034

As you can see in the screen shots above we now have RDMA / SMB Direct working with SET / RDMA vNICs on one node (Node-A) and pure physical RDMA NICs on the other (Node-B). This gives us bandwidth aggregation and redundancy. To complete the exercise, we configure SET on the other node as well. But it’s clear SET and RDAM will also work in a mixed environment.

We’ll discuss some details about certain aspects of the vNIC configuration in future articles. Things like the why and how of Set-VMNetworkAdapterTeamMapping and the use of -IeeePriorityTag. But for now, this is it. Go try it out! It’s the basis for anything you’ll do with SDNv2 in W2K16 and beyond.

Import of RD Gateway configuration file with policies referencing local resources wipes all policies clean!

Introduction

When you have Windows Server 2016 RD Gateway server and you expect to be able to import a configuration XML file you’ll might find yourself in a pickle when you are also using local resources. Because the import of RD Gateway configuration file with policies referencing local resources wipes all policies clean! With local resources I mean local user accounts and groups. These are leveraged more than I imagined at first.

When does it happen?

In the past I have blogged about migrating RD Gateway servers that contain policies referencing local resources here: Fixing Event ID 2002 “The policy and configuration settings could not be imported to the RD Gateway server “%1” because they are associated with local computer groups on another RD Gateway server”.

We used to be able to use the trick of making sure the local resources exist on the new server (either by recreating them there via the server migration wizard or manually) and changing the server name in the exported configuration XML file  to successfully import the configuration. That no longer works. You get an error.Import of RD Gateway configuration file with policies referencing local resources wipes all policies clean!

As far as migrations go from older versions, they work fins as long as you don’t have policies with local resources. Otherwise you’d better do an in place upgrade or recreate the resources & policies on the new servers. The method described in my blog is not working any more. That’s to bad. But it gets worse.

Import of RD Gateway configuration file with policies referencing local resources wipes all policies clean!

As said,it doesn’t end there. The issue is there even when you try to import the configuration on to the same server you exported it from.That’s really bad as it a quick way to protect against any mistakes you might make, and allows to get back to the original configuration.

What’s even worse, when the import fails it wipes ALL the policies in the RD Gateway Server => dangerous! So yes, the import of RD Gateway configuration file with policies referencing local resources wipes all policies clean!

Precautions

Only a backup or a checkpoint can save your then (or recreate the all manually)! Again this is only when the exported configuration file references local resources! The fasted way to clean out an RD Gateway configuration on Windows Server 2016 is actually importing a configuration export which contains a policy referring to local resource. Ouch! I’m not aware of a fix up to this date.

For now you only protection is a checkpoint or a backup. Depending on where and how you source your virtual machines you might not have access to a checkpoint.

You have been warned, be careful.

Migrate a Windows Server 2012 R2 AD FS farm to a Windows Server 2016 AD FS farm

Introduction

I recently went through the effort to migrate a Windows Server 2012 R2 AD FS farm to a Windows Server 2016 AD FS farm. For this exercise the people in charge wanted to maintain the server names and IP addresses. By doing so there is no need for changes in the Kemp Technologies load balancer.

Farm Behavior Level Feature

In Windows Server 2016 ADFS we now have a thing called  the Farm Behavior Level (FBL)  feature (FBL). It  determines the features that the AD FS farm can use. Optimistically you can state that the FBL of a Windows Server 2012 R2 AD FS farm is at the Windows Server 2012 R2 FBL.

The FBL feature and mixed mode now makes a “trick” many used to upgrade a ADFS farm to AD FS Windows Server 2012 R2 organizations without the hassle of setting up a new farm and exporting / importing the configuration possible. looking to upgrade to Windows Server 2016 will not have to deploy an entirely new farm, export and import configuration data. Instead, they can add Windows Server 2016 nodes to an existing farm while it is online and only incur the relatively brief downtime involved in the FBL raise.

We can add a secondary Windows Server 2016 AD FS server to a Windows Server 2012 R2 farm. The farm will continue operating at the Windows Server 2012 R2 FBL. This is “mixed mode” so to speak. There is no need to move all the node to the same version immediately.

As long as you are in mixed mode you don’t get the benefits of the new capabilities and features in Windows Server 2016 ADFS. These are not available.

Administrators can add new, Windows Server 2016 federation servers to an existing Windows Server 2012 R2 farm. As a result, the farm is in “mixed mode” and operates the Windows Server 2012 R2 farm behavior level.

When all Windows Server 2012 R2 nodes have been removed form the farm and all nodes are Windows Server 2016 you can raise the FBL level. This results in the new Windows Server 2016 ADFS features being enabled and ready for configuration and use.

The Migration Path Notes

WARNING

You cannot in place upgrade a Windows Server 2012 R2 ADFS Farm node to Windows Server 2016. You will need to remove it from the farm and replace it with a new Windows Server 2016 ADFS node.

Note

In this migration we are preserving the node names and IP addresses. This means the load balancer needed no configuration changes. So in that respect this process is different from what is normally recommended.

This is a WID based deployment example. You can do the same for an SQL based deployment.

The FBL approach is only valid for a migration from Windows Server 2012 AD FS to Windows Server 2016 AD FS. A migration from AD FS 2.0 or 2.1 (Windows Server 2008 R2 or Windows Server 2012) requires the use of the Export-FederationConfiguration and Import FederationConfiguration as before.

Also see https://technet.microsoft.com/en-us/windows-server-docs/identity/ad-fs/overview/ad-fs-2016-requirements

Step by Step

  • Start with the secondary nodes. For each of them make sure you have the server name and the IP configuration.
  • Make sure you have the Service Communications SSL cert for your AD FS and the domain or managed service account name and password.
  • Make sure you have an ADFS configuration backup and also that you have a backup or an export (cool thing about VMs) of the VMs for rapid recovery if needed.

Remove the ADFS role via Server management

image

  • Shut down the VM
  • Edit the setting and remove the OS VHDX. Delete the file (you have a backup/export)
  • Copy your completely patched and syprepped OS VHDX with Windows Server to the location for this VM. Rename that VHDX to something sensible like adfs2disk01.vhdx.
  • Edit the settings and add the new sysprepped OS VHDX to the VM. Make sure that the disk is 1st in the boot order.
  • Start the VM
  • Go through the mini wizard and log in to it.
  • Configure the NIC with the same setting as your old DNS Server
  • Rename the VM to the original VM name and join the domain.
  • Restart the VM
  • Login to the VM and Install ADFS using Add Roles and Features in Server Manager

image

  • When done configure ADFS

image

  • Select to add the node to an existing federation farm

image

  • Make sure you have an account with AD admin permissions

image

  • Tell the node what primary federation server is

image

  • Import your certificate

image

  • Specify the ADFS Service account and its password

image

  • You’re ready to go on

image

  • If any prerequisites don’t work out you’ll be notified, we’re good to go!

image

  • Let the wizard complete all it steps

image

  • When the configuration is done you need to restart the VM to complete adding the node to the ADFS farm.

image

  • Restart your VM and log back in. When you open up ADFS  you’ll see that this new Windows Server 2016 node is a secondary node in your ADFS Farm.

image

  • Note that from a load balancer perspective nothing has changed. They just saw the node go up and down a few times; if they were paying attention at all that is.
  • Now repeat the entire process for all you secondary ADFS Farm nodes. When done we’ll swap the primary node to one of the secondary nodes. This is needed so you can repeat the process for the last remaining node in the farm, which at that time needs to be a secondary node. In the example of our 2 node farm we swap the roles between ADFS1 and ADFS2.

image

  • Verify that ADFS2 is the primary node and if so, repeat the migration process for the last remaining node (ADFS1) in our case.
  • Once that’s been completed we swap them back to have exactly the same situation as before the migration.

image

  • On the primary node run Get-AdfsFarmInformation (a new cmdlet in Windows Server 2016 R2).

image

  • You’ll see that our current farm behavior is 1 and our 2 nodes (all of them Windows Server 2016) are listed. Note that any nodes still on Windows Server 2012 R2 would not be shown.

WARNING: to raise the FBL to Windows Server 2016 your AD Schema needs to be upgraded to at least Windows Server 2016 version 85 or higher. This is also the case form new AD FS farm installations which will be at the latest FBL by default. My environment is already 100% on Windows Server 2016 AD. So I’m good to go. If yours is not, don’t forget to upgrade you schema. You don’t need to upgrade your DC’s unless you want to leverage Microsoft Password authentication, then you need al least 1 Windows Server 2016 domain controller. See https://technet.microsoft.com/en-us/windows-server-docs/identity/ad-fs/overview/ad-fs-2016-requirements

  • As we know all our nodes are on Windows Server 2016 we can raise our Farm Behavior Level  (FBL) by running Invoke-AdfsFarmBehaviorLevelRaise

image

  • Just let it run it has some work to do including creating a new database.

image

  • It will tell you when it’s done and point out changes in the configuration.

image

  • Now run Get-AdfsFarmInformation again

image

  • Note that the  current farm behavior is 3 and our 2 nodes (all of them Windows Server 2016) are listed. Note that  if any nodes had still been on Windows Server 2012 R2 they would have been kicked from the farm and should be removed form the load balancer.

image

PS: with some creativity and by having a look at my blog on https://blog.workinghardinit.work/2016/11/28/easily-migrating-non-ad-integrated-dns-servers-while-preserving-server-names-and-ip-addresses/ You can easily figure out how to add some extra steps to move to generation 2 VMs while you’re at it if you don’t use these yet.

In Place upgrades of cluster nodes to Windows Server 2016

You will all have heard about rolling cluster upgrades from Windows Server 202 R2 to Windows Server 2016 by now. The best and recommend practice is to do a clean install of any node you want to move to Windows Server 2016. However an in place upgrade does work. Actually it works better then ever before. I’m not recommending this for production but I did do a bunch just to see how the experience was and if that experience was consistent. I was actually pleasantly surprised and it saved me some time in the lab.

Today, if you want to you can upgrade your Windows Server 2012 R2 hosts in the cluster to Windows Server 2016.

The main things to watch out for are that all the VMs on that host have to be migrated to another node or be shut down.

You can not have teamed NICs on the host. Most often these will be used for a vSwitch, so it’s smart and prudent to note down the vSwitch (or vSwitches) name and remove them before removing the NIC team. After you’ve upgraded the node you can recreate the NIC team and the vSwitch(es).

Note that you don’t even have to evict the node from the cluster anymore to perform the upgrade.

image

I have successfully upgrade 4 cluster this way. One was based on PC hardware but the other ones where:

  • DELL R610 2 node cluster with shared SAS storage (MD3200).
  • Dell R720 2 node cluster with Compellent SAN (and ancient 4Gbps Emulex and QLogic FC HBAs)
  • Dell R730 3 node cluster with Compellent SAN (8Gbps Emulex HBAs)

Naturally all these servers were rocking the most current firmware and drives as possible. After the upgrades I upgraded the NIC drivers (Mellanox, Intel) and the FC drivers ‘(Emulex) to be at their supported vendors drivers. I also made sure they got all the available updates before moving on with these lab clusters.

Issues I noticed:

  • The most common issue I saw was that the Hyper-V manager GUI was not functional and I could not connect to the host. The fix was easy: uninstall Hyper-V and re-install it. This requires a few reboots. Other than that it went incredibly well.
  • Another issue I’ve seen with upgrade was that the netlogon service was set to manual which caused various issues with authentication but which is easily fixed. This has also been reported here. Microsoft is aware of this bug and a fixed is being worked on.

 

.