High Availability has a price

We’ll go back to basics today. Some times the obvious, no matter how evident it is to us technologists, is challenged. Recently we got the remark that we were wasting CPU cycles by assigning to many vCPU to certain virtual machines on our Hyper-V cluster. So we had to explain that high availability has a price. On top of that we had to explain that things are not as wasteful as they seem in a virtual environment.

The case

Here’s one of the “offending” virtual machines. They assumed that we are wasting at least 50% of 12 CPUs.

image

This is one node in a dual node  load balancing (active-active) and highly available solution. This provides for zero down time during scheduled maintenance and very little downtime during system failures.

And here’s the second node (yes the 1st node has been down for scheduled maintenance more recently that node 2).

image

In a 2 node HA solution you need to make sure that one node can handle the entire workload. This is the absolute border line of an N+1 solution.  This means you can lose 1 node. N determines the number of nodes needed to guarantee an agreed upon service level and the number defines how many nodes failures can be tolerated before affecting the service.

In the above example there’s a need to have the CPU resources on each node to run the entire workload on one node without having an effect on the service. Therefore, when both nodes are up this might seem like a waste to the uninitiated. It is however a required to achieve the high availability goal. A constant CPU usage over 75 % will lead to a reduction in service quality in this case and even compromise the usability of the that service.

I did not even dive into the dangers of designing purely based on averages during this “explanation”. That was one step to much for the level of the discussion.

It’s also important to note that Hyper-V CPU scheduling is highly intelligent and is far less susceptible to the waste of CPU cycles via over provisioning of vCPU than some other solutions are or used to be. Knowing the capabilities and inner working of the technology used is also important in all this. More nodes generally also make “over provisioning” less of an issue. When you have 10 nodes and you lose 1, you only have lost 10% of the capabilities, not 33% like in a 3 node cluster.

Ideally you have 3 node so that even during an issue with one node you still maintain redundancy. However if you want acceptable services during a 2 node failure you’ll need to go to N+2, meaning that you need 2 nodes to provide the services and handle losing 2 nodes gracefully. In that case you’ll need 4 node and so on.  The larger the node count  the wiser it is to go to a N+2 model and ideally you’ll provide separate failure domains over which the nodes are distributed. An example of this is having a redundant geo-load balanced web farm of 32 virtual machine nodes spread over 2 locations and running on separate hardware failover clusters in each location. As you can see the higher the stakes and demands the faster the cost and potential complexity rises. You can offload some of the complexity by leveraging a public cloud like Azure, but the costs will still be there. There is no such thing as a free lunch, some are quite easy and affordable for what you get.

Conclusion

High Availability has a price. I did mention that already, right? To be able to keep your services running at a level that is both workable and acceptable to your customers and stake holders you will need to over provision to a degree. There is no magic here. When your solutions are being scrutinized by people with no real background, experience and context in high availability you might need to explain this.

Load balancing UDP for a RD Gateway farm with a KEMP Loadmaster

When implementing load balancing for RD Gateway we must take care not to forget load balancing the UDP traffic. Now your RDP Connection will still work over HTTPS alone if you forget this, but you’ll miss out on the benefits.

  • Better experience of bad, unreliable network connections with high packet loss
  • Better experience with high end graphics and in general a better graphical experience over WAN links.

As many people have load balanced their gateways since Windows Server 2008 (R2) when UDP was not into play yet and as things work without people might forget. The most important thing you need to know is that when leveraging UDP for RDP 8/8.1 the UDP session traffic has to leverage Direct Server Return (DSR) for the real servers configuration when we configure load balancing for a RD gateway farm with a KEMP Loadmaster. I’m focusing on the UDP part here, not the HTTPS part. That’s been done enough and the Kemp info on that is sufficient. The UDP part could do with some extra info.

The reason for this is that when UDP is leveraged for high end graphics we want to avoid sending all that graphical network traffic the load balancer. There is no real added value being performed there in this UDP use case but the load might get quite high. This is where DSR is leveraged wen configuring the Loadmaster. That means we also need to configure our real servers to uses Direct Return as the forwarding method. When you forget this you’ll lose UDP with RDP 8.1 but you might not notice immediately. If you’re not looking for it as the HTTP connection alone will let you connect and work, albeit with a reduced experience.

To read more on why it’s done this way (even if it seems complex and has drawbacks) see http://kemptechnologies.com/ca/white-papers/direct-server-return-it-you/ you’ll notice that for graphics it is great idea. By selecting Direct Server Return as the  forwarding method (see later) changes the destination MAC address of the incoming packet on the fly (very fast) to the MAC address of one of the real servers. When the packet reaches the real server it must think it owns the VS IP address, which it doesn’t. So we use the loopback adapter to let the real server reply as if it does but we don’t respond to ARPs as that would cause issues with the load balancer who has the real IP of the virtual service. That’s where the 254 metric we configure in the demo below comes into play.  Note that  the real server responds over it normal NIC. Which is great and it helps with firewall rules not ruining the party. That’s why with DSR which leverages the the loopback adapter on the RD Gateway servers also requires you to configure the weak host / strong host behavior for the network configuration on those servers, it’s not answering itself! I’ll not go into details on this here but basically since Windows Vista and Windows Server 2008 the security model has change from weak host to strong host. This means that a system (that is not acting as a router) cannot send or receive any packets on a given interface unless the destination/source IP in the packet is assigned to the interface. In the “weak host” model, this restriction does not apply. Read more about this here. Let’s walk through this UDP/DSR/weak host setup & configuration.

On your Loadmaster you’ll create a virtual service for UDP traffic.

  • Select Virtual Services > Add New.
  • Enter the IP address of your RD Gateway Farm
  • Set 3391 as the Port.
  • Select udp for the Protocol.
  • Click Add this Virtual Service.

Open up the Standard Options to configure those

image

  • We don’t need layer 7 as the UDP connections are tied to the HTPP connection and they will spawn and die with that one.
  • We select Source IP Address as the Persistence Mode as the RD Gateway needs persistence to guarantee the connection stay together on the same RD Gateway server. Set the time out value no to high so it isn’t remembered to long.
  • We select least connections as that’s the best option in most cases, let the farm node with the least load take on new connections. This is handy after down time for example.

Now head over to the Real Servers section

image

  • Make sure the Real Server Check parameters is set to ICMP ping, which is what the LoadMaster uses to check if the RD Gateway servers are alive.
  • Click Add New to add an  RD Gateway server, you’ll do this for each farm member.

image

  • Enter the Real Server Address for each RD Gateway.
  • Enter 3391 as the Port.
  • Select Direct return as the Forwarding method.
  • Click Add This Real Server.

When you’re done it looks like this:

image

So now we need to check if the real servers are seen as on line and healthy …

image

If one RD Gateway server is down or has an issue you see this … no worries the LoadMaster sends all clients to the other farm member server.

image

Configure the  RD Gateway farm servers to work with DSR

We’re not done yet, we need to configure our RD Gateway servers in the farm to work with DSR.

Go to Device Manager, right-click on the computer name and select Add legacy hardware

image

Click next on the welcome part of the wizard …

image

Select “Install the hardware that I manually select from a list (Advanced)” and click Next …image

Scroll down to network adapters, select it and click Next …image

Under Manufacturer choose Microsoft and as Network Adapter scroll down to Microsoft KM-TEST Loopback Adapter, select it and click Next.

image

Click Next to install it …image

image

Click next to close the Wizard.image

 

Now go to  and change the name so you can easily identify the loopback adapter …imageimage

In the properties of the loopback adapter we disable everything we don’t need. In this case, we only need IPV4 and nothing else. We also need to configure the TCP/IP settings for the loopback adapter. So open up the TCP/IP v4 properties of that NIC …image

Enter the IP address of the Virtual Service for UDP on the load master and, very important enter a subnet mask of 255.255.255.255 for the loopback address. It’s a subnet of 1 host, the VIP IP address. Do not enter a gateway!

image

Now go to the advanced setting and deselect Automatic metric and fill out 254. This step prevents the server to respond to ARP requests for the MAC of the VIP with the MAC of the loop back adapter.

image

Also uncheck “Register this connection”s address in DNS” to avoid any name resolution problems for the real servers.

image

Finally disable NETBIOS over TCP/IP.

image

What we are doing with all the above is preventing any issues with normal network traffic to this real servers being affected by the loopback adapter who’s one and only function is to enable DSR and nothing else. It’s a bit “paranoid” but it pays to be and prevent problems.

Dealing with Strong Host / Weak Host setting in W2K8 and higher

We now still need to deal with the strong host security model and allow the LAN interface to receive traffic from the KEMP and allow the KEMP to receive and send traffic form/to the LAN interface. This is done by executing the following commands:

netsh interface ipv4 set interface LAN weakhostreceive=enabled
netsh interface ipv4 set interface KEMP-DSR-LOOPBACK weakhostreceive=enabled
netsh interface ipv4 set interface KEMP-DSR-LOOPBACK weakhostsend=enabled

That’s it. You should now have HTTP/UDP connections in your RD Gateway monitoring when using a load balancer and set it up correctly.  Remember if this isn’t configured correctly you’ll still connect but you lose the benefits the UDP connections offer.

Now another thing you need to be aware of in your RD Gateway configuration is that for UDP  to work with DSR is that the UDP Transport Settings need to be configured for “all unassigned” IP addresses. Other wise DRS won’t work and you’ll lose UDP. This make sense, you’ll receive traffic on the VIP on your real servers. It’s just like DSR with a web server where in IIS you’ll bind both the LAN and the loopback adapter to port 80 or 443 for the site.

image

We can see that one client is connected via RDSGW01 to two servers (Viking and Spartan) leveraging HTTP and UDP. The load balancing is done via the KEMP Loadmasters in  geo-redundant fashion.

image

Yes, my geo load balanced RD Gateway Server farms are providing UDP support for the servers and clients we  RDP in to.

image

Combined with those servers and clients being spread amongst the sites provides for enough business continuity to keep the shop running when a site fails, so it’s more than just connectivity!

Kemp LoadMaster OEM Servers and Dell Firmware Updates with Lifecycle Controller

When you buy a DELL OEM based Kemp Technologies LoadMaster you might wonder who will handle the hardware updates to the server. Well Dell handles all OEM updates via its usual options and as with all LoadMasters Kemp Technologies handles the firmware update of the LoadMaster image.

KempLM320

Hardware wise both DELL and Kemp have been two companies that excel in support. If you can find the solution that meets your needs it’s a great choice. Combine them and it make for a great experience.  Let me share a small issue I ran into updating Kemp Loadmaster OEM Servers and Dell Firmware Updates with Lifecycle Controller

For a set of DELL R320 loadmasters in HA is was upgrading ( I not only wanted to move to 7.1-Patch28b-BARE-METAL.bin but I also wanted to take the opportunity to bring the firmware of those servers to the latest versions as that had been a while (since they had been delivered on site).

There is no OS that runs in those server,s as they are OEM hardware based appliances for the Loadmaster image. No worries these DELL servers come with DRAC & Lifecycle controllers so you can leverage those to do the firmware updates from a Server Update Utility ISO locally, via virtual media, over over the network, via FTP or a network share. FTP is either the DELL FTP Site or an internal one.

image

image

Now as I had just downloaded the  latest SUU at the time (SUU-32_15.09.200.74.ISO – for now you need to use the 32 bit installers with the life cycle controller) I decided to just mount it via the virtual media, boot to the lifecycle controller and update using local media.

image

image

But I got stuck  …

It doesn’t throw an error but it just returns to the start point and nothing can fix it. Not even adding “/repository”  to the file path . You can type the name of an individual DUP (32 bit!) and that works. Scanning the entire repository however wouldn’t move beyond step 2 “Enter Access Details”.

Scanning for an individual DUP seemed to work but leaving the file path blank while trying to find all eligible updates seemed not to return any results so I could not advance. The way I was able to solve this was by leveraging the DRAC ability to update it own firmware using the firmware image file to the most recent version. I just got mine by extracting the DUP and taking the image file from the payload sub folder.

image

You can read on how to upgrade DRAC / Lifecycle Controller via the DRAC here.

image

When you’ve done that, I give the system a reboot for good measure, and try again. I have found in all my cases fixes the issue. My take on this is that older firmware can’t handle more recent SUU repositories. So give it a try if you run into this and you’ll be well on your way to get your firmware updated. If you need help with this process DELL has excellent documentation here in “Lifecycle Controller Platform Update/Firmware Update in Dell PowerEdge 12th Generation Servers”

image

image

image

The end result is a fully updated DELL Server / Kemp Loadmaster. Mission accomplished. All this can be done from the comfort of your home office. A win-win for both you and your customer/employer. Think about it, it would be a shame to miss out on all the benefits you get from working in the cloud when your on premises part of a hybrid infrastructure forces you to get in a car and drive to a data center 70 km away. Especially at 21:21 at night.

Unable to retrieve all data needed to run the wizard. Error details: “Cannot retrieve information from server “Node A”. Error occurred during enumeration of SMB shares: The WinRM protocol operation failed due to the following error: The WinRM client sent a request to an HTTP server and got a response saying the requested HTTP URL was not available. This is usually returned by a HTTP server that does not support the WS-Management protocol.

I was recently configuring a Windows Server 2012 File server cluster to provide SMB transparent failover with continuous available file shares for end users. So, we’re not talking about a Scale Out File Server here.

All seemed to go pretty smooth until we hit a problem. when the role is running on Node A and you are using the GUI on Node A this is what you see:

image

When you try to add a share you get this

"Unable to retrieve all data needed to run the wizard. Error details: "Cannot retrieve information from server "Node A". Error occurred during enumeration of SMB shares: The WinRM protocol operation failed due to the following error: The WinRM client sent a request to an HTTP server and got a response saying the requested HTTP URL was not available. This is usually returned by a HTTP server that does not support the WS-Management protocol.”

image

When you failover the file server role to the other node, things seem to work just fine. So this is where you run the GUI from Node A while the file server role resides on Node B.

image

You can add a share, it all works. You notice the exact same behavior on the other node. So as long as the role is running on another node than the one on which you use Failover Cluster Manager you’re fine. Once you’re on the same node you run into this issue. So what’s going on?

So what to do? It’s related to WinRM so let’s investigate that.

image

So the WinRM config comes via a GPO. The local GPO for this is not configured. So that’s not the one, it must come from the  domain.The IP addresses listed are the node IP and the two cluster networks. What’s not there is local host 127.0.0.1, the cluster IP address or any of the IPV6 addresses.

I experimented with a lot of settings. First we ended up creating an OU in the OU where the cluster nodes reside on which we blocked inheritance. We than ran gpupdate /target:computer /force on both nodes to make sure WinRM was no longer configure by the domain GPO. As the local GPO was not configured it reverted back to the defaults. The listener show up as listing to all IPv4 and IPv6 addresses. Nice but the GPO was now disabled.

image

This is interesting but, things still don’t work. For that we needed to disable/enable WinRM

Configure-SMRemoting -disable
Configure-SMRemoting –enable

or via server manager

image

That fixed it, and we it seems a necessity to to. Do note that to disable/enable remote management it should not be configured via a GPO or it throws an error like

image

or

image

Some more testing

We experimented by adding 127.0.0.0-172.0.0.1 an enabling the GPO again. We then saw the listener did show the local host, cluster & file role IP address but the issue was back. Using * in just IPv 4 did not do the trick either.

image

What did the trick was to use * in the filter for IPv 6 and keep our original filters on IPv4. The good news is that having removed the GPO and disabling/enabling WinRM  the cluster IP address & Filer Role IP address are now in the list. That could be good for other use cases.

This is not ideal, but it all works now.

What we settled for

So we ended up with still restricting the GPO settings for IPv4 to subnet ranges and allowing * for IPv6. This made sure that even when we run the Failover Cluster Manager GUI from the node that owns the file server role everything still works.

One workaround is to work from a remote host, not from a cluster member, which is a good practice anyway.

The key takeaway is that when Microsoft says they test with IPv6 enabled they literally mean for everything.

Note

There is a TechNet article on WinRM GPO Settings for SCVMM 2012 RC where they advice to set both IPv4 and  IPv6  to * to avoid issues with SCVMM operations. How to Add Trusted Hyper-V Hosts and Host Clusters in VMM 

However, we found that IPv6 is the key requirement here, * for just IP4 alone did not work.