Exchange 2010 DAG Issue: Cluster IP address resource ‘Cluster IP Address’ cannot be brought online

Today I was called upon to investigate an issue with an Exchange 2010 Database Availability Group that had serious backup issues with Symantec Backup Exec not working. As it turned out, while the DAG was still providing mail services and clients did not notice anything the underlying Windows Cluster Service had an issue with. The cluster resource could not be brought on line, instead we got an error:

“Cluster IP address resource ‘Cluster IP Address’ cannot be brought online because the cluster network ‘Cluster Network 1’ is not configured to allow client access. Please use the Failover Cluster Manager snap-in to check the configured properties of the cluster network.”

I have been dealing with Windows 2008 (R2) clusters since the beta’s and had seen some causes of this so I started to check the cluster & Exchange DAG configuration. Nothing was wrong, not a single thing. Weird. I had seen such weird behavior once before with a Hyper-V R2 cluster. There I fixed it by disabling and enabling the NIC’s on the nodes that were having the issue, thus resetting the network. I you don’t have DRAC/ILO or KVM over IP access you can temporarily allow client access via another cluster network or you’ll need physical access to the server console.

In the event viewer I found some more errors:

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          6/18/2010 2:02:41 PM
Event ID:      1069
Task Category: Resource Control Manager
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      node1.company.com
Description: Cluster resource ‘IPv4 DHCP Address 1 (Cluster Group)’ in clustered service or application ‘Cluster Group’ failed.

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          6/18/2010 1:54:47 PM
Event ID:      1223
Task Category: IP Address Resource
Level:         Error
Keywords:     
User:          SYSTEM
Computer:     node1.company.com
Description: Cluster IP address resource ‘Cluster IP Address’ cannot be brought online because the cluster network ‘Cluster Network 1’ is not configured to allow client access. Please use the Failover Cluster Manager snap-in to check the configured properties of the cluster network.

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          6/18/2010 1:54:47 PM
Event ID:      1223
Task Caegory: IP Address Resource
Level:         Error
Keywords:     
User:          SYSTEM
Counter:      node1.company.com
Description: Cluster IP address resource ‘IPv4 DHCP Address 1 (Cluster Group)’ cannot be brought online because the cluster network ‘Cluster Network 3’ is not configured to allow client access. Please use the Failover Cluster Manager snap-in to check the configured properties of the cluster network.

So these cluster networks (it’s a geographically dispersed cluster with routed subnets) are indicating they do not have “Allow clients to connect through this network” set.  Well, I checked and they did! Both “Allow cluster network communications on this network” and “allow clients to connect through this network” are enabled. 

Weird, OK but as mentioned I’ve encountered something similar before. In this case I did not want to do just disable/enable those NICs. The DAG was functioning fine and providing services tot clients, so I did not want to cause any interruption or failover now the cluster was having an issue.

So before going any further I did a search and almost within a minute I found following TechNet blog post: Cluster Core Resources fail to come online on some Exchange 2010 Database Availability Group (DAG) nodes (http://blogs.technet.com/b/timmcmic/archive/2010/05/12/cluster-core-resources-fail-to-come-online-on-some-exchange-2010-database-availability-group-dag-nodes.aspx)

Well, well, the issue is known to Microsoft and they offer three fixes. Which is actually only one, but can be done using  the Failover Cluster Manager GUI, cluster.exe or PowerShell. The fix is to simply disable and enable  “Allow clients to connect through this network” on the affected cluster network. The “long term fix” will be included in Exchange 2010 SP1. The work around does work immediately and their Backup Exec started functioning again. They’ll just have to keep an eye on this issue until the permanent fix arrives with SP1.

Partially Native USB support coming to W2K8R2 with SP1!?

As you might recall from a previous blog post of mine (https://blog.workinghardinit.work/2010/03/29/perversions-of-it-license-dongles/) one of the show stoppers for virtualization can be USB dongles. Apart from my aversion of USB license dongles that should never be mentioned in the same sentence with reliability and predictability, now the push for VDI has exposed another weakness, the need for end users to have USB access. Well Microsoft seems to have heard us. Take a look @ this blog post: http://blogs.technet.com/virtualization/archive/2010/04/25/Microsoft-RemoteFX_3A00_-Closing-the-User-Experience-Gap.aspx

What remains to be seen is if this will work with license dongles. Anyway for desktop virtualization a much needed improvement is under way. I would like to thank Christophe Van Mollekot from Microsoft Belgium for bringing this to my attention. This together with VDI license improvements for SLA customers are giving desktop virtualization a much better change of being adopted. Some times stuff like this really makes the difference. You can’t explain to your end users that the great super modern virtualized environment doesn’t support the ubiquitous USB drive. Trust me on that one.

Perversions of IT: License Dongles

The Sum of All Evil resulting in the Mother of All Fears

One of the most annoying things in the life of infrastructure support are the sometimes convoluted schemes software vendors come up with to protect their commercial interests. The “solution” we despise the most is a license dongle. It sounds easy and straight forward but it never ever is. Murphy’s Law!

Vendors think they are unique

What’s one dongle? What are you complaining about? Huh … we have customers who have more than more than 10. Only one application is actually capable of finding multiple dongles and able to survive the loss of one. Do we need multiple machines to be on the safe side? Yes. How many? You need at least two. Why to allow “quick” redirection if possible at all (using scripts, GPO’s, etc.)

We have a USB hub attached to the rack, so we can stick in many dongles. It saves on hardware. Now once in a while the USB will get a hick up and unplugging it and plugging in back it usually does the trick. As mentioned before, this is not very handy when that dongle is in a secured room or data center. To add insult to injury network issues might stop applications from working that otherwise would be fine just because they can’t find a license server.

Today they are almost always USB dongles, but the parallel port dongle was quite popular as well. If you’re lucky the vendor provides you with a USB dongle when your hardware doesn’t have a parallel port anymore. But sometimes you are not that lucky. Today most laptops and indeed most PC’s don’t come with a parallel port anymore. And no, to road warriors with laptops a USB to parallel converter isn’t really user friendly. Furthermore a dongle sticking out are accidents waiting to happen (broken, lost) and finally some laptops, especially the smaller ones road warriors love sometimes only have one USB port that is taken up by their G3 broadband stick. Heaven forbid that some users actually have two applications that require a dongle and an internet connection. These are only the silly, practical issues you have to deal with when license dongles come into the picture but it gets worse fast when things like uptime, redundancy, high availability etc. are added.

Some dongles are attached to a network license server; some are attach locally to the PC or server running the software. In all cases they need drivers/software to run. The server software is sometimes very basic, rather flimsy and error prone. Combine this with various vendors who all require license dongles with various brands/versions of software. USB ports itself have been known to malfunction now and then. As you can imagine you end up with a USB hub lit up like a X-Mas tree and lots of finger crossing.

Reliable, Predictable, High Available

Dongles and high availability are almost always by definition mutually exclusive. If they are not then it’s a very expensive we’ll work around it type of solution trying to make a dongles highly available. This is only possible when the software that requires it supports redundant setups. I have only seen this possibility once. With some vendors you need to buy extra licenses to get an extra dongles … if the software package is 50.000 € that hurts. Some vendors will show leniency in such cases, some are plain arrogant and complacent. The fact that they are still in business proves that the world runs on bull shit I guess. But even when you do have multiple dongles, multiple servers hosting the dongles most dongle protected software is no written to deal with losing a dongle so you can’t get high availability only some form of hot stand by. Supporting that is also a lot of work that requires cloning, copying, scripts, manual interventions etc.

Some vendors are even so paranoid they check 5 times within a minute … if they can still find a license and if not the application fails. That means that even rebooting a dongle host for patching or another intervention takes down the application. Zero tolerance for failure … dongle wise, power wise, human error wise … pretty unrealistic. And even if the dongle is attached to a redundant server in a secure data center you’ll see that the USB port will fail for some reason. The only reliable and predictable thing in this story is the fact that you will fail.

Security

This is a good one. Do you really want to hurt some companies I work/consult for? Walk around their offices or data center, unplug any dongle you can and flush ‘m down the toilet. That will take ‘m down for a while. Yes I known, they should have better physical security. Either we have a license dongle on a network server which makes it a bit more realistic or we lock up all those PC’s in a secure room. That is not always feasible, either due to cost or just practically. And by the way that doesn’t protect you from a pissed of contractor or employee that has access. Even when security cameras can identify them fast the damage is already done.

Dongles sticking out of a 1 U server prevent the use of bezels to help lock down access to the server. The USB ports in the back are used for KVM over IP or keyboard and mouse.

In some models you can try to plug the dongle into an USB port inside the server chassis but than the old trick of unplugging/inserting the dongle when it goes haywire isn’t that easy anymore, let alone the fact that the dongle sits in a data center somewhere so getting to it might not be feasible so you need to allow someone to access the server to be able to get to the dongle.

Dongles and Virtualization

When you need to virtualize server applications that need a locally attached dongle you need to start looking for USB over Ethernet solutions that are reliable. When you find one you need to manage it very carefully and well. You need to manage the versions of the server software and the client software. We’ve seen network connectivity loss when the versions don’t match up, even if the software didn’t complain about different versions. You need to test its stability, have extra hardware and extra dongles for testing as not all dongles respond well to this type of setup. We can’t afford to bring down production environments with USB over Ethernet software “upgrades of faith”. The need for dongles adds an extra layer of complexity and management, one that is very error prone and hard to make redundant let alone highly available. It’s not a pretty picture.

We used to buy Fabulatech for such implementations. Version 4.1.1 was rock solid but ever since version 4.2/4.3 & 4.4 Beta they have brought us nothing but “Blue Screen of Death” hell. We now implement KernelPro (2.5.5) which seems to be functioning very well for the moment.

Dongles are a virtualization show stopper in some environments due to these issues and risks. Behold dongle David brings down virtualization Goliath.

The Bottom Line

The biggest perversion, in what is essentially a big mess, is the fact that the only people affected by this are your paying customers. Software vendors should take note of the fact that paying clients despise your convoluted, error prone, “accidents waiting to happen” dongle licensing schemes. You not only have no clue what it means to run reliable IT operations but you don’t even care about your customer’s needs. There is only one rule. Software & hardware should work under all circumstances without the need for dongles. That darn piece of 50 cent plastic & silicon could well bring an entire application down. Let us just hope that it isn’t the geo routing software for 911 or 112 services.

There are two possibilities when you sell software. One is that your application is very popular and as such is being “keygenned” and cracked all over the place and the only ones you’re hurting are your paying customers. The other possibility is that your software is so unique and expensive it’s only bought by specialized firms and entities that couldn’t even operate it without being exposed as thieves. Stop fooling yourselves and stop making life hell for your customers. Protect you rights as well as you can but not at the expense of paying customers. You might even sell more if you care about their needs. Go figure. Maybe I’m just to demanding? Nah!