An error occurred connecting to the cluster

Posted on September 8, 2017 by workinghardinit

An error occurred connecting to the cluster

This morning I woke up to a bunch of failed backup notifications of our trusted Veeam Backup & Replication v9.5 update 2 solution. After 3:30 AM the backups of one particular cluster started failing.

I went to have a look but I could not connect to the 3 node cluster.

I logged on to the cluster nodes themselves and did a quick verification of network connectivity, DNS etc. That was all fine. WMI services were running on all nodes but on node 2 and 3 they were not functional.

Cleary we have a WMI issue. And sure enough, no Hyper-V manager available on those 2 nodes but we did have it on the one properly functioning node.

We tested some PowerShell WMI queries (get-wmiobject mscluster_resourcegroup -computer NodeToTest -namespace “ROOT\MSCluster“) to the cluster and this confirmed that WMI was toast on those two nodes.

Fixing the issue

The good news was that all the VMs were all up and running – a few that had RHS.exe issues – but were still alive pure Hyper-V wise. That explains why they didn’t have any support calls come in. So if we can fix this without causing down time this would be great. To try this we decided to restart the WMI service.

On problematic node 2 this worked. It restarted depending services as well such as Hyper-V Virtual Machine Management, User Access Logging Service, IP Helper and the Veeam Installer Service and the Veeam Hyper-V Integration Service. We got connectivity back via Hyper-V manager but the Failover Cluster manager GUI remained an issue but now only complained about node 3.

We wanted to avoid rebooting node 3 to avoid downtime to the VMs. So what we did there is stop the depending services that we could stop. It was vmms.exe that was stuck in shutdown we just killed the process manually with stop-Process -name “vmms” -force
That allowed the WMI service to be restarted. We then started the depending services manually and we got back the connectivity to Hyper-V Manager on node 3.

The Failover Cluster manager GUI could also connect again to the cluster. We checked the cluster for other issues. When done and found OK we live migrated the VMs node per node and did a reboot of every node one by one. This to have cleanly started nodes and to see if any trouble some event were logged during the startup. Normal operations were resumed.

Do note that there is a blog on TechNet about a similar issue but with a different error message. That was caused by missing cluswmi.mof file due to an ill advised use of run mofcomp.exe *.mof. This was not the case here. A reboot of the misbehaving nodes would have done the trick as well (as blogged here Trouble Connecting to Cluster Nodes? Check WMI! ) but we avoided as much downtime as possible here by going the route we did.

Microsoft Active Directory Replication Status Tool won’t upgrade

Posted on August 9, 2017 by workinghardinit

For getting a quick insight into the AD replication health of an environment the Microsoft Active Directory Replication Status Tool is a very handy instrument. The only annoyance is the expiration of the license that forces you to download a new one and upgrade. A bit of a convoluted way to update free software but hey it is handy and free.

And then again …

OK, I’ll download the new one. But the Microsoft Active Directory Replication Status Tool won’t upgrade. That’s because the currently installed version is newer than the one you just downloaded form the Microsoft site. That’s annoying, did they post the wrong version?

Let’s install the new version quickly in a VM. Now looking at the executable in the current install and the new one they are the same … so the license is the only thing causing an issue here; not a version difference actually.

Old version

New version

Let’s look at the license.xml file in C:\Program Files (x86)\Microsoft Active Directory Replication Status Tool\Licensing

The only difference between the old and the new installed is the license file.You can see it has the expiration dates in the future.

So the fix is easy, just uninstall the currently installed version of AD Replication Status tool wherever it is installed and reinstall the one you downloaded. It seems to be exactly the same version but that’s how you get it working again with a fresh license.xml file. Note that you cannot copy the license file between machine, the generated signature is wrong.

Hope this helps someone.

Troubleshooting Veeam B&R Error code: ‘32768’. Failed to create VM recovery snapshot

Posted on August 7, 2017 by workinghardinit

I recently had to move a Windows Server 2016 VM over to another cluster (2012R2 to 2016 cluster) and to do so I uses shared nothing live migration. After the VM was happily running on the new cluster I kicked of a Veeam backup job to get a first restore point for that VM. Better safe than sorry right?

But the job and the retries failed for that VM. The error details are:

Failed to create snapshot Compellent Replay Manager VSS Provider on repository01.domain.com (mode: Veeam application-aware processing) Details: Job failed (‘Checkpoint operation for ‘FailedVM’ failed. (Virtual machine ID 459C3068-9ED4-427B-AAEF-32A329B953AD). ‘FailedVM’ could not initiate a checkpoint operation: %%2147754996 (0x800423F4). (Virtual machine ID 459C3068-9ED4-427B-AAEF-32A329B953AD)’). Error code: ‘32768’.
Failed to create VM recovery snapshot, VM ID ‘3459c3068-9ed4-427b-aaef-32a329b953ad’.

Also when the job fails over to the native Windows VSS approach when the HW VSS provider fails it still does not work. At first that made me think of a bug that sued to exist in Windows Server 2016 Hyper-V where a storage live migration of any kind would break RCT and new full was needed to fix it. That bug has long since been fixed and no a new full backup did not solve anything here. Now there are various reasons why creating a checkpoint will not succeed so we need to dive in deeper. As always the event viewer is your friend. What do we see? 3 events during a backup and they are SQL Server related.

On top of that the SQLServerWriter is in a non retryable error when checking with vssadmin list writers.

It’s very clear there is an issue with the SQL Server VSS Writer in this VM and that cause the checkpoint to fail. You can search for manual fixes but in the case of an otherwise functional SQL Server I chose to go for a repair install of SQL Server. The tooling for hat is pretty good and it’s probably the fastest way to resolve the issues and any underlying ones we might otherwise still encounter.

After running a successful repair install of SQL Server we get greeted by an all green result screen.

So now we check vssadmin list writers again to make sure they are all healthy if not restart the SQL s or other relevant service if possible. Sometime you can fix it by restarting a service, in that case reboot the server. We did not need to do that. We just ran a new retry in Veeam Backup & Replication and were successful.

There you go. The storage live migration before the backup of that VM made me think we were dealing with an early Windows Server 2016 Hyper-V bug but that was not the case. Trouble shooting is also about avoiding tunnel vision.

The Federation Service was unable to create the federation metadata document as a result of an error.Document Path: /FederationMetadata/2007-06/FederationMetadata.xml

Posted on July 4, 2017 by workinghardinit

While working on upgrading a Windows 2012 R2 ADFS Farm to Window Server 2016 I noticed the worried looks of the systems administrators while looking at a warning in the ADFS event log, which they wanted to trouble shoot. I knew they had a hardware load balancer in place which made me 99.999% sure it wasn’t a real issue. You see, early documentation on configuring load balancing for and ADFS farm was often configured with a health check for the following url: /FederationMetadata/2007-06/FederationMetadata.xml. This leads you to an XML file that should be available on a working ADFS node.

This works fine. The Kemp Loadmaster knows the ADFS nodes are functional or not and can do it’s job. There’s a nagging issue however. The ADFS log on the ADFS farm node keep logging every health check with a warning

Event ID 143 AD FS

The Federation Service was unable to create the federation metadata document as a result of an error.Document Path: /FederationMetadata/2007-06/FederationMetadata.xml

Additional Data
Exception details:
System.Net.HttpListenerException (0x80004005): The specified network name is no longer available at System.Net.HttpResponseStream.Write(Byte[] buffer, Int32 offset, Int32 size) at Microsoft.IdentityServer.Service.FederationMetadata.SamlMetadataListener.OnGetContext(IAsyncResult result)

As you cans see it just fills the logs every 9 seconds (the frequency of the health check).

This leads to hunting for a “ghost” issue that’s actually only an artefact of checking for .

Kemp has updated their documentation with 2 other values for the health check url to use. The good news is these don’t cause the above artefact of logging warning to the ADFS event log. These 2 options are:

/adfs/services/trust/mex

This leads to an XML file as well but it doesn’t cause the warning to be logged.

/adfs/ls/idpInitiatedSignon.aspx.

This leads to the ADFS login page which also doesn’t cause a warning to be logged.

So by changing your health check to any of the above you get a functional health check for your nodes and you don’t have the phantom warning entries in the ADFS event log. That’s a lot better and at least doesn’t cause any unneeded concerns by the initiated accidental ADFS administrator.

Working Hard In IT

My view on IT from the trenches

Category Archives: Trouble Shooting

An error occurred connecting to the cluster

An error occurred connecting to the cluster

Fixing the issue

Microsoft Active Directory Replication Status Tool won’t upgrade

Troubleshooting Veeam B&R Error code: ‘32768’. Failed to create VM recovery snapshot

The Federation Service was unable to create the federation metadata document as a result of an error.Document Path: /FederationMetadata/2007-06/FederationMetadata.xml