Missing Hyper-V Service Connection Point caused failed off-host backup proxy jobs

The issue

We have a largish Windows Server 2016 Hyper-V cluster (9 nodes) that is running a smooth as can be but for one issue. The off-host backups with Veeam Backup & Replication v9.5 (based on transportable hardware snapshots) are failing. They only fail for the LUNs that are currently residing on a few of the nodes on that cluster. So when a CSV is owned by node 1 it will work, when it owned by node 6 it will fail. In this case we had 3 node that had issues.

As said, everything else on these nodes, cluster wise or Hyper-V wise was working 100% perfectly. As a matter of fact, they were the perfect Hyper-V clusters we’d all sign for. Bar that one very annoying issue.

Finding the cause

When looking at the application log on the off-host backup proxy it’s quite clear that there is an issue with the hardware VVS provider snapshots.

We get event id 0 stating the snapshot is already mounted to different server.

clip_image002

Followed by event id 12293 stating the import of the snapshot has failed

clip_image004

When we check the SAN, and monitor a problematic host in the cluster we see that the snapshot was taken just fine. what was failing was the transport to the backup repository server. It also seemed like an attempt was made to mount the snapshot on the Hyper-V host itself, which also failed.

What was causing this? We dove into the Hyper-V and cluster logs and found nothing that could help us explain the above. We did find the old very cryptic and almost undocumented error:

Event ID 12660 — Storage Initialization

Updated: April 7, 2009

Applies To: Windows Server 2008

This is preliminary documentation and subject to change.

clip_image005

This aspect refers events relevant to the storage of the virtual machine that are caused by storage configuration.

Event Details

Product:

Windows Operating System

ID:

12660

Source:

Microsoft-Windows-Hyper-V-VMMS

Version:

6.0

Symbolic Name:

MSVM_VDEV_OPEN_STOR_VSP_FAILED

Message:

Cannot open handle to Hyper-V storage provider.

Resolve

Reinstall Hyper-V

A possible security compromise has been created. Completely reimage the server (sometimes called a bare metal restoration), install a new operating system, and enable the Hyper-V role.

Verify

The virtual machine with the storage attached is able to launch successfully.

This doesn’t sound good, does it? Now you can web search this one and find very little information or people having serious issues with normal Hyper-V functions like starting a VM etc. Really bad stuff. But we could start, stop, restart, live migrate, storage live migrate, create checkpoints etc. at will without any issues or even so much as a hint of issues in the logs.

On top of this event id Event ID 12660 did not occur during the backups. It happens when you opened up Hyper-V manager and looked at the setting of Hyper-V or a virtual machine. Everything else on these nodes, cluster wise or Hyper-V wise was working 100% perfectly Again, this is the perfectly behaving Hyper-V cluster we’d all sign for. If it didn’t have that very annoying issue with a transportable snapshot on some of the nodes.

We extended our search outside if of the Hyper-V cluster nodes and then we hit clue. On the nodes that owns the LUN that was being backup and that did show the problematic transportable backup behavior noticed that the Hyper-V Service Connection Point (SCP) was missing.

clip_image006

We immediately checked the other nodes in the cluster having a backup issue. BINGO! That was the one and only common factor. The missing Hyper-V SCP.

Fixing the issue

Now you can create one manually but that leaves you with missing security settings and you can’t set those manually. The Hyper-V SCP is created and attributes populates on the fly when the server boots. So, it’s normal not to see one when a server is shut down.

The fastest way to solve the issue was to evacuate the problematic hosts, evict them from the cluster and remove them from the domain. For good measure, we reset the computer account in AD for those hosts and if you want you can even remove the Hyper-V role. We then rejoined those node to the domain. If you removed the Hyper-V role, you now reinstall it. That already showed the SCP issue to be fixed in AD. We then added the hosts back to the cluster and they have been running smoothly ever since. The Event ID 12660 entries are gone as are the VSS errors. It’s a perfect Hyper-V cluster now.

Root Cause?

We’re think that somewhere during the life cycle of the hosts the servers have been renamed while still joined to the domain and with the Hyper-V role installed. This might have caused the issue. During a Cluster Operating System Rolling Upgrade, with an in-place upgrade, we also sometime see the need to remove and re-add the Hyper-V role. That might also have caused the issue. We are not 100% certain, but that’s the working theory and a point of attention for future operations.

Replay Manager 7.8 and cluster OS rolling upgrade Tips

Compellent Replay manager 7.8  Windows Server 2016 Clusters in mixed mode or at cluster functional lever 8

Consider this a a quick publish about tips for when you combine Replay Manager 7.8, Compellent and Windows Server 2016. Many of you will be doing cluster operating system rolling upgrade of your Windows Server 2012 R2 clusters to Windows Server 2016. If you have done your homework and made sure your hardware is supported you can still run into a surprise. As long as your in mixed mode (Wi2K12R2 mixed with W2K16 nodes) or have not updated the cluster functional level to 9 (Windows Server 2016) you will have a few issues.

In Replay Manager 7.8  itself you’ll notice that the nodes of your cluster only see the CSV LUNs under local volumes that they are the owner of currently. Normally you’ll see all of the CSV LUNs of the (Hyper-V) cluster on all of the nodes of that cluster. So that’s not the expected behavior. This leads to failed  restore points when you run a snapshot from a host that is not the owner of the CSV etc.

image

On top of that when you try to run a backup job it will fail. The reason given is:

The requested volumes is not supported because it is not managed by the provider, is a dynamic volume, or it has some other incompatibility with the current operation.

The fix? Just update your upgrade cluster to cluster functional level  (level 9)

It’s as easy as that. The moment you upgrade your cluster functional level to 9 you will see all the CSV on the cluster on every node of that cluster you connect to. At that moment the replays will also work. That’s OK, you want to move swiftly trough the rolling upgrade and once you’re comfortable all drivers and firmware are working fine. You do not want to be in a the lower cluster version too long, but upgrade to benefit from the new capabilities in Windows Server 2016 Failover clustering. You do need to know this when you start your upgrades

image

Close your backups apps, restart the Replay manager service on the cluster nodes, refresh / reconnect to the backup apps, and voila. You’ll see the image you are use to in Replay Manager 7.8 (green text / arrows) and the backup jobs will work as well as any other backup product using the Compellent Replay Manager 7.8 hardware VSS provider.image

I hope this helps some of you out there. So yes Replay Manager 7.8 supports Windows Server 2016 Clusters with CSV LUNs but if you upgraded your cluster via cluster operating system rolling upgrade you need to have upgraded your cluster functional level! Until then, Replay Manager 7.8 isn’t going to work very well.

So there you go, that’s another reason to move through that process fast and smooth as you can.

Still missing in action for Hyper-V with Replay Manager 7.8

I’d really like for Replay Manager to be a bit more cluster friendly. No matter what node you are connected to they show you all CSV LUNs in the cluster. Since Replay manager 7.8 with Windows Server 2016 when you run a job manually you must start it when connected to the cluster node that owns the CSV or the job will fail with “No resources found on current cluster node for backup set”.

image

This was not the case with Windows Server 2012(R2) and earlier versions of Replay Manager. That did throw some benign errors in the event logs on the cluster node but it did work. I would love for DELLEMC to make sure the Replay Manager Client is smart enough to detect who owns the CSV and make sure it’ starts the job from that node. That would be a lot more user friendly. At the very least it should indicate which of the CSV LUNs you see are owned by the cluster node you are connected to.But when launching a backup job for a CSV that’s not owned by the node you are connect to the job quits/fails. They can detect the node they need, launch the job on that node and show it to you. That avoids having to go find out yourself what cluster node to connect to in Replay manager when you need to run a out of schedule job manually? The tech/logic is already there as the scheduled jobs get launched on the correct node.

It would also be great if they finally could get the logic built into Replay manager for the Hyper-V VM backups to know on what CSV and Hyper-V node the VM lives and deal with that. Sure it might cause more more snapshots to be made but that’s an invalid argument. When the VMs are on the same node,but different  CSV’s that’s already happening. Really on VM per job to avoid this isn’t a great answer.

Manage Your Brocade Fibre Channel Switch with recent Java & browser versions

Introduction

I was in the process of setting up a new jump server a management station server virtual machine on Windows Server 2016 Hyper-V. The guest was also Windows Server 2016 (desktop install). That station needed to be used to manage some aging Brocade fibre channel switches. With the default setting and links this will give you some headaches and some solution require you to keep older and insecure browser or java versions installed. We’ll show you how to get GUI access to your FC switches without needing to do that so you can manage your Brocade Fibre channel switch with recent Java & browser versions. Well not all of them, but it can be done with IE 11 and Firefox 52.0.1 (at the time of writing).

Another solution is to use the CLI naturally.

Manage Your Brocade Fibre Channel Switch with recent Java & browser versions

It’s OK to use the most recent Java version available. At the moment that I wrote this blog post that was Java 1.8.0.121. I can’t give guarantees other than that, but for now that does work.

Instead of navigating to http or https to just the IP address which will send you to https://x.x..x.x/switchexplorer you need to create a shortcut link to the following: https://10.30.2.2/switchexplorer_installed.html (or http://10.30.2.2/switchexplorer_installed.html if you have not enabled https on your switch).

Like this:

clip_image001

I normally change the icon to the shortcut to indicate it’s pointing to a network device. I actually created some ico files based on an image of brocades Fibre Channel switches that I use for this. I just place then under C:\Programdata\BrocadeFC for safe keeping together with a cop of the short cuts. On the management station, I add them to the desktop for easy access. Below is a screenshot of my Windows 10 or Windows Server 2016 (Desktop Experience) management station.

clip_image002

But we’re not there yet. You need to go to Java configuration and select the Security Tab. Make sure Enable Java Content in the browser is enabled. Leave the security at high but don’t forget to add the IP addresses of your Brocade switch to the Exception Site List.

clip_image004

You’ll need to add http or https or both depending on your situation. I think we can all agree we should go for https in this day and age.

In Firefox when you launch the shortcut you’ll get asked what app to use for opening this file.

clip_image005

Make sure you point it to javaws.exe (in C:\Program Files (x86)\Java\jre1.8.0_121\bin) if that’s not the case.

clip_image007Also, check to “Do this automatically for files like this from now on” for faster access during normal operations.

In Internet Explorer allow the add-on “Java SE Runtime Environment 8 Update 121 from Oracle America Inc.” to run.

clip_image009

When it comes to Chrome, this doesn’t’ work anymore. See https://www.java.com/en/download/faq/chrome.xml

When the application is launched, depending on the age of the fibre channel switch and the version of the firmware you’ll be greeted by a more or less harsh security warning.

clip_image010

clip_image011

Check the “I accept the risk and want to run this application” or “Do not show this again for this app from the publisher above” depending on the case. This also allows for easy access the next time you launch the shortcut. The app will launch and you’ll be greeted by the login screen.

clip_image012

Juts log in and there’s nothing more to it. You can now manage your FC switches from Firefox again.

image

Hope this helps some of you out there that come across this issue.

Off Host Backup Jobs with Veeam and Replay Manager 7.8

It’s all about application consistent hardware VSS provider snapshots

I was browsing to see if I could already download Replay Manager 7.8 for our Compellent (SC) SANs. No luck yet, but I did find the release notes. There was a real gem in there on Off Host Backup Jobs with Veeam and Replay Manager 7.8. We’ll get back to that after the big deal here.

image

So what kind of goodness is in there? Well obviously there is the way too long overdue support for Windows Server 2016, including the Hyper-V role and its features. That is great news. We now have application consistent hardware VSS provider snapshots.I do not know what took them so long but they need to get with the program here. I have given this a s feedback before and again at DELL EMC World 2017 The Compellent still is one of the best “traditional” centralized storage SAN solutions out there hat punches fare above its weight. On top of that, having looked at Unity form DELL EMC, I can tell you that in my humble opinion the Compellent has no competition from it.

Off Host Backup Jobs Veeam Replay Manager 7.8

Equally interesting to me, as someone who leverages Compellent and Veeam Baclup & Replication with Off Host Proxies (I wrote FREE WHITE PAPER: Configuring a VEEAM Off Host Backup Proxy Server for backing up a Windows Server 2012 R2 Hyper-V cluster with a DELL Compellent SAN (Fiber Channel))  is the following. Under fixed Issues is found:

RMS-24 Off-host backup jobs might fail during the volume discover scan when using Veeam backup software.

I have Off host proxies with transportable snapshots working pretty smooth but it has the occasional hiccup. Maybe some of those will disappear with Replay Manager 7.8. I’m looking forward to putting that to the test and roll forward with Windows Server 2016 for those nodes where we need and want to leverage the Compellent Hardware VSS provider. When I do I’ll let you know the results.