Missing Hyper-V Service Connection Point caused failed off-host backup proxy jobs

The issue

We have a largish Windows Server 2016 Hyper-V cluster (9 nodes) that is running a smooth as can be but for one issue. The off-host backups with Veeam Backup & Replication v9.5 (based on transportable hardware snapshots) are failing. They only fail for the LUNs that are currently residing on a few of the nodes on that cluster. So when a CSV is owned by node 1 it will work, when it owned by node 6 it will fail. In this case we had 3 node that had issues.

As said, everything else on these nodes, cluster wise or Hyper-V wise was working 100% perfectly. As a matter of fact, they were the perfect Hyper-V clusters we’d all sign for. Bar that one very annoying issue.

Finding the cause

When looking at the application log on the off-host backup proxy it’s quite clear that there is an issue with the hardware VVS provider snapshots.

We get event id 0 stating the snapshot is already mounted to different server.

clip_image002

Followed by event id 12293 stating the import of the snapshot has failed

clip_image004

When we check the SAN, and monitor a problematic host in the cluster we see that the snapshot was taken just fine. what was failing was the transport to the backup repository server. It also seemed like an attempt was made to mount the snapshot on the Hyper-V host itself, which also failed.

What was causing this? We dove into the Hyper-V and cluster logs and found nothing that could help us explain the above. We did find the old very cryptic and almost undocumented error:

Event ID 12660 — Storage Initialization

Updated: April 7, 2009

Applies To: Windows Server 2008

This is preliminary documentation and subject to change.

clip_image005

This aspect refers events relevant to the storage of the virtual machine that are caused by storage configuration.

Event Details

Product:

Windows Operating System

ID:

12660

Source:

Microsoft-Windows-Hyper-V-VMMS

Version:

6.0

Symbolic Name:

MSVM_VDEV_OPEN_STOR_VSP_FAILED

Message:

Cannot open handle to Hyper-V storage provider.

Resolve

Reinstall Hyper-V

A possible security compromise has been created. Completely reimage the server (sometimes called a bare metal restoration), install a new operating system, and enable the Hyper-V role.

Verify

The virtual machine with the storage attached is able to launch successfully.

This doesn’t sound good, does it? Now you can web search this one and find very little information or people having serious issues with normal Hyper-V functions like starting a VM etc. Really bad stuff. But we could start, stop, restart, live migrate, storage live migrate, create checkpoints etc. at will without any issues or even so much as a hint of issues in the logs.

On top of this event id Event ID 12660 did not occur during the backups. It happens when you opened up Hyper-V manager and looked at the setting of Hyper-V or a virtual machine. Everything else on these nodes, cluster wise or Hyper-V wise was working 100% perfectly Again, this is the perfectly behaving Hyper-V cluster we’d all sign for. If it didn’t have that very annoying issue with a transportable snapshot on some of the nodes.

We extended our search outside if of the Hyper-V cluster nodes and then we hit clue. On the nodes that owns the LUN that was being backup and that did show the problematic transportable backup behavior noticed that the Hyper-V Service Connection Point (SCP) was missing.

clip_image006

We immediately checked the other nodes in the cluster having a backup issue. BINGO! That was the one and only common factor. The missing Hyper-V SCP.

Fixing the issue

Now you can create one manually but that leaves you with missing security settings and you can’t set those manually. The Hyper-V SCP is created and attributes populates on the fly when the server boots. So, it’s normal not to see one when a server is shut down.

The fastest way to solve the issue was to evacuate the problematic hosts, evict them from the cluster and remove them from the domain. For good measure, we reset the computer account in AD for those hosts and if you want you can even remove the Hyper-V role. We then rejoined those node to the domain. If you removed the Hyper-V role, you now reinstall it. That already showed the SCP issue to be fixed in AD. We then added the hosts back to the cluster and they have been running smoothly ever since. The Event ID 12660 entries are gone as are the VSS errors. It’s a perfect Hyper-V cluster now.

Root Cause?

We’re think that somewhere during the life cycle of the hosts the servers have been renamed while still joined to the domain and with the Hyper-V role installed. This might have caused the issue. During a Cluster Operating System Rolling Upgrade, with an in-place upgrade, we also sometime see the need to remove and re-add the Hyper-V role. That might also have caused the issue. We are not 100% certain, but that’s the working theory and a point of attention for future operations.

Off Host Backup Jobs with Veeam and Replay Manager 7.8

It’s all about application consistent hardware VSS provider snapshots

I was browsing to see if I could already download Replay Manager 7.8 for our Compellent (SC) SANs. No luck yet, but I did find the release notes. There was a real gem in there on Off Host Backup Jobs with Veeam and Replay Manager 7.8. We’ll get back to that after the big deal here.

image

So what kind of goodness is in there? Well obviously there is the way too long overdue support for Windows Server 2016, including the Hyper-V role and its features. That is great news. We now have application consistent hardware VSS provider snapshots.I do not know what took them so long but they need to get with the program here. I have given this a s feedback before and again at DELL EMC World 2017 The Compellent still is one of the best “traditional” centralized storage SAN solutions out there hat punches fare above its weight. On top of that, having looked at Unity form DELL EMC, I can tell you that in my humble opinion the Compellent has no competition from it.

Off Host Backup Jobs Veeam Replay Manager 7.8

Equally interesting to me, as someone who leverages Compellent and Veeam Baclup & Replication with Off Host Proxies (I wrote FREE WHITE PAPER: Configuring a VEEAM Off Host Backup Proxy Server for backing up a Windows Server 2012 R2 Hyper-V cluster with a DELL Compellent SAN (Fiber Channel))  is the following. Under fixed Issues is found:

RMS-24 Off-host backup jobs might fail during the volume discover scan when using Veeam backup software.

I have Off host proxies with transportable snapshots working pretty smooth but it has the occasional hiccup. Maybe some of those will disappear with Replay Manager 7.8. I’m looking forward to putting that to the test and roll forward with Windows Server 2016 for those nodes where we need and want to leverage the Compellent Hardware VSS provider. When I do I’ll let you know the results.

Full or Thick Provisioned Volume on Compellent

Introduction

There are pundits out there that claim that you cannot create a fully provisioned LUN on a Compellent SAN.  Now that what I call unsubstantiated rumors, better know as bull shit.

Sure the magic sauce of many modern storage array lies in thin provisioning. Let there be no mistake about that. But there are scenarios where you might want to leverage a fully provisioned volume. This is also know a s thick provisioned LUN. You can read about one such a scenario where they make perfect sense in this blog post Mind the UNMAP Impact On Performance In Certain Scenarios

Create a  Full or Thick Provisioned Volume on Compellent

First of all you create brand new volume in the Storage Center System Explorer. That’s a standard as it gets.

You then map this volume to a server

At that moment, before you even mount that volume on your server let alone do anything else such a bringing it on line or formatting it you’ll “Preallocate Storage” for that volume in Storage Center.

image

You’ll get a warning as this is not a default action and you should only do so when the conditions of the IO warrant this.

image

When you continue you’ll get some feedback. This can take quite some time depending on the size of the volume.

image

When it’s done peek at the statistics of that full or this provisioned volume on the Compellent.This is what it looks like when you look at the statistics for that volume after is was done. So before we even formatted the volume on a server and wrote data to it. It’s using all the space on the SAN for the start.

image

Due to data protection it’s even more. It’s clear form the image above that a 500GB disk in RAID 10 fully provisioned is using 1TB of space as its all still in RAID 10 (no tiering down has occurred yet). Raid 10 has an overhead factor 2. The volume is for a large part in Tier 2 because my Tier 1 is full, so writing spilled over into Tier 2.

Now compare this to a thinly provisioned volume that we just created and again we haven’t even touched it in any other way.

image

Yup, until we actually write data to the volume it’s highly space efficient, there is absolutely no spaces use and we’ll see only a little when we mount, initialize the disk in Windows, create a simple volume and format it.image

This is completely in Tier 2 and my tier 1 is full. I accept donations of SANs and SSD’s for my lab it this bothers you Winking smile. When we write data to it you’ll see this rise and over time you’ll see it tier down and up as well.

Dell Compellent SCOS 6.7 ODX Bug Heads Up

UPDATE 3: Bad and disappointing news. After update 2 we’ve seen DELL change the CSTA (CoPilot Services Technical Alert)  on the customer website to “’will be fixed” in a future version. No according to the latest comment on this blog post that would be In Q1 2017. Basically this is unacceptable and it’s a shame to see a SAN that was one of the best when in comes to Hyper-V Support in Windows Server 2012 / 2012 R2 decline in this way. If  7.x is required for Windows Server 2016 Support this is pretty bad as it means early adopters are stuck or we’ll have to find an recommend another solution. This is not a good day for Dell storage.

UPDATE 2: As you can read in the comments below people are still having issues. Do NOT just update without checking everything.

UPDATE: This issue has been resolved in Storage Center 6.7.10 and 7.Ximage

If you have 6.7.x below 6.7.10 it’s time to think about moving to 6.7.10!

No vendor is exempt form errors, issues, mistakes and trouble with advances features and unfortunately Dell Compellent has issues with Windows Server 2012 (R2) ODX in the current release of SCOS 6.7. Bar a performance issue in a 6.4 version they had very good track record in regards to ODX, UNMAP, … so far. But no matter how good your are, bad things can happen.

DellCompellentModern

I’ve had to people who were bitten by it contact me. The issue is described below.

In SCOS 6.7 an issue has been determined when the ODX driver in Windows Server 2012 requests an Extended Copy between a source volume which is unknown to the Storage Center and a volume which is presented from the Storage Center. When this occurs the Storage Center does not respond with the correct ODX failure code. This results in the Windows Server 2012 not correctly recognizing that the source volume is unknown to the Storage Center. Without the failure code Windows will continually retry the same request which will fail. Due to the large number of failed requests, MPIO will mark the path as down. Performing ODX operations between Storage Center volumes will work and is not exposed to this issue.

You might think that this is not a problem as you might only use Compellent storage but think again. Local disks on the hosts where data is stored temporarily and external storage you use to transport data in and out of your datacenter, or copy backups to are all use cases we can encounter.  When ODX is enabled, it is by default on Windows 2012(R2), the file system will try to use it and when that fails use normal (non ODX) operations. All of this is transparent to the users. Now MPIO will mark the Compellent path as down. Ouch. I will not risk that. Any IO between an non Compellent LUN and a Compellent LUN might cause this to happen.

The only workaround for now is to disable ODX on all your hosts. To me that’s unacceptable and I will not be upgrading to 6.7 for now. We rely on ODX to gain performance benefits at both the physical and virtual layer. We even have our SMB 3 capable clients in the branch offices leverage ODX to avoid costly data copies to our clustered Transparent Failover File Servers.

When a version arrives that fix the issue I’Il testing even more elaborate than before. We’ve come to pay attention to performance issues or data corruption with many vendors, models and releases but this MPIO issue is a new one for me.