Live Migration Fails due to non-existent SharedStoragePath or ConfigStoreRootPath

Introduction

I was tasked to troubleshoot a cluster where cluster aware updating (CAU) failed due to the nodes never succeeding going into maintenance mode. It seemed that none of the obvious or well know issues and mistakes that might break live migrations were present. Looking at the cluster and testing live migration not a single VM on any node would live migrate to any other node.
So, I take a peek the event id and description and it hits me. I have seen this particular event id before.

Live Migration Fails due to non-existent SharedStoragePath or ConfigStoreRootPath

Log Name:      System
Source:        Microsoft-Windows-Hyper-V-High-Availability
Date:          9/27/2018 15:36:44
Event ID:      21502
Task Category: None
Level:         Error
Keywords:
User:          SYSTEM
Computer:      NODE-B.datawisetech.corp
Description:
Live migration of ‘Virtual Machine ADFS1’ failed.
Virtual machine migration operation for ‘ADFS1’ failed at migration source ‘NODE-B’. (Virtual machine ID 4B5F2F6C-AEA3-4C7B-8342-E255D1D112D7)
Failed to verify collection registry for virtual machine ‘ADFS1’: The system cannot find the file specified. (0x80070002). (Virtual Machine ID 4B5F2F6C-AEA3-4C7B-8342-E255D1D112D7).

The live migration fails due to non-existent SharedStoragePath or ConfigStoreRootPath which is where collections metadata lives.

More errors are logged

There usually are more related tell-tale events. They however are clear in pin pointing the root cause.

On the destination host

On the destination host you’ll find event id 21066:

Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          9/27/2018 15:36:45
Event ID:      21066
Task Category: None
Level:         Error
Keywords:
User:          SYSTEM
Computer:      NODE-A.datawisetech.corp
Description:
Failed to verify collection registry for virtual machine ‘ADFS1’: The system cannot find the file specified. (0x80070002). (Virtual Machine ID 4B5F2F6C-AEA3-4C7B-8342-E255D1D112D7).

A bunch of 1106 events per failed live migration per VM in like below:

Log Name:      Microsoft-Windows-Hyper-V-VMMS-Operational
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          9/27/2018 15:36:45
Event ID:      1106
Task Category: None
Level:         Error
Keywords:
User:          SYSTEM
Computer:      NODE-A.datawisetech.corp
Description:
vm\service\migration\vmmsvmmigrationdestinationtask.cpp(5617)\vmms.exe!00007FF77D2171A4: (caller: 00007FF77D214A5D) Exception(998) tid(1fa0) 80070002 The system cannot find the file specified.

On the source host

On the source host you’ll find event id 1840 logged
Log Name:      Microsoft-Windows-Hyper-V-Worker-Operational
Source:        Microsoft-Windows-Hyper-V-Worker
Date:          9/27/2018 15:36:44
Event ID:      1840
Task Category: None
Level:         Error
Keywords:
User:          NT VIRTUAL MACHINE\4B5F2F6C-AEA3-4C7B-8342-E255D1D112D7
Computer:      NODE-B.datawisetech.corp
Description:
[Virtual machine 4B5F2F6C-AEA3-4C7B-8342-E255D1D112D7] onecore\vm\worker\migration\workertaskmigrationsource.cpp(281)\vmwp.exe!00007FF6E7C46141: (caller: 00007FF6E7B8957D) Exception(2) tid(ff4) 80042001     CallContext:[\SourceMigrationTask]

As well as event id 21111:
Log Name:      Microsoft-Windows-Hyper-V-High-Availability-Admin
Source:        Microsoft-Windows-Hyper-V-High-Availability
Date:          9/27/2018 15:36:44
Event ID:      21111
Task Category: None
Level:         Error
Keywords:
User:          SYSTEM
Computer:      NODE-B.datawisetech.corp
Description:
Live migration of ‘Virtual Machine ADFS1’ failed.

… event id 21066:
Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          9/27/2018 15:36:44
Event ID:      21066
Task Category: None
Level:         Error
Keywords:
User:          SYSTEM
Computer:      NODE-B.datawisetech.corp
Description:
Failed to verify collection registry for virtual machine ‘ADFS1’: The system cannot find the file specified. (0x80070002). (Virtual Machine ID 4B5F2F6C-AEA3-4C7B-8342-E255D1D112D7).

… and event id 21024:
Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          9/27/2018 15:36:44
Event ID:      21024
Task Category: None
Level:         Error
Keywords:
User:          SYSTEM
Computer:      NODE-B.datawisetech.corp
Description:
Virtual machine migration operation for ‘ADFS1’ failed at migration source ‘NODE-B’. (Virtual machine ID 4B5F2F6C-AEA3-4C7B-8342-E255D1D112D7)

Live migration fails due to non-existent SharedStoragePath or ConfigStoreRootPath explained

If you have worked with guest clusters and the ConfigStoreRootPath you know about issues with collections/ groups & checkpoints. This is related to those. If you haven’t heard anything yet read https://blog.workinghardinit.work/2018/09/10/correcting-the-permissions-on-the-folder-with-vhds-files-checkpoints-for-host-level-hyper-v-guest-cluster-backups/.

This is what a Windows Server 2016/2019 cluster that has not been configured with a looks like.

Get-VMHostCluster  -ClusterName “W2K19-LAB”

image

HKLM\Cluster\Resources\GUIDofWMIResource\Parameters there is a value called ConfigStoreRootPath which in PowerShell is know as the SharedStoragePath property.  You can also query it via

And this is what it looks like in the registry (0.Cluster and Cluster keys.) The resource ID we are looking at is the one of the Virtual Machine Cluster WMI resource.

image

If it returns a path you must verify that it exists, if not you’re in trouble with live migrations. You will also be in trouble with host level guest cluster backups or Hyper-V replicas of them. Maybe you don’t have guest cluster or use in guest backups and this is just a remnant of trying them out.

When I run it on the problematic cluster I get a path points to a folder on a CSV that doesn’t exist.

Get-VMHostCluster -ClusterName “W2K19-LAB
ClusterName SharedStoragePath
———– —————–
W2K19-LAB   C:\ClusterStorage\ReFS-01\SharedStoragePath

What happend?

Did they rename the CSV? Replace the storage array? Well as it turned out they reorganized and resized the CSVs. As they can’t shrink SAN LUNs the created new ones. They then leveraged storage live migration to move the VMs.

The old CSV’s where left in place for about 6 weeks before they were cleaned up. As this was the first time they ran Cluster Aware Updating after removing them this is the first time they hit this problem. Bingo! You probably think you’ll just change it to an existing CSV folder path or delete it. Well as it turns out, you cannot do that. You can try …

PS C:\Users\administrator1> Set-VMHostCluster -ClusterName “W2K19-LAB” -SharedStoragePath “C:\ClusterStorage\Volume1\SharedStoragePath”

Set-VMHostCluster : The operation on computer ‘W2K19-LAB’ failed: The WS-Management service cannot process the request. The WMI service or the WMI provider returned an unknown error: HRESULT 0x80070032
At line:1 char:1
+ Set-VMHostCluster -ClusterName
“W2K19-LAB” -SharedStoragePath “C:\Clu …
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo          : NotSpecified: (:) [Set-VMHostCluster], VirtualizationException
+ FullyQualifiedErrorId : OperationFailed,Microsoft.HyperV.PowerShell.Commands.SetVMHostCluster

Or try …
$path = “C:\ClusterStorage\Volume1\Hyper-V\Shared”
Get-ClusterResource “Virtual Machine Cluster WMI” | Set-ClusterParameter -Name ConfigStoreRootPath -Value $path -Create

Whatever you try, deleting, overwriting, … no joy. As it turns out you cannot change it and this is by design. A shaky design I would say. I understand the reasons because if it changes or is deleted and you have guest clusters with collection depending on what’s in there you have backup and live migration issues with the guest clusters. But if you can’t change it you also run into issues if storage changes. You dammed if you do, dammed if you don’t.

Workaround 1

What

Create a CSV with the old name and folder(s) to which the current path is pointing. That works. It could even be a very small one. As test I use done of 1GB. Not sure of that’s enough over time but if you can easily extend your CSV that’s should not pose a problem. It might actually be a good idea to have this as a best practice. Have a dedicated CSV for the SharedStoragePath. I’ll need to ask Microsoft.

How

You know how to create a CSV and a folder I guess, that’s about it.  I’ll leave it at that.

Workaround 2

What

Set the path to a new one in the registry. This could be a new path (mind you this won’t fix any problems you might already have now with existing guest clusters).

Delete the value for that current path and leave it empty. This one is only a good idea if you don’t have a need for VHD Set Guest clusters anymore. Basically, this is resetting it to the default value.

How

There are 2 ways to do this. Both cost down time. You need to bring the cluster service down on all nodes and then you don’t have your CSV’s. That means your VMs must be shut down on all nodes of the cluster

The Microsoft Support way

Well that’s what they make you do (which doesn’t mean you should just do it even without them instructing you to do so)

  1. Export your HKLM\Cluster\Resources\GUIDofWMIResource\Parameters for save keeping and restore if needed.
  2. Shut down all VMs in the cluster or even the ones residing on a CSV even if not clusterd.
  3. Stop the cluster service on all nodes (the cluster is shutdown if you do that), leave the one you are working on for last.
  4. From one node, open up the registry key
  5. Click on HKEY_LOCAL_MACHINE and then click on file, then select load hive
  6. Browse to c:\windows\cluster, and select CLUSDB
  7. Click ok, and then name it DB
  8. Expand DB, then expand Resources
  9. Select the GUID of Virtual Machine WMI
  10. Click on parameters, on (configStoreRootPath) you will find the value
  11. Double click on it, and delete it or set it to a new path on a CSV that you created already
  12. Start the cluster service
  13. Then start the cluster service from all nodes, node by node

My way

Not supported, at your own risk, big boy rules apply. I have tried and tested this a dozen times in the lab on multiple clusters and this also works.

  1. In the registry key Cluster (HKLM\Cluster\Resources\GUIDofWMIResource\Parameters) of ever cluster node delete the content of the REGZ value for configStoreRootPath, so it is empty or change it to a new path on a CSV that you created already for this purpose.
  2. If you have a cluster with a disk witness, the node who owns the disk witness also has a 0.Cluster key (HKLM\0.Cluster\Resources\GUIDofWMIResource\Parameters). Make sure you also to change the value there.
  3. When you have done this. You have to shut down all the virtual machines. You then stop the cluster service on every node. I try to work on the node owning the disk witness and shut down the cluster on that one as the final step. This is also the one where I start again the cluster again first so I can easily check that the value remains empty in both the Cluster and the 0.Cluster keys. Do note that with a file share / cloud share witness, knowing what node was shut down last can be important. See https://blog.workinghardinit.work/2017/12/11/cluster-shared-volumes-without-active-directory/. That’s why I always remember what node I’m working on and shut down last.
  4. Start up the cluster service on the other nodes one by one.
  5. This avoids having to load the registry hive but editing the registry on every node in large clusters is tedious. Sure, this can be scripted in combination with shutting down the VMs, stopping the cluster service on all nodes, changing the value and then starting the cluster services again as well as the VMs. You can control the order in which you go through the nodes in a script as well. I actually did script this but I used my method. you can find it at the bottom of this blog post.

Both methods will work and live migrations will work again. Any existing problematic guest cluster VMs in backup or live migration is food for another blog post perhaps. But you’ll have things like driving your crazy.

Some considerations

Workaround 1 is a bit of a “you got to be kidding me” solution but at least it leaves some freedom replace, rename, reorganize the other CSVs as you see fit. So perhaps having a dedicated CSV just for this purpose is not that silly. Another benefit is that this does not involve messing around in the cluster database via the registry. This is something we advise against all the time but now has become a way to get out of a pickle.

Workaround 2 speaks for its self. There is two ways to achieve this which I have shown. But a word of warning. The moment the path changes and you have some already existing VHD Set guests clusters that somehow depend on that you’ll see that backups start having issues and possibly even live migrations. But you’re toast for all your Live migrations anyway already so … well yeah, what can I do.

So, this is by design. Maybe it is but it isn’t very realistic that your stuck to a path and name that hard and that it causes this much grief or allows for people to shoot themselves in the foot. It’s not like all this documented somewhere.

Conclusion

This needs to be fixed. While I can get you out of this pickle it is a tedious operation with some risk in a production environment. It also requires down time, which is bad. On top of that it will only have a satisfying result if you don’t have any VHD Set guest clusters that rely on the old path. The mechanism behind the SharedStoragePath isn’t as robust and flexible yet as it should be when it comes to changes & dealing with failed host level guest cluster backups.

I have tested this in Windows 2019 insider preview. The issue is still there. No progress on that front. Maybe in some of the future cumulative updates, things will be fixed to make guest clustering with VHD Set a more robust and reliable solution. The fact that Microsoft relies on guest clustering to support some deployment scenarios with S2D makes this even more disappointing. It is also a reason I still run physical shared storage-based file clusters.

The problematic host level backups I can work around by leveraging in guest backups. But the path issue is unavoidable if changes are needed.

After 2 years of trouble with the framework around guest cluster backups / VHD Set, it’s time this “just works”. No one will use it when it remains this troublesome and you won’t fix this if no one uses this. The perfect catch 22 situation.

The Script

The script is below as promised. If you use this without testing in a production environment and it blows up in your face you are going to get fired and it is your fault. You can use it both to introduce as fix the issue. The action are logged in the directory where the script is run from.

Correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backups

Introduction

It’s not a secret that while guest clustering with VHDSets works very well. We’ve had some struggles in regards to host level backups however. Right now I leverage Veeam Agent for Windows (VAW) to do in guest backups. The most recent versions of VAW support Windows Failover Clustering. I’d love to leverage host level backups but I was struggling to make this reliable for quite a while. As it turned out recently there are some virtual machine permission issues involved we need to fix. Both Microsoft and Veeam have published guidance on this in a KB article. We automated correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backup

The KB articles

Early August Microsoft published KB article with all the tips when thins fail Errors when backing up VMs that belong to a guest cluster in Windows. Veeam also recapitulated on the needed conditions and setting to leverage guest clustering and performing host level backups. The Veeam article is Backing up Hyper-V guest cluster based on VHD set. Read these articles carefully and make sure all you need to do has been done.

For some reason another prerequisite is not mentioned in these articles. It is however discussed in ConfigStoreRootPath cluster parameter is not defined and here https://docs.microsoft.com/en-us/powershell/module/hyper-v/set-vmhostcluster?view=win10-ps You will need to set this to make proper Hyper-V collections needed for recovery checkpoints on VHD Sets. It is a very unknown setting with very little documentation.

But the big news here is fixing a permissions related issue!

The latest addition in the list of attention points is a permission issue. These permissions are not correct by default for the guest cluster VMs shared files. This leads to the hard to pin point error.

Error Event 19100 Hyper-V-VMMS 19100 ‘BackupVM’ background disk merge failed to complete: General access denied error (0x80070005). To fix this issue, the folder that holds the VHDS files and their snapshot files must be modified to give the VMMS process additional permissions. To do this, follow these steps for correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backup.

Determine the GUIDS of all VMs that use the folder. To do this, start PowerShell as administrator, and then run the following command:

get-vm | fl name, id
Output example:
Name : BackupVM
Id : d3599536-222a-4d6e-bb10-a6019c3f2b9b

Name : BackupVM2
Id : a0af7903-94b4-4a2c-b3b3-16050d5f80f

For each VM GUID, assign the VMMS process full control by running the following command:
icacls <Folder with VHDS> /grant “NT VIRTUAL MACHINE\<VM GUID>”:(OI)F

Example:
icacls “c:\ClusterStorage\Volume1\SharedClusterDisk” /grant “NT VIRTUAL MACHINE\a0af7903-94b4-4a2c-b3b3-16050d5f80f2”:(OI)F
icacls “c:\ClusterStorage\Volume1\SharedClusterDisk” /grant “NT VIRTUAL MACHINE\d3599536-222a-4d6e-bb10-a6019c3f2b9b”:(OI)F

My little PowerShell script

As the above is tedious manual labor with a lot of copy pasting. This is time consuming and tedious at best. With larger guest clusters the probability of mistakes increases. To fix this we write a PowerShell script to handle this for us.

Below is an example of the output of this script. It provides some feedback on what is happening.

Correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backup

Correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backup

PowerShell for the win. This saves you some searching and typing and potentially making some mistakes along the way. Have fun. More testing is underway to make sure things are now predictable and stable. We’ll share our findings with you.

Collect cluster nodes with HBA WWN info

Introduction

Below is a script that I use to collect cluster nodes with HBA WWN info. It grabs the cluster nodes and their HBA (virtual ports) WWN information form an existing cluster. In this example the nodes have Fibre Channel (FC) HBAs. It works equally well for iSCSI HBA or other cards. You can use the collected info in real time. As an example I also demonstrate writing and reading the info to and from a CSV.

This script comes in handy when you are replacing the storage arrays. You’ll need that info to do the FC zoning for example.  And to create the cluster en server object with the correct HBA on the new storage arrays if it allows for automation. As a Hyper-V cluster admin you can grab all that info from your cluster nodes without the need to have access to the SAN or FC fabrics. You can use it yourself and hand it over to those handling them, who can use if to cross check the info they see on the switch or the old storage arrays.

image

Script to collect cluster nodes with HBA WWN info

The script demos a single cluster but you could use it for many. It collects the cluster name, the cluster nodes and their Emulex HBAs. It writes that information to a CSV files you can read easily in an editor or Excel.

image

The scripts demonstrates reading that CSV file and parsing the info. That info can be used in PowerShell to script the creation of the cluster and server objects on your SAN and add the HBAs to the server objects. I recently used it to move a bunch of Hyper-V and File clusters to a new DELLEMC SC Series storage arrays. That has the DELL Storage PowerShell SDK. You might find it useful as an example and to to adapt for your own needs (iSCSI, brand, model of HBA etc.).

Does the DELL VRTX Support Storage Spaces anno 2018?

Some one asked on my blog if the DELL VRTX supported Storage Spaces. It’s 2018 and when I wrote about the VRTX it was mainly as a Cluster in a Box (CiB) solution. This is based on a shared SAS raid controller. The addition of a second controller improved the redundancy (past the write-through requirement as we had in 2014) even though I would really like to see a native in bow redundant network solution here as well. Whether this is suitable for your need is something only you can determine.

Bus as far as a support for Microsoft Shared Storage Spaces or Storage Spaces goes that isn’t there and I would advise against it. A storage controller configuration (pass-through) for the DELL Technologies VRTX series that supported any form of Storage Spaces never came. While with 2 Nodes and the VRTX supporting two storage controller this would theoretically be possible. But with 3 or 4 nodes (The VRXT supports up to 4 nodes) that’s another challenge.

While I have liked the idea and suggested it even as a possible path it has never materialized. If S2D, especially in combination with ReFSv3 or beyond, becomes so immensely popular, they might consider it, but for now it’s not something I see happen and they might very choose other offerings to serve that demand anyway, one with a better design for the separate pass-through capable storage controllers.

As a cluster in a box solution the VRTX does hold merit. As said, I’d love to see a few improvements made to make it fully redundant all in box. With a ruggedized version for industrial or highly mobile environments could make an unbeatable offering.

DISCLAIMER: I don’t work for DELL, I don’t get paid by DELL, I don’t speak for DELL. This is my current independent opinion.