Hyper-V Cluster Node Pause & Drain fails – Live Migrations fail with “The requested operation cannot be completed because a resource has locked status”

One night I was doing some maintenance on a Hyper-V cluster and I wanted to Pause and drain one of the nodes that was up next for some tender loving care. But I was greeted by some messages:

image

[Window Title]
Resource Status

[Main Instruction]
The requested operation cannot be completed because a resource has locked status.

[Content]
The requested operation cannot be completed because a resource has locked status.

[OK]

Strange, the cluster is up and running, none of the other nodes had issues and operational wise all VMs are happy as can be. So what’s up? Not to much in the error logs except for this one related to a backup. Aha …We fire up disk part and see some extra LUNs mounted + using “vssadmin list writers“ we find:

clip_image002

 

 

Writer name: ‘Microsoft Hyper-V VSS Writer’
…Writer Id: {66841cd4-6ded-4f4b-8f17-fd23f8ddc3de}
…Writer Instance Id: {2fa6f9ba-b613-4740-9bf3-e01eb4320a01}
…State: [5] Waiting for completion
…Last error: Unexpected error

Bingo! Hello old “friend”, I know you! The Microsoft Hyper-V VSS Writer goes into an error state during the making of hardware snapshots of the LUNs due to almost or completely full partitions inside the virtual machines. Take a look at this blog post on what causes this and how to fix fit. As a result we can’t do live migrations anymore or Pause/Drain the node on which the hardware snapshots are being taken.

And yes, after fixing the disk space issue on the VM (a SDT who’s pumped the VM disks 99.999% full) the Hyper-V VSS writer get’s out of the error state and the hardware provider can do it’s thing. After the snapshots had completed everything was fine and I could continue with my maintenance.

PowerShell: Monitoring DrainStatus of a Hyper-V Host & The Time Limited Value of Information In Beta & RC Era Blogs

I was writing some small PowerShell scripts to kick pause and resume Hyper-V cluster hosts and I wanted to monitor the progress of draining the virtual machines of the node when pausing it. I found this nice blog about Draining Nodes for Planned Maintenance with Windows Server 2012 discussing this subject and providing us with the properties to do just that.

It seems we have two common properties at our disposal: NodeDrainStatus and NodeDrainTarget.

image

So I set to work but I just didn’t manage to get those properties to be read. It was like they didn’t exist. So I pinged Jeff Wouters who happens to use PowerShell for just about anything and asked him if it was me being stupid and missing the obvious. Well it turned out to be missing the obvious for sure as those properties do no exist. Jeff told me to double check using:

Get-ClusterNode MyNode -cluster MyCluster | Select-Object -Property *

Guess what, it’s not NodeDrainStatus and NodeDrainTarget but DrainStatus and DrainTarget.

image

What put me off here was the following example in the same blog post:

Get-ClusterResourceType "Virtual Machine" | Get-ClusterParameter NodeDrainMoveTypeThreshold

That should have been a dead give away. As we’ve been using MoveTypeTresHold a lot the recent months and there is no NodeDrain in that value either. But it just didn’t register. By the way you don’t need to create the property either is exists. I guess this code was valid with some version (Beta?) but not anymore. You can just get en set the property like this

Get-ClusterResourceType “Virtual Machine” -Cluster MyCluster | Get-ClusterParameter MoveTypeThreshold

Get-ClusterResourceType “Virtual Machine” -Cluster MyCluster | Set-ClusterParameter MoveTypeThreshold 2000

So lessons learned. Trust but verify Smile.  Don’t forget that a lot of things in IT have a time limited value. Make sure that to look at the date of what you’re reading and about what pre RTM version of the product the information is relevant to.

To conclude here’s the PowerShell snippet I used to monitor the draining process.


Suspend-clusternode –Name crusader -Cluster warrior -Drain

Do
{
    Write-Host (get-clusternode –Name “crusader” -Cluster warrior).DrainStatus -ForegroundColor Magenta    
    Sleep 1
}
until ((get-clusternode –Name “crusader” -Cluster warrior).DrainStatus -ne "InProgress")

If ((get-clusternode –Name “crusader” -Cluster warrior).DrainStatus -eq "Completed")
{
    Write-Host (get-clusternode –Name “crusader” -Cluster warrior).DrainStatus -ForegroundColor Green
}

Which outputs

image

Understanding Virtual Machine Priority and Preemption Behavior

Introduction

By reading Aidan Finn his blog You Pause A Clustered Hyper-V Host And Low Priority VMs are QUICK MIGRATED! you will learn something about how virtual machine priorities work during the pausing and draining of a clustered Hyper-V host. They are either Live or quick migrated depending on the value of the MoveTypeThreshold cluster parameter for resources of the type “Virtual Machine”. By default it’s at 2000 and that happens to be the value of the virtual machine priority “Low”.

Changing this value can alter the default behavior. For example setting the MoveTypeThreshold value to 1000 using PowerShell

Get-ClusterResourceType “Virtual Machine” | Set-ClusterParameter MoveTypeThreshold 1000

makes sure that only VMs with a priority set to “No Auto Restart”  are quick migrated. The  low priority machines would than also live migrate where by default they quick migrate.

  • Virtual Machines with Priority equal to or higher than the value specified in MoveTypeThreshold will be moved using Live Migration.
  • Virtual Machines with Priority lower than the value specified in MoveTypeThreshold will be moved using Quick Migration.

Virtual Machine Priorities
3000 = High
2000 = Medium
1000 = Low
0 = Virtual machine does not restart automatically.

Another Scenario to be aware of  to avoid surprises

Note that al this also comes into play in other scenario’s. One of them is when you attempt to start a guest that requires more resources than available on the host. Preemption kicks in and the lower priority virtual machines go into a saved state.  If you didn’t plan for this it could be a bit of a surprise, causing service interruption. What’s also important to know is that preemption kicks in even when there is no chance that putting lower priority virtual machines into saved mode will free enough resources for (all) the VMs you’re trying to start. So that service interruption might do you no good. If this is the case the Low priority VMS come back up when there are sufficient resources left.  Do note however that the ones set top “No Auto Restart” remain in a saved state. Look below for an example on how this could happen.

How does this happen?

Let’s say you have a brand new VM that has gotten 16GB of RAM as requested by the business. When that large memory guest starts it will fail due to the fact that there are not enough memory resources available on the host that only has 16GB available. But as it attempts to start, the need for memory resources is detected and preemption comes into play. The guests with “Low” and “No Auto Restart” priorities are put into a saved state as the large memory VM has the default medium priority and the MoveTypeTreshold is at the default of 2000. You need to be ware of this behavior. Preemption kicks in and the machines are still saving while starting the large memory VM has already failed as they couldn’t free enough resources anyway.

image

The good new is that, as you can see below, is that the low priority guest starts again after starting the large memory guest has failed. No use keeping it saved as it can run and service customers. So the service interruption for this VM is limited but it does happen. Please also note that the guest set to No Auto Restart doesn’t come up again as it’s priority status says exactly that. So, this one becomes collateral damage.

image

As you can see it’s important to know how priorities and preemption work together and behave. It also good to know that changing the threshold come into play in more situations that just pausing & draining a host of during a fail over. While the cluster will try it’s best to keep as many VMs up and running you might have some unintended consequences under certain conditions. A good understanding of this can prevent you from being bitten here. So build a small cheap lab so you can play with stuff. This helps to gain a better understanding of how features work and behave. If you want to play some more, set the priority of the memory hungry VM to high you’ll see even more interesting things happen.

KB2803748 Failover Cluster Management snap-in crashes after you install update 2750149 on a Windows Server 2012-based failover cluster

When you install KB2750149 (An update is available for the .NET Framework 4.5 in Windows 8, Windows RT and Windows Server 2012) you’ll have an issue with the Cluster GUI.image

Basically it shows an error message. The issue caused by installing the above update 2750149 on a Windows Server 2012-based failover cluster or a management station running the Failover Cluster Management snap-in. In this situation, the Failover Cluster Management snap-in crashes. Do NOT worry, the entire cluster is fine, this is just a GUI bug that will leave your GUI work/results pane blank after closing the error screen and basically unusable.

clip_image002

The only known workaround was to uninstall the hotfix or not install it at all on any node where you need to use the Cluster GUI (Windows 8 with RSAT for example). But now there is a fix released with KB2803748.

The update requires no reboot unless you have the Cluster GUI running as that it locks the file that need replacing. So keep them closed and you’re good to go. Also, it’s also great opportunity to use Cluster Aware Updating (CAU) with the hotfix plug-in to install the hotfix in an orchestrated fashion.

UPDATE: This update is also available now via WSUS. So updating is possible via the CAU windows update plug-in Smile

image