Hyper-V Cluster Node Pause & Drain fails – Live Migrations fail with “The requested operation cannot be completed because a resource has locked status”

One night I was doing some maintenance on a Hyper-V cluster and I wanted to Pause and drain one of the nodes that was up next for some tender loving care. But I was greeted by some messages:

image

[Window Title]
Resource Status

[Main Instruction]
The requested operation cannot be completed because a resource has locked status.

[Content]
The requested operation cannot be completed because a resource has locked status.

[OK]

Strange, the cluster is up and running, none of the other nodes had issues and operational wise all VMs are happy as can be. So what’s up? Not to much in the error logs except for this one related to a backup. Aha …We fire up disk part and see some extra LUNs mounted + using “vssadmin list writers“ we find:

clip_image002

 

 

Writer name: ‘Microsoft Hyper-V VSS Writer’
…Writer Id: {66841cd4-6ded-4f4b-8f17-fd23f8ddc3de}
…Writer Instance Id: {2fa6f9ba-b613-4740-9bf3-e01eb4320a01}
…State: [5] Waiting for completion
…Last error: Unexpected error

Bingo! Hello old “friend”, I know you! The Microsoft Hyper-V VSS Writer goes into an error state during the making of hardware snapshots of the LUNs due to almost or completely full partitions inside the virtual machines. Take a look at this blog post on what causes this and how to fix fit. As a result we can’t do live migrations anymore or Pause/Drain the node on which the hardware snapshots are being taken.

And yes, after fixing the disk space issue on the VM (a SDT who’s pumped the VM disks 99.999% full) the Hyper-V VSS writer get’s out of the error state and the hardware provider can do it’s thing. After the snapshots had completed everything was fine and I could continue with my maintenance.

Logging Cluster Aware Updating Hotfix Plug-in Installations To A File Share

As an early adopter of Windows Server 2012 it’s not about being the fist it’s about using the great new features. When you leverage the Cluster Aware Updating (CAU) Plug-in to deploy hardware vendor updates like those from DELL which are called DUPs (Dell Update Packages) you have the option to to log the process via parameter /L

This looks like this in the config XML file for the CAU (I’ll address this XML file in more details later).

<Folder name="Optiplex980DUPS" alwaysReboot="false"> 
    <Template path="$update$" parameters="/S /L=\zuluCAULoggingCAULog.log"/>

 

As you can see I use a file share as I don’t want to log locally because this would mean I’d have to collect the logs on all nodes of a cluster.   Now if you log to  file share you need to do two things that we’ll discuss below.

1. Set up a share where you can write the log or logs to

Please note that you cannot and should not use the CAU file share for this. First off all only a few accounts are allows to have write permissions to the CAU file share. This is documented in How CAU Plug-ins Work

Only certain security principals are permitted (but are not required) to have Write or Modify permission. The allowed principals are the local Administrators group, SYSTEM, CREATOR OWNER, and TrustedInstaller. Other accounts or groups are not permitted to have Write or Modify permission on the hotfix root folder.

This makes sense. SMB Signing and Encryption are used to protect tampering with the files in transit and to make sure you talk to the one an only real CAU file share. To protect the actual content of that share you need to make sure now one but some trusted accounts and a select group of trusted administrators can add installers to the share. If not you might be installing malicious content to your cluster nodes without you ever realizing. Perhaps some auditing on that folder structure might be a good idea?

image_thumb61

This means that you need a separate file share so you can add modify or at least write permissions to the necessary accounts on the folder. Which brings us to the second thing you need to do.

2. Set up Write or Modify permissions on the log share

You’ll need to set up Write or Modify permissions on the log share for all cluster node computer accounts. To make this work more practically with larger clusters please you can add the computer accounts to an AD group, which makes for easier administration).

image_thumb61

The two nodes here have permissions to write to the location

image

As you can see the first node to create the loge file is the owner:

image

Some extra tips

The log can grow quite large if used a lot. Keep an eye on it so avoid space issues or so it doesn’t get too big to handle and be useful. And for clarities sake you might get a different log per cluster or even folder type. You can customize to your needs.

Cluster Aware Updating – Cluster CNO Name 15 Characters (NETBIOS name length) GUI Issue

There seems to be a small bug in the Cluster Aware Updating GUI when the cluster name exceeds 15 characters. In our example we’ll look at a cluster with the name XXXCLUSSQLSERVERS or xxxclussqlservers.test.lab. We’ll try to connect to that cluster to do some cluster aware updating.

Click on the dropdown arrow and select our cluster

image

 

Once selected, click “Connect”

image

 

Now we’re greeted by this little message

image

No, you didn’t make a typo as you selected the cluster from the drop down list. You also know that your cluster is up and running. So what happened? Well, the GUI queries AD and returns the CNOs it finds. Those are limited to the NETBIOS name and as such maximal 15 characters long. In this case the name is XXXCLUSSQLSERVERS and this gives a CNO of XXXCLUSSQLSERVE, which is not found as a cluster.

The fix is easy and simple. Just type in the cluster name. XXXCLUSSQLSERVERS and voila. You can connect and are on your way.

image

Let’s see if the FQDN is accepted as well, shall we? And yes, the below screenshot proves this.

http://workinghardinit.files.wordpress.com/2012/12/image43.png?w=584

Conclusion

So this is not a problem once you know this Smile. The CAU GUI returns the cluster CNO name and that’s the NetBIOS name which can be only 15 characters long. Selecting it in CUA to connect to the cluster doesn’t work. You need to fill out the complete name. As we demonstrated the CAU GUI does also accept a FQDN. To prevent running into this issue consider not making your cluster names longer than 15 characters as then the CNO and the cluster name will be identical and is a smart thing to do as you’ll avoid possible duplicate CNOs trying (and failing) to be created or other bugs Winking smile.

In PowerShell you always submit the cluster name so you don’t hit this issue. Perhaps the GUI drop down list could translate the CNOs into the actual cluster names?

Active-Active File sharing with SMB 2.2 Scale Out in Windows 8 Rocks

Introduction

Wow. That’s what I have to say. WOW! I configured a two node virtual machines 

cluster running Windows 8 Server Developer Preview to test the SMB2 Scale Out functionality and I smiling. In my previous blog Transparent Failover & Node Fault Tolerance With SMB 2.2 Tested I already tested the transparent failover with a more traditional active-passive file cluster and that was pretty neat. But there are two things to note:

  1. The most important one to me is that the experience with transparent failover isn’t as fluid for the end user as it should be in my opinion. That freeze is a bit to long to be comfortable. Whether that will change remains to be seen. It’s early days yet.
  2. The entire active-passive concept doesn’t scale very well to put it mildly. Whether this is important to you depends on your needs. Today one beefy well, configured server can server up a massive amount of data to a large number of users. So in  a lot of environments this might not be an issue at all (it’s OK not to be running a 300.000 user global file server infrastructure, really Winking smile).

So bring in “File Server For Scale-Out Application Data” which is an active/active cluster. This is intended for use by  applications like SQL server & Hyper.-V for example. It’s high speed and low drag high available file sharing based on SMB 2.2, Clusters Shared Volumes and failover clustering. The thing is, at this moment, it is not aimed at end user file sharing (hence it’s name ““File Server For Scale-Out Application Data”. When I heard that,  I was a going “come on Microsoft, get this thing going for end user data as well”. Now that I have tested this in the lab, I want this only more. Because the experience is much more fluid. So I have to ask Microsoft to please get this setup supported in a production environment for all file sharing purposes! This is so awesome as an experience for both applications AND end users. The other approach that would          work (except perhaps for scaling) is making the transparent failover for an active-passive file cluster more fluid. But again, early days yet.

Setting  Up The Lab

Build a “File Server for scale-out application data” cluster

You need three virtual machines running Windows 8, two to build the cluster and one to use as a client.Once you have the cluster you configure storage to be used as a Clustered Shared Volume (CSV)

image

You’ll see the progress bar adding the storage to CSV

image

And voila you have CSV storage configured. Note that you don’t have to enable it any more and that there are no more warnings that this is only supported for Hyper-V data.

image

Now navigate to Role, right click and select “Configure Roles”

image

This brings up the High Availability Wizard

image

Click Next and select “File Server for scale-out application data”

image

Give the Client Access Point a name

image

Click Next and on the following wizard page click confirm

image

And voila you’re done. Do notice the wizards skips the “Configure High Availability” step here.

image

Get a share up and running for use

Don’t make the mistake of trying to double click on the you see in the Role. Go to the node who’s the owner of the role and navigate to the role “ScaleOut”, right click and select add shared folder.

image

Select the cluster shared volume on the server “ScalingOut” which is actually the client access point.

image

I gave the share the name SOFS (Scale Out File Share)

image

I like Access Based Enumerations so I enable this next to Enable continuous availability that is enabled by default.

image

Than you get to the permissions settings. Here you have to make sue you set the share permissions to  more than read if you want to do some writing to the share. Nothing new here Winking smile

image

After that you’re almost done. Confirm your settings & click Commit

image

Watch the wizard do it’s magic

image

And it’s all setup

image

Play Time

We have a third node “Independence” running Windows 8 Server to use as a client. As you can see we can easily navigate  to the “server” via the access point.

image

And yes that’s about all you have to do. You can see the ease of name space management at work here.

Now let’s copy some data and turn of a one of the cluster nodes, the one that owns the role for example …

image

I was copying the content of the Windows 8 Server folder from Independence and failed over the node, the client did not notice anything. I turned off the node holding the role and still the client did only notice as short delay (a couple of seconds max). This was a complete transparent experience. I cannot stress enough how much I want this technology for my business customers. You can patch, repair, replace, file server nodes at will at any given moment en no application or user has to notice a thing. People, this is Walhalla. This is is the place where brave file server administrators that have served their customers well over the years against all odds have the right to go. They’ve earned this! Get this technology in their hands and yes even for end user file data. Or at least make the transparent failover for user file sharing as fluid. Make it happen Microsoft! And while I’m asking, will there ever be a SMB 2.2 installable client for Windows 7? In SP2, please?!

Learn more here by watching the sessions from the Build conference at http://www.buildwindows.com/Sessions

Noticed bugs

The shares don’t always show up in the share pane, after failover.

Conclusion

This is awesome, this is big, this is a game changer in the file serving business. Listen, file services are not dead, far from it. It wasn’t very sexy and we didn’t get the holey grail of high availability for that role as of yet until now. I have seen the future and it looks great. Set up a lab people and play at will. Take down servers in any way imaginable and see your file activities survive without at hint of disruption. As long a you make sure that you have multiple nodes in the cluster and that if these are virtual machines they always reside on different nodes in a failover cluster it will take a total failure of the entire cluster to bring you file services down. So how do you like them apples?