Cluster Validation Failure while setting up a Windows 2012 Continuous Available File Share: The password does not meet the password policy requirements

We were installing a Windows Server 2012 cluster in a W2K8R2 domain and while we were checking out our work by running the cluster validation we got one warning we’ve never seen before:

Validate CSV Settings

Description: Validate that settings and configuration required by Cluster Shared Volumes are present. This test can only be run with an administrative account, and it only tests servers that are cluster nodes.

Start: 9/24/2012 5:01:18 PM.

Validating Server Message Block (SMB) share access through the IP address of the fault tolerant network driver for failover clustering (NetFT), and connecting with the user account associated with validation.

Begin Cluster Shared Volumes support testing on node server1.test.lab.

Failure while setting up to run Cluster Shared Volumes support testing on node server1.test.lab: The password does not meet the password policy requirements. Check the minimum password length, password complexity and password history requirements.

Begin Cluster Shared Volumes support testing on node server2.test.lab.

Failure while setting up to run Cluster Shared Volumes support testing on node server2.test.lab: The password does not meet the password policy requirements. Check the minimum password length, password complexity and password history requirements.

This test requires more than one node. If your cluster contains more than one node, please run validation tests again with more than one node specified.

Now as it turns out this Active Directory domain does enforce some lengthy and complex passwords. By this they are basically driving the admins to use pass sentences which are lot more secure. That also means that the account we are using to run the validation have adequate lengths & complexities.

So, what if we tune down the password length requirements and than run GPUDATE from an elevated command prompt on all nodes of the cluster? Bingo! The cluster valid now passes with flying colors.

I’m guessing that perhaps the local doesn’t have a strong enough password to meet the requirements. But this is just guessing. This is the account that is involved in reducing the clusters dependency on Active Directory so that CSV for example can come on line even if there is not domain controller to contact. Hence my guess that this is related. This did not happen in a lab environment so I’m not going to change the password on all nodes to a more complex one. That is for a lab Smile

image

Continuously Available File Shares Don’t Support Short File Names – “The request is not supported” & “CA failure – Failed to set continuously available property on a new or existing file share as Resume Key filter is not started.”

If you ever get the following error while trying to create a Continuously Available File Share in Windows Server 2012  “The request is not supported”

If on top you find this entry in the Microsoft-Windows-SmbServer/Operational event log:

Log Name:      Microsoft-Windows-SmbServer/Operational
Source:        Microsoft-Windows-SmbServer
Date:          24/09/2012 17:56:59
Event ID:      1801
Task Category: (1801)
Level:         Error
Keywords:      (8)
User:          SYSTEM
Computer:      server1.lab.test
Description:
CA failure – Failed to set continuously available property on a new or existing file share as Resume Key filter is not started.

First of all check  with fsutil if you have short file names enabled on the volumes on which you are trying to create the continuous available file share:

  • Log on to the node running the File role and open a elevated command prompt to run the following on the volume/partition in play, F: in this example.

fsutil 8dot3name query F:
The volume state is: 0 (8dot3 name creation is enabled).
The registry state is: 2 (Per volume setting – the default).
Based on the above two settings, 8dot3 name creation is enabled on F:

  • I chose to enable or disable short file names per volume

fsutil 8dot3name set 2
The registry state is now: 2 (Per volume setting – the default).

  • Disable short file names on the volume at hand

fsutil 8dot3name set f: 1
Successfully disabled 8dot3name generation on f:

  • Remove any short file names present on this volume

fsutil 8dot3name strip f:
Scanning registry…
Total affected registry keys:                   0
Stripping 8dot3 names…
Total files and directories scanned:            6
Total 8dot3 names found:                        3
Total 8dot3 names stripped:                     3
For details on the operations performed please see the log:
“C:UsersUSER~1AppDataLocalTemp28dot3_removal_log @(GMT 2012-09-24 18-40-05).log”

  • Now, move the role over to the next node to rinse & repeat

fsutil 8dot3name set 2
The registry state is now: 2 (Per volume setting – the default).

fsutil 8dot3name set f: 1
Successfully disabled 8dot3name generation on f:

fsutil 8dot3name query f:
The volume state is: 1 (8dot3 name creation is disabled).
The registry state is: 2 (Per volume setting – the default).
Based on the above two settings, 8dot3 name creation is disabled on f:

fsutil 8dot3name strip f:
Scanning registry…
Total affected registry keys:                   0
Stripping 8dot3 names…
Total files and directories scanned:            6
Total 8dot3 names found:                        0
Total 8dot3 names stripped:                     0
For details on the operations performed please see the log:
“C:UsersUSER~1AppDataLocalTemp38dot3_removal_log @(GMT 2012-09-24 18-44-36).log”

I know this now because I hit the wall on this one and Claus Joergensen at Microsoft turned me to the solution. He actually blogged about this as well, but I never really registered this until today.

Disable 8.3 name generation

SMB Transparent Failover does not support cluster disks with 8.3 name generation enabled. In Windows Server 2012 8.3 name generation is disabled by default on any data volumes created. However, if you import volumes created on down-level versions of Windows or by accident create the volume with 8.3 name generation enabled, SMB Transparent Failover will not work. An event will be logged in (Applications and Services Log – Microsoft – Windows – ResumeKeyFilter – Operational) notifying that it failed to attach to the volume because 8.3 name generation is enabled.

You can use fsutil to query and setting the state of 8.3 name generation system-wide and on individual volumes. You can also use fsutil to remove previously generated short names from a volume.

There’s also a little note here http://support.microsoft.com/kb/2709568

SMB Transparent Failover

Both the SMB client and SMB server must support SMB 3.0 to take advantage of the SMB Transparent Failover functionality.
SMB 1.0- and SMB 2.x-capable clients will be able to connect to, and access, shares that are configured to use the Continuously Available property. However, SMB 1.0 and SMB 2.x clients will not benefit from the SMB Transparent Failover feature. If the currently accessed cluster node becomes unavailable, or if the administrator makes administrative changes to the clustered file server, the SMB 1.0 or SMB 2.x client will lose the active SMB session and any open handles to the clustered file server. The user or application on the SMB client computer must take corrective action to reestablish connectivity to the clustered file share.
Note SMB Transparent Failover is incompatible with volumes enabled for short file name (8.3 file name) support or with compressed files (such as NTFS-compressed files).

Frankly, all my testing of Continuous available share, from the BUILD conference till RTM setups have been green field, meaning squeaky clean, brand new LUNs. So this time, in real live with a LUN that has a history in a Windows 2008 R2 environment I got bitten.

So, read, read and than read some more Smile is my advise and be grateful for the help of patient and knowledgeable people.

Anyway, It’s full steam ahead here once again getting the most out of our Software Assurance by leveraging everything we can out of Windows Server 2012.

Disk to Disk Backup Solution with Windows Server 2012 & Commodity DELL Hardware – Part II

As I blogged in a previous post we’ve been building a Disk2Disk based backup solution with commodity hardware as all the appliances on the market are either to small in capacity for our needs, ridiculously expensive or sometimes just suck or a combination of the above (Virtual Library Systems or Virtual Tape Libraries come to mind, one of my biggest technology mistakes ever, at least the ones I had and in my humble opinion Disappointed smile) .

Here’s a logical drawing of what we’re talking about. We are using just two backup media agent building blocks (server + storage)  in our setup for now so we can scale out.

image

Now in future post I hope to be discussing storage spaces & Windows deduplication thrown into the mix.

So what do we get?

Not to shabby …  > 1TB/Hour

image

To great …

image

In close up you are seeing just 2 Windows 2012 Hyper-V cluster nodes, each being backed up over a native LBFO team of 2*1Gbps NIC ports to one Windows Server 2012 Backup Media Agent with a 10Gbps pipe. Look at the max throughput we got  …

image

Sure this is under optimal conditions, but guess what? When doing backup from multiple hosts to dual backup media servers or more we’re getting very fast backups at very low cost compared to some other solutions out there. This is our backup beast Smile. More bandwidth needed at the backup media server? It has dual port 10Gbps that can be teamed and/or leverage SMB 3.0 multichannel. High volume hosts can use 10Gbps at the source as well.

Lessons learned

  • The Windows 2012 networking improvements rock. Upgrade and benefit from it! We’re seeing great results thanks to Multichannel leveraging RSS and in box NIC teaming (LBFO).
  • A couple of 1Gbps NICS teamed on Windows Server 2012 work really well. Don’t worry about not having 10Gbps on all your hosts.
  • Having 10Gbps on your backup media hosts (target) is great as you’ll be pushing a lot of data to them from multiple (source) hosts.
  • Make sure your backup software supports enough streams before it keels over under the load you’re pushing through. More streams means more concurrent files (read VHDs/VMs) and thus more throughput and allows multichannel to shine over RSS capable NICs.
  • Find the sweet sport for number of disks per node and total IOPS versus the throughput you can send to the backup media agents. 4 Nodes of 50TB might be better than 2 nodes of a 100TB. If you can, experiment a bit to find your optimal backup block size.
  • Isolate your backup network traffic from data traffic either physically or by other means (QOS) and don’t route it all over the place to end up where it needs to be.
  • We’re doing this using Dell PowerConnect 5424 (end of life) /5524 switches … no need for the real  high end very expensive gear to do this. The 10Gbps switch, well yes that’s always high end at the moment.
  • Use JBODS with SAS/Storage spaces & you’ll be fine. Select them carefully for performance. You can use bays like the MD3X00 if you want to replicates the backups somewhere otherwise MD12x0 will do or any other decent JBOD => even cheaper. You can also mix, some building blocks that can replicate & other on Storage Spaces /JBOS. Mix and match with different backup needs means you have flexibility. Note that at TechEd Europe (June 2012), in a session by DELL, they mentioned the need for a firmware update with the MD1200 to optimize performance with Storage Spaces.

It’s all about the money in a smart way!

As I said before, you will not get fired for:

  • Increasing backup throughput at least 4 fold (without dedupe)
  • Increasing backup capacity 3.5 fold (without deduplication)
  • Doing the above for 20% of systems that are replaced & new offerings with specialized appliances (even at hilarious discount rates). That’s CAPEX reduction.
  • This helps pay for the primary storage, DRC site & extra SAN for data replication in case of disaster
  • Make backups faster, more reliable & reduce OPEX (The difference for us is huge and important)
  • Putting an affordable scale up & scale out Disk2Disk backup solution into place to the business can safely handle future backup loads as very acceptable costs.
  • It’s a modular solution which we like. On top of that it’s about as zero vendor lock in as it gets. You can mix servers, bays, switches. Use what you like best for your needs. Only the bays have to remain the same within an individual “building block”.

Cost reduction is one thing but look at what we get whilst saving money… wow!

What am I missing?  Specialized dedupe. Yes, but we’re  going for the poor mans workaround there. More on that later.  As long as we get enough throughput that doesn’t hurt us. And give the cost we cannot justify it + it’s way to much vendor lock in. If you can’t get backups done without, sure you might need to walk that route. But for now I’m using a different path. Our money is better spend in different places.

Now how to get the same economic improvements from the backup software? Disk capacity licensing sucks. But we need a solution that can handle this throughput & load , is reliable, has great support & product information, get’s support for new release fast after RTM (come on CommVault, get a move on) and is simple to use ==> even more money and time saved.

Spin off huge file server project?

Why is support for new releases in backup software important. Because the lack of it is causing me delays. Delays cost me, time, money & opportunities. I’m really interested to covert our large LUN file servers to Windows Server 2012 Hyper-V virtual machines, which I now can rather smoothly thanks to those big VDHX sizes that are possible now and slash the backup times of those millions of small files to pieces by backing them up per VHDX over this setup. But I’ll have to wait and see when CommVault will support VHDX files and GPT disks in guests because they are not moving as fast as a leading (and high cost) product should. Altaro & Veeam have them beaten solid there and I don’t like to be held back.

Trouble Shooting Windows Server 2012 host based CommVault Backups with DELL Compellent hardware VSS provider of Hyper-V guests: ‘Microsoft Hyper-V VSS Writer’ State: [5] Waiting for completion

We have been running CommVault Simpana 9.0 R2 SP7 in combination with the DELL Compellent Hardware VSS provider to do host based backups of the virtual machines on our Windows Server 2012 Hyper-V clusters host with great success and speed.

We’ve run into two issues so far. One, I blogged about in DELL Compellent Hardware VSS Provider & Commvault on Windows Server 2012 Hyper-V nodes – Volume Shadow Copy Service error: Unexpected error querying for the IVssWriterCallback interface. hr = 0×80070005, Access is denied was an due to some missing permissions for the domain account we configured the Compellent Replay manager Service to run with. The solution for that issue can be found in that same blog post.

The other one was that sometimes during the backup of a Hyper-V host we got an error from CommVault that put the job in a “pending” status, kept trying and failing. The error is:

Error Code: [91:9], Description: Volume Shadow Copy Service (VSS) error. VSS service or writers may be in a bad state. Please check vsbkp.log and Windows Event Viewer for VSS related messages. Or run vssadmin list writers from command prompt to check state of the VSS writers.

clip_image001

When we look at the Compellent controller we see the following things happen:

  • The snapshots get made
  • They are mounted briefly and then dismounted.
  • They are deleted

The result at the CommVault end is that the job goes into a pending state with the above error. When we look at the state of the Microsoft Hyper-V VSS Writer by running “vssadmin list writer” …

image

… from an elevated command prompt we see:

Writer name: ‘Microsoft Hyper-V VSS Writer’
…Writer Id: {66841cd4-6ded-4f4b-8f17-fd23f8ddc3de}
…Writer Instance Id: {2fa6f9ba-b613-4740-9bf3-e01eb4320a01}
…State: [5] Waiting for completion
…Last error: Retryable error

Note at this stage:

  1. Resuming the job doesn’t help (it actually keep trying by itself but no joy).
  2. Killing the job and restarting brings no joy. On top of that our friendly error “Volume Shadow Copy Service error: Unexpected error querying for the IVssWriterCallback interface. hr = 0×80070005, Access is denied.“ is back, but this time related to the error state of the ‘Microsoft Hyper-V VSS Writer’. The error now has changed a little and has become:

clip_image002

 

 

Writer name: ‘Microsoft Hyper-V VSS Writer’
…Writer Id: {66841cd4-6ded-4f4b-8f17-fd23f8ddc3de}
…Writer Instance Id: {2fa6f9ba-b613-4740-9bf3-e01eb4320a01}
…State: [5] Waiting for completion
…Last error: Unexpected error

To get rid of this one we can restart the host or, less drastic, restart the Hyper-V Virtual Machine management Service (VMMS.exe) which will do the trick as well.  Before you do this , drain the node when you pause it, then resume it with the option failing back the roles. Windows 2012 makes it a breeze to do this without service interruption Smile

image

clip_image003

image

The Cause: Almost or completely full partitions inside the virtual machines

Looking for solutions when CommVault is involved can be tedious as their consultancy driven sales model isn’t focused on making information widely available. Trouble shooting VSS issues can also be considered a form of black art at times. Since this is Windows 2012 RTM an the date is September 20th 2012 as the moment of writing, there are not yet any hotfixes related to host level backups of Virtual machines and such. CommVault Simpana 9.0 R2 SP7 is also fully patched.

This,combined with the fact that we did not see anything like this during testing (and we did a fair amount) makes us look at the guests. That’s the big difference on a large production cluster. All those unique guests with their own history. We also know from the past years with VSS snapshots in Windows 2008(R2) that these tend to fail due to issues in the guests. Take a peak at Troubleshoot VSS issues that occur with Windows Server Backup (WBADMIN) in Windows Server 2008 and Windows Server 2008 R2 just for starters  As an example we already had seen one guest (dev/test server) that had 5 user logged in doing all kinds of reconfigurations and installs go into save mode during a backup, so it could be due to something rotten in certain guests. There is very much to consider when doing these kinds of backups.

By doing some comparing of successful & failed backups it really looks as if it was related to certain virtual machines. A lot of issues are caused by the VSS service, not running or not being able to do snapshots because of lack of space so perhaps this was the case here as well?

We poked around a bit. First let’s see what we can find in the Hyper-V specific logs like the Microsoft-Windows-Hyper-V-VMMS-Admin event log. Ah lot’s of errors relating to a number of guests!

image

Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          19/09/2012 22:14:37
Event ID:      10102
Task Category: None
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      undisclosed server
Description:
Failed to create the volume shadow copy inside of virtual machine ‘undisclosedserver’. (Virtual machine ID 84521EG0G-8B7A-54ED-2F24-392A1761ED11)

Well people, that is called a clue Winking smile. So we did some Live Migration to isolate suspect VMs to a single node, run backups, see them fail, do the the same with a new and clean VM an it all works. and indeed … looking at the guest involved when the CommVault backup fails we that the VSS service is running and healthy but we do see all kind of badness related to disk space:

  • Large SQL Server backup files put aside on the system partition or or other disks
  • Application & service pack installers left behind,
  • Log and tempdb volumes running out of space.
  • Application Logs running out of control

That later one left 0MB of disk space on the system (Test Controller TFS shitting itself), but we managed to clear just enough to get to just over 1GB of free space which was enough to make the backup succeed.

clip_image001[8]

image

Servers, virtual or physical ones, should to be locked down to prevent such abuse. I know, I know. Did I already tell you I do not reside in a perfect world? We cannot protect against dev and test server admins who act without much care on their servers. We’ll just keep hammering at it to raise their awareness I guess. For end users and production servers we monitor those well enough to proactively avoid issues. With dev & test servers we don’t do so, or the response team would have a day’s work reacting to all alerts that daily dev & test usage on those servers generate.

The fix

  • Clear at least 1GB or a bit more inside each partition in the guest running on the host that has a failing backup. I prefer to have at least a couple of GB free  (10% to 15% => give yourself some head room people).
  • Then you can resume the backup job manually or let CommVault do that for you if it’s still in a pending state.
  • If you’ve killed the job make sure you restore the
  • Microsoft Hyper-V VSS Writer  to a healthy state as described above. Thanks to Live Migration this can be achieved without any down time.

Conclusion

There is experimenting, testing, production testing, production and finally real life environments where not all is done as it should be. Yes, really the world isn’t perfect. Managers sometimes think it’s click, click, Next, click and voila we’ve got a complex multisite system running. Well it isn’t like that and you need some time and skills to make it all work. Yes even in todays “cheap, fast, easy to run your business form your smartphone”  ecosystem of the private, hybrid and public cloud, where all is bliss and world peace reigns.

The DELL Compellent Hardware VSS provider & replay manager service handle all this without missing a beat, which is very comforting. As previous experiences with hardware VSS provides of other vendors make us think that these would probably have blown up by now.