Active-Active File sharing with SMB 2.2 Scale Out in Windows 8 Rocks

Introduction

Wow. That’s what I have to say. WOW! I configured a two node virtual machines 

cluster running Windows 8 Server Developer Preview to test the SMB2 Scale Out functionality and I smiling. In my previous blog Transparent Failover & Node Fault Tolerance With SMB 2.2 Tested I already tested the transparent failover with a more traditional active-passive file cluster and that was pretty neat. But there are two things to note:

  1. The most important one to me is that the experience with transparent failover isn’t as fluid for the end user as it should be in my opinion. That freeze is a bit to long to be comfortable. Whether that will change remains to be seen. It’s early days yet.
  2. The entire active-passive concept doesn’t scale very well to put it mildly. Whether this is important to you depends on your needs. Today one beefy well, configured server can server up a massive amount of data to a large number of users. So in  a lot of environments this might not be an issue at all (it’s OK not to be running a 300.000 user global file server infrastructure, really Winking smile).

So bring in “File Server For Scale-Out Application Data” which is an active/active cluster. This is intended for use by  applications like SQL server & Hyper.-V for example. It’s high speed and low drag high available file sharing based on SMB 2.2, Clusters Shared Volumes and failover clustering. The thing is, at this moment, it is not aimed at end user file sharing (hence it’s name ““File Server For Scale-Out Application Data”. When I heard that,  I was a going “come on Microsoft, get this thing going for end user data as well”. Now that I have tested this in the lab, I want this only more. Because the experience is much more fluid. So I have to ask Microsoft to please get this setup supported in a production environment for all file sharing purposes! This is so awesome as an experience for both applications AND end users. The other approach that would          work (except perhaps for scaling) is making the transparent failover for an active-passive file cluster more fluid. But again, early days yet.

Setting  Up The Lab

Build a “File Server for scale-out application data” cluster

You need three virtual machines running Windows 8, two to build the cluster and one to use as a client.Once you have the cluster you configure storage to be used as a Clustered Shared Volume (CSV)

image

You’ll see the progress bar adding the storage to CSV

image

And voila you have CSV storage configured. Note that you don’t have to enable it any more and that there are no more warnings that this is only supported for Hyper-V data.

image

Now navigate to Role, right click and select “Configure Roles”

image

This brings up the High Availability Wizard

image

Click Next and select “File Server for scale-out application data”

image

Give the Client Access Point a name

image

Click Next and on the following wizard page click confirm

image

And voila you’re done. Do notice the wizards skips the “Configure High Availability” step here.

image

Get a share up and running for use

Don’t make the mistake of trying to double click on the you see in the Role. Go to the node who’s the owner of the role and navigate to the role “ScaleOut”, right click and select add shared folder.

image

Select the cluster shared volume on the server “ScalingOut” which is actually the client access point.

image

I gave the share the name SOFS (Scale Out File Share)

image

I like Access Based Enumerations so I enable this next to Enable continuous availability that is enabled by default.

image

Than you get to the permissions settings. Here you have to make sue you set the share permissions to  more than read if you want to do some writing to the share. Nothing new here Winking smile

image

After that you’re almost done. Confirm your settings & click Commit

image

Watch the wizard do it’s magic

image

And it’s all setup

image

Play Time

We have a third node “Independence” running Windows 8 Server to use as a client. As you can see we can easily navigate  to the “server” via the access point.

image

And yes that’s about all you have to do. You can see the ease of name space management at work here.

Now let’s copy some data and turn of a one of the cluster nodes, the one that owns the role for example …

image

I was copying the content of the Windows 8 Server folder from Independence and failed over the node, the client did not notice anything. I turned off the node holding the role and still the client did only notice as short delay (a couple of seconds max). This was a complete transparent experience. I cannot stress enough how much I want this technology for my business customers. You can patch, repair, replace, file server nodes at will at any given moment en no application or user has to notice a thing. People, this is Walhalla. This is is the place where brave file server administrators that have served their customers well over the years against all odds have the right to go. They’ve earned this! Get this technology in their hands and yes even for end user file data. Or at least make the transparent failover for user file sharing as fluid. Make it happen Microsoft! And while I’m asking, will there ever be a SMB 2.2 installable client for Windows 7? In SP2, please?!

Learn more here by watching the sessions from the Build conference at http://www.buildwindows.com/Sessions

Noticed bugs

The shares don’t always show up in the share pane, after failover.

Conclusion

This is awesome, this is big, this is a game changer in the file serving business. Listen, file services are not dead, far from it. It wasn’t very sexy and we didn’t get the holey grail of high availability for that role as of yet until now. I have seen the future and it looks great. Set up a lab people and play at will. Take down servers in any way imaginable and see your file activities survive without at hint of disruption. As long a you make sure that you have multiple nodes in the cluster and that if these are virtual machines they always reside on different nodes in a failover cluster it will take a total failure of the entire cluster to bring you file services down. So how do you like them apples?

Transparent Failover & Node Fault Tolerance With SMB 2.2 Tested

Transparent Failover and node fault tolerance with SMB 2.2 in Windows 8 Server is something that caught my attention immediately. The entire effort in infrastructure has been to keep the plumbing as invisible & unnoticed as possible. In some areas we had great success in others not so much. Planned & unplanned down time of file servers has always been an issue as there was always a short or longer outage and any failover meant disconnecting & reconnecting leading to all kinds of end users problems and confusion. To them the network is down. But the same issues exist on the server side with apps depending on files shares or servers like SQL Server that are writing backups to remote share or read data from such a share. Often it needs some kind of human intervention to correct the situation. No not even 3rd party clustered file systems and active-active clustering software could achieve this. The SMB protocol prior to 2.2 did not allow for it.

So when one hears it is a possibility now we want to test this! So we throw some virtual machines on the test cluster and build a file cluster with windows 8 server and we also have a 3rd server to act as a client with SMB 2.2. Open the Failover Cluster Manager right click roles and choose to configure a role.

image

You’ll see the familiar wizard, click next

image

And choose the file server role

image

Give the Client Access Point a name and add an IP address.

image

Add some storage

image

And voila … after the confirmation we’re asked to configure high availability

image

This opens the New Share Wizard

image

…this is all pretty straight forward so I’ll leave out the screenshots but for the most important one where we explicitly uncheck the “Enable continuous availability” as we want to first run a test without it Smile

image

Continue through the wizard & voila you have a clustered file server with a Client Access Point as a single namespace. Please not that you can connect to this using that single name space. No need for \serverABizzShare & \ServerBBizzShare and going fancy with redundant DFS name spaces and the like.

Remember we still need to make this share highly available but let’s do some file copies and fail over the node to see what this looks like without transparent failover. Select the role transparent, right click, choose “Move” and “Select  Node”.

image

Choose an available node and click OK

image

As you can see this looks rather familiar.image

Let’s make that share continuously available. Go to and double click on the share you want to configure.

image

You’ll see a progress dialog whilst information is retrieved …

image

… and then the share properties are presented, most is familiar stuff but we need the bottom one “Settings”. Select the check box to make the share continuously available.

image

Now let’s try that file copy of again whilst failing over the file server role to another node.

image

So there is no loss of data, no need to the client to reconnect, you don’t have to retry but you do have a freeze that lasts for about 20 seconds on my test lab. I hope this will still improve before RTM.

What we learned here is that we can have Transparent (File Share) failover with SMB 2.2 in a virtualized environment and we can give it a “Client Access Point” name like “MyOldFileServer” so that users are not confused or need to learn another UNC path. There are many options to achieve keeping old namespaces around for end user ease of use but this is an extra ace up our sleeve. For now planned (patching, server maintenance) or unplanned (crash) is a 20 second freeze experience right now as the file share fails over. This freeze is probably due to active–passive clustering. For now active-active is not recommended/supported for file sharing in an end user scenario. I think they are “worried” of huge file shares with a zillion meta data updates to sync. But this is supported for apps like hyper-V, SQL Server backups or apps needing file data etc. I’m going to try it next and for user data. Things might change before RTM and with multichannel, RDAM, 10Gbps, NIC teaming in the OS perhaps that active-active scenario might be feasible for user file data? PLEASE? Otherwise here’s another request for “Windows 8 Server R2” Winking smile

The secret sauce is in:

  • SMB 2.2 on both client & server
  • Resume Key
  • SMB 2.2 Witness service which is stet to running when you make an share continuous available.

image

Go watch the sessions from the Build conference to hear more on al this. The work they’ve put in to this  + some of the complexities are quite amazing. http://www.buildwindows.com/Sessions

Things to find out: how to rename a Client Access Point or how to delete it. Adding a new one is easy.

Warning: It’s September 23rd 2011 and the Developer Preview is a little rough around the edges don’t run this on anything you need to get your bills paid yet  Winking smile

Exchange 2010 DAG Issue: Cluster IP address resource ‘Cluster IP Address’ cannot be brought online

Today I was called upon to investigate an issue with an Exchange 2010 Database Availability Group that had serious backup issues with Symantec Backup Exec not working. As it turned out, while the DAG was still providing mail services and clients did not notice anything the underlying Windows Cluster Service had an issue with. The cluster resource could not be brought on line, instead we got an error:

“Cluster IP address resource ‘Cluster IP Address’ cannot be brought online because the cluster network ‘Cluster Network 1’ is not configured to allow client access. Please use the Failover Cluster Manager snap-in to check the configured properties of the cluster network.”

I have been dealing with Windows 2008 (R2) clusters since the beta’s and had seen some causes of this so I started to check the cluster & Exchange DAG configuration. Nothing was wrong, not a single thing. Weird. I had seen such weird behavior once before with a Hyper-V R2 cluster. There I fixed it by disabling and enabling the NIC’s on the nodes that were having the issue, thus resetting the network. I you don’t have DRAC/ILO or KVM over IP access you can temporarily allow client access via another cluster network or you’ll need physical access to the server console.

In the event viewer I found some more errors:

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          6/18/2010 2:02:41 PM
Event ID:      1069
Task Category: Resource Control Manager
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      node1.company.com
Description: Cluster resource ‘IPv4 DHCP Address 1 (Cluster Group)’ in clustered service or application ‘Cluster Group’ failed.

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          6/18/2010 1:54:47 PM
Event ID:      1223
Task Category: IP Address Resource
Level:         Error
Keywords:     
User:          SYSTEM
Computer:     node1.company.com
Description: Cluster IP address resource ‘Cluster IP Address’ cannot be brought online because the cluster network ‘Cluster Network 1’ is not configured to allow client access. Please use the Failover Cluster Manager snap-in to check the configured properties of the cluster network.

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          6/18/2010 1:54:47 PM
Event ID:      1223
Task Caegory: IP Address Resource
Level:         Error
Keywords:     
User:          SYSTEM
Counter:      node1.company.com
Description: Cluster IP address resource ‘IPv4 DHCP Address 1 (Cluster Group)’ cannot be brought online because the cluster network ‘Cluster Network 3’ is not configured to allow client access. Please use the Failover Cluster Manager snap-in to check the configured properties of the cluster network.

So these cluster networks (it’s a geographically dispersed cluster with routed subnets) are indicating they do not have “Allow clients to connect through this network” set.  Well, I checked and they did! Both “Allow cluster network communications on this network” and “allow clients to connect through this network” are enabled. 

Weird, OK but as mentioned I’ve encountered something similar before. In this case I did not want to do just disable/enable those NICs. The DAG was functioning fine and providing services tot clients, so I did not want to cause any interruption or failover now the cluster was having an issue.

So before going any further I did a search and almost within a minute I found following TechNet blog post: Cluster Core Resources fail to come online on some Exchange 2010 Database Availability Group (DAG) nodes (http://blogs.technet.com/b/timmcmic/archive/2010/05/12/cluster-core-resources-fail-to-come-online-on-some-exchange-2010-database-availability-group-dag-nodes.aspx)

Well, well, the issue is known to Microsoft and they offer three fixes. Which is actually only one, but can be done using  the Failover Cluster Manager GUI, cluster.exe or PowerShell. The fix is to simply disable and enable  “Allow clients to connect through this network” on the affected cluster network. The “long term fix” will be included in Exchange 2010 SP1. The work around does work immediately and their Backup Exec started functioning again. They’ll just have to keep an eye on this issue until the permanent fix arrives with SP1.

SCVMM 2008 R2 Phantom VM guests after Blue Screen

UPDATE: Microsoft posted an SQL Clean Up script to deal with this issue. Not exactly a fix and let’s hope it gets integrated into SCVMM vNext 🙂 Look at the script here http://blogs.technet.com/b/m2/archive/2010/04/16/removing-missing-vms-from-the-vmm-administrator-console.aspx. There is a link to this and another related blog post in the newsgroup link at the bottom of this article as well.

I’ve seen an annoying hick up in SCVMM 2008 R2 (November 2009) in combination with Hyper-V R2 Live migration two times now. In both cases a Blue Screen (due to the “Nehalem” bug http://support.microsoft.com/kb/975530) was the cause of this. Basically when a node in the Hyper-V cluster blue screens you can end up with some (never seen all) VM’s on that node being is a failed/missing state. The VM’s however did fail over to another node and are actually running happily. They will even fail back to the original node without an issue. So, as a matter of fact, all things are up and running. Basically you have a running VM and a phantom one. There are just multiple entries in different states for the same VM. Refreshing SCVMM doesn’t help and a repair of the VM is not working.

While it isn’t a show stopper, it is very annoying and confusing to see VM guest in a missing state, especially since it the VM is actually up and running. You’re just seeing a phantom entry. However be careful when deleting the phantom VM as you’ll throw away the running VM as well they point to the same files. 

Removing the failed/Orphaned VM in SCVMM is a no go when you use shared storage like for example CSV as it points to the same files as the running one and it is visible to both the good VM node and the phantom one. Meaning it will ruin your good VM as well.

Snooping around in the SCVMM database tables revealed multiple VM’s with the same name but with separate GUIDS. In production it’s really a NO GO to mess around with the records. Not even as a last resort because we don’t know enough about the database scheme and dependencies. So I have found two workarounds that do work (used ‘m both).

  1. Export the good VM for save keeping, delete the missing/orphaned VM entry in SCVMM (one taking the good one with it if you didn’t export it) and import the exported VM again. This means down time for the VM guest. 
  2. Remove the Hyper-V cluster from VMM and re add it. This has the benefit that it creates no down time for the good VM and that the bad/orphaned one is gone. 

Searching the net didn’t reveal much info but I did find this thread that discusses the issue as well http://social.technet.microsoft.com/Forums/en-US/virtualmachinemanager/thread/1ea739ec-306c-4036-9a5d-ecce22a7ab85 and this one http://social.technet.microsoft.com/Forums/en/virtualmachinemgrclustering/thread/a3b7a8d0-28dd-406a-8ccb-cf0cd613f666

I’ve also contacted some Hyper-V people about this but it’s a rare and not well-known issue. I’ll post more on this when I find out.