Continuous available general purpose file shares & ReFSv3 provide high available backup targets

Introduction

In our previous two blog posts on Veeam and SMB 3 we’ve seen how and when Veeam Backup & Replication can leverage SMB Multichannel and SMB Direct. See Veeam Backup & Replication leverages SMB Multichannel and Veeam Backup & Replication Preferred Subnet & SMB Multichannel.The benefits of this are more bandwidth, high availability, better throughput and with RDMA low latency and CPU offload. What’s not to like, right? In a world where the compute and networks need keeps rising due to the storage capabilities (flash storage) pushing the limits this is all very welcome.

We have also seen earlier that Veeam B & R 9.5 leverages ReFSv3 in Windows Server 2016. This provides clear and present benefits in regards to space efficiencies and speed with many backup file related operations. Read Veeam Leads the way by leveraging ReFSv3 capabilities

When it comes to ReFSv3 in Windows Server 2016 most of the focus has gone to solutions based around Storage Spaces Direct (S2D). That’s a great solution and it is the poster child use case of these technologies.

But what other options do you have out there to build efficient and effective high available backup targets creatively except for S2D? What if you would like to repurpose existing hardware to build those? Let’s take a look together at how continuous available general purpose file shares & ReFSv3v3 provide high available backup targets

CSV, S2D, ReFSv3 & Archival Data

In Windows Server 2016, traditional shared storage (iSCSI, FC, Shared SAS, Shared RAID) with CSV are not recommend to be used with ReFSv3. Why isn’t exactly clear. The biggest impact you’ll see is the performance difference when not writing to the owner node of the CSV in this use case. Even with a well configured RDMA network that difference is significant. But that doesn’t mean that the performance is bad. It’s just that many of the super-fast meta data operations are relatively and significantly slower when compared each other, not that any of these two are slow.

clip_image002

Microsoft does state that an S2D with ReFSv3 and SOFS shares can be used for archival data. Storage spaces and ReFSv3 also have the benefit of offering automatic repair of corrupt data from a redundant copy on the fly even when needed. So yes, the best know supported scenario is this one.

Continuous available general purpose file shares and ReFSv3 provide high available backup targets

But what if we need a high available backup target and would love to leverage ReFSv3 with Veeam Backup & Replication 9.5? Well, you can have 95% of your cookie and eat it to. All this without ignoring the cautions offered.

We could set up SOFS shares on a Windows Server 2016 Cluster with ReFSv3 with traditional shared storage. Some storage vendors do state this is supported actually.

That only means you don’t have the auto repair functionality ReFSv3 combined with storage spaces offers. But perhaps you want to avoid the risk of using ReFSv3 with CSV in a non S2D scenario all together. What you could do is forgo ReFSv3 and use NTFS. How well this will work for archival data or backup is something you’ll need to test and find out how well this holds up. There is not much info is out there, only other cautions and warnings that might keep you up at night.

There is another scenario however and that is using Windows Server 2016 failover clustering to set up continuously available general purpose file shares that leverage SMB3 transparent failover.

The good news is that general purpose file shares (no CSV) do work consistently with ReFSv3 because such a share/LUN is only exposed on one cluster node at the time, the owner. By having multiple shares and setting preferred owners we can load balance the workload across all cluster nodes.

Thank to continuous availability for general file shares and SMB 3 transparent failover we can still get a high available backup target this way. The failover is fast enough to make this happen and all we see with Veeam Backup & Replication is a short pause in throughput before it resumes after failover. To put the icing on the cake, you can leverage SMB multichannel SMB Direct for both backup and restores.

I would take a sizeable whitepaper to walk through the setup so instead I’ll show you a a quick video of a POC we did in the lab here https://vimeo.com/212886392.

clip_image004

If you want to learn more come to the community & other conferences I’m speaking at and will be around for Ask The Experts time opportunities. I’ll be at the German Hyper-V community meet up, The Cloud & Datacenter Conference in Germany 2017, Dell EMC World 2017 and last but not least VeeamON 2017 (see  May 2017 will be a travelling month). 

Conclusion

What do you lose?

Potentially there is one big loss in regards to the capabilities of ReFSv3 with this solution when you are not using storage spaces. This is that you lose the capability to automatic repair of corrupt data. The ability of ReFSv3 to do so is tied into the redundant copies of Storage Spaces (parity/mirror).

What do you get?

That’s fine, the strength of this design is that you get the speed and space efficiencies of ReFSv3and high available backup targets in way more scenarios than “just” S2D. After all, not everyone is in a position to choose their storage fabric for backup targets green field or at will. But they might be able to leverage existing storage and opt to use SMB 3 for their data transport.

So even if you can’t have it all, you can still build very good solutions. It offers ReFSv3 benefits and high availability for your backup target via transparent failover with SMB transparent failover on continuous available general purpose file shares. This also only requires Windows Server 2016 Standard Edition, which is a cost saving. You get to leverage SMB Multichannel and SMB Direct. All this while not ignoring the cautions of using ReFSv3 in certain scenarios.

On top of that, if you use NTFS with this approach it will also work for Windows Server 2012 (R2) as the OS for the backup target cluster hosts.

Disclaimer

I do not work for or at Microsoft, nor am I perfect or infallible just because I’m an MVP. You’ll have to do your own testing and validation. From our testing and without ReFSv3 bugs ruining the show, to me this is a very valid and cost effective approach.

Cluster Operating System Rolling Upgrade Leaves Traces

Introduction

When you perform a cluster OS rolling upgrade of Windows Server 2012 R2 cluster to a Windows Server 2016 Cluster you’ll have two options.

1. You evict the nodes, one after the other, perform a clean OS install and join them to the existing cluster.

2. You do an in-place OS upgrade of the nodes (no need to evict the nodes, you can if you want to). I tested this and blogged about it in In Place upgrades of cluster nodes to Windows Server 2016  

Both of these give you the benefits that you can keep your workloads (Hyper-V, SOFS, SQL Server) running and you don’t have to create a new cluster to do so. The moment you have Windows Server 2016 Nodes added to an existing Windows Server 2012 R2 cluster you are running in Mixed mode. Until all your nodes have been upgraded to Windows Server 2016 will remain running in mixed mode.

Illustration showing the three stages of a cluster OS rolling upgrade: all nodes Windows Server 2012 R2, mixed-OS mode, and all nodes Windows Server 2016

When there are only Windows Server 2016 nodes you can decide to also upgrade the cluster functional level.  This enables all the new capabilities in Windows Server 2016 Failover Clustering and also means you cannot go back to a Windows 2012 R2 cluster anymore. So, only take this step after a final validation of all drivers and firmware to make sure you don’t need to go back and you’re ready to fully commit to a fully functional Windows Server 2016 Failover Cluster.

A cluster operating system rolling upgrade does leave some traces, but that’s OK. Let’s take a look. 

This is what a get-cluster against Windows Server 2016 that was upgraded from Windows Server 2012 R2 looks like.

image

As you can see the cluster functional level is 8 and not 9 yet. This means that we have not yet run the Update-ClusterFunctionalLevel command on this cluster yet. Which still allows us to roll back all the way to a cluster running only Windows 2012 R2 nodes. The ClusterUpgradeVersion has a value of 3.

We now execute the Update-ClusterFunctionalLevel command and take a look at Get-Cluster again.

image

As you can see we are now at cluster functional level 9 which enables all the capabilities offered by Windows Server 2016 Failover Clustering. The cluster Upgrade version is 8. That’s the previous cluster functional level we were at before we executed Update-ClusterFunctionalLevel.

Note that both properties ClusterFunctionalLevel and ClusterUpgradeVersion are only available with Windows Server 2016. You will not find it on a Windows Server 2012 R2 or lower cluster. If you run this command from Windows Server 2016 against a Windows Server 2012 R2 cluster both properties will be empty. If you run it on a Windows Server 2012 R2 host against Windows Server 2012 R2 or lower and even a Windows Server 2016 cluster these properties are not even there. The commandlet is older on those OS versions and didn’t know about these properties yet.

What about if you create a brand-new cluster, perhaps even on freshly installed windows Server 2016 Nodes? What does ClusterUpgradeVersion have as a value then? Well it’s also 8. In the end, there is no difference between an in-place upgrade Windows Server 2016 cluster and a cleanly created one. So where are those traces?

Cluster Operating System Rolling Upgrade Leaves Traces

What gives a rolling upgrade away is that in the registry, under the HKLM\Cluster the OS and OSVersion values are not updated (purple in the picture below). This is a benign artifact and I don’t know if this if on purpose or not.  I have changed them to Windows Server 2016 Datacenter as an experiment and I have not found any issues by doing so. Now, please don’t take this as recommendation to do so. The smartest and safest thing is to leave it alone. These are not used, so don’t worry about them.

image

But even if you would change those values a cluster resulting from of a cluster operating system rolling upgrade still has other ways of telling it was not born as a Window Server 2016 Cluster.

Under HKLM\Cluster (and Cluster.0) you’ll find the value CusterFunctionaLevel that does not exist on a cleanly installed Windows Server 2016 Cluster (green in the picture above). As you can see this is a Window Server 2016 cluster running at functional level 9.

There is even an extra key OperatingVersion under HKLM\Cluster that you will not find on a cleanly installed cluster either. It also has a Mixed Mode value under that key which indicates whether the cluster is still running in mixed mode or not.

image

Here is a screenshot of newly installed/created Windows Server 2016 cluster. No ClusterFunctionalLevel value, the OS and OSVersion Values are correct and there is no OperatingVersion key to be found.

image

What if you don’t like traces?

First of all, these traces are harmless. One thing you can do if you want to weed out all traces of a rolling upgrade (as far as the cluster is concerned) is to destroy the cluster and create one with the same CNO (and IP address if that was a fixed one). This might all be a bit more involved when it comes to CSV naming and other existing resources but then these remnants will be gone in a supported way. Now this does defeat one of the main purposes of this feature: no down time. The operating system itself might also contain traces if you did in-place OS upgrades but the cluster will not. Just adapting OS/OSVersion, ClusterFunctionalLevel and deleting the key OperatingVersion from HKLM\Cluster (and HKLM\Cluster.0) are not supported actions and messing around in the cluster registry keys can lead to problems, so don’t! The advice is to just leave it all alone. Microsoft developed cluster operating system rolling upgrade the way they did for a reason and by leaving things as Microsoft has set or left them will make sure you are always in a fully supported condition. So, use it if it fits the circumstances & you comply with all the prerequisites. Look at these traces a flag of honor, not a smudge on your shining armor. When I see these artifacts, I see people who have used this feature to their own benefits. Well done I say.

Learn more about the Cluster OS Rolling Upgrade process

Next to my blogs like First experiences with a rolling cluster upgrade of a lab Hyper-V Cluster (Technical Preview) and In Place upgrades of cluster nodes to Windows Server 2016 there are many resources out there by fellow blogger and Microsoft. A great video on the subject is Introducing Cluster OS Rolling Upgrades in Windows Server 2016 with Rob Hindman, who actually works on this feature and knows it inside out.

An important thing to keep in mind is that this can be automated using PowerShell or by leveraging SCVMM for orchestration for example. 3rd party tools could also support this and help you automate this process in order to scale it when needed.

Finally, the official documentation can be found here Cluster operating system rolling upgrade

The exceptional value of a great technical community

There is a tremendous value in being an active community member. You learn form other people experiences. Both their successes and their mistakes. They learn from you. All this at the cost of the time and effort you put in. This, by itself, is of great value.

There are moments that this value reaches a peak. It becomes so huge it cannot be dismissed by even the biggest cynic of a penny pinching excuse for a manager.You see, one day bad things happen to even the nicest, most experienced and extremely competent people. That day, in the middle of the night you reach out to your community. The message is basically “HELP!”.

Guess what, the community, spread out across the globe over all time zones answers that call. You get access to support and skills form your peers when you most need it. Even if you have to pay an hourly fee that would still be a magnitude cheaper than many “premium” support schemes that, while very much needed for that vertical support, cannot match the depth and breath of the community.

For sure, you don’t have a piece of paper, and SLA, an escalation manager. That might upset some people. But what you do get are hard core skills, extra eyes and hands when you need it the most. That, ladies and gentlemen, is the exceptional value of a great technical community at work. Your backup when the system fails. Who ever has committed community experts as employees or partners or owners of a business indirectly has access to a global network of knowledge, talent, skills and experience. If you truly think people are the biggest capital you have, than these are the gems.

In Place upgrades of cluster nodes to Windows Server 2016

You will all have heard about rolling cluster upgrades from Windows Server 202 R2 to Windows Server 2016 by now. The best and recommend practice is to do a clean install of any node you want to move to Windows Server 2016. However an in place upgrade does work. Actually it works better then ever before. I’m not recommending this for production but I did do a bunch just to see how the experience was and if that experience was consistent. I was actually pleasantly surprised and it saved me some time in the lab.

Today, if you want to you can upgrade your Windows Server 2012 R2 hosts in the cluster to Windows Server 2016.

The main things to watch out for are that all the VMs on that host have to be migrated to another node or be shut down.

You can not have teamed NICs on the host. Most often these will be used for a vSwitch, so it’s smart and prudent to note down the vSwitch (or vSwitches) name and remove them before removing the NIC team. After you’ve upgraded the node you can recreate the NIC team and the vSwitch(es).

Note that you don’t even have to evict the node from the cluster anymore to perform the upgrade.

image

I have successfully upgrade 4 cluster this way. One was based on PC hardware but the other ones where:

  • DELL R610 2 node cluster with shared SAS storage (MD3200).
  • Dell R720 2 node cluster with Compellent SAN (and ancient 4Gbps Emulex and QLogic FC HBAs)
  • Dell R730 3 node cluster with Compellent SAN (8Gbps Emulex HBAs)

Naturally all these servers were rocking the most current firmware and drives as possible. After the upgrades I upgraded the NIC drivers (Mellanox, Intel) and the FC drivers ‘(Emulex) to be at their supported vendors drivers. I also made sure they got all the available updates before moving on with these lab clusters.

Issues I noticed:

  • The most common issue I saw was that the Hyper-V manager GUI was not functional and I could not connect to the host. The fix was easy: uninstall Hyper-V and re-install it. This requires a few reboots. Other than that it went incredibly well.
  • Another issue I’ve seen with upgrade was that the netlogon service was set to manual which caused various issues with authentication but which is easily fixed. This has also been reported here. Microsoft is aware of this bug and a fixed is being worked on.

 

.