I have blogged about Virtual Machine Resiliency in Windows 2016 Failover Clustering before in Testing Virtual Machine Compute Resiliency in Windows Server 2016
Those test and demos were done with block lever storage, CSV on Fibre Channel, iSCSI or shared SAS. Today we’ll look at the experience when you’re running your VMs on a continually available file share on a Scale Out File Server (SOFS). This configuration offers the best possible experience.
Why well, when the cluster node is in Isolated mode this has no impact on the SOFS share as this is a resource external to the Hyper-V cluster. In other words it remains on line. This means that the VMs, even if they have lost their high availability during the time the node is Isolated, they keep running. After all there is nothing wrong with Hyper-V itself. With block level CSV storage you lose access to the storage as that a cluster resource and the node got isolated. That’s why the VMs go into a paused critical state during a transient failure with block level storage but they don’t when you’re using SOFS.
The virtual machine compute resiliency feature in action shows you that the VMs service a transient failure without issues. Your services need never know something was up. Even when the transient failure is reoccurring that doesn’t mean it will cause down time. The node will be quarantined and if it come backup the workload will be live migrated away.
You can watch a video of this in action here on Vimeo:
The quarantine threshold and duration as well as the resiliency period and can be tweaked to your environment to get the best possible results.
SMB 3 for the win! This is yet one more convincing argument to start looking into SOFS and leveraging the capabilities of SMB3. Remember that you can run as SOFS cluster against your existing shared storage to get started if you can get the IOPS/latency you require. But also look into storage spaces, especially storage spaces direct which avoids some of the drawback SANs have in such a scenario. High time for storage vendors to really scale out, implement SMB 3 well and complete and keep the great added value features they already have in their offering. It’s this or becoming yet a bit more irrelevant in todays storage scene in the Microsoft ecosystem.
Okay, but what about if I use storage spaces direct on hyper-converged deployment ? What will be the scenario there ? Like the SMB3 – VMs stay online or like block level storage they will go into paused mode ?
Well, let’s figure this one out together. The fact that with SMB3 SOFS the VMs can stay runnings is due to the CA Share is serviced by a a SOFS Cluster that that has no dependencies on the Hyper-V cluster. Now when the issue is occurring on the Hyper-V cluster itself and if that issue affects the CSV (which depends on the cluster) the VMs can’t stay on line as the CSV’s can’t be reached. So afaik know, if the cluster affect the CSV availability the VMs will be pauzed while being unmonitored. This is the case for block level and for Hyper Converged as then the VMs aren’t running on a SOFS share but on CSV directly on the cluster. Makes sense? Nice scenario to test.