Veeam Availability Suite 9.5 Update 3 Released

Veeam Availability Suite 9.5 Update 3 was released just in time to put under the X-Mas tree of you IT staff. It’s a major release and they’ve gone above and beyond what’’ you’d expect from an update. Also read this post to see all the goodness it offers. It’s clear Veeam has the intent to become the backup provider for even more environments whether these are physical, virtual or containers and whether these are on-premises or in private, hybrid or public clouds. Read up on some details in this blog post by Rick Vanover on the Veeam web site.

image

As a Veeam Vanguard I did not waste time and downloaded the updates to test the deployment in the lab and that went fine. From there it went to the proving grounds (real life hardware & labs) before deploying it in production. All of these deployments went super smooth.

I’m happy to report Veeam Backup & Replication 9.5 and Veeam One 9.5 Update 3 is running very well in production and I have not seen any issues yet. In case you’re wondering or nervous about it the SAN deployments with Off Host Proxies didn’t miss a beat. That’s the quality and smooth experience of Veeam that we have come to rely on. I’m still impressed, even after many updates, with how smooth it goes and how streamlined the update process is.

Two more things. You only have a few days left to enter for a chance to win a fully paid trip to a VeeamON 2018 Event in your region. So act now if you’d like to attend for free. You have 2 days left!

image

It’s also still possible to nominate your self or some one else for the Veeam Vanguard program. Don’t delay! You have until December 29th!

Veeam Vanguard Program

 

That’s it. Time for some down time for the end of year & new year festivities.

Hyper V Amigos Showcast 15 – VMFleet explained

After a short break in broadcasts, due to extremely busy times, the Hyper-V Amigos ride again! Yes, fellow Microsoft MVP Carsten Rachfahl (@hypervserver) from Rachfahl IT Solutions and yours truly are back in the saddle!

In episode 15 of the Hyper-V Amigos Showcast we look at VMFleet. VMFleet is a free Microsoft test suite to show/test and help analyze the performance of a Hyper-converged Storage Spaces Direct Setup.

image

In this episode Hyper V Amigos Showcast 15 – VMFleet explained we show you how to set it up, how it works and how to measure and evaluate your S2D deployment with it. We also share some real life tips with you. All this in just around 2 hours. So stick around for the long haul!

Cluster Shared Volumes without Active Directory

Introduction

Cluster Shared Volumes without Active Directory will come online if certain conditions are met. Basically, when the cluster can form. That’s what we’ll talk about here. There are a lot of things to consider when virtualizing your Active Directory environment, partially or completely. Too much for this blog post alone and much of it has been discussed before. Here I’ll discuss when people count on their Cluster Shared Volumes coming online when active directory is not available but are disappointed and think that it’s a bug or a broken promise. It is not! We know that since Windows Server 2012 the CSVs coming online do not depend on Active Directory being available (thanks to the local CLIUSR account being used for cluster starts).

Cluster Shared Volumes without Active Directory

So, let’s address getting your Cluster Shared Volumes online when AD is not available. The things to realize is that not having Active Directory has been taken care off. The thing that still can go wrong is that the cluster doesn’t come on line properly and that means that the CSV LUNs won’t be online as those are cluster resource. Hyper-V can boot the VMs without the cluster being formed but as the VMs reside on a CSV and those are not available, that’s what’s causing the problem. That means that getting your cluster to come up is the real issue.

Getting your cluster to come up is the real issue.

If cluster shared volumes can come up without a domain controller / AD being available since Windows Server 2012 (Failover Clustering and Active Directory Integration) how is it that some many people still have issues with it? How do you make this work for you? You can use a disk witness to protect you in most cases. When you have a file share or cloud witness you can take some step to avoid running into this issue.

There is a great article on cluster behavior here: Failover Cluster Node Startup Order in Windows Server 2012 R2 (the same rules apply for Windows Server 2016). Read it and you’ll notice that the behavior differs depending on whether you have a witness defined and whether that witness is a witness disk or a file share or cloud witness. The disk witness has a copy of the cluster database and if available will help you out of any situation where the paxos on the disk witness (if it’s available) is more recent then the one on the cluster node as the node will download the cluster database info from the disk witness. But a file share or cloud witness don’t have a copy, so they have a small disadvantage under certain scenarios. CSV’s that won’t come up is when booting the nodes in the wrong order when using a file share or cloud witness is one of those scenarios but not having AD isn’t the root cause of this.

Some people will never notice this issue at all, especially not when they have disk witness, but when they do, it might be at a very inconvenient moment in time. Examples of where this situation can occur are a single cluster environment where the domain controllers are running on CSV and high available in that cluster that was shut down completely, the cluster has a file share witness and the nodes are started in the wrong order.

Making sure your CSVs come online = clustering being formed

Well for one make sure you are using Windows Server 2012 or higher for your cluster. That’s a given.

Beyond that you basically just need to know what to do in what scenario to get your cluster up and running so you won’t have issues with getting the CSV to come up. You just have to follow some rules of thumb and you’ll be fine. Also, there’s almost always a way to get out of pickle, just don’t panic. But also remember that you can make your live easier when you design your solutions with failure in mind and by knowing your options so you can act correctly. I my example I’ll be using Hyper-V cluster with CSVs but the same goes for SOFS, SQL Server clusters leveraging CSVs etc.

Planned down time

If you’re shutting down a complete cluster you have two options to make sure things go a smooth as possible.

Option 1 – A clean cluster shutdown by the book

  • Shut down the workload, i.e. the VMs.
  • Stop the cluster
  • Shutdown the Cluster nodes.
  • Boot the cluster nodes. You can start with any you like. The CSVs will come on line. You will be able to start the VMs. Do start with the domain controllers and wait for them to come on line before starting the others.

During this you’ll see some “collateral” events, errors, warnings due to Active Directory not being available. The cluster name has issues without Active Directory but that a management connection point, it doesn’t mean the cluster doesn’t work. Once Active Directory is available the cluster name will come on line automatically when the default failure policy restarts it. You can also manage the cluster via RDP or console by connection to “.” locally on that node or use the running node name. You can also try to bring cluster name on line manually when Active Directory is up and running.

Option 2 – A clean cluster shutdown as it happens most of the times in real life

Which is one a lot of people do to keep part of the work load running as long as possible.

  • Shut down the no critical workload, i.e. the VMs.
  • Pause a node so the critical workloads live migrate to other nodes
  • When the node is paused, shut down the node.
  • You rinse and repeat this until the last node is left with only the most critical VMs

clip_image002

  • Finally, you stop the workload on that last cluster node and shut it down as well.
  • Now comes the critical part: Remember what node you shut down last. You have to start that one first and you’ll see that your CSV will come on line. If you boot another node, you might panic as the CSV will not come on line.

Now I need to correct this a little bit. With a disk witness you are OK whatever node you boot. When the paxos numbers on the first node to boot and the disk witness are compared the most recent copy will be used. Either the local one on the node will be used directly or after it has been updated with the data form the disk witness. To make things simple for the ops team I always tell them to note the last node they shut down anyway no matter what type of witness they have. It’s good info to have.

With a file share or cloud witness the shutdown/startup order really comes into play. The reason this happens is that by shutting down node by node we end up with a one node cluster (last man standing).  When that’s shut down that’s the only node that knows about the last (potential) changes in the cluster as it holds a copy of the cluster database. Remember that a files hare or a cloud witness has no copy of the cluster database. That’s why the last node to be shut down has to come on line first when comparing the paxos numbers with the witness as that node can form a cluster. If the node that boots first does not hold the most recent paxos number it cannot download the cluster databases info from the file share or cloud witness. Such a node cannot form a cluster and bring the CSVs online. If the first node to boot was the last one to shut down, it can form a cluster and the CSVs will come on line. This is the big difference with option one where you shut down the entire cluster and then take the nodes of line.

You might not know or remember the order. If that’s the case you still have options like starting the cluster node with the -fixquorum option (net.exe start clussvc /forcequorum) at the risk of loosing some cluster changes that are not in the local cluster database.

No need to go to immediately backups, extract the domain controller VMs from SAN snapshots or mount a LUN to a different machine to extract the VM files for the DC or so. Don’t panic!

One or more failed nodes

Well as long as the cluster survives your domain controller VMs should fail over. Keep ‘m on separate node (anti-affinity), separate CSV LUNs, if possible separate clusters if all domain controller virtual machines are going to be running high available on a cluster node and that cluster is still functional after all. No issues here.

Total cluster failure

The cluster nodes all show due to a “global” BSOD or are turned off due to a power failure or a storage array crash. This is more the realm of bad dreams I know, it does happen. Often things will recover and you’ll be fine but you can do your part. The same rules apply, get the cluster to form and you’ll have your CSVs on line. In a bad case -fixquorum is your friend but normally it’s not your first option. In the worst case you’ll need recovery from backup of the cluster or rebuild it. It’s a very bad day if it comes to that. And cluster recovery is not the subject of this post.

Conclusion

Don’t blame Active Directory and start troubleshooting or fixing the wrong problem. So yes, CSV will come on line when certain conditions are met and you can work yourself out of a pickle if needed. But during a disaster that’s only extra work and stress you might not want to worry about if you can avoid it. It’s good to know how to resolve issues around CSVs not coming online when the shit hits the fan as even the best laid out plans tend to get side tracked by reality when disaster strikes.

If you cannot guarantee control over all the prerequisites and might not have the skills in please when needed, you might consider other options. Some of these are actually the best practices of the past when a CSV would indeed not come on line without active directory in any way. This is great for AD related issues but not for you offline CSVs, they need the cluster to form properly!

Sure, you can run the domain controller virtual machines on local storage, and not made high available. This cloud be on one of the cluster nodes (*) or on a stand-alone Hyper-V host. Having a physical domain controller is also a possibility. This helps avoid issues with AD in virtualized environments as many other services are very dependent on them and it’s good to have on one available all of the time and get them back on line a.s.a.p..

I’ll leave you with the fact that virtualizing domain controllers can be done but it pays to study up on how to do it well and test your assumptions in the lab. There is a lot of information on virtualizing domain controllers for a reason. Read it and process what you’ve learned from it. You might find that this CCV thingies is not the most complex subject to deal with.

(*) Please note that some cluster deployments like HCI based on S2D do not support running other (local) storage in addition to the boot OS and the S2D storage pool volumes.

My favorite deployment for VMs with Discrete Device Assignment for GPU

Introduction

Recently I had an interesting discussion on how to leverage Discrete Device Assignment (DDA) for GPU needs when it’s only needed for a certain number of virtual machines. Someone had read my blogs on leveraging DDA that made here optimistic and enthusiastic. But she noticed in the lab she could not leverage DDA on a VM running on a cluster and she could not use storage QoS policies on a stand-alone Hyper-V host with local storage. So, what could she do?

Well for one, her findings are correct. Microsoft did not enable DDA on clustered virtual machines. It doesn’t make sense as the GPU hardware is tied to the virtual machine and any high availability, both planned (live migration) or unplanned (failover) isn’t possible and available anyway. It just cannot be done. I hear you, when you say “but they pulled it off for SR-IOV for networking”. Sure, but please keep in mind that network cards with Ethernet and TCP/IP allows for different approaches than high end video.

My favorite deployment for VMs with Discrete Device Assignment for GPU

My favorite deployment for VMs with Discrete Device Assignment (DDA) for GPU leveraged SMB3 SOFS shares for the virtual hard disks and stand-alone Hyper-V hosts that are member servers in the domain. Let me explain why.

Based on what we discussed above we have some options. One work around is running the DDA virtual machines not high available on local storage on a cluster node. But that would mean you would have a few VMs on all the nodes and that all those nodes must have a DDA capable GPU. Or if you limit the number of nodes that have such a GPU you’ll have a few odd balls in your cluster. You’ll need to manage some extra complexity and must save guard against assigning a GPU via DDA that is already in use for RemoteFX. That cause all kinds of unpleasantness, nothing too deadly but not something you want to do if on your production VDI clusters for fun. It’s a bit like not running a domain controller on a CSV and not making it highly available. If that’s the only option you have you can do that, and I do when needed as Microsoft has improved a lot of things to make this a better and less risky experience. But I prefer to have either physical one or host it on a separate non-clustered Hyper-V host if that’s an option because not all storage solutions and environments have all capabilities needed to make that fool proof.

Also note that running other storage on a S2D node isn’t supported. You have your OS on the boot disks and the disks used in storage spaces. Odd ones out aren’t supposed to be there as S2D will try to recruit them. You can get do it when using traditional shared storage

What I also don’t like about that is that if the cluster storage is not SMB3 SOFS you don’t get the benefit of storage QoS policies in Windows Server 2016, that only works with CSV. So optionally you could leave the non-clustered VM on a CSV. But that’s perhaps a bit confusing and some people might think some forgot to make the machine high available etc.

My preferred setup to get high available storage for virtual machines with DDA needs that benefits from what storage QoS polies have to offer for VDI is to use standalone Hyper-V hosts that have DDA capable GPUs and leverage SMB3 SOFS shares for the virtual Machines.

clip_image002

The virtual machines cannot be high available anyway so you lose nothing there. The beauty is that in this case, as you leverage a Windows Server 2016 SOFS cluster for Hyper-V storage over SMB3 shares, you do get Storage QoS policies.

#On a SOFS node

Get-StorageQosPolicy -Name DedicatedTier1Policy | Get-StorageQosFlow | ft InitiatorName, *IOPS, Status, PolicyID, filePath -AutoSize

#Query for the VM disks on the Hyper-V node

Get-VM -Name DDAVMSOFSStorage -ComputerName RemoteFXHost | Get-VMHardDiskDrive |fl *

clip_image004

#We generate some IO and get some stats on a SOFS node

get-storageQosFlow

get-storageQoSVolume -Mountpoint C:\ClusterStorage\SOFSDEMO\

get-storageQoSVolume -Mountpoint C:\ClusterStorage\SOFSDEMO\ | fl

clip_image006

You can start out with one Hyper-V node and add more when needed, that scale out. Depending on the needs of the virtual machines and specs of the servers (Memory, CPU cores) and the capability and number of GPU in the video cards you get some scale up as well.

To learn more about DDA go here:  https://blog.workinghardinit.work/?s=DDA&submit=Search

To learn more about storage QoS policies go here:

Some more considerations

By going disaggregated. You can leverage a SOFS share for both virtual machines running on a Hyper-V cluster or on stand-alone (non-clustered) Hyper-V that are domain members. The SOFS cluster can be leveraging S2D, traditional storage spaces with shared SAS (JBODs) or even a FC, iSCSI or shared SAS SANS if that the only option you have. That’s all OK as long as it’s SOFS running on Windows Server 2016 and the Hyper-V hosts (stand alone or clustered) are a running 2016 as well (needed for Storage QoS policies and DDA). There is no need for the Hyper-V host to be part of a cluster to get the best results you need. If I use SOFS for both scenarios I can use the same storage array, but I don’t need to. I could also use separate storage arrays. If the Hyper-V cluster is leveraging CSV instead of SOFS I will need to use a separate one for SOFS as its ill advised to mix Hyper-V workloads with the SOFS role. Keep things easy, clear and supportable. I’ll borrow a picture I got from a Microsoft PM recently, do seek out the bad ideas.

clip_image008