Optimizing Backups: PowerShell Script To Move All Virtual Machines On A Cluster Shared Volume To The Node Owing That CSV

When you are optimizing the number of snapshots to be taken for backups or are dealing with storage vendors software that leveraged their hardware VSS provider you some times encounter some requirements that are at odds with virtual machine mobility and dynamic optimization.

For example when backing up multiple virtual machines leveraging a single CSV snapshot you’ll find that:

  • Some SAN vendor software requires that the virtual machines in that job are owned by the same host or the backup will fail.
  • Backup software can also require that all virtual machines are running on the same node when you want them to be be protected using a single CSV snap shot. The better ones don’t let the backup job fail, they just create multiple snapshots when needed but that’s less efficient and potentially makes you run into issues with your hardware VSS provider.

image

VEEAM B&R v8 in action … 8 SQL Server VMs with multiple disks on the same CSV being backed up by a single hardware VSS writer snapshot (DELL Compellent 6.5.20 / Replay Manager 7.5) and an off host proxy Organizing & orchestrating backups requires some effort, but can lead to great results.

Normally when designing your cluster you balance things out a well as you can. That helps out to reduce the needs for constant dynamic optimizations. You also make sure that if at all possible you keep all files related to a single VM together on the same CSV.

Naturally you’ll have drift. If not you have a very stable environment of are not leveraging the capabilities of your Hyper-V cluster. Mobility, dynamic optimization, high to continuous availability are what we want and we don’t block that to serve the backups. We try to help out to backups as much a possible however. A good design does this.

If you’re not running a backup every 15 minutes in a very dynamic environment you can deal with this by live migrating resources to where they need to be in order to optimize backups.

Here’s a little PowerShell snippet that will live migrate all virtual machines on the same CSV to the owner node of that CSV. You can run this as a script prior to the backups starting or you can run it as a weekly scheduled task to prevent the drift from the ideal situation for your backups becoming to huge requiring more VSS snapshots or even failing backups. The exact approach depends on the storage vendors and/or backup software you use in combination with the needs and capabilities of your environment.

cls

$Cluster = Get-Cluster
$AllCSV = Get-ClusterSharedVolume -Cluster $Cluster

ForEach ($CSV in $AllCSV)
{
    write-output "$($CSV.Name) is owned by $($CSV.OWnernode.Name)"
    
    #We grab the friendly name of the CSV
    $CSVVolumeInfo = $CSV | Select -Expand SharedVolumeInfo
    $CSVPath = ($CSVVolumeInfo).FriendlyVolumeName

    #We deal with the \ being and escape character for string parsing.
    $FixedCSVPath = $CSVPath -replace '\\', '\\'

    #We grab all VMs that who's owner node is different from the CSV we're working with
    #From those we grab the ones that are located on the CSV we're working with
      $VMsToMove = Get-ClusterGroup | ? {($_.GroupType –eq 'VirtualMachine') -and ( $_.OwnerNode -ne $CSV.OWnernode.Name)} | Get-VM | Where-object {($_.path -match $FixedCSVPath)} 
     
    ForEach ($VM in $VMsToMove)

    {
        write-output "`tThe VM $($VM.Name) located on $CSVPath is not running on host $($CSV.OwnerNode.Name) who owns that CSV"
        write-output "`tbut on $($VM.Computername). It will be live migrated."
        #Live migrate that VM off to the Node that owns the CSV it resides on
        Move-ClusterVirtualMachineRole -Name $VM.Name -MigrationType Live -Node $CSV.OWnernode.Name
    }

Now there is a lot more to discuss, i.e. what and how to optimize for virtual machines that are clustered. For optimal redundancy you’ll have those running on different nodes and CSVs. But even beyond that, you might have the clustered VMs running on different cluster, which is the failure domain.  But I get the remark my blogs are wordy and verbose so … that’s for another time Winking smile

Hyper-V Amigos Showcast Episode 8: Storage Replica in a Stretched Cluster

We finally go to make a next “Hyper-V Amigos Showcast”, due to very busy schedules we had to postpone this a couple of times. But we made it! In this Episode (the 8th one) Carsten and I show one application of a new great feature in Windows Server vNext: Storage Replication. This allows us to replicate a volume between two storage systems without caring what that storage system is as long a you have windows volumes on it. Replication can be synchronous or asynchronous and there are multiple scenarios in which to use this.

Here we focus on trying out replication between two clusters or in a stretched cluster scenario. I have already made a video demonstrating server to server replication. In this showcast we demonstrate  the Stretched Cluster scenario (and troubleshoot our own lab).

image

More info is available here:

Enjoy and see you next time!

Video Interview On Rolling Cluster Upgrades in Windows Server vNext

Carsten Rachfahl from Rachfahl IT-Solutions (quite possibly  Germany’s leading Hyper-V, Storage Spaces & Private cloud consultancy) and I got together in Berlin last November at the Microsoft Technical Summit 2014. Between presenting (I delivered What’s new in Failover Clustering in Windows Server 2012 R2), workshops, interviews we found some time to do a video interview.

We discussed a very welcome new capability in Windows Server vNext: “Rolling cluster updates” or “Cluster Operating System Rolling Upgrade” in Windows Server Technical Preview as Microsoft calls it. I blogged about this rather soon after the release of the Technical Preview First experiences with a rolling cluster upgrade of a lab Hyper-V Cluster (Technical Preview).

Videointerview with Didier Van Hoye about Rolling Cluster Upgrade Thumb1

We’ve been able to do rolling updates of Windows NLB for a long time and we’ve been asking for that same capability in Windows Failover Clustering for many years and now, it’s finally coming! And yes, as you will notice we like that a lot!

You need to realize that making the transition form one version to another as smooth, easy and risk free as possible is of great value to the customer as it enables them to upgrade faster and get the benefits of their investment quicker. For Microsoft it means they can have more people move to more modern environments faster which helps with support and delivering value in a secure and modern environment.

At the end we also joke around a bit about DevOps and how this is just as set of training wheels on the road to true site resilience engineering. All fun and all good. Enjoy!

SMB Direct With RoCE in a Mixed Switches Environment

I’ve been setting up a number of Hyper-V clusters with  Mellanox ConnectX3 Pro dual port 10Gbps Ethernet cards. These Mellanox cards provide a nice amount of queues (128) for DVMQ and also give us RDMA/SMB Direct capabilities for CSV & live migration traffic.

Mixed Switches Environments

Now RoCE and DCB is a learning curve for all of us and not for the faint of heart. DCB configuration is non trivial, certainly not across multiple hops and different switches. Some say it’s to be avoided or can’t be done.

You can only get away with a single pair of (uniform) switches in smaller deployments. On top of that I’m seeing more and more different types of switches being used to optimize value, so it’s not just a lab exercise to do this. Combine this with the fact that DCB is an unavoidable technology in networking, unless it get’s replaced with something better and easier, and you might as well try and learn. So I did.

Well right now I’m successfully seeing RoCE traffic going across cluster nodes spread over different racks in different rows at excellent speeds. The core switches are DELL Force10 S4810 and the rack switches are PowerConnect 8132Fs. By borrowing an approach from spine/leave designs this setup delivers bandwidth where they need it a a price point they can afford. They don’t need more expensive switches for the rack or the core as these do support DCB and give the port count needed at the best price point.  This isn’t supposed to be the top in non blocking network design. Nope but what’s available & affordable today in you hands is better than perfection tomorrow. On top of that this is a functional learning experience for all involved.

We see some pause frames being sent once in a while and this doesn’t impact speed that very much. It does guarantee lossless traffic which is what we need for RoCE. When we live migrate 300GB worth of memory across the nodes in the different racks we get great results. It varies a bit depending on the load the switches & switch ports are under but that’s to be expected.

Now tests have shown us that we can live migrate just as fast with non RDMA 10Gbps as we can with RDMA leveraging “only” Multichannel. So why even bother? The name of the game low latency and preserving CPU cycles for SQL Server or storage traffic over SMB3. Why? We can just buy more CPUs/Cores. Great, easy & fast right? But then with SQL licensing comes into play and it becomes very expensive. Also storage scenarios under heavy load are not where you want to drop packets.

Will this matter in your environment? Great question! It depends on your environment. Sometimes RDMA is needed/warranted, sometimes it isn’t. But the Mellanox cards are price competitive and why not test and learn right? That’s time well spent and prepares you for the future.

But what if it goes wrong … ah well if the nodes fail to connect over RDAM you still have Multichannel and if the DCB stuff turns out not to be what you need or can handle, turn it of and you’ll be good.

RoCE stuff to test: Routing

Some claim it can’t be done reliably. But hey they said that for non uniform switch environments too Winking smile. So will it all fall apart and will we need to standardize on iWarp in the future?  Maybe, but isn’t DCB the technology used for lossless, high performance environments (FCoE but also iSCSI) so why would not iWarp not need it. Sure it works without it quite well. So does iSCSI right, up to a point? I see these comments a lot more form virtualization admins that have a hard time doing DCB (I’m one so I do sympathize) than I see it from hard core network engineers. As I have RoCE cards and they have become routable now with the latest firmware and drivers I’d love to try and see if I can make RoCE v2 or Routable RoCE work over different types of switches but unless some one is going to sponsor the hardware I can’t even start doing that. Anyway, lossless is the name of the game whether it’s iWarp or RoCE. Who know what we’ll be doing in 5 years? 100Gbps iWarp & iSCSI both covered by DCB vNext while FC, FCoE, Infiniband & RoCE have fallen into oblivion? We’ll see.