Shared nothing live migration with a virtual switch change and VLAN ID configuration

Introduction

I was working on a hardware refresh, consolidation and upgrade to Windows Server 2019 project. This mainly boils down to cluster operating system rolling upgrades from Windows Server 2016 to Windows Server 2019 with new servers replacing the old ones. Pretty straight forward. So what does this has to do with shared nothing live migration with a virtual switch change and VLAN ID configuration

Due to the consolidation aspect, we also had to move virtual machines from some older clusters to the new clusters. The old cluster nodes have multiple virtual switches. These connect to different VLANs. Some of the virtual machines have on only one virtual network adapter that connects to one of the virtual switches. Many of the virtual machines are multihomed. The number of virtual NICs per virtual machine was anything between 1 to 3. For this purpose, we had the challenge of doing a shared nothing live migration with a virtual switch change and VLAN ID configuration. All this without downtime.

Meeting the challenge

In the new cluster, there is only one converged virtual switch. This virtual switch attaches to trunked network ports with all the required VLANs. As we have only one virtual switch on the new Hyper-V cluster nodes, the name differs from those on the old Hyper-V cluster nodes. This prevents live migration. Fixing this is our challenge.

First of all, compare-vm is your friend to find out blocking incompatibilities between the source and the target nodes. You can read about that in many places. Here, we focus on our challenge.

Making Shared nothing migration work

The first step is to make sure shared nothing migration works. We can achieve this in several ways.

Option 1

We can disconnect the virtual machine network adapters from their virtual switch. While this allows you to migrate the virtual machines, this leads to connectivity loss. This is not acceptable.

Option2

We can preemptively set the virtual machine network adapters to a virtual switch with the same name as the one on the target and enable VLAN ID. Consequently, this means you have to create those and need NICs to do so. but unless you configure and connect those to the network just like on the new Hyper-V hosts this also leads to connectivity loss. That was not possible in this case. So this option again is unacceptable.

Option 3

What I did was create dummy virtual switches on the target hosts. For this purpose, I used some spare LOM NICs. I did not configure them otherwise. As a matter of fact, tI did not even connect them. Just the fact that they exist with the same names as on the old Hyper-V hosts is sufficient to make shared nothing migration possible. Actually, this is a great time point to remind ourselves that we don’t even spare NICs. Dummy private virtual switches that are not even attached to a NIC will also do.

After we have finished the migrations we just delete the dummy virtual switches. That all there is to do if you used private ones. If you used spare NICs just disable them again. Now all is as was and should be on the new cluster nodes.

Turning shared nothing migration into shared nothing live migration

Remember, we need zero downtime. You have to keep in mind that long as the shared nothing live migration is running all is well. We have connectivity to the original virtual machines on the old cluster nodes. As soon as the shared nothing live migration finishes we do 2 things. First of all, we connect the virtual network adapters of the virtual machines to the new converged virtual switch. Also, we enable the VLAN ID. To achieve this, we script it out in PowerShell. As a result, is so fast we only drop only 1 or 2 pings. Just like a standard live migration.

Below you can find a conceptual script you can adapt for your own purposes. For real migrations add logging and error handling. Please note that to leverage share nothing migration you need to be aware of the security requirements. Credential Security Support Provider (CredSSP) is the default option. If you want or must use Kerberos you must configure constrained delegation in Active Directory.

I chose to use CredSSP as we would decommission the old host soon afterward anyway. It also means we did not need Active Directory work done. This can be handy if that is not evident in the environment you are in. We started the script on every source Hyper-V host, migrating a bunch of VMs to a new Hyper-V host. This works very well for us. Hope this helps.

Sample Script

    #The source Hyper-V host
    $SourceNode = 'NODE-A'
    #The LUN where you want to storage migrate your VMs away from
    $SourceRootPath = "C:\ClusterStorage\Volume1*"
    #The source Hyper-V host

    #The target Hypr-V host
    $TargetNode = 'ZULU'
    #The storage pathe where you want to storage migrate your VMs to
    $TargetRootPath = "C:\ClusterStorage\Volume1"

    $OldVirtualSwitch01 = 'vSwitch-VLAN500'
    $OldVirtualSwitch02 = 'vSwitch-VLAN600'
    $NewVirtualSwitch = 'ConvergedVirtualSwitch'
    $VlanId01 = 500
    $VlanId02 = 600

    #Grab all the VM we find that have virtual disks on the source CSV - WARNING for W2K12 you'll need to loop through all cluster nodes.
    $AllVMsOnRootPath = Get-VM -ComputerName $SourceNode | where-object { $_.HardDrives.Path -like $SourceRootPath }

    #We loop through all VMs we find on our SourceRoootPath
    ForEach ($VM in $AllVMsOnRootPath) {
        #We generate the final VM destination path
        $TargetVMPath = $TargetRootPath + "\" + ($VM.Name).ToUpper()
        #Grab the VM name
        $VMName = $VM.Name
        $VM.VMid
        $VMName


        if ($VM.isclustered -eq $True) {
            write-Host -ForegroundColor Magenta $VM.Name "is clustered and is being removed from cluster"
            Remove-ClusterGroup -VMId $VM.VMid -Force -RemoveResources
            Do { Start-Sleep -seconds 1 } While ($VM.isclustered -eq $True)
            write-Host -ForegroundColor Yellow $VM.Name "has been removed from cluster"
        }
    
        #Do the actual storage migration of the VM, $DestinationVMPath creates the default subfolder structure
        #for the virtual machine config, snapshots, smartpaging & virtual hard disk files.
        Move-VM -Name $VMName -ComputerName $VM.ComputerName -IncludeStorage -DestinationStoragePath $TargetVMPath -DestinationHost $TargetNode
    
         $OldvSwitch01 = Get-VMNetworkAdapter -ComputerName $TargetNode -VMName $MovedVM.VMName | where-object SwitchName -eq $OldVirtualSwitch01

        if ($Null -ne $OldvSwitch01) {
            foreach ($VMNetworkadapater in $OldvSwitch01)
            {   write-host 'Moving to correct vSwitch'
                Connect-VMNetworkAdapter -VMNetworkAdapter $OldvSwitch01 -SwitchName $NewVirtualSwitch
                write-out "Setting VLAN $VlanId01"
                Set-VMNetworkAdapterVlan  -VMNetworkAdapter $OldvSwitch01 -Access -VLANid $VlanId01
            }
        }
        $OldvSwitch02 = Get-VMNetworkAdapter -ComputerName $TargetNode -VMName $MovedVM.VMName | where-object SwitchName -eq $OldVirtualSwitch02
        if ($NULL -ne $OldvSwitch02) {
            foreach ($VMNetworkadapater in $OldvSwitch02) {
                write-host 'Moving to correct vSwitch'
                Connect-VMNetworkAdapter -VMNetworkAdapter $OldvSwitch02 -SwitchName $NewVirtualSwitch
                write-host "Setting VLAN $VlanId02"
                Set-VMNetworkAdapterVlan  -VMNetworkAdapter $OldvSwitch02 -Access -VLANid $VlanId02
            }
        }
    }



Shared Nothing Live Migration Leverages SMB 3.0 Under the Hood

Shared Nothing Live Migration

By now most of you must have heard about the Shared Nothing Live Migration capabilities introduced with Windows Server 2012 Hyper-V. If not I suggest you check it out over here and then come back here for some extra insights in how it works.

Shared Nothing Live Migration is not magic however. It is made possible by the fact that it relies on some of the new capabilities SMB 3.0 in Windows Server 2012 brought us. Once you know this you also realize that this can be quite fast. The reason for this is that you can design your the network for Shared Nothing Live Migration with 10Gbps or higher, Multi Channel and RDMA for unprecedented throughput. Yup Smile, if you invest in setting up networking right the remaining bottle neck might be the amount of storage IO you can handle whilst reading from the source and writing to the target, or the CPU load you put o your host. Windows will protect you from draining your host beyond reason by the way.

Making Shared Nothing Live Migration Work

You need to set if up of course and do it right. Here’s a list of steps you need to do / check on every Hyper-V host involved.

  1. Enable incoming and outgoing live migrations on all involved Hyper-V host otherwise it will not work. If your host are part of  a cluster this is taken care of for you.
  2. Select an authentication protocol (CredSSP or Kerberos)
    Kerberos authentication allows you to Live Migrate VMs without having to login to the source host’s server itself. Kerberos authentication does require you to configure constrained delegation in Active Directory (don’t go for "Trust this computer for delegation to any services". Follow the principle of least privileges possible.
  3. Set the number of Simultaneous Live Migrations. Experiment with the best value for you environment. Test a little what’s
  4. Set the networks(s) for incoming Live Migrations. It’s best to design this and not just use any network.

See Keith Mayer’s excellent blog for more details.

Constraint Delegation

Shared Nothing Live Migration needs some prep work security wise before it will work. In Active directory you need to set up so constraint delegation permissions. To some people the concept of constraint delegation is brand new but if you’ve been deploying multi tiered web applications in your environment before this is a cookie you’ve dealt with many times before. It’s the same approach you need to get a web client using Windows Authentication to talk via an IIS web app or service to a SQL Server database and/or read file data from somewhere you’ve been configured this plenty of times.

Use an account to perform the Shared Nothing Live Migration that has administrator privileges on all computers that are involved. While you can use groups in AD to make your live and permission management easier when it comes to granting Share permissions & NTFS rights on folders it doesn’t work that way with constraint delegation. Groups can not be used here so you’ll need to use individual accounts. PowerShell scripting here can help lessen the work if you have many hyper-v hosts involved. In large environments (up to 64 nodes!) this inundates the constraint delegations tab with computer names, so PowerShell really is your friend here.

On each computer object you need to set the delegation permissions for the  CIFS and the Microsoft Virtual System Migration Service to all other computers you want to involve in Shared Nothing Live Migration as a source or a target.

IMPORTANT! Hey why do we need CIFS constraint delegation here? Well indeed because Shared Nothing Live Migration under the hood leverages SMB 3.0. It creates a temporary file share on the target to get the job done Smile! So once you realize that Shared Nothing Live Migration uses SMB 3.0 shares to do it’s magic it than becomes obvious why these constraint delegation permissions for CIFS in active directory are needed.

Visualizing the SMB 3.0 share in action

At the source server (ZULU) we run  after starting the Shared Nothing Live Migration and see that we have a connection to a share o the target server. That share is named after the source server with an ID like ZULU.3341302342$. So it’s a hidden share.image

 

On the target server we run Get-SmbSession | fl and see that indeed the source computer has two sessions open on target server.

image

 

Let’s see if a share is created using Get-SmbShare.on the target. Yes there is:

image

 

In Computer Management it shows up like this on the target sever:

image

In explorer you can see this as a $VSM$ folder in the root of C, that has a subfolder with the name of the source server and an ID like ZULU.2541288334$. This subfolder is shared (hidden) and contains a shortcut to the volume where the selected target folder resides, this could be C, D local storage (DAS), shared storage (CSV) or an SMB 3.0 share as well. In the screen shot below the folder doesn’t match up to the share name as they are taking from different Shared Nothing Live Migration

image

Security wise we’re to keep our hands of and the security settings reflect this Winking smile. But if you take ownership you can co peak at what’s in there. When writing a blog post for example WhistlingWe indeed saw the copied disk size of the VM being live migrated increase in the selected target folder.

image

image

Conclusion

I find it pretty cool to see how this all works under the hood. Hope you found this educational and interesting as well. It’s a testimonial to what SMB 3.0 can be leveraged for all kind of interesting scenarios.

Shared Nothing Live Migration White Board Time – Scenario I

The Problem

Let’s say you are very happy with your SAN. You just love the snapshots, the thin provisioning, deduplication, automatic storage tiering, replication, ODX and the SMI-S support. Live is good! But you have one annoying issue. For example; to get the really crazy IOPS for your SQL Server 2012 DAG nodes you would have to buy 72 SSDs to add to you tier 1 storage in that SAN. That’s a lot of money I you know the price range of those. But perhaps you don’t even have SSDs in your SAN.To get the required amount of IOPS from your SAN with SAS or NL-SAS disks in second and respectively third level storage tier you would need to buy a ridiculous amount of disks and, let’s face it, waste that capacity. Per IOPS that becomes a very expensive and unrealistic option.

Some SSD only SAN vendors will happily sell you a SAN that address the high IOPS need to help out with that problem. After all that is their niche, their unique selling point, fixing IOPS bottle necks of the big storage vendors where and when needed. This is cheaper solution per IOPS than you standard SAN can deliver but it’s still a lot of money, especially if you need more than a couple of terabytes of storage. Granted they might give you some extra SAN functionality you are used to, but you might not need that.

Yes I know there are people who say that when you have such needs you also have the matching budgets. Maybe, but what if you don’t? Or what if you do but you can put 500.000 € towards another need or goal? Your competitive advantage for pricing your products and winning customers might come form that budget Winking smile

Creative Thinking or Nuts?

Let’s see if we can come up with a home grown solution bases on Windows Server 2012 Hyper-V. If we can this might solve your business need, save a ton of money and extend  (or even save) the usefulness of you SAN in your environment. The latter is possible because you successfully eliminated the biggest disk IO from you SAN.

The Solution Scenario

So let’s build 3 Hyper-V hosts, non-clustered, each with its own local SAS based storage with commodity SSD drives. You can use either storage pools/spaces with a non-raid SAS HBA or use a RAID SAS HBA with controller based virtual disks for this. If you’ve seen what Microsoft achieved with this during demos you know you can easily get to hundreds of thousands of IOPS. Let’s say you achieve half of what MSFT did in both IOPS and latency. Let’s just put a number on it => that’s about 500.000 IOPS and 5GB/s. Now reduce that for overhead of virtualization, the position of the moon and the fact things turn out a bit less than expected. So let’s settle for 250.000 IOPS and 2.5GB/s. Anybody here who knows what this kind of numbers would cost you with the big storage vendors their SANs? Right, case closed. Don’t just look at the cost, put it into context and look at the value here. What does and can your SAN do and at what cost?

OK we lose some performance due to the virtualization overhead. But let’s face it. We can use SR-IOB to get the very best network performance. We have hundreds of thousands of IOPS. All the cores on the hosts are dedicated to a single virtual machine running a SQL Server DAG node and bar 4Gb of RAM for the OS we can give all the RAM in the hosts to the VM. This one VM to one host mapping delivers a tremendous amount of CPU, Memory, Network and Storage capabilities to your SQL Server. This is because it gets exclusive use of the resources on the host, bar those that the host requires to function properly.

In this scenario it is the DAG that provides high availability to the SQL Server database. So we do not mind loosing shared storage here.

image

Because we have virtualized the SQL server you can leverage Shared Nothing Live Migration to move the virtual machines with SQL server to the central storage of the SAN without down time if the horsepower is no longer needed. That means that you might migrate another application to those standalone Hyper-V hosts That could be high disk IO intensive application, that is perhaps load balanced in some way so you can have multiple virtual machines mapped to the hosts (1 to 1, many to one). You could even automate this all and use the “Beast” as a dynamic resource based on temporal demands.

In the case of the SQL Server DAG you might opt to keep one DAG member on the SAN so it can be replicated and backed up via snapshot or whatever technology you are leveraging on that storage.

Extend to Other Use Cases

More scenarios are possible. You could build such a beast to be a Scale Out File Server or PCI RAID/Shared SAS if you need shared storage to build a Hyper-V cluster when your apps require it for high availability.

image

The latter looks a lot like a cluster in a box actually. I don’t think we’ll see a lot iSCSI in cluster in a box scenarios, SAS might be the big winner here up to 4 nodes (without a “SAS switch”, which brings even “bigger” scenarios to live with zoning,  high availability, active cables and up to 24Gbps of bandwidth per port).

Using a SOFS means that if you also use SMB 3.0 support with your central SAN you can leverage RDMA for shared nothing live migration, which could help out with potentially very large VHDs of your virtual SQL Servers.

Please note that the big game changer here compared to previous versions of windows is Shared Nothing Live Migration. This means that now you have virtual machine mobility. High performance storage and the right connectivity (10Gbps, Teaming, possibly RDMA if using SMB 3.0 as source and target storage) means we no longer mind that much to have separate storage silos. This opens up new possibilities for alleviating IOPS issues. Just size this example to your scenarios & needs to think about what it can do for you.

Disclaimer: This is white board thinking & design, not a formal solution. But I’d cannot ignore the potential and possibilities. And for the critics. No this doesn’t mean that I’m saying modern SANs don’t have a place anymore. Far from it, they solve a lot of needs in a lot of scenarios, but they do have some (very expensive) pain points.

Hyper-V Shared Nothing Live Migration In Windows Server 2012– VM Mobility Rules

I see and hear some people shrug at the idea of Shared Nothing Live Migration, dismissing it as marginally useful. Some do state they’ll have it as well but that it’s not that valuable. Well I disagree totally. A lot of the time these remarks are due to a lack of understanding about how several technologies in the Microsoft stack work together. Combine this with tunnel vision and the fear of some vendors and you get a lot of FUD.

I advise you to look beyond the virtualization stack, to the issues that people who are building infrastructure for dynamic, flexible and * cloud  data centers are dealing with.

Look, as “architects” we have to design & build for failure. We all know that it’s just a matter of time before things go BOINK.  So we build in redundancy, some of this within a silo, some of this is between silos. The two approaches compliment each other. What this gives you is options and everybody who knows me, especially those who work  with me has heard my mantras: “Assumptions are the mother of all F* Ups” and “Options, options, options”. Make sure you design & build in options. This way you can maneuver your self out of a bad situation. Don’t ever assume you’re out of options, especially not when you put some in the design on purpose Winking smile. It’s also very useful beyond that because a lot of you might agree with me that silos and fork lift, down time inducing upgrades, migrations, transitions or replacements are expensive and bad. This is where Share Nothing Live Migrations comes into play. You gain mobility over silos. That silo might be a server, a cluster, storage or mixtures of them all.

With Shared Nothing Live Migration we can migrate virtual machines between those silos with nothing more than a network cable.This is huge people. You are no longer trapped in that silo. In this context it provides you with all the options & flexibility mobility gives you. even it the technology itself is not about high availability.

Some very useful scenarios

Migrate virtual machines from an old cluster to a new cluster with out any down time

  1. Migrate virtual machines from stand alone hyper-V hosts to a fail over cluster with out any down time
  2. Migrate virtual machines from one stand alone host to another one for maintenance, again, without any down time
  3. Choose different types if storage & Hyper-V deployment depending in IOPS, redundancy, availability, manageability needs. With Shared Nothing Live Migration you can be confident  that  you can move your virtual machine from one environment to the other when needs change. This is breaking the storage silo boundaries open people! This is huge … think about it.

How it works

The details are for another post but basically is made possible by the combination of Live Storage Migration and Live Migration.

First the Storage is Live Migrated

image

After the Live Storage Migration is done the state of the virtual machines is copied and synchronized.

image

This Is Mobility

I hear the competition shrug.  It isn’t high availability. Well indeed no one who understands the feature ever said it was. It’s virtual machine mobility. Look at the scenarios above and you’ll see that this ability could very well be game changer in how we look at storage & design solutions.

Speed & Performance

What did we hear on this front: “it will be too slow to be really useful”. Really? Well let’s see:

  1. The world is converging to 10Gbps and after that 40Gbps and up will come
  2. NIC Teaming in box With Windows 2012 which can provide more bandwidth.
  3. SMB 3.0 Multichannel. This provides multiple channels per connection spreading the load over multiple CPUs
  4. SMB Direct, have you seen the speeds this achieves?

Before you state that this doesn’t work on Live Migration … as confirmed at TechEd 2012 Europe with Jose Baretto this does work when both the source AND the target is an SMB 3.0 share. This means yet another reason to use SMB 3.0 share for your Hyper-V storage needs! So unlike what Tad at vLimited keeps saying, unhindered by any knowledge, it is a very valuable feature and it can be extremely fast given the right connectivity and storage that can handle the IOPS. And no, the fact that it’s unbuffered doesn’t impact this to much. Test this by using xcopy/robocopy /J with a VHD over your infrastructure.

image

Even if you’re on a budget and cannot go for the RDMA NICs & SMB 3.0 you have several options to get very decent virtual machine mobility and not be stuck in a silo. And for those who want to leverage this feature to create and agile & mobile virtual environment you have some very nice technologies available to optimize to your needs & budgets.

Conclusion

Virtual Machine mobility and storage mobility are very interesting features that provide for a previously unknown flexibility. Windows Server 2012 makes us rethink our storage approaches (I sure am) and I’m very interested in seeing how this will evolve.