Shared nothing live migration with a virtual switch change and VLAN ID configuration

Introduction

I was working on a hardware refresh, consolidation and upgrade to Windows Server 2019 project. This mainly boils down to cluster operating system rolling upgrades from Windows Server 2016 to Windows Server 2019 with new servers replacing the old ones. Pretty straight forward. So what does this has to do with shared nothing live migration with a virtual switch change and VLAN ID configuration

Due to the consolidation aspect, we also had to move virtual machines from some older clusters to the new clusters. The old cluster nodes have multiple virtual switches. These connect to different VLANs. Some of the virtual machines have on only one virtual network adapter that connects to one of the virtual switches. Many of the virtual machines are multihomed. The number of virtual NICs per virtual machine was anything between 1 to 3. For this purpose, we had the challenge of doing a shared nothing live migration with a virtual switch change and VLAN ID configuration. All this without downtime.

Meeting the challenge

In the new cluster, there is only one converged virtual switch. This virtual switch attaches to trunked network ports with all the required VLANs. As we have only one virtual switch on the new Hyper-V cluster nodes, the name differs from those on the old Hyper-V cluster nodes. This prevents live migration. Fixing this is our challenge.

First of all, compare-vm is your friend to find out blocking incompatibilities between the source and the target nodes. You can read about that in many places. Here, we focus on our challenge.

Making Shared nothing migration work

The first step is to make sure shared nothing migration works. We can achieve this in several ways.

Option 1

We can disconnect the virtual machine network adapters from their virtual switch. While this allows you to migrate the virtual machines, this leads to connectivity loss. This is not acceptable.

Option2

We can preemptively set the virtual machine network adapters to a virtual switch with the same name as the one on the target and enable VLAN ID. Consequently, this means you have to create those and need NICs to do so. but unless you configure and connect those to the network just like on the new Hyper-V hosts this also leads to connectivity loss. That was not possible in this case. So this option again is unacceptable.

Option 3

What I did was create dummy virtual switches on the target hosts. For this purpose, I used some spare LOM NICs. I did not configure them otherwise. As a matter of fact, tI did not even connect them. Just the fact that they exist with the same names as on the old Hyper-V hosts is sufficient to make shared nothing migration possible. Actually, this is a great time point to remind ourselves that we don’t even spare NICs. Dummy private virtual switches that are not even attached to a NIC will also do.

After we have finished the migrations we just delete the dummy virtual switches. That all there is to do if you used private ones. If you used spare NICs just disable them again. Now all is as was and should be on the new cluster nodes.

Turning shared nothing migration into shared nothing live migration

Remember, we need zero downtime. You have to keep in mind that long as the shared nothing live migration is running all is well. We have connectivity to the original virtual machines on the old cluster nodes. As soon as the shared nothing live migration finishes we do 2 things. First of all, we connect the virtual network adapters of the virtual machines to the new converged virtual switch. Also, we enable the VLAN ID. To achieve this, we script it out in PowerShell. As a result, is so fast we only drop only 1 or 2 pings. Just like a standard live migration.

Below you can find a conceptual script you can adapt for your own purposes. For real migrations add logging and error handling. Please note that to leverage share nothing migration you need to be aware of the security requirements. Credential Security Support Provider (CredSSP) is the default option. If you want or must use Kerberos you must configure constrained delegation in Active Directory.

I chose to use CredSSP as we would decommission the old host soon afterward anyway. It also means we did not need Active Directory work done. This can be handy if that is not evident in the environment you are in. We started the script on every source Hyper-V host, migrating a bunch of VMs to a new Hyper-V host. This works very well for us. Hope this helps.

Sample Script

    #The source Hyper-V host
    $SourceNode = 'NODE-A'
    #The LUN where you want to storage migrate your VMs away from
    $SourceRootPath = "C:\ClusterStorage\Volume1*"
    #The source Hyper-V host

    #The target Hypr-V host
    $TargetNode = 'ZULU'
    #The storage pathe where you want to storage migrate your VMs to
    $TargetRootPath = "C:\ClusterStorage\Volume1"

    $OldVirtualSwitch01 = 'vSwitch-VLAN500'
    $OldVirtualSwitch02 = 'vSwitch-VLAN600'
    $NewVirtualSwitch = 'ConvergedVirtualSwitch'
    $VlanId01 = 500
    $VlanId02 = 600

    #Grab all the VM we find that have virtual disks on the source CSV - WARNING for W2K12 you'll need to loop through all cluster nodes.
    $AllVMsOnRootPath = Get-VM -ComputerName $SourceNode | where-object { $_.HardDrives.Path -like $SourceRootPath }

    #We loop through all VMs we find on our SourceRoootPath
    ForEach ($VM in $AllVMsOnRootPath) {
        #We generate the final VM destination path
        $TargetVMPath = $TargetRootPath + "\" + ($VM.Name).ToUpper()
        #Grab the VM name
        $VMName = $VM.Name
        $VM.VMid
        $VMName


        if ($VM.isclustered -eq $True) {
            write-Host -ForegroundColor Magenta $VM.Name "is clustered and is being removed from cluster"
            Remove-ClusterGroup -VMId $VM.VMid -Force -RemoveResources
            Do { Start-Sleep -seconds 1 } While ($VM.isclustered -eq $True)
            write-Host -ForegroundColor Yellow $VM.Name "has been removed from cluster"
        }
    
        #Do the actual storage migration of the VM, $DestinationVMPath creates the default subfolder structure
        #for the virtual machine config, snapshots, smartpaging & virtual hard disk files.
        Move-VM -Name $VMName -ComputerName $VM.ComputerName -IncludeStorage -DestinationStoragePath $TargetVMPath -DestinationHost $TargetNode
    
         $OldvSwitch01 = Get-VMNetworkAdapter -ComputerName $TargetNode -VMName $MovedVM.VMName | where-object SwitchName -eq $OldVirtualSwitch01

        if ($Null -ne $OldvSwitch01) {
            foreach ($VMNetworkadapater in $OldvSwitch01)
            {   write-host 'Moving to correct vSwitch'
                Connect-VMNetworkAdapter -VMNetworkAdapter $OldvSwitch01 -SwitchName $NewVirtualSwitch
                write-out "Setting VLAN $VlanId01"
                Set-VMNetworkAdapterVlan  -VMNetworkAdapter $OldvSwitch01 -Access -VLANid $VlanId01
            }
        }
        $OldvSwitch02 = Get-VMNetworkAdapter -ComputerName $TargetNode -VMName $MovedVM.VMName | where-object SwitchName -eq $OldVirtualSwitch02
        if ($NULL -ne $OldvSwitch02) {
            foreach ($VMNetworkadapater in $OldvSwitch02) {
                write-host 'Moving to correct vSwitch'
                Connect-VMNetworkAdapter -VMNetworkAdapter $OldvSwitch02 -SwitchName $NewVirtualSwitch
                write-host "Setting VLAN $VlanId02"
                Set-VMNetworkAdapterVlan  -VMNetworkAdapter $OldvSwitch02 -Access -VLANid $VlanId02
            }
        }
    }



Trunking With Hyper-V Networking

When doing lab work, or real life implementations you’ll need to go beyond the basic 101 stuff to build solutions every now and then. This is especially true when using virtual network appliances. Networking means you’ll you’ll be dealing with Link Aggregation Groups, Trunking, MLAG, routing, LACP … in short the tools of the trade when doing networking. In my experience I use trunking in Hyper-V mostly to mimic real world scenarios where trunking is used (firewall, routers, load balancers). These tend to be limited in usable ports in real life. So even, before you run out of physical ports on your Hyper-V host to work with we leverage them to mimic the real live environment. This leads us to trunking with Hyper-V networking

I for one have used this on 10Gbps ports on bot physical and virtual load balancers in the uplink to the switches. As you can imagine when doing redundant (teaming) cabling with HA load balancers you’re consuming 10Gbps ports and not all VLANs warrant a dedicated 10Gbps uplink, even if you had ‘m.

Trunking & VLAN’s are the way we deal with this in the network hardware world and we can do the same in Hyper-V. In the Hyper-V Manager GUI you will not find a way to define a trunk on an vNIC attached to a vSwitch. But this can be done via PowerShell. So please do not reject Hyper-V as not being up to the job. It is. Let me show you how you can do trunking with Hyper-V networking.

Generally on a clean install I dump the default vNIC. DO NOT DO this blindly on an existing deployed appliance virtual machine.

#Delete the default network adapter
Remove-VMNetworkAdapter -VMName VLM200-1 -Name "Network Adapter"

I then add the number of ethernet ports I need on my Kemptechnologies virual Load Master.

#Create the VLM200 ports (4 like it's physical counterpart)
For ($Count=0; $Count -le 3; $Count ++)
{
Add-VMNetworkadapter -VMName VLM200-1 -Name "Eth$Count"
}

A peak at our handy work via Get-VMNetworkAdapter -VMName VLM200-1 shows our 4 ports.

image

As you can see I like to name my network adapters with a distinctive name. In combination with the switch name it enables me to identify the NICs better. Combine that with a good naming policy inside the VM if possible. In Windows Server 2016 you can hot add and remove vNICs and new “Device Naming”

(see Hot add/remove of network adapters and enabling device naming in Windows Server Hyper-V) functionality which only makes the experience better in relation to uptime and automation.

Now let’s say we use eth0 for management and for the HA heartbeat. That leaves Eth2 and Eth3 for workloads. We could even aggregate these (redundancy, heart beat). In this demo we’ll configure Eth3 as a trunk with a list of allowed VLANs. We keep the native VLAN ID on 0 as it is by default. Only in specific situations where you have changed this in the network should this be changed.

#Trunk Eth3 and add the required VLAnIDs
Set-VMNetworkAdaptervlan -VMName VLM200-1 -VMNetworkAdapterName "Eth3"-Trunk -AllowedVlanIdList "10, 20, 30" -NativeVlanId 0

Which delivers us what we need to get our network appliance going

image

In your virtual appliance you can now create VLANs on Eth3. How this shows up is dependent on the appliance. In this example a Kemp Virtual Load Master. Here we mimic a 4 port load master. We’re not doing trunking because we ran out of the max supported number of NICs we can add to a virtual machine.

image

A word of warning. You will not see this configuration in the settings via the GUI.
Manipulating the VLAN settings in the GUI will overwrite the settings without a warning.
So be careful with configuration of your virtual network appliance(s).  As an example I’ll touch the VLAN setting of Eth3 and give it VLAN 500.

image

We now have a look at our VLAN settings of the appliance

image

That vNIC is now in Access mode with VLAN 500. Ouch, that will seriously ruin your day in production! Be careful!

On top of this some appliances do not respond well to such misconfigurations on the switch side (both physical and virtual switches). This leads not only to service interruption but could lead to the inability to mange the appliance, requiring a reboot of them etc.

Anyway, so yes you can do trunking with Hyper-V networking on a vNIC but this normally only makes sense I you have an appliance running that knows what to do with a trunk such as a virtual  firewall, router or load balancer.

The Mysterious Case of Infrequent Network Connectivity Issues on 2 Hyper-V VMs Out of 40 Guests

In The Mysterious Case of Infrequent Network Connectivity Issues on 2 Hyper-V VMs Out of 40 Guests I share a trouble shooting experience with you. I was asked if I could possibly take a look at a weird, but very infrequent network issue with 2 VMs (W2K12R2) on a cluster (W2K12R2) running over 40 guests? Sure! These 2 virtual machines worked well 98% of the time. About 2% of the time they just fell of the network, sometimes both vNICs, sometimes both VMs. Asking what they meant, they said unreachable. But we can’t find anything wrong as all other VMs run fine with the same configuration on the same hosts. They told me there was nothing in the event logs of either the host or the guests to explain any of this. A reboot or 2 or even a live migration sometimes fix the issue. Normally the monthly patch cycle prevent to many problems with connectivity. Pretty weird! Usually bad firmware, drivers or bad offload feature support can cause issues, but that would not target just 2 out of 40 VMs that have the same settings.

It was only these 2 VMs, not matter what host the were running on in the cluster. As the the vNICs shared the same 2 vSwitches (teamed) with all other VMS that never had issues I was pretty sure the configuration of the switches, NIC, teams and vSwitch were OK. This was verified for due diligence and it  checked out on all hosts as expected. All firmware, drivers and offloads were done correctly.

I also checking the VLANs settings of the vNIC themselves for those two VMs and compared them a couple of VMs that had no issues what so ever and found them to be identical.

At first everything seemed fine and I was stumped. The event logs both in the VMs as on the hosts were squeaky clean. After that exercise I started running some PowerShell command lets to take a look at the configuration of the VMs on the hosts. You see the GUI does not expose all possible configurations and I wanted to look every configuration option. That’s when I found the following

image

The vNIC for the 2 offending VMs were in Access mode while the VlanList had a single value 0 (basically meaning untagged, it’s a reserved VLAN for priority tagging and the use is not 100% standard across switch vendors). This just didn’t compute. In the GUI we did not see this, there things looked normal.

image

You cannot even set this in the GUI, it won’t allow you.

image

image

But when run in a PowerShell command it allows you to make this configuration. So maybe that’s what’s happened.

Set-VMNetworkAdaptervlan -VMName DNS01 -Access -VlanId 0

No one knew, nor can I tell you. But I tested to verify this does run and makes that configuration without any issue, weird. Anyway, I resolved the issue by running the following command.

Set-VMNetworkAdaptervlan -VMName DNS01 –Untagged

image

The rare connectivity issue disappeared and all was well in 100% of the cases. That how The Mysterious Case of Infrequent Network Connectivity Issues on 2 Hyper-V VMs Out of 40 Guests came to a happy end.

Trouble Shooting Intermittent Virtual Machine Network Connectivity

I was asked to take a look at an issue with virtual machines losing network connectivity. The problems were described as follows:

Sometimes some VMs had connectivity, some times they didn’t. It was not tied to specific virtual machines. Sometimes the problem was not there, than it showed up again. It was, not an issue of a wrong subnet mask or gateway.

They suspected firmware or driver issues. Maybe it was a Windows NIC teaming bug or problems with DVMQ or NIC offload settings. There’s a lot of potential reasons, just Google Intermittent VM connectivity Issues Hyper-V and you’ll get a truckload of options.

So a round of wishful firmware, driver upgrading started. Followed by a round of wishful disabling network features. That’s one way to do it. But why not sit back an look at the issue.

Based on what they said I looked at the environment and asked it was tied to specific host as only VMs on one of the hosts had the issue.  Could it be be after a live migration or a VM restart. They didn’t really know but it could. So we started looking at the hosts. All teams for the vSwitch were correctly configured on all host. No tagged VLAN on the member NIC. No extra team interfaces that would violate the rule that there can be only one if the team is used by a Hyper-V switch. They used the switch independent teaming mode with the load balancing mode set to Dynamic, all member active. Perfect.

I asked it they used tagged VLAN on the VMs some times. They said yes. Which gave me a clue they had trunking or general mode configured on the ports. So I looked at the switches to see what the port configuration was like?  Guess what. All ports on both switches were correctly configured bar the ports of the vSwitch team members on one Hyper-V host. The one with problematic VMs. The two ports were in general mode but the port on the top switch had PVID* 100 and the one on the bottom switch had PVID 200. That was the issue. If the VM “landed” on the team member with PVID 200 it has no network connectivity.

HyperV-vSwitchTeam-WronNativeVLAN

 

* PVID (switchport general pvid 200) is the default VLAN of the port, in CISCO speak that would translate into “”native VLAN as in switchport trunk native vlan 200

Yes NIC firmware and drivers have issues. There are bugs or problems with advanced features once in a while. But you really do need to check that the configuration is correct and the design or setup makes sense. Do yourself a favor by not assuming anything. Trust but verify.