Optimizing Live Migrations with a 10Gbps Network in a Hyper-V Cluster

Introduction

You’ll find the following recommendations on line about optimizing Live Migrations:

  1. Use bigger pipes (10Gbps is better than 1Gbps)
  2. Enable Jumbo Frames
  3. Up the Receive Buffer to 8192 (Exchange 2010 virtualization recommendation for Live Migration)

As we’ve been building Hyper-V Cluster since the early betas let me share some experiences with this. For the curious I used Intel® Ethernet X520 SFP+ Direct Attach Server Adapters & DELL PowerConnect 8024F 10Gbps switches for my testing. See my blog posts on considerations about the use of 10Gbps in Hyper-V clusters here:

  1. Introducing 10Gbps Networking In Your Hyper-V Failover Cluster Environment (Part 1/4)
  2. Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4)
  3. Introducing 10Gbps & Thoughts On Network High Availability For Hyper-V (Part 3/4)
  4. Introducing 10Gbps & Integrating It Into Your Network Infrastructure (Part 4/4)

Bigger pipes are better

On bigger pipes I can only say that if you can afford them and need them you should get them. End of discussion.

Jumbo frames rock

Jumbo frames help out a lot (+/- 20 %), especially with the larger memory virtual machines.

The golden nugget

So far so good, but there is one golden nugget of information I want to share. There is little trip wire that can prevent you from getting your optimal performance. Advanced power settings in the BIOS. If you read my blogs you might have come across a blog post Consider CPU Power Optimization Versus Performance When Virtualizing and I encourage you to go and read that post as it holds a lot of good info but also is very relevant to this post. Because we have yet another reason to make sure your BIOS is set right to achieve a decent return on investment in quality hardware.

In our experience those power saving settings, the C states and the C1 states are also not very helpful when it comes to Live Migration & such. I got from a meager 20% bandwidth use all the way up to 35-45% at best with jumbo frames enabled and the power settings set to ”Full Power”. A lot better but still not very impressive.

clip_image002

Now go ahead and disable the C states AND the C1E state to achieve 55% to 65%.

clip_image004

Now the speed of a live migration varies greatly between virtual machines that are idle or running a full load, both CPU & memory wise. It also depends on the load the host you’re migrating from and to, but this impact is less when you disable those advance CPU power settings.

Look at the following screen shots

clip_image005

A SQL Server with 50GB of RAM being live migrated over 10Gbps. Jumbo frames enabled, Power Settings optimized but with C1E & C States enabled.

clip_image006

A SQL Server with 50GB of RAM being live migrated over 10Gbps. Jumbo frames enabled, Power Settings optimized but with C1E & C States disabled.

The live migration of this virtual SQL Server takes between 74-78 seconds. Not bad!

image

By the way these settings also help with 1Gbps but there is isn’t as spectacular. You use 99% instead of 75-80% of you bandwidth. And improvement yes, but not on the same scale as with 10Gbps for speeding up Live Migrations.

As you can see in this post on the TechNet support groups, this seems to be a common occurrence. It’s not just me who’s seeing things: Live Migration on 10GbE only 16%. even Dell chimed in there confirming these findings in their labs.

Receive Buffer

There is one setting that’s been advised for Exchange 2010 virtualization with Hyper-V that I have not seen improve speeds and that’s upping the Receive Buffer 8192. You can read this in Best Practices for Virtualizing Exchange Server 2010 with Windows Server® 2008 R2 Hyper V™. In some cases I tested this even reduces the results, especially when you have C1E & C states enabled. It is also a confusing recommendation as they state to set the Receive Buffer to 8192 .This value however is dependent on the NIC type and driver so you might only be able to set it to 4096 or so. The guidance should state to set it as high a possible but I have not seen any benefits. Do mind that I did not test this with a Hyper-V cluster running a virtualized Exchange 2010 guest. Your mileage may vary. Trust but verify is the age old adagio. Also keep in mind I’m running 10Gbps, so the effect of this setting might be not be what it could do for a 1Gbps network, but on the whole I’m not convinced. If you implement all other recommendations you’ll saturate a 1Gbps already.

What does this mean?

The sad news is that in virtual environments or other high performance configurations the penguins have to give way to performance. I wish it was different but unfortunately it isn’t.

By the way, this is vendor agnostic. You’ll see this with HP, DELL, CISCO in all form factors whether they are tower, rack or blade servers. The main thing you need to make sure is that the BIOS allows you to disable the C States en power settings. Not all vendors/BIOS version allow for this I read so make sure you check this. Some CISCO blades have annoying on this front, ruining the performance of VDI projects with less than optimal CPU performance but they have released an updated BIOS now to fix this.

Look, it makes no sense saving on power if it means you’ll by more servers to compensate for the lack of performance per unit. In my honest opinion a lot of all the hardware optimizations are awesome but they still have a long way to go in making sure it doesn’t incur such a hit even on performance. Right sizing servers in number & type of CPU, power supplies etc. still seems the best way to avoid waiting energy and money. Buying more power than needed and counting on the power consumption optimizations to reduce operating cost can be a good idea to protecting your investment for expected future increases in resource demand within the service life of your hardware. On average that is 3 to 5 years depending on the environment & needs.

Conclusion

Three things are needed for lightning fast Live Migrations:

  1. Bandwidth. Hence the 10Gbps network. There is no substitute for bigger pipes.
  2. Jumbo Frames. Configure them right & you’ll reap the benefits
  3. Disable C1E& C states. Also Configure your servers power options for maximum performance.
  4. I have not been able to confirm the receive buffer has a big impact on Live Migration speed or does any good at all. Test this to find out if it works for you

Remember that you’ll be able to do multiple Live Migrations in parallel with Windows 8. So a 10Gbps pipe will be used at full capacity then. Being able to use more networks for Live Migration will only increase the capability to evacuate a host fast or to move virtual machines for load balancing across a cluster. If you look at the RDMA, infiniband, 40/100Gbps evolutions becoming available in the next 12 to 36 months 10Gbps will become a lot more mainstream while at the same time the options for network connectivity will become more diversified. 10Gbps prices are dropping but for the moment they do remain high enough to keep people away.

System Center Virtual Machine Manager 2008 R2 Error 12711 & The cluster group could not be found (0×1395)

The Issues

I recently had to go and fix some issues with a couple of virtual machines in SCVMM 2008 R2. There was one that failed to live migrate with following error:

Error (12711)
VMM cannot complete the WMI operation on server HopelessVm.test.lab because of
error: [MSCluster_ResourceGroup.Name=" df43bf60-7216-47ed-9560-7561d24c7dc8"] The cluster group could not be found.

(The cluster group could not be found (0×1395))
 
Recommended Action
Resolve the issue and then try the operation again

Other than that it looked fine and could be managed with SCVMM 2008 R2. Another one was totally wrecked it seemed. It was in a failed state after an attempted live migration. You couldn’t do anything with it anymore. Repair was “available” but every option there failed so basically that was the end of the game with that VM. Both issues can be resolved with the approach I’ll describe below.

The Cause

After some investigation the cause of this was the fact that this virtual machine had been removed from the failover cluster as a resource was exported & imported using Hyper-V manager on one of the cluster nodes. It was then added back to the failover cluster again to make them high available. All this was done without removing it from SCVMM 2008 R2. By the way, as mentioned above in “The Issues” this can get even worse than just failing live migrations. The same scenario can lead to virtual machines going into a failed state that you can’t repair (retry or undo fail) or ignore and basically you’re stuck at that point. You can’t even stop, start, shutdown the virtual machine anymore, not one single operation works in SCVMM while in the failover cluster GUI and in hyper-v manager everything is fully operational. This is important to note, as the services are fully on line and functional. It’s just in SCVMM that you’re in trouble.

Why did they do it this way? They did it to move the VM to a new CSV. The fact that you delete the VM files when deleting a VM with SCVmm2008R2 made them use Hyper-V manager instead. Now this approach (whatever you think of it) can work but then you need to delete the VM in SCVMM2008R2 after exporting the virtual machine AND before proceeding with the import and making the virtual machine highly available.

People get creative in how to achieve things due to inconsistencies, differences in functionality between Hyper-V Manger and SCVMM 2008R2 (in the latter especially the lack of complete control over naming, files & folders, export/migration behavior) as well as the needs of the failover cluster can lead to some confusing scenarios.

The Supported Fix

Now the easy way to fix this is to export the virtual machine again and delete it in SCVMM 2008 R2. That will remove the virtual machine object from SCVMM, the failover cluster en Virtual Machine Manager. However this virtual machine was so large (50Gb + 750 GB data disk) that there was no room for an export to be made. Secondly an export of such a large VM takes a considerable time and it has to be off line for this operation. This is annoying as SCVMM might be uncooperative at this point, the virtual machine is online en performing it’s duties for the business. So this presented us with a bit of a problem. Stopping the virtual machine, Exporting it using Hyper-V Manager will cause it to go missing in SCVMM 2012 and then you can delete it, importing the virtual machine again and adding it to the failover cluster causes down time.

The Root Cause

Why does this happen? Well when you import a virtual machine into a failover cluster is creates a new unique ID for the virtual machine Resource Group . This happens always. Choosing to reuse an existing ID during import in Hyper-V Manager has nothing to do with this. But VMM uses ID/names to identify a VM, independent of the cluster. So when you did not remove the VM from SCVMM before adding the VM back to the cluster you get a different cluster group ID in the cluster than you have in SCVMM. They both have the same name but there is a disconnect leading to the issues described above.

By the way exporting & importing a VM without first removing the virtual machine from the failover cluster leads to some issues in the Failover cluster so don’t do that either Smile

The “No Down Time” Fix

This is not the first time we need to dive in to the SCVMM database to fix issues. One of my main beef about SCVMM other than inconsistency with the other tools and its lack of control & options in some scenarios is the fact that it doesn’t have enough self-maintenance intelligence & functionality. This leads to the workaround above which are slow and rather annoying or consist of messing around in the SCVMM database, which isn’t exactly supported. Mind you Microsoft has published some T-SQL to clean up such issues themselves. See You cannot delete a missing VM in SCVMM 2008 or in SCVMM 2008 R2 and RemoveMissingVMs. See also my blog SCVMM 2008 R2 Phantom VM guests after Blue Screen post on this subject.

The usual tricks of the trade like refreshing the virtual machine configuration in the failover cluster GUI don’t work here. Neither does the solution to this error described Migrating a System Center Virtual Machine Manager 2008 VM from one cluster to another fails with error 12711. The error is the same but not the cause.

# Add the VMM cmdlets
Add-PSSnapin microsoft.systemcenter.virtualmachinemanager

# Connect to the VMM server
Get-VMMServer –ComputerName MySCVMMServer.test.lab

# Grab the problematic VM and put it into the object $vm
$vm = Get-VM –name “HopelessVM”

#Force a refresh
refresh-vm -force  $vm

In the end we have to fix the mismatch between the VMResourceGroupID in failover cluster and SCVMM by editing the database.

First you navigate to the registry key HKEY_LOCAL_MACHINEClusterGroups on one the cluster nodes, do a find for the problematic VM’s name and grab the name of its key, this is the VMResourceGroupID the cluster knows and works with? So now we have the correct VMResourceGroupID: 0f8cabe4-f773-4ae4-b431-ada5a3c9926c

clip_image002

Now you connect to the SCVMM database and run following query to find the VMResourceGroupID that SCVMM thinks that VM has and that it uses causing the issues

SELECT  VMResourceGroupID  FROM tbl_WLC_VMInstance WHERE ComputerName = 'hopelessVM.test.lab'
GO 

The results:

VMResourceGroupID

————————————————–

df43bf60-7216-47ed-9560-7561d24c7dc8

(1 row(s) affected)

The trick than is to simply update that value to the one you just got from the registry by running:

UPDATE tbl_WLC_VMInstance SET VMResourceGroupID = '0f8cabe4-f773-4ae4-b431-ada5a3c9926c' WHERE VMResourceGroupID = 'df43bf60-7216-47ed-9560-7561d24c7dc8'
GO 

Than you need some patience & refresh the GUI a few times. Things will turn back to normal, but in between you might seem some “missing” statuses appear for your problematic VM. These go away fast however. If not you can always use the Microsoft provided script to remove missing VM’s as mentioned above in RemoveMissingVMs.

Warning

What I described above is something you can do to fix these issues fast and effectively when needed. But I’m not telling you this is the way to go, let alone that this is supported. Make sure you have backups of your VMs, Hosts, SCVMM database etc. It only takes one mistake or misinterpretation to royally shoot yourself in your foot Winking smile. It hurts like hell; recovery is long and seldom complete. On top of that it might generate a vacancy in your company whilst you’re escorted out of the building. Be careful out there.

Upgrading Windows Server 2008R2 Editions With DISM

When an environment evolves (growth, mergers, different needs) you have might very well have resource needs above and beyond the  limits of the original Windows edition that was installed. Scaling out might not the right (or possible) solution you so scale up is alternative option. Today with Windows Server 2008 R2 this is very easy. However, again and again I see people resorting labor intensive and often tedious solutions. Some go the whole 9 yards and do a complete clean install and migration. Others get creative and do a custom install with the windows media to achieve an in place upgrade. But all this isn’t needed at all. Using DISM (Windows Edition-Servicing Command-Line Options) you can achieve anything you need and every role, feature, app on your server will remain in good working condition. Recently I had to upgrade some standard edition Hyper-V guest servers to the enterprise edition to make use of more than 32 GB of RAM. Another reason might be to move from Windows Server 2008 R2 Enterprise Edition to Data Center Edition for hyper-v host to make use of that specific licensing model for virtual machines.

Please note the following:

  • You can only do upgrades. You CANNOT downgrade
  • The server you upgrade cannot be a domain controller (demote, upgrade, promote)
  • This works on Standard, Enterprise edition, both full & core installations.
  • You cannot switch form core to full or vice versa. It’s edition upgrade only, not  for switching type of install.

This is how to find the possible target editions for your server:

C:Windowssystem32>DISM /online /Get-TargetEditions

Deployment Image Servicing and Management tool
Version: 6.1.7600.16385

Image Version: 6.1.7600.16385
Editions that can be upgraded to:

Target Edition : ServerDataCenter
Target Edition : ServerEnterprise

The operation completed successfully.

So I went to Enterprise Edition by executing this process takes some time but is painless but for one reboot.

C:Windowssystem32>Dism /online /Set-Edition:ServerEnterprise /ProductKey:489J6-VHDMP-X63PK-3K798-CPX3Y

Deployment Image Servicing and Management tool
Version: 6.1.7600.16385

Image Version: 6.1.7600.16385

Starting to update components...
Starting to install product key...
Finished installing product key.

Removing package Microsoft-Windows-ServerStandardEdition~31bf3856ad364e35~amd64~~6.1.7601.17514
[==========================100.0%==========================]
Finished updating components.

Starting to apply edition-specific settings...
Restart Windows to complete this operation.
Do you want to restart the computer now (Y/N)?

You either use a MAK key (if you don’t have a KMS server) or the default key for your volume license media. When you have KMS in place (and the matching server group KMS key A, B, or C) the activation will be done automatically and transparent for you. Standard trouble shooting applies if you run into an issue there.

These are the public keys for use with a KMS server:

  • Windows 7 Professional – FJ82H-XT6CR-J8D7P-XQJJ2-GPDD4
  • Windows 7 Professional N – MRPKT-YTG23-K7D7T-X2JMM-QY7MG
  • Windows 7 Enterprise – 33PXH-7Y6KF-2VJC9-XBBR8-HVTHH
  • Windows 7 Enterprise N – YDRBP-3D83W-TY26F-D46B2-XCKRJ
  • Windows 7 Enterprise E – C29WB-22CC8-VJ326-GHFJW-H9DH4
  • Windows Server 2008 R2 HPC Edition – FKJQ8-TMCVP-FRMR7-4WR42-3JCD7
  • Windows Server 2008 R2 Datacenter – 74YFP-3QFB3-KQT8W-PMXWJ-7M648
  • Windows Server 2008 R2 Enterprise – 489J6-VHDMP-X63PK-3K798-CPX3Y
  • Windows Server 2008 R2 for Itanium-Based Systems – GT63C-RJFQ3-4GMB6-BRFB9-CB83V
  • Windows Server 2008 R2 Standard – YC6KT-GKW9T-YTKYR-T4X34-R7VHC
  • Windows Web Server 2008 R2 – 6TPJF-RBVHG-WBW2R-86QPH-6RTM4

Don’t worry this is public information (KMS Client Setup Keys), these will only activate if you have a KMS server and the to key make that KMS server work.

Either way there is no need for reinstall & migration or upgrade installation in for a simple upgrade scenario So do your self a  favor and always check if you can use DSIM to achieve your goals!

Assigning Large Memory To Virtual Machine Fails: Event ID 3320 & 3050

We had a kind reminder recently that we shouldn’t forget to complete all steps in a Hyper-V cluster node upgrade process. The proof of a plan lies in the execution Smile. We needed to configure a virtual machine with a whooping 50GB of memory for an experiment. No sweat, we have plenty of memory in those new cluster nodes. But when trying to do so it failed with a rather obscure error in System Center Virtual Machine Manager 2008 R2

Error (12711)

VMM cannot complete the WMI operation on server hypervhost01.lab.test because of error: [MSCluster_Resource.Name="Virtual Machine MYSERVER"] The group or resource is not in the correct state to perform the requested operation.

(The group or resource is not in the correct state to perform the requested operation (0x139F))

Recommended Action

Resolve the issue and then try the operation again.

image

One option we considered was that SCVMM2008R2 didn’t want to assign that much memory as one of the old host was still a member of the cluster and “only” has 48GB of RAM. But nothing that advanced was going on here. Looking at the logs found the culprit pretty fast: lack of disk space.

We saw following errors in the Microsoft-Windows-Hyper-V-Worker-Admin event log:

Log Name:      Microsoft-Windows-Hyper-V-Worker-Admin
Source:        Microsoft-Windows-Hyper-V-Worker
Date:          17/08/2011 10:30:36
Event ID:      3050
Task Category: None
Level:         Error
Keywords:     
User:          NETWORK SERVICE
Computer:      hypervhost01.lab.test
Description:
‘MYSERVER’ could not initialize memory: There is not enough space on the disk. (0x80070070). (Virtual machine ID DEDEFFD1-7A32-4654-835D-ACE32EEB60EE)

Log Name:      Microsoft-Windows-Hyper-V-Worker-Admin
Source:        Microsoft-Windows-Hyper-V-Worker
Date:          17/08/2011 10:30:36
Event ID:      3320
Task Category: None
Level:         Error
Keywords:     
User:          NETWORK SERVICE
Computer:      hypervhost01.lab.test
Description:
‘MYSERVER’ failed to create memory contents file ‘C:ClusterStorageVolume1MYSERVERVirtual MachinesDEDEFFD1-7A32-4654-835D-ACE32EEB60EEDEDEFFD1-7A32-4654-835D-ACE32EEB60EE.bin’ of size 50003 MB. (Virtual machine ID DEDEFFD1-7A32-4654-835D-ACE32EEB60EE)

Sure enough a smaller amount of memory, 40GB, less than the remaining disk space on the CSV did work. That made me remember we still needed to expand the LUNS on the SAN to provide for the storage space to store the large BIN files associated with these kinds of large memory configurations. Can you say "luxury problems"? The BIN file contains the memory of a virtual machine or snapshot that is in a saved state. Now you need to know that the BIN file actually requires the same disk space as the amount of physical memory assigned to a virtual machine. That means it can require a lot of room. Under "normal" conditions these don’t get this big and we provide a reasonable buffer of free space on the LUNS anyway for performance reasons, growth etc. But this was a bit more than that buffer could master.

As it was stated in the planning that we needed to expand the LUNS a bit to be able to deal with this kind of memory hogs this meant that the storage to do so was available and the LUN wasn’t maxed out yet. If not, we would have been in a bit of a pickle.

So there you go a real life example of what Aidan Finn warns about when using dynamic memory. Also see KB 2504962 “Dynamic Memory allocation in a Virtual Machine does not change although there is available memory on the host” which discusses the scenario where dynamic memory allocation seems not to work due to lack of disk space. Don’t forget about your disk space requirements for the bin files when using virtual machines with this much memory assigned. They tend to consume considerable chunks of your storage space. And even if you don’t forget about it in your planning, please don’t forget the execute every step of the plan Winking smile