Introducing 10Gbps Networking In Your Hyper-V Failover Cluster Environment

This is a 1st post in a series of 4. Here’s a list of all parts:

  1. Introducing 10Gbps Networking In Your Hyper-V Failover Cluster Environment (Part 1/4)
  2. Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4)
  3. Introducing 10Gbps & Thoughts On Network High Availability For Hyper-V (Part 3/4)
  4. Introducing 10Gbps & Integrating It Into  Your Network Infrastructure (Part 4/4)

A lot of early and current Hyper-V clusters are built on 1Gbps network architectures. That’s fine and works very well for a large number of environments. Perhaps at this moment in time you’re running solutions using blades with 10Gbps mezzanine cards/switches and all this with or without cutting up the bandwidth available for all the different networks needs a Hyper-V cluster has or should have for optimal performance and availability. This depends on the vendor and the type of blades you’re using. It also matters when you bought the hardware (W2K8 or W2K8R2 era) and if you built the solution yourself or bought a fast track or reference architecture kit, perhaps even including all Microsoft software and complete with installation services.

I’ve been looking into some approaches to introducing a 10Gbps network for use with Hyper-V clusters mainly for Clustered Shared Volume (CSV) and Live Migration (LM) Traffic. In brown field environments that are already running Hyper-V clusters there are several scenarios to achieve this, but I’m not offering the “definite guide” on how to do this. This is not a best practices story. There is no one size fits all. Depending on your capabilities, needs & budget you’ll approach things differently, reflecting what’s best for your environment. There are some “don’t do this in production whatever you environment is” warnings that you should take note of, but apart from that you’re free to choose what suits you best.

The 10Gbps implementations I’m dealing with are driven by one very strong operational requirement: reduce the live migration time for virtual machines with a lot of memory running a under a decent to heavy load. So here it is all about bandwidth and speed. The train of taught we’re trying to follow is that we do not want introduce 10Gbps just to share its bandwidth between 4 or more VLANs as you might see in some high density blade solutions. There that has often to do with limited amount of NIC/switch ports in some environments where they also want to have high availability. In high density scenarios the need to reduce cabling is also more urgent. All this is also often driven a desire to cut costs or keep those down as much as possible. But as technology evolves fast my guess is that within a few years we won’t be discussing the cost of 10Gbps switches anymore and even today there very good deals to be made. The reduction of cabling safes on labor & helps achieve high density in the racks. I do need to stress however that way too often discussions around density, cooling and power consumption in existing data centers or server rooms is not as simple as it appears. I would state that the achieve real and optimal results from an investment in blades you have to have the server room, cooling, power and ups designed around them. I won’t even go into the discussion over when blade servers become a cost effective solutions for SMB needs.

So back to 10Gbps networking. You should realize that Live Migration and Redirected Access with CSV absolutely benefit from getting a 10Gbps pipes just for their needs. For VMs consuming 16 Gb to 32 GB of memory this is significant. Think about it. Bringing 16 seconds back to 4 seconds might not be too big of a deal for a node with 10 to 15 VMs. But when you have a dozen SQL Servers that take 180 to 300 seconds to live migrate and reduce that to 20 to 30 seconds that helps. Perhaps not so during automated maintenance but when it needs to be done fast (i.e. on a node indicating serious hardware issues) those times add up. To achieve such results we gave the Live Migration & CSV network both a dedicated 10Gbps network. They consume about 50% of the available bandwidth so even a failover of the CSV traffic to our Live Migration network or vice versa should be easily handled. On top of the “Big Pipes” you can test jumbo frames, VMQ, …

Now the biggest part of that Live Migration time is in the “Brown-Out” phase (event id 22508 in the Hyper-V-Worker log) during which the memory transfer happens. Those are the times we reduce significantly by moving to 10Gbps. The “Black-Out” phase during which the virtual machine is brought on line on the other node creates a snapshot with the last remaining delta of “dirty memory pages”, followed by quiescing the virtual machine for the last memory copy to be performed and finally by the unquiescing of the virtual machine which is then running on the other node. This is normally measured in hundreds of milliseconds (event id 22509 in the Hyper-V-Worker log) . We do have a couple of very network intensive applications that sometimes have a GUI issue after a live migration (the services are fine but the consoles monitoring those services act up). We plan on moving those VMs to 10Gbps to find out if this reduces the “Black-Out” phase a bit and prevents that GUI of acting up. When can give you more feedback on this, I’ll let you known how that worked out.

An Example of these events in the Hyper-V-Worker event log is listed below:

Event ID 22508:

‘XXXXXXXX-YYYY-ZZZZ-QQQQ-DC12222DE1’ migrated with a Brown-Out time of 64.975 seconds.

Event ID 22509:

‘XXXXXXXX-YYYY-ZZZZ-QQQQ-DC12222DE1’ migrated with a Black-Out time of 0.811 seconds, 842 dirty page and 4841 KB of saved state.

Event ID 22507:

Migration completed successfully for ‘XXXXXXXX-YYYY-ZZZZ-QQQQ-DC12222DE1’ in 66.581 seconds.

In these 10Gbps efforts I’m also about high availability but not when that would mean sacrificing performance due to the fact I need to keep costs down and perhaps use approaches that are only really economical in large environments. The scenarios I’m dealing with are not about large hosting environments or cloud providers. We’re talking about providing the best network performance to some Hyper-V clusters that will be running SQL Server for example, or other high resource applications. These are relatively small environments compared to hosting and cloud providers. The economics and the needs are very different. As a small example of this: saving a ten thousand switch ports means that you’ll need you’ll save 500 times the price of a switch. To them that matters a lot more, not just in volume but also in relation to the other costs. They’re probably running services with an architecture that survives loosing servers and don’t require clustering. It all runs on cheap hardware with high energy efficiency as they don’t care about losing nodes when the service has been designed with that in mind. Economics of scale is what they are all about. They’d go broke building all that on highly redundant hardware and fail at achieving their needs. But most of us don’t work in such an environment.

I would also like to remind you that high availability introduces complexity. And complexity that you can’t manage will sink your high availability faster than a torpedo mid ship downs a cruiser. So know what you do, why and when to do it. One final piece of advice: TEST!

So to conclude this part take note of the fact I’m not discussing the design of a “fast track” setup that I’ll resell for all kinds of environments and I need a very cost effective rinse & repeat solution that has a Small, Medium & Large variety with all bases covered. I’m not saying those aren’t good or valuable, far from it, a lot of people will benefit from those but I’m serving other needs. If you wonder why they want to virtualize the applications at all, it has to do with disaster recovery & business continuity and replicating the environment to a remote site.

I intend to follow up on this in future blog posts when I have more information and some time to write it all up.

Hotfixes For Hyper-V & Failover Clustering Can Be Confusing KB2496089 & KB2521348

As I’m building or extending a number of Hyper-V Clusters in the next 4 months I’m gathering/updating my list with the Windows 2008 R2 SP1 hotfixes relating to Hyper-V and Failover Clustering. Microsoft has once published KB2545685: Recommended hotfixes and updates for Windows Server 2008 R2 SP1 Failover Clusters but that list is not kept up to date, the two hotfixes mentioned are in the list below. I also intend to update my list for Windows Server 2008 SP2 and Windows 2008 R2 RTM. As I will run into these and it’s nice to have a quick reference list.

I’ll include my current list below. Some of these fixes are purely related to Hyper-V, some to a combination of hyper-V and clusters, some only to clustering and some to Windows in general. But they are all ones that will bite you when running Hyper-V (in a failover cluster or stand-alone). Now for the fun part with some hotfixes, I’ll address in this blog post. Confusion! Take a look at the purple text and the green text hotfixes and the discussion below. Are there any others like this I don’t know about?

* KB2496089 is included in SP1 according to “Updates in Win7 and WS08R2 SP1.xls” that can be downloaded here (http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=269) but the Dutch language KB article states it applies to W2K8R2SP1 http://support.microsoft.com/kb/2496089/nl

Artikel ID: 2498472 – Laatste beoordeling: dinsdag 10 februari 2011 – Wijziging: 1.0

Vereisten

Deze hotfix moet worden uitgevoerd een van de volgende besturings systemen:

  • Windows Server 2008 R2
  • Servicepack 1 (SP1) voor Windows Server 2008 R2
Voor alle ondersteunde x64 versies van Windows Server 2008 R2

6.1.7600.20881
4,507,648
15-Jan-2011
04: 10
x64

Vmms.exe
6.1.7601.21642
4,626,944
15-Jan-2011
04: 05
x64

When you try to install the hotfix it will. So is it really in there? Compare file versions! Well the version after installing the hotfix on a W2K8R2 SP1 Hyper-V server the version of vmms.exe was 6.1.7601.21642 and on a Hyper-V server with SP1 its was 6.1.7061.17514. Buy the way these are English versions of the OS, no language packs installed.

With hotfix installed on SP1

Without hotfix installed on SP1

To make matters even more confusing while the Dutch KB article states it applies to both W2K8R2 RTM and W2K8R2SP1 but the English version of the article has been modified and only mentions W2K8R2 RTM anymore.

http://support.microsoft.com/kb/2496089/en-us

Article ID: 2496089 – Last Review: February 23, 2011 – Revision: 2.0

For all supported x64-based versions of Windows Server 2008 R2

Vmms.exe
6.1.7600.20881
4,507,648
15-Jan-2011
04:10
x64

So what gives? Has SP1 for W2K8R2 been updated with the fix included and did the SP1 version I installed (official one right after it went RTM) in the lab not yet include it? Do the service packs differ with language, i.e. only the English one got updated?. Sigh :-/ Now for the good news: ** It’s all very academic because of this KB 2521348 A virtual machine online backup fails in Windows Server 2008 R2 when the SAN policy is set to “Offline All” which brings the vmms.exe version to 6.1.7601.21686 and this hot fix supersedes KB2496089. See http://blogs.technet.com/b/yongrhee/archive/2011/05/22/list-of-hyper-v-windows-server-2008-r2-sp1-hotfixes.aspx where this is explicitly mentioned.

Ramazan Can mentions hotfix 2496089 and whether it is included in SP1 in the comments on his blog post http://ramazancan.wordpress.com/2011/06/14/post-sp1-hotfixes-for-windows-2008-r2-sp1-with-failover-clustering-and-hyper-v/ but I’m not very convinced it is indeed included. The machines I tested on are running W2K8R2 English RTM updated to SP1, not installations for the media including SP1 so perhaps there could also be a difference. It also should not matter that if you install SP1 before adding the Hyper-V role, so that can’t be the cause.

Anyway, keep your systems up to date and running smoothly, but treat your Hyper-V clusters with all due care and attention.

  1. KB2277904: You cannot access an MPIO-controlled storage device in Windows Server 2008 R2 (SP1) after you send the “IOCTL_MPIO_PASS_THROUGH_PATH_DIRECT” control code that has an invalid MPIO path ID
  2. KB2519736: Stop error message in Windows Server 2008 R2 SP1 or in Windows 7 SP1: “STOP: 0x0000007F”
  3. KB2496089: The Hyper-V Virtual Machine Management service stops responding intermittently when the service is stopped in Windows Server 2008 R2
  4. KB2485986: An update is available for Hyper-V Best Practices Analyzer for Windows Server 2008 R2 (SP1)
  5. KB2494162: The Cluster service stops unexpectedly on a Windows Server 2008 R2 (SP1) failover cluster node when you perform multiple backup operations in parallel on a cluster shared volume
  6. KB2496089: The Hyper-V Virtual Machine Management service stops responding intermittently when the service is stopped in Windows Server 2008 R2 (SP1)*
  7. KB2521348: A virtual machine online backup fails in Windows Server 2008 R2 (SP1) when the SAN policy is set to “Offline All”**
  8. KB2531907: Validate SCSI Device Vital Product Data (VPD) test fails after you install Windows Server 2008 R2 SP1
  9. KB2462576: The NFS share cannot be brought online in Windows Server 2008 R2 when you try to create the NFS share as a cluster resource on a third-party storage disk
  10. KB2501763: Read-only pass-through disk after you add the disk to a highly available VM in a Windows Server 2008 R2 SP1 failover cluster
  11. KB2520235: “0x0000009E” Stop error when you add an extra storage disk to a failover cluster in Windows Server 2008 R2 (SP1)
  12. KB2460971: MPIO failover fails on a computer that is running Windows Server 2008 R2 (SP1)
  13. KB2511962: “0x000000D1” Stop error occurs in the Mpio.sys driver in Windows Server 2008 R2 (SP1)
  14. KB2494036: A hotfix is available to let you configure a cluster node that does not have quorum votes in Windows Server 2008 and in Windows Server 2008 R2 (SP1)
  15. KB2519946: Timeout Detection and Recovery (TDR) randomly occurs in a virtual machine that uses the RemoteFX feature in Windows Server 2008 R2 (SP1)
  16. KB2512715: Validate Operating System Installation Option test may identify Windows Server 2008 R2 Server Core installation type incorrectly in Windows Server 2008 R2 (SP1)
  17. KB2523676: GPU is not accessed leads to some VMs that use the RemoteFX feature to not start in Windows Server 2008 R2 SP1
  18. KB2533362: Hyper-V settings hang after installing RemoteFX on Windows 2008 R2 SP1
  19. KB2529956: Windows Server 2008 R2 (SP1) installation may hang if more than 64 logical processors are active
  20. KB2545227: Event ID 10 is logged in the Application log after you install Service Pack 1 for Windows 7 or Windows Server 2008 R2
  21. KB2517329: Performance decreases in Windows Server 2008 R2 (SP1) when the Hyper-V role is installed on a computer that uses Intel Westmere or Sandy Bridge processors
  22. KB2532917: Hyper-V Virtual Machines Exhibit Slow Startup and Shutdown
  23. KB2494016: Stop error 0x0000007a occurs on a virtual machine that is running on a Windows Server 2008 R2-based failover cluster with a cluster shared volume, and the state of the CSV is switched to redirected access
  24. KB2263829: The network connection of a running Hyper-V virtual machine may be lost under heavy outgoing network traffic on a computer that is running Windows Server 2008 R2 SP1
  25. KB2406705: Some I/O requests to a storage device fail on a fault-tolerant system that is running Windows Server 2008 or Windows Server 2008 R2 (SP1) when you perform a surprise removal of one path to the storage device
  26. KB2522766: The MPIO driver fails over all paths incorrectly when a transient single failure occurs in Windows Server 2008 or in Windows Server 2008 R2

KB Article 2522766 & KB Article 2135160 Published Today

At this moment in time I don’t have any more Hyper-V clusters to support that are below Windows Server 2008 R2 SP1. That’s good as I only have one list of patches to keep up to date for my own use. As for you guys still taking care of Windows 2008 R2 RTM Hyper-V cluster you might want to take a look at KN article 2135160 FIX: "0x0000009E" Stop error when you host Hyper-V virtual machines in a Windows Server 2008 R2-based failover cluster that was released today. The issue however is (yet again) an underlying C-State issue that already has been fixed in relation to another issue published as KB article 983460 Startup takes a long time on a Windows 7 or Windows Server 2008 R2-based computer that has an Intel Nehalem-EX CPU installed.

And for both Windows Server 2008 R2 RTM and SP1 you might take a look at an MPIO issue that was also published today (you are running Hyper-V on a cluster and your are using MPIO for redundant storage access I bet) KB article 2522766 The MPIO driver fails over all paths incorrectly when a transient single failure occurs in Windows Server 2008 or in Windows Server 2008 R2

It’s time I add a page to this blog for all the fixes related to Hyper-V and Failover Clustering with Windows Server 2008 R2 SP1 for my own reference Smile