Don’t Let Live Migration Settings Behavior Confuse You

Configuring live migration settings on a cluster

In the cluster under Networks, Live Migration Settings you can select what networks are available for live migration and their order of preference.image

image

Basically this setting determines what NICs can be used. It also determines and in what order of preference the available networks can be used by Live Migration. It does not determine bandwidth aggregation or failover. All it does is provide the order in which the redundant networks will be used. It’s up to the cluster service NETFT, Multichannel or SMB Direct to provide the bandwidth aggregation if possible As you can see we use LM/CSV over SMB and as our two NICS are RDMA capable 10Gbps NIC, multichannel will discover RDMA capabilities & leverage SMB Direct if it can be establish otherwise it will just stick with multichannel. If you would  team that NIC shows up as just one network. Also not that if you lose a NIC during live migration it might fail for some VMs under certain scenarios, but you cluster nodes will maintain the capability & recover. The  names of the network reflect this: LM1+CSV2 & CSV1+LM2 will be used both but if for some reason multichannel goes completely south the names reflect the metrics of these networks. The lowest is CSV1+LM2 and the second lowest is LM1+CSV2, reflect on how NETFT will select to use which automatically based on the metrics. I.e. It’s “self documentation” for human consumption.

image

Sometimes you might get surprising results. An example. If you’ve selected SMB for Live Migration and you have selected only one of the NICs here.image Still when you look at perform you might see both being used! What’s happening is that multichannel will kick in (and use two or more similar NICs when it finds them and if applicable move to RDMA.

image

So here we select SMB for the live migration type and the two equally capable 10GBps NICs available for live migration it will use them, even if you selected only one of them in the cluster network settings for live migration.

Why is that? Well, there is still another location where live migration settings are defined and that is in Hyper-V Manager. Take a look at the screenshot below.

image

The Incoming live migration settings is set to “Use any available network for live migration”. If you have this on it will still leverage both as when one is used multichannel drag the other one into action anyway, no matter what you set in network settings for live migration in the Cluster GUI (it set to use only one and dimmed out).

Do note that on Hyper-V Manager the settings for live migrations specify “Incoming live migrations”. That leads us to believe that it’s the target, the node where the VMs are migrated to that determines what NICs get used. But let’s put this to the test.

Testing A Hyper-V cluster with two nodes – Node A and Node B

On the cluster network settings you select only one network.

image

On  Hyper-V cluster Node A you have configured the following for live migration via Hyper-V Manager to “Use these IP addresses for live migration”. You cannot add or remove networks, the networks used are defined by the cluster.

image

On  Hyper-V cluster Node B you have configured the following for live migration via Hyper-V Manager to “Use any available network for live migration”.

As we now kick of a live migration from node A to node B we’ll see both NICs being used. Why well because Node B is the target and Node B has the setting “Use any available network for live migration”. So why then only these 2, why not pick up any other suitable NICs? Well we’ve configured the live migration on both nodes to use SMB. As this cluster is RDMA capable that means it will leverage multichannel/SMB direct. The auto configuration will select the best, equally capable NICs for this and that’s these two in our scenario. Remember the capabilities of the NICs have to match. So no mixtures of 1 * 1Gbps and 1 *10Gbps or 1 * multichannel and 1 SMB Direct.

The confusion really sets in that even if live migrate from Node B to A it will also use both NICs! Hum, that is “Incoming Live Migrations” is not always “correct” it seems, not when using SMB as a performance option at least. Apparently multichannel will kick in in both directions.

If you set both to Node A and Node B to “Use these IP addresses for live migration” and leave the cluster network setting with only one network it does only use one, even with SMB as a performance option for live migration. This is expected.

Note: I had one interesting hiccup while testing this configuration: when doing the latter one of the VMs failed in live migration of the entire host. I ran it again and that one VM still used both networks. All others went well during host migration with just one being used. That was a bit of a huh Confused smile moment and it sure tripped me up & kept me busy for a while. I blame RDMA and the position of the planets & constellations.

Things aren’t always what they seem at first and it’s good to keep that in mind. The moment you think you got if figured out, you’re wrong Winking smile. So look again & investigate.

Linux Integration Services Version 3.5 for Hyper-V Available For Download

Yesterday, December 19th 2013, Microsoft made the Linux Integration Services Version 3.5 for Hyper-V available for download.

The Linux Integration Services (LIS) package downloaded from Microsoft  is meant to deliver support older Linux distros. In the most recent Linux distros the KVP component is to be included, as are the other Hyper-V related drivers. In these distros these drivers and components are to be part of the upstream Linux kernel, and as such are included in Linux distros releases. So you should not need this download if you run these newer distros that has the LIS built-in. The list of supported distros is slowly growing.

image

If you are running (or need to run) older versions of Linux in your VMs and leverage the 100% fully featured Hyper-v Server 2012 R2 that is also 100% free of charge this is your way to leverage all those features. The aim is that you’re never a left behind when running Hyper-V (within the limits of supportability, DOS 6.0, NT 4.0 or Windows 2000 is not an acceptable OS today).

In Microsoft speak:

Hyper-V supports both emulated (“legacy”) and Hyper-V-specific (“synthetic”) devices for Linux virtual machines. When a Linux virtual machine is running with emulated devices, no additional software is required to be installed. However, emulated devices do not provide high performance and cannot leverage the rich virtual machine management infrastructure that the Hyper-V technology offers.

To make full use of all benefits that Hyper-V provides, it is best to use Hyper-V-
specific devices for Linux. The collection of drivers that are required to run Hyper-V-specific devices is known as Linux Integration Services (LIS).
 
For certain older Linux distributions, Microsoft provides an ISO file containing installable LIS drivers for Linux virtual machines. For newer Linux distributions, LIS is built into the Linux operating system, and no separate download or installation is required. This guide discusses the installation and functionality of LIS drivers on older Linux distributions.

For some extra info an tips see Enabling Linux Support on Windows Server 2012 R2 Hyper-V

Failed Live Migrations with Event ID 21502 Planned virtual machine creation failed for virtual machine ‘VM Name’: An existing connection was forcibly closed by the remote host. (0x80072746) Caused By Wrong Jumbo Frame Settings

OK so Live Migration fails and you get the following error in the System even log with event id 21502:

image

Planned virtual machine creation failed for virtual machine ‘DidierTest01’: An existing connection was forcibly closed by the remote host. (0x80072746). (Virtual Machine ID 41EF2DB-0C0A-12FE-25CB-C3330D937F27).

Failed to receive data for a Virtual Machine migration: An existing connection was forcibly closed by the remote host. (0x80072746).

There are some threads on the TechNet forums on this like here http://social.technet.microsoft.com/Forums/en-US/805466e8-f874-4851-953f-59cdbd4f3d9f/windows-2012-hyperv-live-migration-failed-with-an-existing-connection-was-forcibly-closed-by-the and some blog post pointing to TCP/IP Chimney settings causing this but those causes stem back to the Windows Server 2003 / 2008 era.

In the Hyper-V event log Microsoft-Windows-Hyper-V-VMMS-Admin you also see a series of entries related to the failed live migration point to the same issue: image

  
Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          10/8/2013 10:06:15 AM
Event ID:      20413
Task Category: None
Level:         Information
Keywords:     
User:          SYSTEM
Computer:      SRV1.BLOG.COM
Description:
The Virtual Machine Management service initiated the live migration of virtual machine  ‘DidierTest01’ to destination host ‘SRV2’ (VMID 41EF2DB-0C0A-12FE-25CB-C3330D937F27).
 
Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          10/8/2013 10:06:26 AM
Event ID:      22038
Task Category: None
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      SRV1.BLOG.COM
Description:
Failed to send data for a Virtual Machine migration: An existing connection was forcibly closed by the remote host. (0x80072746).
 
Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          10/8/2013 10:06:26 AM
Event ID:      21018
Task Category: None
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      SRV1.BLOG.COM
Description:
Planned virtual machine creation failed for virtual machine ‘DidierTest01’: An existing connection was forcibly closed by the remote host. (0x80072746). (Virtual Machine ID 41EF2DB-0C0A-12FE-25CB-C3330D937F27).
 
Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          10/8/2013 10:06:26 AM
Event ID:      22040
Task Category: None
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      SRV1.BLOG.COM
Description:
Failed to receive data for a Virtual Machine migration: An existing connection was forcibly closed by the remote host. (0x80072746).
Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          10/8/2013 10:06:26 AM
Event ID:      21024
Task Category: None
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      srv1.blog.com
Description:
Virtual machine migration operation for ‘DidierTest01’ failed at migration source ‘SRV1’. (Virtual machine ID 41EF2DB-0C0A-12FE-25CB-C3330D937F27)

There is something wrong with the network and if all checks out on your cluster & hosts it’s time to look beyond that. Well as it turns out it was the Jumbo Frame setting on the CSV and LM NICs.

Those servers had been connected to a couple of DELL Force10  S4810 switches. These can handle an MTU size up to 12000. And that’s how they are configured. The Mellanox NICs allow for MTU Sizes up to 9614 in their Jumbo Frame property.  Now super sized jumbo frames are all cool until you attach the network cables to another switch like a PowerConnect 8132 that has a max MTU size of 9216. That moment your network won’t do what it’s supposed to and you see errors like those above. If you test via an SMB share things seem OK & standard pings don’t show the issue. But some ping tests with different mtu sizes & the –f (do no fragment) switch will unmask the issue soon. Setting the Jumbo Frame size on the CSV & LM NICs to 9014 resolved the issue.

Now if on the server side everything matches up but not on the switches you’ll also get an event id 21502 but with a different error message:

Event ID: 21502 The Virtual Machine Management Service failed to establish a connection for a Virtual machine migration with host XXXX. A connection attempt failed because the connected party did not properly respond after a period of time, or the established connection failed because connected host has failed to respond (0X8007274C)

image

This is the same message you’ll get for a known cause of shared nothing live migration failing as described in this blog post by Microsoft Shared Nothing Migration fails (0x8007274C).

So there you go. Keep an eye on those Jumbo Frame setting especially in a mixed switch environment. They all have their own capabilities, rules & peculiarities. Make sure to test end to end and you’ll be just fine.

I’m In Austin Texas For Dell World 2013

This is the night time sky line of where I’m at right now. Austin, Texas, USA. That famous “Lone Star State” that until now I only knew from the movies & the media. Austin is an impressive city in an impressive state and, as most US experiences I’ve had, isn’t comparable with anything in my home country Belgium. That works both ways naturally and I’m lucky I get to travel a bit and see a small part of the world.image

Dell World 2013

So why am I here?  Well I’m here to attend DELL World 2013, but you got that already Smile

image

That’s nice Didier but why DELL World? Well, several reasons. For one, I wanted to come and talk to as many product owners & managers, architects & strategists as I can. We’re seeing a lot of interest in new capabilities that Windows Server 2012 (R2) brought to the Microsoft ecosystem. I want to provide all the feedback I can on what I as a customer, a Microsoft MVP and technologist expect from DELL to help us make the most of those. I’m convinced DELL has everything we need but can use some guidance on what to add or enhance. It would be great to get our priorities and those of DELL aligned. Form them I expect to hear their plans, ideas, opinions and see how those match up. Dell has a cost/value leadership position when it comes to servers right now. They have a great line up of economy switches that pack a punch (PowerConnect) & some state of the art network gear with Force10. it would be nice to align these with guidance & capabilities to leverage SMB Direct and NVGRE network virtualization. Dell still has the chance to fill some gaps way better than others have. A decent Hyper-V network virtualization gateway that doesn’t cost your two first born children and can handle dozens to hundreds of virtual networks comes to mind. That and real life guidance on several SMB Direct with DCB configuration guidance. Storage wise, the MD series, Equalogic & Compellent arrays offer great value for money. But we need to address the needs & interest that SMB 3.0, Storage Spaces, RDMA has awoken and how Dell is planning to address those. I also think that OEMs need to pick up pace & change some of their priorities when it comes to providing answers to what their customers in the MSFT ecosystem ask for & need, doing that can put them in a very good position versus their competitors. But I have no illusions about my place in & impact on the universe.

Secondly, I was invited to come. As it turns out DELL has the same keen interest in talking to people who are in the trenches using their equipment to build solutions that address real life needs in a economical feasible way.  No, this is not “just” marketing. A smart vendor today communicates in many ways with existing & potential customers. Social media is a big part of that but also off line at conferences, events and both contributor and sponsor.  Feedback on how that works & is received is valuable as well for both parties. They learn what works &n doesn’t and we get the content we need. Now sure you’ll have the corporate social media types that are bound by legal & marketing constrictions but the real value lies in engaging with your customers & partners about their real technological challenges & needs.

Third is the fact that all these trends & capabilities in both the Microsoft ecosystem and in hardware world are not happening in isolation. They are happening in a world dominated by cloud computing in all it’s forms. This impact everything from the clients, servers, servers to the data centers as well as the people involved. It’s a world in which we need to balance the existing and future needs with a mixture of approaches & where no one size fits all even if the solutions come via commodity products & services. It’s a world where the hardware  & software giants are entering each others turf. That’s bound to cause some sparks Smile. Datacenter abstraction layer, Software Defined “anything” (storage, networking, …), converged infrastructure. Will they collaborate or fight?

So put these three together and here I am. My agenda is full of meetings, think tanks, panels, briefings and some down time to chat to colleagues & DELL employees alike.

Why & How?

Some time ago I was asked why I do this and why I’m even capable to do this. It takes time, money and effort.  Am I some kind of hot shot manager or visionary guru? No, not at all. Believe there’s nothing “hot” about working on a business down issue at zero dark thirty. I’m a technologist. I’m where the buck stops. I need things to work. So I deal in realities not fantasies. I don’t sell methods, processes or services people, I sell results, that’s what pays my bills long term. But I do dream and I try to turn those into realities. That’s different from just fantasy world where reality is an unwelcome guest. I’m no visionary, I’m no guru. I’m a hard working IT Pro (hence the blog title and twitter handle) who realizes all to well he’s standing on the shoulders of not just giants but of all those people who create the ecosystem in which I work. But there’s more. Being a mere technologist only gets you that far. I also give architectural & strategic advice as that’s also needed to make the correct decisions. Solutions don’t exist in isolation and need to be made in relation to trends, strategies and needs. That takes insight & vision. Those don’t come to you by only working in the data center, your desktop or in eternal meetings with the same people in the same place. My peers, employers and clients actively support this for the benefit of our business, customers, networks & communities. That’s the what, why and who that are giving me the opportunities to learn & grow both personally & professionally. People like Arlindo Alves and may others at MSFT, my fellow MVPs (Aidan Finn, Hans Vredevoort, Carsten Rachfahl, …), Florian Klaffenbach & Peter Tsai. As a person you have to grab those opportunities. If you want to be heard you need to communicate. People listen and if the discussions and subjects are interesting it becomes a two way conversation and a great learning experience. As with all networking and community endeavors you need to put in the effort to reap the rewards in the form of information, insights and knowledge you can leverage for both your own needs as well as for those in your network. That means speaking your mind. Being honest and open, even if at times you’re wrong. That’s learning. That, to me, is what being in the DELL TechCenter Rock StarDELL TechCenter Rock Star program is all about.

Learning, growing, sharing. That and a sustained effort in your own development slowly but surely makes you an “expert”. An expert that realizes all to well how much he doesn’t known & cannot possible all learn.  Luckily, to help deal with that fact, you can turn to the community.