Failed Live Migrations with Event ID 21502 Planned virtual machine creation failed for virtual machine ‘VM Name’: An existing connection was forcibly closed by the remote host. (0x80072746) Caused By Wrong Jumbo Frame Settings

OK so Live Migration fails and you get the following error in the System even log with event id 21502:

image

Planned virtual machine creation failed for virtual machine ‘DidierTest01’: An existing connection was forcibly closed by the remote host. (0x80072746). (Virtual Machine ID 41EF2DB-0C0A-12FE-25CB-C3330D937F27).

Failed to receive data for a Virtual Machine migration: An existing connection was forcibly closed by the remote host. (0x80072746).

There are some threads on the TechNet forums on this like here http://social.technet.microsoft.com/Forums/en-US/805466e8-f874-4851-953f-59cdbd4f3d9f/windows-2012-hyperv-live-migration-failed-with-an-existing-connection-was-forcibly-closed-by-the and some blog post pointing to TCP/IP Chimney settings causing this but those causes stem back to the Windows Server 2003 / 2008 era.

In the Hyper-V event log Microsoft-Windows-Hyper-V-VMMS-Admin you also see a series of entries related to the failed live migration point to the same issue: image

  
Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          10/8/2013 10:06:15 AM
Event ID:      20413
Task Category: None
Level:         Information
Keywords:     
User:          SYSTEM
Computer:      SRV1.BLOG.COM
Description:
The Virtual Machine Management service initiated the live migration of virtual machine  ‘DidierTest01’ to destination host ‘SRV2’ (VMID 41EF2DB-0C0A-12FE-25CB-C3330D937F27).
 
Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          10/8/2013 10:06:26 AM
Event ID:      22038
Task Category: None
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      SRV1.BLOG.COM
Description:
Failed to send data for a Virtual Machine migration: An existing connection was forcibly closed by the remote host. (0x80072746).
 
Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          10/8/2013 10:06:26 AM
Event ID:      21018
Task Category: None
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      SRV1.BLOG.COM
Description:
Planned virtual machine creation failed for virtual machine ‘DidierTest01’: An existing connection was forcibly closed by the remote host. (0x80072746). (Virtual Machine ID 41EF2DB-0C0A-12FE-25CB-C3330D937F27).
 
Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          10/8/2013 10:06:26 AM
Event ID:      22040
Task Category: None
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      SRV1.BLOG.COM
Description:
Failed to receive data for a Virtual Machine migration: An existing connection was forcibly closed by the remote host. (0x80072746).
Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          10/8/2013 10:06:26 AM
Event ID:      21024
Task Category: None
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      srv1.blog.com
Description:
Virtual machine migration operation for ‘DidierTest01’ failed at migration source ‘SRV1’. (Virtual machine ID 41EF2DB-0C0A-12FE-25CB-C3330D937F27)

There is something wrong with the network and if all checks out on your cluster & hosts it’s time to look beyond that. Well as it turns out it was the Jumbo Frame setting on the CSV and LM NICs.

Those servers had been connected to a couple of DELL Force10  S4810 switches. These can handle an MTU size up to 12000. And that’s how they are configured. The Mellanox NICs allow for MTU Sizes up to 9614 in their Jumbo Frame property.  Now super sized jumbo frames are all cool until you attach the network cables to another switch like a PowerConnect 8132 that has a max MTU size of 9216. That moment your network won’t do what it’s supposed to and you see errors like those above. If you test via an SMB share things seem OK & standard pings don’t show the issue. But some ping tests with different mtu sizes & the –f (do no fragment) switch will unmask the issue soon. Setting the Jumbo Frame size on the CSV & LM NICs to 9014 resolved the issue.

Now if on the server side everything matches up but not on the switches you’ll also get an event id 21502 but with a different error message:

Event ID: 21502 The Virtual Machine Management Service failed to establish a connection for a Virtual machine migration with host XXXX. A connection attempt failed because the connected party did not properly respond after a period of time, or the established connection failed because connected host has failed to respond (0X8007274C)

image

This is the same message you’ll get for a known cause of shared nothing live migration failing as described in this blog post by Microsoft Shared Nothing Migration fails (0x8007274C).

So there you go. Keep an eye on those Jumbo Frame setting especially in a mixed switch environment. They all have their own capabilities, rules & peculiarities. Make sure to test end to end and you’ll be just fine.

Live Migration Can Benefit From Jumbo Frames

Does live migration benefit from Jumbo frames? This question always comes back so I’d just blog it hear again even if I have mentioned it as part of other blog posts. Yes it does! How do I know. Because I’ve tested and used it with Windows Server 2008 R2, 2012 & 2012 R2. Why? because I have a couple of mantra’s:

  • Assumption are the mother of all fuckups
  • Assume makes an ASS out of U and ME
  • Trust but verify

What can I say. I have been doing 10Gbps since for Live Migration with Hyper-V. And let me tell you my experiences with an otherwise completely optimized server (mainly BIOS performance settings): It will help you with up to 20% more bandwidth use.

And thanks to Windows Server 2012 R2 supporting SMB for live migration we can very nicely visualize this with 2*10Gbps NICS, not teamed, used by live migration leveraging SMB Multichannel. On one of the 10Gbps we enable Jumbo Frames on the other one we do not. We than live migrate a large memory VM back and forth. Now you tell me which one is which.

image

Now enable Jumbo frames on both 10Gbps NICs and again we live migrate the large memory VM back and forth. More bandwidth used, faster live migration.

image

I can’t make it any more clear. No jumbo frames will not kill your performance unless you have it messed up end to end. Don’t worry if you have a cheaper switch where you can only enable it switch wide instead op port per port. The switch is a pass through. So unless you set messed up sizes on sender/receiving host that the switch in between can’t handle, it will work even without jumbo frames and without heaven falling down on your head Smile. Configure it correctly, test it, and you’ll see.

Enabling Jumbo Frames Inside Virtual Machines Enhances Throughput & Reduces CPU Load

Let’s play a bit with a Windows Server 2012 R2 Hyper-V cluster with 2*10Gbps Intel X520 teamed switch independent and in dynamic mode, which is optimal for DVMQ. On these NICs we enabled Jumbo frames (and on the switches of course). That team is used to create a virtual switch for consumption by the virtual machines. The switches used are 2*DELL PowerConnect 8132F. So we have full fault tolerance. But the important thing to note is that this is commodity, quality hardware that we can leverage for great results.

Now we’ll compare 2 scenarios. In both test a sending VM will try to saturate a receiving VMs network bandwidth. Due to how this NIC teaming setup works, that’s about 10Gbps in a two member 2*10Gbps team. We work around this by leaving both VMs on the same host, so the traffic doesn’t need to pass across the wire. The VMs have VMQ & vRSS enabled with the host team members having VMQ enabled.

Jumbo frames disabled inside the VM (Both VMs on same host)

Without Jumbo Frames enabled in the Guest VM and all other things being equal the very best we can achieve non-sustained is 21Gbps (average +/- 17Gbps)  receiving traffic in a VM. Not bad, not bad at all.

image

In the picture you can see the host with DVMQ doing it’s job at the left while vRSS is at work in the VM. Pretty clear.

Jumbo frames enabled inside the VM

Now, let’s enable Jumbo frames in the VM.  Fire up PowerShell or use the GUI.

image

Don’t forget to do this on both the sending and the receiving VM Smile. Here we get +/- 30Gbps receiving traffic inside of the VM. A nice improvement isn’t it? Not just that but we consume less CPU resources as well! Sweet Smile

image

Useful Power at your finger tips or just showing off?

vRSS & DVMQ is one of may scalability & performance improvements in Windows 2012 R2. And yes, Jumbo Frames inside the VM do make a difference but in a 10Gbps environment it’s not  “in your face” that visible. The 10Gbps limit of a single NIC team member makes this your bottle neck. But it DOES help to reduce CPU cycles in that case. Just look at the two screen shots below.

No Jumbo frames in VM (sending & receiving VMs on different hosts)

image

We’re consuming 7% CPU resources on the host and 15% in the VM.

Jumbo Frames in VM (sending & receiving VMs on different hosts)

image

We’ve dropped down to 3% on the host and to 9% inside the VM. All bits help I say!

How far can we push this?

Again if you take the NIC Team bottleneck out of the way you can see some serious differences. Take a look at the screenshot below, that’s 36,2Gbps inside of a VM courtesy of vRSS during some other experiments. Tallyho!

image

So let’s face it, I guess we’ll need some faster memory (DDR4) and multiple 40Gbps/100Gbps Cards to see what the limits of Windows Server 2012 R2 are or to find out if we reached it. Right now the operating system is giving hardware people a run for their money.

Also note that if you have a number of VMs doing a lot of network IO you’ll be using quite a number of CPU cycles. While vRSS & DVMQ make this scale you might want to consider leveraging SMB Direct for the various tastes of live migration as this will definitely help you out on that front as the NIC will do the heavy lifting.

Reality Check

But perhaps we also need little reality check. While 100Gbps and DDR4 is very nice you might not need it for your current needs. When the environment is built right you’ll find that your apps are usually your limiting factor before the hardware, let alone Windows Server 2012 R2.  So why is knowing this important? Well I verify Microsoft claims so I can talk from experience and not just from what I read in a Microsoft presentation. Secondly you can trust that your investment in Windows Server 2012 R2 is going to carry you long and far. It’s future proofed and that’s god for you. Both when your when needs grow exponential and for the longevity of your environment. Third, we can leverage this to virtualize high through put environments and get the best possible results and ROI.

Also, please, please test & find out, verify what settings suit your environment best and just just blindly enable stuff. Good luck!

Presentation & Demos E2EVC Rome 2013

Well my E2EVC presentation has been given and it went well. Sweet experience for the 20th edition of this excellent community conference.

image

Thank you to all those attendees that attended my session. I hope you enjoyed it, learned something and got a taste of experimenting with some of the perhaps lesser know features at your disposal in Windows Server 2012 R2.

A big thanks to Alex and Clare for the splendid organization of E2EVC for the twentieth time!

For more information on VMQ & vRSS go here: https://blog.workinghardinit.work/2013/10/23/windows-server-2012-r2-virtual-rss-vrss-in-action/
http://blogs.technet.com/b/networking/archive/2013/09/10/vmq-deep-dive-1-of-3.aspx
http://blogs.technet.com/b/networking/archive/2013/09/24/vmq-deep-dive-2-of-3.aspx
http://blogs.technet.com/b/networking/archive/2013/10/22/vmq-deep-dive-3-of-3.aspx

For a good start on SMB Direct with iWarp or RoCE go here: https://blog.workinghardinit.work/2013/08/28/adventures-in-rdma-the-roce-path-to-windows-server-2012-r2-smb-3-0-glory/

You can download the presentation here. The video of the session will be made available later by E2EVC so you can see the demo.

Just to make you all drool for the video to become live here’s a screenshot of what DVMQ/vRSS can achieve. Just pushing it to the limit. No, this is not “Photoshopped” I have witnesses Winking smile. That’s a W2K12R2 VM receiving 37.4Gbps of traffic. That will do or most of you I guess until 100Gbps NICs are the standard LOM on your future computing device.

image

By the way some people really loved some of the drawn art I used in the presentation. For these I owe thanks to Kathy Sierra and her great blog art. It’s a shame she felt the need to go dark.

More stuff is in the work as working hard in IT never fails to deliver more good subjects, findings and results to blog about.

If you’re a managerial type and feel offended, you probably should, as you’re doing it all wrong Smile. Otherwise you would have smiled because you have a sense of humor and nodded at the mistakes of your colleagues.