I was asked to take a look at an issue with virtual machines losing network connectivity. The problems were described as follows:
Sometimes some VMs had connectivity, some times they didn’t. It was not tied to specific virtual machines. Sometimes the problem was not there, than it showed up again. It was, not an issue of a wrong subnet mask or gateway.
They suspected firmware or driver issues. Maybe it was a Windows NIC teaming bug or problems with DVMQ or NIC offload settings. There’s a lot of potential reasons, just Google Intermittent VM connectivity Issues Hyper-V and you’ll get a truckload of options.
So a round of wishful firmware, driver upgrading started. Followed by a round of wishful disabling network features. That’s one way to do it. But why not sit back an look at the issue.
Based on what they said I looked at the environment and asked it was tied to specific host as only VMs on one of the hosts had the issue. Could it be be after a live migration or a VM restart. They didn’t really know but it could. So we started looking at the hosts. All teams for the vSwitch were correctly configured on all host. No tagged VLAN on the member NIC. No extra team interfaces that would violate the rule that there can be only one if the team is used by a Hyper-V switch. They used the switch independent teaming mode with the load balancing mode set to Dynamic, all member active. Perfect.
I asked it they used tagged VLAN on the VMs some times. They said yes. Which gave me a clue they had trunking or general mode configured on the ports. So I looked at the switches to see what the port configuration was like? Guess what. All ports on both switches were correctly configured bar the ports of the vSwitch team members on one Hyper-V host. The one with problematic VMs. The two ports were in general mode but the port on the top switch had PVID* 100 and the one on the bottom switch had PVID 200. That was the issue. If the VM “landed” on the team member with PVID 200 it has no network connectivity.
* PVID (switchport general pvid 200) is the default VLAN of the port, in CISCO speak that would translate into “”native VLAN as in switchport trunk native vlan 200
Yes NIC firmware and drivers have issues. There are bugs or problems with advanced features once in a while. But you really do need to check that the configuration is correct and the design or setup makes sense. Do yourself a favor by not assuming anything. Trust but verify.
I attended and presented at E2EVC 2015 in Berlin from June 12th to June 14th. The networking was a blast. No “marchitecure” bull shit or vendor fairy tales what so ever and lots of very open discussions on the realities we’re seeing and facing in virtualization and cloud. Most account managers and esoteric presales would die a painful (but fast) death in this environment.
One session was with my Hyper-V Amigo buddy Carsten Rachfahl and was pure demo extravaganza, so no slides. My own session was “SMB Direct – The Secret Decoder Ring” and was an attempt to position this technology what by looking at the why and where followed by the how by who and when.
I hope a lot of people had at least a better understanding of SMB Direct, RDMA and DCB. The second aim was to take away the fear many people have of this tech by showcasing it in short demos. Time constraints where a challenge so it was not a 200 level session.
You cannot afford to ignore SMB3 and it’s capabilities related to storage traffic such as multichannel, RDMA and encryption. SMB Direct over RoCE seems to have a bright future as it continuous to evolve and improve in Windows Server 2106. The need for DCB (PFC and optionally ETS) intimidates some people. But it should not.
All these talks are at extremely affordable community driven events to make sure you can attend. The sessions are given by speakers who do this for the community (speakers and attendees do this in their own time and pay for their our own travel/expenses) and who work with these technology in real life and provide feedback to vendors on the issues or opportunities we see. This makes the sessions very interesting and anything but marketing, slide ware or sales pitches. See you there!
I recently had to trouble shoot a Windows Server 2012 R2 Hyper-V cluster where SMB Direct is leveraged for live migration. It seemed to work, sometime perfectly but at times it but it was in “slow” motion. The VMs got queued for live migration, it took some time for it started and sometimes it would finish or it would fail. This did not happen between all the nodes. I diligently checked out the SMB Direct network but that was OK on all nodes. Basically the LM network was perfectly fine.
To me this indicated that the hosts potentially had issues communicating with each other to coordinate the live migration. But pings and such looked good, there was connectivity, on the surface all seemed well. In the event log details we saw indications that this was indeed the case. Unfortunately I did not get the opportunity to take screenshots or copies of the events in this particular situation.
The nodes had a separate 2*1Gbps native team LAN access and backups. But diving deeper I noticed that they had set Jumbo Frames on some of those member NICs and not on others. So these setting differed from node to node and that was leading to the symptoms we described above.
You can use Jumbo Frames on your live migration network. Testing has shown this to be beneficial. When you’re doing SMB direct it won’t make such a big difference but it doen not hurt. When SMB Direct fails you’ll fall back to SMB with Multichannel and there it helps more! See Live Migration Can Benefit From Jumbo Frames. While SMB Direct (infiniband, RoCE & iWarp) know Jumbo frames the limited testing I have ever done there indicates only a small increase (2%) in throughput so I’m not sure it’s even worthwhile when doing RDMA.
When you can use Jumbo Frames on you host LAN NIC or team of NICs (handy is you use it to do backups as well) you need to be consistent end to end. Meaning ALL hosts, ALL NICS & all switches/ switch ports. Being inconsistent in this on the cluster nodes was what cause the slow to failing live migrations. You need to have good communications between the hosts themselves and AD. Just unplug the LAN from a Hyper-V cluster host to demo this => live migration from to that node and the rest of the cluster won’t work. Mismatching Jumbo Frames or potentially other network settings make this less obvious. Another “fun” example to trouble shoot is a NIC team where the member NICs are in different VLANs.