A VM that would not route

A VM that would not route

This blog post will address a troubleshooting exercise with a VM (virtual machine) that would not route. As it turned out it had the default gateway set to 0.0.0.0 next to the actual gateway IP address. The VM did its job as the workload it serves is in the same subnet as the client, as it happens in the same subnet of the DC and DNS. This meant it did not lose its trust with Active Directory.

But the admins could not RDP into that VM, nor would it update, But as it did its job, many months went by until it fell too far behind in updates so they could not ignore it anymore. That’s how things go goes in real life.

Finding & fixing the issue

Superficially the configuration of the VM was totally OK. The gateway for the NIC is correct.

Under Advanced we see no other entries that would cause any issues.

But we could not deny that we had a VM that would not route at hand. Let’s figure this out and fix it.

So what does one do? If you don’t trust the CLI, check the GUI, and if you don’t trust the GUI check the CLI. As in the GUI, everything seemed fine we checked via the CLI. Name resolution worked fine, both internally and externally when checking this with nslookup. But actually getting anywhere not on the same subnet was not successful. Naturally, I did check if any forward proxy was in play but that was not the case and, this was an issue for more use cases than HTTP(S).

When I ran ipconfig /all I quickly saw the culprit. We have a Default Gateway entry pointing to 0.0.0.0 next to one for the actual gateway!

So where does that come from? Not from the GUI settings, that we can see. So I ran route print and that show us the root cause

So we needed to drop the route sending traffic for 0.0.0.0 to its own IP address as the gateway. They missed this as it does not show up in the GUI at all.

I dropped all persistent routes for 0.0.0.0 via route delete 0.0.0.0.0 mask 0.0.0.0. I check if this deleted all persistent routes via route print.

At that moment routing won’t work and we need to add the gateway back to the NIC. YOu can use the GUI or route add -p 0.0.0.0 MASK 0.0.0.0 192.168.2.7 IF 9 Once I did that things lit up. We could download and install updates from the WSUS server, they had remote desktop access again. Routing worked again in other words.

How did it happen? Ah, somewhere, somehow, someone added that route. I am not paid to do archeology or forensics in this case so, I did not try to find out the what, when, and why. But my guess is that VM had another NIC at a given time with those setting and they removed it from the Hyper-V setting without cleaning up, leading to that setting being left behind in the routing table leaving a gateway 0.0.0.0 on the NIC that is only visible in via ipconfig /all. Or they have tried to add a gateway manually to address this or another issue.

A final note

When faced with this issue, some folks on the internet will tell you to reset the TCP/IP stack and Winsock with netsh, or add a new NIC with a new IP (dynamically or via DHCP) and dump the old one. But this is all bit drastic. Check the root cause and try to fix that first. You can try drastic measures when all else fails.

SMB over QUIC POC

SMB over QUIC POC

I have had the distinct pleasure of being one of the first people to implement a SMB over QUIC POC. It was in a proof of concept I did with Windows Server 2022 Azure Edition in public preview.

That was a fun and educational excercise. As a result, I learned a lot. As a result, I decided to write a lab and test guide, primarily for my own reference. But also, to share my experience with others.

SMB over QUIC POC
So happy I did this POC and I am very happy with the results!

You can read the lab guide in a two part series of articles. SMB over QUIC: How to use it – Part I | StarWind Blog (starwindsoftware.com) and SMB over QUIC Testing Guide – Part II | StarWind Blog (starwindsoftware.com)

I am convinded it will fill a need for people that require remote access to SMB file shares without a VPN. Next to that, the integration with the KDC proxy service make it a Kerberos integrated solution. In addition, the KDC Prosy service has the added benefit of allowing for remote password changes.

If you need to get up to speed on what SMB over QUIC is all about I refer your to my article SMB over QUIC Technology | StarWind Blog (starwindsoftware.com). I’m sure that will bring you up to speed.

Finally, I hope you will find these articles useful. I’m pretty sure they will help you with your own SMB over QUIC POC and testing.

Thank your for reading!

Microsoft and QUIC

Microsoft and QUIC

If you are interested in Microsoft and QUIC I have some good news for you. Recently a new article, SMB over QUIC Technology | StarWind Blog (starwindsoftware.com) was published. It is the first in a series about what Microsoft is working on in regards to QUIC. While not without some controversy, QUIC does a lot for a number of issues connectivity over “the internet at large” has been dealing with.

  • It leverages UDP.
  • TLS 1.3 is built into the protocol.
  • Reduces RTT during connection & encryption setup.
  • Handles and optimizes flow control and loss recovery.
Figure : QUIC reduces the round trips during the TLS handshake significantly
QUIC reduces the round trips during the TLS handshake significantly

Over the internet, with mobile clients, this is a big deal. Since it is secure by default people really started thinking about where this can be used to improve things for all involved.

I think QUIC is going to be more and more important in the future and this article positions QUIC in the Microsoft ecosystem. So, head over there, read it, and let me know what you think.

TLS 1.3, QUIC, HTTP/3, and SMB 3.1.1 are shaking up things a bit by challenging TCP. Microsoft dropped QUIC into Windows Server 2022 Azure edition. That went into public preview last week and I dove in to the lab to figure out what I can do with it.

As a technologist, I am having a lot of fun testing this out in the lab. Last weekend I was busy with SMB over QUIC and QUIC in IIS. I learned a lot and have made up my mind I can use this in the real world to solve requirements. I will share my findings and musing with you in near the future. But today, start with an introduction in SMB over QUIC Technology | StarWind Blog (starwindsoftware.com).

Check vhid, password and virtual IP address Kemp LoadMaster

Introduction

Recently I was implementing a high available Kemp LoadMaster X15 system. I prepared everything, documented the switch and LM-X15 configuration, and created a VISIO to visualize it all. That, together with the migration and rollback scenario was all presented to the team lead and the engineer who was going to work on this with us. I told the team lead that all would go smoothly if my preparations were good and I did not make any mistakes in the configuration. Well, guess what, I made a mistake somewhere and had to solve a Kemp LoadMaster ad digest – md2=[31084da3…] md=[20dcd914…] – Check vhid, password and virtual IP address log entry.

Check vhid, password and virtual IP address

As, while all was working well, we saw the following entry inundate the system message file log:

<date> <LoadMasterHostName> ucarp[2193]: Bad digest – md2=[xxxxx…] md=[xxxxx…] – Check vhid, password and virtual IP address

Check vhid, password and virtual IP address
every second …

Wait a minute, as far as I know all was OK. The VHID was unique for the HA pair and we did not have duplicate IP addresses set anywhere on other network appliances. So what was this about?

Figuring out the cause

Well, we have a bond0 on eth0 and eth2 for the appliance management. We also have eth1 which is a special interface used for L1 health checks between the Loadmasters. We don’t use a direct link (different racks) so we configure them with an IP on a separate dedicated subnet. Then we have the bonds with the VLAN for the actual workloads via Virtual Services.

We have heartbeat health checks on bond0, eth1 and on at least one VLAN per bonds for the workloads.

Confirm that Promiscuous mode and PortFast are enabled. Check!
HA is configured for multicast traffic in our setup so we confirm that the switch allows multicast traffic. Check!

Make sure that switch configurations that block multicast traffic, such as ‘IGMP snooping’, are disabled on the switch/switch ports as needed. Check!

Now let’s look at possible causes and check our confguration:

So what else? The documentation states as possible other causes the following:

  1. There is another device on the network with the same HA Virtual ID. The LoadMasters in a HA pair should have the same HA Virtual ID. It is possible that a third device could be interfering with these units. As of LoadMaster firmware version 7.2.36, the LoadMaster selects a HA Virtual ID based on the shared IP address of the first configured interface (the last 8 bits). You can change the value to whatever number you want (in the range 1 – 255), or you can keep it at the value already selected. Check!
  2. An interface used for HA checks is receiving a packet from a different interface/appliance. If the LoadMaster has two interfaces connecting to the same switch, with Use for HA checks enabled, this can also cause these error messages. Disable the Use for HA checks option on one of the interfaces to confirm the issue. If confirmed, either leave the option disabled or move the interface to a separate switch.

I am sure there is no interference from another appliance. Check! As we had checked every other possible cause the line in red caught my attention. Could it be?

Time for some packet captures

So we took a TCP dump on bond0 and looked at it in Wireshark. You can make a TCP dump via debug options under System Log Files.

Check vhid, password and virtual IP address
Debug Options, once there find TCP dump.

Select your interface, click start, after 10 seconds or so click stop and download the dump

TCP dump

Do note that Wireshark identifies this as VRRP, but the LoadMaster uses CARP (open source) do set it to decode as CARP, that way you’ll see more interesting information in Info

No, not proprietary VRRP but CARP

Also filter on ip.dst == 244.0.0.18 (multicast address). What we get here is that on eth0 we see multicasts from eth1. That is the case described in the documentation. Aha!

Check vhid, password and virtual IP address
Aha, we see CARP multicasts from eth1 on eth0, that is what we call a clue!

So now what, do we need to move eth1 to another switch to solve this? Or disable the HA check? No, luckily not. Read on.

The fix for Check vhid, password and virtual IP address

No, I did not use one or more separate switches just to plug in the heartbeat HA interfaces on the LoadMasters. What I did is create a separate VLAN for the eth0 HA heartbeat uplink interfaces on the switches. This way I ensure that they are in a separate unicast group from the management interface uplinks on the switches

By selecting a different VLAN for the MGNT and Heartbeat interface uplink they are in different TV VLAN groups by default.

By default the Multicast TV VLAN Membership is per VLAN. The reason the actual workload interfaces did not cause an issue when we enabled HA checks is that these were trunk ports with a number of allowed VLANs, different from the management VLAN, which prevents this error being logged in the first place.

That this works was confirmed in the packet trace from the LM-X15 after making the change.

No more packets received from a different interface. Mission accomplished.

So that was it. The error was gone and we could move along with the project.

Conclusion

Well, I should have know as normally I do put those networks not just in a separate subnet but also make sure they are on different VLANs. This goes to show that no matter how experienced you are and how well you prepare you will still make mistakes. That’s normal and that’s OK, it means you are actually doing something. Key is how you deal with a mistake and that why I wrote this. To share how I found out the root cause of the issue and how I fixed it. Mistakes are a learning opportunity, use them as such. I know many organizations frown upon a mistake but really, these should grow up and don’t act this silly.