Do Azure VPN Gateways that leverage BGP support BFD?

Introduction

Anyone doing redundant, high-available VPN gateways leveraging BGP (Border Gateway Protocol) has encountered BFD (Bi-Directional Forwarding Detection). That said, BFD is not limited to BGP but also works with OSPF and OSPF6. But before we answer whether Azure VPN Gateways that leverage BGP support BFD, l briefly discuss what BFD does.

Bi-Directional Forwarding Detection

The BFD (Bi-Directional Forwarding Detection) protocol provides high-speed and efficient detection for link failures. It works even when the physical link has no failure detection support itself. As such, it helps routing protocols, such as BGP, failover much quicker than they could achieve if left to their own devices.

BFD control packets are transmitted via UDP from source ports between 49152-65535 to destination port 3784 (single-hop, RFC 5880, RFC 5881, and RFC 5882) or 4784 (multi-hop, RFC 5883). It can be IPv4 as well as IPv6. See Bidirectional Forwarding Detection on Wikipedia for more information. Note that this works between directly connected routers (single-hop) or (multi-hop).

Azure VPN Gateways that leverage BGP support BFD

Currently, OPNsense and pfSense, with the FRR (Flexible Rigid Routing) plugin, support BFD integration with BGP, Open Shortest Path First (OSPF), and Open Shortest Path First version 6 (OSPF6). Naturally, most vendors support this, but I mention OPNsense and pfSense because they offer free, fully functional products that are very handy for demos and lab testing.

Do Azure VPN Gateways that leverage BGP support BFD?

TL;DR: No.

You do not find much information when you search for BFD information about Microsoft Azure networking. Only for Azure ExpressRoute does Microsoft clearly state that it is supported and provides information.

But what about Azure VPN gateways with BFD? Well, no, that is not supported at all. You can try to set it up, but your VPN Gateways on-premises is shouting into a void. The session status with the peers will always be “down.” It just won’t work.

A VM that would not route

A VM that would not route

This blog post will address a troubleshooting exercise with a VM (virtual machine) that would not route. As it turned out it had the default gateway set to 0.0.0.0 next to the actual gateway IP address. The VM did its job as the workload it serves is in the same subnet as the client, as it happens in the same subnet of the DC and DNS. This meant it did not lose its trust with Active Directory.

But the admins could not RDP into that VM, nor would it update, But as it did its job, many months went by until it fell too far behind in updates so they could not ignore it anymore. That’s how things go goes in real life.

Finding & fixing the issue

Superficially the configuration of the VM was totally OK. The gateway for the NIC is correct.

Under Advanced we see no other entries that would cause any issues.

But we could not deny that we had a VM that would not route at hand. Let’s figure this out and fix it.

So what does one do? If you don’t trust the CLI, check the GUI, and if you don’t trust the GUI check the CLI. As in the GUI, everything seemed fine we checked via the CLI. Name resolution worked fine, both internally and externally when checking this with nslookup. But actually getting anywhere not on the same subnet was not successful. Naturally, I did check if any forward proxy was in play but that was not the case and, this was an issue for more use cases than HTTP(S).

When I ran ipconfig /all I quickly saw the culprit. We have a Default Gateway entry pointing to 0.0.0.0 next to one for the actual gateway!

So where does that come from? Not from the GUI settings, that we can see. So I ran route print and that show us the root cause

So we needed to drop the route sending traffic for 0.0.0.0 to its own IP address as the gateway. They missed this as it does not show up in the GUI at all.

I dropped all persistent routes for 0.0.0.0 via route delete 0.0.0.0.0 mask 0.0.0.0. I check if this deleted all persistent routes via route print.

At that moment routing won’t work and we need to add the gateway back to the NIC. YOu can use the GUI or route add -p 0.0.0.0 MASK 0.0.0.0 192.168.2.7 IF 9 Once I did that things lit up. We could download and install updates from the WSUS server, they had remote desktop access again. Routing worked again in other words.

How did it happen? Ah, somewhere, somehow, someone added that route. I am not paid to do archeology or forensics in this case so, I did not try to find out the what, when, and why. But my guess is that VM had another NIC at a given time with those setting and they removed it from the Hyper-V setting without cleaning up, leading to that setting being left behind in the routing table leaving a gateway 0.0.0.0 on the NIC that is only visible in via ipconfig /all. Or they have tried to add a gateway manually to address this or another issue.

A final note

When faced with this issue, some folks on the internet will tell you to reset the TCP/IP stack and Winsock with netsh, or add a new NIC with a new IP (dynamically or via DHCP) and dump the old one. But this is all bit drastic. Check the root cause and try to fix that first. You can try drastic measures when all else fails.