The Perfect Storm of Azure DNS resolver, a custom DNS resolver, and DNS configuration ambiguities

TL:DR

The very strict Azure recursive DNS resolver, when combined with a Custom DNS resolver, can cause a timeout-sensitive application to experience service disruption due to ambiguities in third-party DNS NS delegation configurations.

Disclaimer

I am using fantasy FQDNs and made-up IP addresses here. Not the real ones involved in the issue.

Introduction

Services offered by a GIS-driven business noticed a timeout issue. Upon investigation, this was believed to be a DNS issue. That was indeed the case, but not due to a network or DNS infrastructure error, let alone a gross misconfiguration.

The Azure platform DNS resolver (168.63.129.16) is a high-speed and very strict resolver. While it can return the IP information, it does indicate a server error.

nslookup pubdata.coast.be

Server:         127.0.0.11

Address:        127.0.0.11#53

Non-authoritative answer:

pubdata.coast.be canonical name = www.coast.be.

Name:   www.coast.be

Address: 154.152.150.211

Name:   www.coast.be

Address: 185.183.181.211

** server can’t find www.coast.be: SERVFAIL

Azure handles this by responding fast and reporting the issue. The Custom DNS service, which provides DNS name resolution for the service by forwarding recursive queries to the Azure DNS resolver, also reports the same problem. However, it does not do this as fast as Azure. Here, it takes 8 seconds (Recursive Query Timeout value), potentially 4 seconds longer due to the additional timeout value. So, while DNS works, something is wrong, and the extra time before the timeout occurs causes service issues.

When first asked to help out, my first questions were if it had ever worked and if anything had changed. The next question was whether they had any control over the time-out period to adjust it upward, which would enable the service to function correctly. The latter was not possible or easy, so they came to me for troubleshooting and a potential workaround or fix.

So I dove in with the tools of the trade. nslookup, Nameresolver, Dig, https://dnssec-analyzer.verisignlabs.com/, and https://dnsviz.net/. The usual suspects were DNSSEC and zone delegation mismatches.

First, I run:

Nslookup -debug pubdata.coast.be

In the output, we find:

Non-authoritative answer:

Name:    www.coast.be

Addresses:  154.152.150.211

          185.183.181.211

Aliases:  pubdata.coast.be

We learn that pubdata.coast.be is a CNAME for www.coast.be. Let’s see if any CNAME delegation or DNSSEC issues are in play. Run:

dig +trace pubdata.coast.be

;; global options: +cmd

.                       510069  IN      NS      a.root-servers.net.

.                       510069  IN      NS      b.root-servers.net.

..

.

.                       510069  IN      NS      l.root-servers.net.

.                       510069  IN      NS      m.root-servers.net.

.                       510069  IN      RRSIG   NS 8 0 518400 20250807170000 20250725160000 46441 . <RRSIG_DATA_ANONYMIZED>

;; Received 525 bytes from 1.1.1.1#53(1.1.1.1) in 11 ms

be.                     172800  IN      NS      d.nsset.be.

..

.

be.                     172800  IN      NS      y.nsset.be.

be.                     86400   IN      DS      52756 8 2 <DS_HASH_ANONYMIZED>

be.                     86400   IN      RRSIG   DS 8 1 86400 20250808050000 20250726040000 46441 . <RRSIG_DATA_ANONYMIZED>

;; Received 753 bytes from 198.41.0.4#53(a.root-servers.net) in 13 ms

coast.be.           86400   IN      NS      ns1.corpinfra.be.

coast.be.           86400   IN      NS      ns2.corpinfra.be.

<hash1>.be.             600     IN      NSEC3   1 1 0 – <next-hash1> NS SOA RRSIG DNSKEY NSEC3PARAM

<hash1>.be.             600     IN      RRSIG   NSEC3 8 2 600 20250813002955 20250722120003 62188 be. <RRSIG_DATA_ANONYMIZED>

<hash2>.be.             600     IN      NSEC3   1 1 0 – <next-hash2> NS DS RRSIG

<hash2>.be.             600     IN      RRSIG   NSEC3 8 2 600 20250816062813 20250724154732 62188 be. <RRSIG_DATA_ANONYMIZED>

;; Received 610 bytes from 194.0.37.1#53(b.nsset.be) in 10 ms

pubdata.coast.be. 3600   IN      CNAME   www.coast.be.

www.coast.be.       3600    IN      NS      dns-lb1.corpinfra.be.

www.coast.be.       3600    IN      NS      dns-lb2.corpinfra.be.

;; Received 151 bytes from 185.183.181.135#53(ns1.corpinfra.be) in 12 ms

The DNSSEC configuration is not the issue, as the signatures and DS records appear to be correct. So, the delegation inconsistency is what causes the SERVFAIL, and the duration of the custom DNS servers’ recursive query timeout causes the service issues.

The real trouble is here:

pubdata.coast.be. 3600 IN CNAME www.coast.be

www.coast.be.        3600 IN NS dns-lb1.corpinfra.be.

This means pubdata.coast.be is a CNAME to www.coast.be. But www.coast.be is served by different nameservers than the parent zone (coast.be uses ns1/ns2.corpinfra.be). This creates a delegation inconsistency:

The resolver must follow the CNAME and query a different set of nameservers. If those nameservers don’t respond authoritatively or quickly enough, or if glue records are missing, resolution may fail.

Strict resolvers (such as Azure DNS) may treat this as a lame delegation or a broken chain, even if DNSSEC is technically valid.

Workarounds

I have already mentioned that fixing the issue in the service configuration setting was not on the table, so what else do we have to work with?

  • A quick workaround is to use the Azure platform DNS resolver (168.63.129.16) directly, which, due to its speed, avoids the additional time required for finalizing the query. However, due to DNS requirements, this workaround is not always an option.
  • The other one is to reduce the recursive query timeouts and additional timeout values on the custom DNS solution. This is what we did. The timeout value is now 2 (default is 8), and the additional timeout value is now 2 (default is 4). That is what I did to resolve the issue as soon as possible. Monitor this to ensure that no other problems arise after taking this action.
  • Third, we could conditionally forward coast.be to the dns-lb1.corpinfra.be and dns-lb2.corpinfra.be NS servers. That works, but it requires maintenance when those name servers change, so we need to keep an eye on that. We already have enough work.
  • A fourth workaround is to provide an IP address from a custom DNS query in the source code to a public DNS server, such as 1.1.1.1 or 8.8.8.8, when accessing the pubdata.coast.be FQDN is involved. This is tedious and not desirable.
  • The most elegant solution would be to address the DNS configuration Azure has an issue with. That is out of our hands, but it can be requested from the responsible parties. For that purpose, you will find the details of our findings.

Issue Summary

The .be zone delegates coast.be to the NS servers:

dns-lb1.corpinfra.be

dns-lb2.corpinfra.be

However, the coast.be zone itself lists different NS servers:

ns1.corpinfra.be

ns2.corpinfra.be

This discrepancy between the delegation NS records in .be and the authoritative NS records inside the coast.be zone is a violation of DNS consistency rules.

Some DNS resolvers, especially those performing strict DNSSEC and delegation consistency checks, such as Azure Native DNS resolver, interpret this as a misconfiguration and return SERVFAIL errors. This happens even when the IP address(es) for pubdata.coast.be can indeed be resolved.

Other resolvers (e.g., Google Public DNS, Cloudflare) may be more tolerant and return valid answers despite the mismatch, without mentioning any issue.

Why could this be a problem?

DNS relies on consistent delegation to ensure:

  • Security
  • Data integrity
  • Reliable resolution

When delegation NS records and authoritative NS records differ, recursive resolvers become uncertain about the actual authoritative servers.

This uncertainty often triggers a SERVFAIL to avoid possibly returning stale or malicious data. When NS records differ between parent and child zones, resolvers may reject responses to prevent the use of stale or spoofed data.

Overview

Zone LevelNS RecordsNotes
.be (parent)dns-lb1.corpinfra.be, dns-lb2.corpinfra.beDelegation NS for coast.be
coast.bens1.corpinfra.be, ns2.corpinfra.beAuthoritative NS for zone

Corpinfra.be (see https://www.dnsbelgium.be/nl/whois/info/corpinfra.be/details) – this is an example, the domain is fictitious – operates all four NS servers that resolve to IPs in the same subnet, but the naming inconsistency causes delegation mismatches.

Recommended Fixes

Option 1: Update coast.be zone NS records to match the delegation NS

Add dns-lb1.corpinfra.be and dns-lb2.corpinfra.be as NS records in the coast.be zone alongside existing ones (ns1 and ns2), so the zone’s NS RRset matches the delegation.

coast.be.   IN  NS  ns1.corpinfra.be.

coast.be.   IN  NS  ns2.corpinfra.be.

coast.be.   IN  NS  dns-lb1.corpinfra.be.

coast.be.   IN  NS  dns-lb2.corpinfra.be.

Option 2: Update .be zone delegation NS records to match the zone’s NS records

Change the delegation NS records in .be zone to use only:

ns1.corpinfra.be

ns2.corpinfra.be

remove dns-lb1.corpinfra.be and dns-lb2.corpinfra.be

Option 3: Align both the .be zone delegation and coast.be NS records to a consistent unified set

Either only use ns1.corpinfra.be abd ns2.corpinfra.be for both the delegation and authoritative zone NS records, or only use dns-lb1.corpinfra.be and dns-lb2.corpinfra.be for both. Or use all of them; three or more geographically dispersed DNS servers are recommended anyway. Depends on who owns and manages the zone.

What to choose?

OptionDescriptionProsCons
1Add dns-lb1 and dns-lb2 to the zone fileQuick fix, minimal disruptionMaybe the zones are managed by <> entities
2Update .be delegation to match zone NS (ns1, ns2)Clean and consistentRequires coordination with DNS Belgium
3Unify both delegation and zone NS recordsMost elegantRequires a full agreement between all parties

All three options are valid, but Option 3 is the most elegant and future-proof. That said, this is a valid configuration as is, and one might argue that Azure’s DNS resolver’s strictness is the cause of the issue. Sure, but in a world where DNSSEC is growing in importance, such strictness might become more common? Additionally, if the service configuration could handle a longer timeout, that would also address this issue. However, that is outside my area of responsibility.

Simulation: Resolver Behavior

ResolverBehavior with MismatchNotes
Azure DNS resolverSERVFAILStrict DNSSEC & delegation checks
Google Public DNSResolves normallyTolerant of NS mismatches
Cloudflare DNSResolves normallyIgnores delegation inconsistencies
Unbound (default)May varyDepends on configuration flags
Bind (strict mode)SERVFAILEnforces delegation consistency

Notes

  • No glue records are needed for coast.be, because the NS records point to a different domain (corpinfra.be), so-called out-of-bailiwick name servers, and .be correctly delegates using standard NS records.
  • After changes, flush DNS caches

Conclusion

When wading through the RFC we can summarize the findings as below

RFC Summary: Parent vs. Child NS Record Consistency

RFCSectionPosition on NS MatchingKey Takeaway
RFC 1034§4.2.2No mandate on matchingDescribes resolver traversal and authoritative zones, not strict delegation consistency
RFC 1034§6.1 & §6.2No strict matching ruleDiscusses glue records and zone cuts, but doesn’t say they must be aligned
RFC 2181§5.4.1Explicit: child may differParents’ NS records are not authoritative for the child; the child can define its own set.
RFC 4035§2.3DNSSEC implicationsMismatched NS sets can cause issues with DNSSEC validation if not carefully managed.
RFC 7719GlossaryReinforces delegation logicClarifies that delegation does not imply complete control or authority over the child zone

In a nutshell, RFC 2181 Section 5.4.1 is explicit: the NS records in a parent zone are authoritative only for that parent, not for the child. That means the child zone can legally publish entirely different NS records, and the RFC allows it. So, why is there an issue with some DNS resolvers, such as Azure?

Azure DNS “Soft” Enforces Parent-Child NS Matching

Azure DNS resolvers implement strict DNS validation behavior, which aligns with principles of security, reliability, and operational best practice, not just the letter of the RFC. This is a soft enforcement; the name resolution does not fail.

Why

1. Defense Against Misconfigurations and Spoofing

Mismatched NS records can indicate stale or hijacked delegations.

Azure treats mismatches as potential risks, especially in DNSSEC-enabled zones, and returns SERVFAIL to warn about potential spoofed responses, but does not fail the name resolution.

2. DNSSEC Integrity

DNSSEC depends on a trusted chain of delegation.

If the parent refers to NS records that don’t align with the signed child zone, validation can’t proceed.

Azure prioritizes integrity over leniency, which is why there is stricter enforcement.

3. Predictable Behavior for Enterprise Networks

In large infrastructures (like hybrid networks or private resolvers), predictable resolution is critical.

Azure’s strict policy ensures that DNS resolution failures are intentional and traceable, not silent or inconsistent like in looser implementations.

4. Internal Resolver Design

Azure resolvers often rely on cached referral points.

When those referrals don’t match authoritative data at the zone apex, Azure assumes the delegation is unreliable or misconfigured and aborts resolution.

Post Mortem summary

Azure DNS resolvers enforce delegation consistency by returning a SERVFAIL error when parent-child NS records mismatch, thereby signaling resolution failure rather than silently continuing or aborting. While RFC 2181 §5.4.1 allows child zones to publish different NS sets than the parent, Azure chooses to explicitly flag inconsistencies to uphold DNSSEC integrity and minimize misconfiguration risks. This deliberate error response enhances reliability in enterprise environments, ensuring resolution failures are visible, traceable, and consistent with secure design principles.

This was a perfect storm. A too-tight timeout setting in the service (which I do not control), combined with the Azure DNS resolvers’ rigorous behavior, which is fronted by a custom DNS solution required to serve all possible DNS needs in the environment, results in longer times for recursive DNS resolution that finally tripped up the calling service.

Hot add/remove of network adapters and enabling device naming in Windows Server Hyper-V

One of the cool new features in Window Server vNext Hyper-V (in Technical Preview at the moment of writing) is that you gain the ability to hot add and remove NICs.  That might sound not to important to the non initiated in the fine art of virtualization & clouds. But it is. You see anything you can do to a VM configuration wise that does not require downtime is good. That’s what helps shift the needle of high availability to that holey grail of continuous availability.

On top of that the names of the network adapters are now exposed to the guest. Which is also great. It’s become lot easier to automate the VM network configuration.

Hot adding NICs can be done via the GUI and PoSh.

image

But naming the network adapter seems a PowerShell only game for now (nothing hard, no sweat). This can be done during creation of the network adapter. Here I add a NIC to VM RAGNAR connected to the ISCSI-GUEST switch & named ISCSI.

Add-VMNetworkAdapter –VMName RAGNAR –SwitchName ISCSI-GUEST –Name ISCSI

Now I want this name to be reflected into the VM’s NCI configuration properties. This is done by enabling device naming. You can do this via the GUI or PoSh.

Set-VMNetworkAdapter –VMName RAGNAR –Name ISCSI –Devicenaming On

That’s it.

image

So now let’s play with our existing network adapter “Network Adapter” which connects our Hyper-V guests to the LAN via the HYPER-V-GUESTS switch? Can you rename it?  Yes, you can. In PoSh run this:

Rename-VMNetworkAdapter –VMName RAGNAR –Name “Network Adapter” –NewName “LAN”

And that’s it. If you refresh the setting of your VM or reopen it you’ll see the name change.

image

The one thing that I see in the Tech Preview is that I need to reboot the VM to see the Name change reflected inside the VM in the NIC configuration under advance properties, called “Hyper-V Network Adapter Name”. Existing one show their old name and new one are empty until then.

image

 

Two important characteristics to note about enabling device naming

You notice that one can edit this field in NIC configuration of the VM but it doesn’t move up the stack into the settings of the VM. Security wise this seems logical to me and it’s not intended to work. It’s a GUI limitation that the field cannot be disabled for editing but no one can try and  be “funny” by renaming the ethernet adapter in the VMs settings via the guest Winking smile

Do note that this is not exactly the same a Consistent Device Naming in Windows 2012 or later. It’s not reflected in the name of the NIC in the GUI, these are still Ethernet, Ethernet 2, … Enable device naming is mainly meant to enable identifying the adapter assigned to the VM inside the VM, mainly for automation. You can name the NIC in the Guest whatever works best for you and you’ll never lose the correlations between the Network adapter in your VM settings and the Hyper-V Network Adapter name in the NIC configuration properties. In that respect is a bit more solid/permanent even if some one found it funny to rename all vNICs to random names you’re still OK with this feature.

That’s it off, you go! Download the Technical Preview bits from MSDN, start exploring and learning. Knowledge is seldom a bad thing Winking smile

April 24th–Windows 2003 Is 10 Years Old

I’d like to chime in on a recent blog post by Aidan Finn Hey Look–Your Business Is Running On A 10-Year Old Server Operating System (W2003). The sad thing is this is so true and “the good” thing is some are even still on Windows Server 2000 so even in worse shape. Now I realize that not all industries are the same but keeping your operating systems up to date does have it’s benefits for all types of companies.

  • Security Improvements
  • Improved, richer, enhanced features
  • New functionality
  • Support for state of the art hardware & software
  • Supported for that day the SHTF
  • Future Proofing of your current investments

For one, all the above  it will save you time and money. On top of that mitigates the risks of lost revenue due to security incidents & unsupported environments no one can fix for you.

Think about it, if you’re running Windows Server 2000 or 2003 chances are you are paying for software to provide functionality that’s available right out of the box. You’re also putting in the extra effort & jumping through loops to run those on modern server grade hardware.

You’re also building up debt. Instead of yearly improvements keeping your infrastructure & services top notch you’re actively digging an ever bigger, very expensive, complex and high risk hole where you’ll have to dig your self out off. If you can, that is. Not a good place to be in. Still think leveraging software assurance is a bad thing?

So while way to many companies now have to assigned resources to mitigating that looming problem we’re focusing on other ventures (such as Hyper-V, Azure, Hybrid Cloud, …) and just keep our OS up to date at a steady pace, like before. Well people that doesn’t happen by accident. We’ve maintained a very healthy pace of upgrading to the most recent version of windows in our environments and at times I have had to fight for that and I’m I will again..But look at our base line, even if the economy tanks completely we’re in darn good shape to weather that storm and come out ahead. But it’s not going to happen by sitting there avoiding change out of fear or laziness. So start today.A point where I agree with Aidan completely: if your “Zombie ISV” and other vendors are telling you Windows 2003 is great and you shouldn’t use those new unproven versions of the OS; they are really touting BS. They have fallen behind so far on the technology stack that they need you to stay in their black hole of despair with them or they’ll go broke. Just move one. Trust me, they need you more than the other way around

Some SAN Storage Fun

At the end of this day I was doing some basic IO tests on some LUNs on one of the new Compellent SANs. It’s amazing what 10 SSDs can achieve … We can still beat them in  certain scenarios but it takes 15 times more disks. But that’s not what this blog is about. This is about goofing off after 20:00 following another long day in another very long week, it’s about kicking the tires of Windows and the SAN now that we can.

For fun I created a 300TB LUN on a DELL Compellent, thin provisioned off cause, I only have 250 TB Smile

I then mounted it to a Windows 2008 R2 test server.

image

The documented limit of a Volume in Windows 2008 R2 is 256TB when you use 64K allocation size. So I tested this limit by trying to format the entire LUN and create a 300TB simple volume. I brought it online, initialized it to an GPT disk, created a simple volume with an allocation unit size of 64K and well that failed with following error:

Failed Format300TB

There is nothing unexpected about this. This has to do with the maximum NTFS volume size supported on a GPT disk. It depends on the cluster size that is selected at the time of formatting. NTFS is currently limited to 2^32-1 allocation units. This yields a 256TB volume, using 64k clusters. However, this has only been tested to 16TB, or 17,592,186,040,320 bytes, using 4K cluster size. You can read up on this in Frequently asked questions about the GUID Partitioning Table disk architecture. The table below shows the NTFS limits based on cluster size.

image

This was the first time I had the opportunity to test these limits I formatted part of that LUN to a size close to the limit and than formatted the remainder to a second simple volume.

image

I still need get a Windows Server 2012 test server hooked up to the SAN. To see if anything has changed there. One thing is for sure, you could put at least 3 64TB VHDX files on a single volume in Windows. Not too shabby Smile. It’s more than enough to put just about any backup software into problems. Be warned, MSFT tested and guarantees performance & behavior up to 64TB in Windows Server 2012, but beyond that you’d better do your own due diligence.

The next thing I’ll do when I have a Windows Server 2012 host hooked up is, is create 64TB VHDX file and see if I can go beyond it before things break. Why, well because I can and I want to take the new SAN and Windows 2012 for a ride to see what boundaries we can push. The SANs are just being set up so now is the time to do some testing.