PowerShell script to maintain Azure Public DNS zone conditional forwarders

Introduction

I recently wrote a PowerShell script to maintain Azure Public DNS zone conditional forwarders. If you look at the list is quite long. Adding these manually is tedious and error-prone. Sure you might only need a few, but hey, I think and prepare long term.

Some background on DNS and private endpoints

When using private endpoints in Azure correct DNS name resolution is essential. While Azure can do a lot of things for you under the hood it is important to wrap your head around name resolution in Azure, for all your public, private, and custom DNS requirements. In the end, you need a DNS solution that is maintainable and works for current and future use cases. Your peaceful IT existence will fall apart fast without the ability to correctly resolve the private endpoint IP addresses to their fully qualified domain name (FQDN).

That in itself is a big subject I will not dive into right now. I will say that host files (if applicable) are OK for testing but not a maintainable solution, except for the smallest environments. In Azure, you can link virtual networks to Private DNS zones to resolve DNS queries for private endpoints. As an alternative, you can use your own custom DNS Server(s) with a forwarder to Azure’s VIP 168.63.129.16 and, at least on-premises conditional forwarders.

The latter is a requirement to resolve DNS queries for Azure resources with private endpoints for on-premises. At least until Azure DNS Private Resolver becomes generally available. That will be the way forward in the future if you otherwise have no need for custom DNS servers.

Please note that name resolution for private endpoints uses the public DNS zones. This allows existing Microsoft Azure services with DNS configurations for a public endpoint to keep functioning when accessed from the internet. Azure will intercept queries that originate from Azure or connected on-premises locations and reply with the private IP address of private endpoints. This configuration must be overridden to connect using your private endpoint.

On-premises DNS Servers

While your custom DNS servers in Azure can forward queries they are not authoritative for to the Azure VIP 168.63.129.16, on-premises servers cannot reach that IP address. They need to send the DNS queries for private Azure resources to a custom DSN Server in Azure via conditional forwarding. The Azure custom DNS server will forward the query to 168.63.129.16.

Example on-prem / Azure ADDS environment with Azure FW DNS proxy

PowerShell script to maintain Azure Public DNS zone conditional forwarders

Below you will find the code for the script. I created a CSV file with all the Azure public DNS Conditional forwarder zones. Zones with placeholders for regions, partitions, or SQL instances will be generated. For that, you need to provide the correct parameters. if not these are ignored.

PowerShell script to maintain Azure Public DNS zone conditional forwarders
Adding all Azure public DNS conditional forwarders to an on-premises DNS server

Another attention point is the fact that you can opt to store the zones in Active Directory or not. If so you can specify in what builtin or custom partition.

There are examples in the script TestAzurePublicDNSZoneForwardersScript.ps1 on how to use it. You will need at least one playground DNS server or better, 2 AD integrated DC/DNS servers for testing.

You can find the script at WorkingHardInIT/AzurePublicDnsZoneForwarders (github.com)

Conclusion

That’s it. I can extend the PowerShell script to maintain Azure Public DNS zone conditional forwarders with extra options when adding or updating conditional forwarders. Right now, for its current role, it does what I need. I do not plan to add an option to update the “store this conditional forwarder in Active Directory” setting as this has a bug.

See Bug when changing the “store this conditional forwarder in active directory” setting for more info. The gist is that it makes changing the setting causes DNS queries for the conditional forwarder to fail. We avoid that issue by removing and adding the conditional forwarders again. In many (most?) use cases so far, the default setting of not storing the conditional forwarder in Active Directory is what I need, so the script has no option to change that default setting until I might need it.

Code

Bug when changing the “store this conditional forwarder in active directory” setting

Bug when changing the “store this conditional forwarder in active directory” setting

Recently I encountered a bug when changing the “store this conditional forwarder in active directory” setting. I have been doing quite some active directory extensions to Azure lately. Part of that, post-process, is making sure that DNS name resolution from on-premises to Azure and vice versa is working optimally. When it comes to resolving Azure private endpoints and other private DNS zones from on-premises we need to add the conditional forwarders for the respective Azure DNS zones.

As we have different needs for this configuration on-premises versus in Azure we disable “Store this conditional forwarder in Active Directory, and replicate as follows” for all zones. This is the defaultm when you add a conditional forwarder.

However, you will also need to do this, in certain cases for other conditional forwarders depending on the DNS infrastructure between Azure and on-premises. I tend to change those non-Azure resource conditional forwarders before I add the one needed for Azure.

Bug when changing the "store this conditional forwarder in active directory" setting
The “store this conditional forwarder in active directory” setting

While that sounds easy enough, you can easily get into a pickle. When you change this, while the configuration seems perfectly fine, the name resolution for those zones where you change this stops working. That is bad. No bueno!

That can break a lot of services and applications leading to support calls, causing upset application owners, and lost revenue while leaving you scrambling to find a fix.

So how do we fix this?

Well, the only solution is to remove each and every conditional forwarder involved and add them again, While re-adding it you might get an “unknown error” in the GUI, but ignore it. Just go ahead. When your reverse lookup zones are in order it will resolve to the FQDN and name resolution will start working again. You can also use PowerShell or the command line. It is worth checking if changing the setting via PowerShell or the command line triggers the bug or not.

Please note that, as your are not replication the conditional forwarders in Active Directory, you must do that on all DNS servers on-premises involved in resolving Azure resources.

Is this a known bug?

Well, it looks like it, but I have yet to find a knowledge base article about it. There are mentions of other people running into the issue. This is not per se Azure-related. Take a look here DNS Conditional Forwarder stops working as soon as it’s Domain Replicated – Microsoft Q&A and AD Integrating conditional DNS forwarders stops them working (microsoft.com).

Note that this bug when changing the “store this conditional forwarder in active directory” setting will appear when you either enable or disable it.

This bug has existed for many years and over many versions of Windows DNS. The last encounters I had was with Windows Server 2019 and 2022. But beware with Windows Server 2016 and 2012 (R2) as well.

Linux AD computer object operating system values

Introduction

So, why am I dealing with Linux AD computer object operating system values? OK, here is some background. In geographic services, engineering, etc. people often run GIS and CAD software from various big-name vendors on Windows Servers. But it also has a rich and varied open source ecosystem driven by academic efforts. Often a lot of these handy tools only run in Linux.

The Windows Linux Subsystem might be an option for client-based or interactive tools. But when running a service I tend to use Ubuntu. It is the most approachable for me and, you can buy support for it in an enterprise setting if so desired or required.

To keep things as easy as possible and try to safeguard the concept of single sign-on we join these Ubuntu servers to Active Directory (AD) so they can log with their AD credentials.

Pre-staging computer objects

When joining an Ubuntu server to AD it partially fills out the Operating System values.

Not too detailed and only partially filled out.

However, we tend to pre-stage the computer accounts in the correct OU and not create them automatically in the default Computer OU when joining. In that case, the Operating System values seem to be left all blank. We can fix that with PowerShell.

Don’t worry, the screenshot is from my lab with my fictitious Active Directory forest/domain. You also have a lab right?

Linux AD computer object operating system values
Fill out the operating system info for pre-staged computer objects of Active Directory joined Ubuntu servers

Actually we need PowerShell Core

Now, this all very good and well, but how do we find out the values for the operating system. During deployment, we know, but over time they will update and upgrade. So it would be nice to figure out those values automatically and remotely.

PowerShell Core to the rescue! With PowerShell Core, we can do PowerShell Remoting Over SSH to run a remote session on our Linux server over SSH and get all the information we need. To make this automation-friendly you must certificate bases authentication for your SSH connection. Setting that up can be a bit tricky, especially on Windows. That is a subject for a future blog post I hope. You can also use the SecretStore to securely store the AD automation account credentials. Note that I also use a dedicated automation account on all my Linux systems for this purpose. Here is a “quick & dirty” code snippet to give you some inspiration on how to do that for Ubuntu.

#Grab the AD automation account credentials - please don't use a domain admin for this.
#Use a dedicated account with just enough privileges to get the job done.
$Creds = Get-Credential -UserName 'DATAWISETECH\dwtautomationaccount'
 
#Connect to a remote PowerShell session on our Linux server using certificate authentication.
#Setting this up is beyond the scope of this article but I will try to post a blog post on this later.
#Note you need to configure all Linux servers and desktops with the $public cert and allow the user to authenticate with it.
#We use a cert as that is very automation friendly! You will not get #prompted for a password for the Linux host.
$RemoteSession = New-PSSession -Hostname GRIZZLY -UserName autusrli
 
#Grab the OS information. Note that $PSVersionTable.OS only exist on PowerShell Core.
#which is OK as that is the version that is available for Linux.
 
$OS = Invoke-command -Session $RemoteSession { $PSVersionTable.OS }
 
#Grab the OSVersion.VersionString.
$VersionString = Invoke-command -Session $RemoteSession { [System.environment]::OSversion.VersionString }
 
#Clean up, we no longer need the remote session.
Remove-PSSession $RemoteSession
 
#Sanitize the strings for filling out the Active Directory computer object operating system values.
$UbuntuVersionFull = ($OS | Select-String -pattern '(\d+\.)(\d+\.)(\d)-Ubuntu').Matches.Value
$OperatingSystem = $UbuntuVersionFull.Split('-')[1] + " " + (($UbuntuVersionFull.Split('-')[0])).Substring(0, 5)
 
#Grab the Active Directory computer object and fill out the operating system values.
$Instance = (Get-AdComputer -Credential $Creds -Identity GRIZZLY -Server datawisetech.corp)
$Instance.OperatingSystem = $OperatingSystem
$Instance.OperatingSystemVersion = $VersionString
$Instance.OperatingSystemServicePack = $UbuntuVersionFull
Set-AdComputer -Instance $Instance

That’s it! Pretty cool huh?!

Conclusion

While you cannot edit the Linux AD computer object operating system values in the GUI you can do this via PowerShell. With Windows, this is not needed. This is handled for you. When joining Ubuntu to Active Directory this only gets set if you do not pre-stage the computer accounts. When you do pre-stage them, these are left blank. I showed you a way of adding that info via PowerShell. The drawback is that you need to maintain this and as such you will want to automate it further by querying those computers and updating the values as you update or upgrade these Ubuntu servers. Remote PowerShell over SSH and PowerShell Core on Linux are your friends for this. Good luck!

Cluster Shared Volumes without Active Directory

Introduction

Cluster Shared Volumes without Active Directory will come online if certain conditions are met. Basically, when the cluster can form. That’s what we’ll talk about here. There are a lot of things to consider when virtualizing your Active Directory environment, partially or completely. Too much for this blog post alone and much of it has been discussed before. Here I’ll discuss when people count on their Cluster Shared Volumes coming online when active directory is not available but are disappointed and think that it’s a bug or a broken promise. It is not! We know that since Windows Server 2012 the CSVs coming online do not depend on Active Directory being available (thanks to the local CLIUSR account being used for cluster starts).

Cluster Shared Volumes without Active Directory

So, let’s address getting your Cluster Shared Volumes online when AD is not available. The things to realize is that not having Active Directory has been taken care off. The thing that still can go wrong is that the cluster doesn’t come on line properly and that means that the CSV LUNs won’t be online as those are cluster resource. Hyper-V can boot the VMs without the cluster being formed but as the VMs reside on a CSV and those are not available, that’s what’s causing the problem. That means that getting your cluster to come up is the real issue.

Getting your cluster to come up is the real issue.

If cluster shared volumes can come up without a domain controller / AD being available since Windows Server 2012 (Failover Clustering and Active Directory Integration) how is it that some many people still have issues with it? How do you make this work for you? You can use a disk witness to protect you in most cases. When you have a file share or cloud witness you can take some step to avoid running into this issue.

There is a great article on cluster behavior here: Failover Cluster Node Startup Order in Windows Server 2012 R2 (the same rules apply for Windows Server 2016). Read it and you’ll notice that the behavior differs depending on whether you have a witness defined and whether that witness is a witness disk or a file share or cloud witness. The disk witness has a copy of the cluster database and if available will help you out of any situation where the paxos on the disk witness (if it’s available) is more recent then the one on the cluster node as the node will download the cluster database info from the disk witness. But a file share or cloud witness don’t have a copy, so they have a small disadvantage under certain scenarios. CSV’s that won’t come up is when booting the nodes in the wrong order when using a file share or cloud witness is one of those scenarios but not having AD isn’t the root cause of this.

Some people will never notice this issue at all, especially not when they have disk witness, but when they do, it might be at a very inconvenient moment in time. Examples of where this situation can occur are a single cluster environment where the domain controllers are running on CSV and high available in that cluster that was shut down completely, the cluster has a file share witness and the nodes are started in the wrong order.

Making sure your CSVs come online = clustering being formed

Well for one make sure you are using Windows Server 2012 or higher for your cluster. That’s a given.

Beyond that you basically just need to know what to do in what scenario to get your cluster up and running so you won’t have issues with getting the CSV to come up. You just have to follow some rules of thumb and you’ll be fine. Also, there’s almost always a way to get out of pickle, just don’t panic. But also remember that you can make your live easier when you design your solutions with failure in mind and by knowing your options so you can act correctly. I my example I’ll be using Hyper-V cluster with CSVs but the same goes for SOFS, SQL Server clusters leveraging CSVs etc.

Planned down time

If you’re shutting down a complete cluster you have two options to make sure things go a smooth as possible.

Option 1 – A clean cluster shutdown by the book

  • Shut down the workload, i.e. the VMs.
  • Stop the cluster
  • Shutdown the Cluster nodes.
  • Boot the cluster nodes. You can start with any you like. The CSVs will come on line. You will be able to start the VMs. Do start with the domain controllers and wait for them to come on line before starting the others.

During this you’ll see some “collateral” events, errors, warnings due to Active Directory not being available. The cluster name has issues without Active Directory but that a management connection point, it doesn’t mean the cluster doesn’t work. Once Active Directory is available the cluster name will come on line automatically when the default failure policy restarts it. You can also manage the cluster via RDP or console by connection to “.” locally on that node or use the running node name. You can also try to bring cluster name on line manually when Active Directory is up and running.

Option 2 – A clean cluster shutdown as it happens most of the times in real life

Which is one a lot of people do to keep part of the work load running as long as possible.

  • Shut down the no critical workload, i.e. the VMs.
  • Pause a node so the critical workloads live migrate to other nodes
  • When the node is paused, shut down the node.
  • You rinse and repeat this until the last node is left with only the most critical VMs

clip_image002

  • Finally, you stop the workload on that last cluster node and shut it down as well.
  • Now comes the critical part: Remember what node you shut down last. You have to start that one first and you’ll see that your CSV will come on line. If you boot another node, you might panic as the CSV will not come on line.

Now I need to correct this a little bit. With a disk witness you are OK whatever node you boot. When the paxos numbers on the first node to boot and the disk witness are compared the most recent copy will be used. Either the local one on the node will be used directly or after it has been updated with the data form the disk witness. To make things simple for the ops team I always tell them to note the last node they shut down anyway no matter what type of witness they have. It’s good info to have.

With a file share or cloud witness the shutdown/startup order really comes into play. The reason this happens is that by shutting down node by node we end up with a one node cluster (last man standing).  When that’s shut down that’s the only node that knows about the last (potential) changes in the cluster as it holds a copy of the cluster database. Remember that a files hare or a cloud witness has no copy of the cluster database. That’s why the last node to be shut down has to come on line first when comparing the paxos numbers with the witness as that node can form a cluster. If the node that boots first does not hold the most recent paxos number it cannot download the cluster databases info from the file share or cloud witness. Such a node cannot form a cluster and bring the CSVs online. If the first node to boot was the last one to shut down, it can form a cluster and the CSVs will come on line. This is the big difference with option one where you shut down the entire cluster and then take the nodes of line.

You might not know or remember the order. If that’s the case you still have options like starting the cluster node with the -fixquorum option (net.exe start clussvc /forcequorum) at the risk of loosing some cluster changes that are not in the local cluster database.

No need to go to immediately backups, extract the domain controller VMs from SAN snapshots or mount a LUN to a different machine to extract the VM files for the DC or so. Don’t panic!

One or more failed nodes

Well as long as the cluster survives your domain controller VMs should fail over. Keep ‘m on separate node (anti-affinity), separate CSV LUNs, if possible separate clusters if all domain controller virtual machines are going to be running high available on a cluster node and that cluster is still functional after all. No issues here.

Total cluster failure

The cluster nodes all show due to a “global” BSOD or are turned off due to a power failure or a storage array crash. This is more the realm of bad dreams I know, it does happen. Often things will recover and you’ll be fine but you can do your part. The same rules apply, get the cluster to form and you’ll have your CSVs on line. In a bad case -fixquorum is your friend but normally it’s not your first option. In the worst case you’ll need recovery from backup of the cluster or rebuild it. It’s a very bad day if it comes to that. And cluster recovery is not the subject of this post.

Conclusion

Don’t blame Active Directory and start troubleshooting or fixing the wrong problem. So yes, CSV will come on line when certain conditions are met and you can work yourself out of a pickle if needed. But during a disaster that’s only extra work and stress you might not want to worry about if you can avoid it. It’s good to know how to resolve issues around CSVs not coming online when the shit hits the fan as even the best laid out plans tend to get side tracked by reality when disaster strikes.

If you cannot guarantee control over all the prerequisites and might not have the skills in please when needed, you might consider other options. Some of these are actually the best practices of the past when a CSV would indeed not come on line without active directory in any way. This is great for AD related issues but not for you offline CSVs, they need the cluster to form properly!

Sure, you can run the domain controller virtual machines on local storage, and not made high available. This cloud be on one of the cluster nodes (*) or on a stand-alone Hyper-V host. Having a physical domain controller is also a possibility. This helps avoid issues with AD in virtualized environments as many other services are very dependent on them and it’s good to have on one available all of the time and get them back on line a.s.a.p..

I’ll leave you with the fact that virtualizing domain controllers can be done but it pays to study up on how to do it well and test your assumptions in the lab. There is a lot of information on virtualizing domain controllers for a reason. Read it and process what you’ve learned from it. You might find that this CCV thingies is not the most complex subject to deal with.

(*) Please note that some cluster deployments like HCI based on S2D do not support running other (local) storage in addition to the boot OS and the S2D storage pool volumes.