Bug when changing the “store this conditional forwarder in active directory” setting
Recently I encountered a bug when changing the “store this conditional forwarder in active directory” setting. I have been doing quite some active directory extensions to Azure lately. Part of that, post-process, is making sure that DNS name resolution from on-premises to Azure and vice versa is working optimally. When it comes to resolving Azure private endpoints and other private DNS zones from on-premises we need to add the conditional forwarders for the respective Azure DNS zones.
As we have different needs for this configuration on-premises versus in Azure we disable “Store this conditional forwarder in Active Directory, and replicate as follows” for all zones. This is the defaultm when you add a conditional forwarder.
However, you will also need to do this, in certain cases for other conditional forwarders depending on the DNS infrastructure between Azure and on-premises. I tend to change those non-Azure resource conditional forwarders before I add the one needed for Azure.
The “store this conditional forwarder in active directory” setting
While that sounds easy enough, you can easily get into a pickle. When you change this, while the configuration seems perfectly fine, the name resolution for those zones where you change this stops working. That is bad. No bueno!
That can break a lot of services and applications leading to support calls, causing upset application owners, and lost revenue while leaving you scrambling to find a fix.
So how do we fix this?
Well, the only solution is to remove each and every conditional forwarder involved and add them again, While re-adding it you might get an “unknown error” in the GUI, but ignore it. Just go ahead. When your reverse lookup zones are in order it will resolve to the FQDN and name resolution will start working again. You can also use PowerShell or the command line. It is worth checking if changing the setting via PowerShell or the command line triggers the bug or not.
Please note that, as your are not replication the conditional forwarders in Active Directory, you must do that on all DNS servers on-premises involved in resolving Azure resources.
Note that this bug when changing the “store this conditional forwarder in active directory” setting will appear when you either enable or disable it.
This bug has existed for many years and over many versions of Windows DNS. The last encounters I had was with Windows Server 2019 and 2022. But beware with Windows Server 2016 and 2012 (R2) as well.
While this post is about an Offline Azure Devops Windows 2012 R2 build server with failing builds let me talk about the depreciation of TLS 1.0/1.1. Now this is just my humble opinion, as someone who has been implementing TLS 1.3, QUIC and even SMB over QUIC. The out phasing of TLS 1.0/1.1 in favor of TLS 1.2 has been an effort done at snail’s pace. But hey, here we are, TLS 1.0/1.1 are still working for Azure Devops Services. Many years after all the talk, hints, tips, hunches and efforts to get rid of it. They did disable it finally on November 31st 2021 (Deprecating weak cryptographic standards (TLS 1.0 and TLS 1.1) in Azure DevOps)) but on January 31st 2022 Microsoft had to re-enable it since to many customers ran into issues. Sigh.
Tech Debt
The biggest reason for these issues are tech debt, i.e. old server versions. So it was in this case, but with a twist. Why was the build server still running Windows Server 2012 R2? Well in this case the developers won’t allow an upgrade or migration of the server to a newer version because they are scared they won’t be able to get the configuration running again and won’t be able to build their code anymore. This is not a joke but better to laugh than to cry, that place has chased away most good developers long ago and left pretty few willing to fight the good fight as there no reward for doing the right things, quite the opposite.
Offline Azure Devops Windows 2012 R2 build server with failing builds
But Microsoft, rightly so, must disable TLS 1.0/1.1 and will do so on March 31st 2022. To enable customers to detect issues they enabled it already temporarily on March 22nd (https://orgname.visualstudio.com)and 24th (https://dev.azure.com/orgname) form 09:00 to 21:00 UTC.
Guess what? On March 24th I got a call to trouble shoot Azure Devops Services build server issues. A certain critical on-premises build server shows as off line in Azure and their builds with a dead line of March 25th were failing. Who you going to call?
No bueno!
That’s right, WorkingHardInIT! Sure enough, a quick test (Invoke-WebRequest -Uri status.dev.azure.com -UseBasicParsing).StatusDescription did not return OK.
Well first of all that server only had .NET 4.6 installed. .NET 4.7 or higher is a requirement after March 31st 2022 for connectivity to Azure Devops Services.
So, I checked that there were (working backups) and made a Hyper-V checkpoint of the VM. I then installed .NET 4.8 and rebooted the host. I ran (Invoke-WebRequest -Uri status.dev.azure.com -UseBasicParsing).StatusDescription again, but no joy.
There is another requirement that you must pay extra attention to, the enable ciphers! Specifically for Windows Server 2012 R2 the below cipher suites are the only two that will work with Azure Devops Services.
But pay attention to the part about the AEAD ciphers that are only available on Windows Server 2012 R2. The above to ciphers are missing and I added them.
Add the two ciphers needed for W2K12R2 with Azure Devops
Add those two ciphers to the part for Windows Server 2012 R2. and run the script again. That requires a server reboot.Our check with (Invoke-WebRequest -Uri status.dev.azure.com -UseBasicParsing).StatusDescription returns OK. The build server was online again in Azure Devops and they could build whatever they want via Azure Devops.
Conclusion
Tech debt is all around us. I avoid it as much as possible. Now, on this occurrence I was able to fix the issue quite easily. But I walked away telling them to either move the builds to azure or replace the VM with Windows Server 2022 (they won’t). There are reasons such a cost, consistent build speed to stay with an on-prem virtual machine. But than one should keep it in tip top shape. The situation that no one dares touch it is disconcerting. And in the end, I come in and do touch it, minimally, for them to be able to work. Touching tech is unavoidable, from monthly patching, over software upgrades to operating system upgrades. Someone needs to do this. Either you take that responsibility or you let someone (Azure) do that for you.
You can compile Desired State Configuration (DSC) configurations in Azure Automation State Configuration, which functions as a pull server. Next to doing this via the Azure portal, you can also use PowerShell. The latter allows for easy integration in DevOps pipelines and provides the flexibility to deal with complex parameter constructs. So, this is my preferred option. Of course, you can also push DSC configurations to Azure virtual machines via ARM templates. But I like the pull mechanisms for life cycle management just a bit more as we can update the DSC config and push it out when needed. So, that’s all good, but under certain conditions, you can get the following error: Cannot connect to CIM server. The specified service does not exist as an installed service.
When can you get into this pickle?
DSC itself is PowerShell, and that comes in quite handy. Sometimes, the logic you use inside DSC blocks is insufficient to get the job done as needed. With PowerShell, we can leverage the power of scripting to get the information and build the logic we need. One such example is formatting data disks. Configuring network interfaces would be another. A disk number is not always reliable and consistent, leading to failed DSC configurations. For example, the block below is a classic way to wait for a disk, and when it shows up, initialize, format, and assign a drive letter to it.
The disk number may vary depending on whether your Azure virtual machine has a temp disk or not, or if you use disk encryption or not can trip up disk numbering. No worries, DSC has more up its sleeve and allows to use the disk id instead of the disk number. That is truly unique and consistent. You can quickly grab a disk’s unique id with PowerShell like below.
So, life is good, right? Yes, until you try and compile that (DSC) configuration in Azure Automation State Configuration. Then, you will get a nasty compile error.
“Exception: The running command stopped because the preference variable “ErrorActionPreference” or common parameter is set to Stop: Cannot connect to CIM server. The specified service does not exist as an installed service.”
Or in the Azure Portal:
The Azure compiler wants to validate the code, and as you cannot get access to the host, compilation fails. So the configs compile on the Azure Automation server, not the target node (that does not even exist yet) or the localhost. I find this odd. When I compile code in C# or C++ or VB.NET, it will not fail because it cannot connect to a server and validate my code by crabbing disk or interface information at compile time. The DSC code only needs to be correct and valid. I wish Microsoft would fix this behavior.
Workarounds
Compile DSC locally and upload
Yes, I know you can pre-compile the DSC locally and upload it to the automation account. However, the beauty of using the automation account is that you don’t have to bother with all that. I like to keep the flow as easy-going and straightforward as possible for automation. Unfortunately, compiling locally and uploading doesn’t fit into that concept nicely.
Upload a PowerShell script to a storage container in a storage account
We can store a PowerShell script in an Azure storage account. In our example, that script can do what we want, find, initialize, and format a disk.
But we need to set up a storage account and upload a PowerShell script to a blob. We also need a SAS token to download that script or allow public access to it. Instead of hardcoding this information in the DSC script, we can also store it in automation variables. We could even abuse Automation credentials to store the SAS token securely. All that is possible, but it requires more infrastructure, maintenance, security while integrating this into the DevOps flow.
PowerShell to generate a PowerShell script
The least convoluted workaround that I found is to generate a PowerShell script in the Script block of the DSC configuration and save that to the Azure VM when DSC is running. In our example, this becomes the below script block in DSC.
So, in SetScript, we build our actual PowerShell command we want to execute on the host as a string. Then, we persist to file using our $OutputPath variable we can access inside the Script block via the $using: OutputPath. Finally, we execute our persisted script by dot sourcing it with “. “$using:OutputPath” In TestScript, we test for the existence of the file and ignore the output of GetScript, but it needs to be there. The maintenance is easy. You edit the string variable where we create the PowerShell to save in the DSC configuration file, which we upload and compile. That’s it.
To be fair, this will not work in all situations and you might need to download protected files. In that case, the above will solutions will help out.
Conclusion
Creating a Powershell script in the DSC configuration file requires less effort and infrastructure maintenance than uploading such a script to a storage account. So that’s the pragmatic trick I use. I’d wish the compilation to an automation account would succeed, but it doesn’t. So, this is the next best thing. I hope this helps someone out there facing the same issue to work around the error: Cannot connect to CIM server. The specified service does not exist as an installed service.
IMPORTANT UPDATE:Microsoft released Azure AD Connect 2.1.1.0 on March 24th, 2022 which fixes the issue described in this blog post). You can read about it hereAzure AD Connect: Version release history | Microsoft DocsThe fun thing is they wrote a doc about how to fix it on March 25th, 2022. The best option is to upgrade to AD Connect 2.1.1.0 or higher.
IMPORTANT UPDATE 2: Upgrade to version 2.1.15.0 (or higher) as that version also addresses LocalDB corruption issues! Introduction
On Windows Server 2019 and Windows Server 2022 running AD Connect v2, I have been seeing an issue since October/November 2021 where Microsoft Azure AD Sync service fails to start – event id 528. It does not happen in every environment, but it does not seem to go away when it does. It manifests clearly by the Microsoft Azure AD Sync service failing to start after a reboot. If you do application-consistent backups or snapshots, you will notice errors related to the SQL Server VSS writer even before the reboot leaves the Microsoft Azure AD Sync service in a bad state. All this made backups a candidate for the cause. But that does not seem to be the case.
Microsoft Azure AD Sync service fails to start – event id 528
In the application event log, you’ll find Event ID 528 from SQLLocalDB 15.0 with the below content.
Windows API call WaitForMultipleObjects returned error code: 575. Windows system error message is: {Application Error} The application was unable to start correctly (0x%lx). Click OK to close the application. Reported at line: 3714.
Getting the AD Connect Server operational again
So, what does one do? Well, a Veeam Vanguard turns to Veeam and restores the VM from a restore point that a recent known good AD Connect installation.
But then the issue comes back
But then it comes back. Even worse, the AD Connect staging server suffers the same fate. So, again, we restore from backups. And guess what, a couple of weeks later, it happens again. So, you rebuild clean AD Connect VMs, and it happens again. We upgraded to every new version of AD Connect but no joy. You could think it was caused by failed updates or such, but no.
The most dangerous time is when the AD Connect service restarts. Usually that is during a reboot, often after monthly patching.
Our backup reports a failure with the application consistent backup of the AD Connect Server, often before Azure does so. The backup notices the issues with LocalDB before the AD Sync Service fails to start due to the problems.
The failing backups indicate that there is an issue with the LoclaDB database …
However, if you reboot enough, you can sometimes trigger the error. No backups are involved, it seems. That means it is not related to Veeam or any other application consistent backup. The backup process just stumbles over the LocalDB issue. It does not cause it. The error returns if we turn off application-consistent backups in Veeam any way. We also have SAN snapshots running, but these do not seem to cause the issue.
So backups, VSS, it seems there is a correlation but not causation.
What goes wrong with LocalDB
After a while, and by digging through the event and error logs of a server with the issue, we find that somehow, the model.mdf and model.ldf are toast for some inexplicable reason on a pseudo regular basis. Below you see a screenshot from the C:\Windows\ServiceProfiles\ADSync\AppData\Local\Microsoft\Microsoft SQL Server Local DB\Instances\ADSync2019\Error.log. Remember your path might differ.
That’s it, the model db seems corrupt for some reason.
You’ll find entries like “The log scan number (37:218:29) passed to log scan in database ‘model’ is not valid. This error may indicate data corruption or that the log file (.ldf) does not match the data file (.mdf).”
Bar restoring from backup, the fastest way to recover is to replace the corrupt model DB files with good ones. I will explain the process here because I am sure some of you don’t have a recent, good know backup.
Sure, you can always deploy new AD Connect servers, but that is a bit more involved, and as things are going, they might get corrupted as well. Again, this is not due to cosmic radiation on a one-off server. Now we see it happen sometime three weeks to a month apart, sometimes only a few days apart.
Manual fix by replacing the corrupt model dd files
Once you see the SQLLocalDB event ID 528 entries in the application logs when your Microsoft Azure AD Sync service fails to start, you can do the following. First, check the logs for corruption issues with model DB. You’ll find them. To fix the problem, do the following.
Disable the Microsoft Azure AD Sync service. To stop the service that will hang in “starting” you will need to reboot the host. You can also try and force kill ADSync.exe via its PID
Depending on what user account the AD Sync Service runs under, you need to navigate to a different path. If you run under NT SERVICE\ADSync you need to navigate to
The account the Microsoft Azure AD Sync service runs under
C:\Windows\ServiceProfiles\ADSync\AppData\Local\Microsoft\Microsoft SQL Server Local DB\Instances\ADSync2019
Welcome to the home of the AD Connect LocalDB model database
If you don’t use the default account but another one, you need to go to C:\Users\ YOURADSyncUSER\AppData\Local\Microsoft\Microsoft SQL Server Local DB\Instances\ADSync2019
Open a second explorer Windows and navigate to C:\Program Files\Microsoft SQL Server\150\LocalDB\Binn\Templates. From there, you copy the model.mdf and modellog.ldf files and paste those in the folder you opened above, overwriting the existing, corrupt model.mdf and model.ldf files.
You can now change the Microsoft Azure AD Sync service back to start automatically and start the service.
If all goes well, the Microsoft Azure AD Sync service is running, and you can synchronize to your heart’s content.
Conclusion
If this doesn’t get resolved soon, I will automate the process. Just shut down or kill the ADSync process and replace the model.mdf and model.ldf files from a known good copy.
Here is an example script, which needs more error handling but wich you can run manually or trigger by monitoring for event id 528 or levering Task Scheduler. As always run this script in the lab first. Test it, make sure you understand what it does. You are the only one responsible for what you run on your server! Once you are done testing replace Write-Host with write-output or turn it into a function and use cmdletbinding and param to gain write-verbose if you don’t want all the output/feedback. Bothe those options are more automation friendly.
cls
$SQLServerTemplates = "C:\Program Files\Microsoft SQL Server\150\LocalDB\Binn\Templates"
$ADConnectLocalDB = "C:\Windows\ServiceProfiles\ADSync\AppData\Local\Microsoft\Microsoft SQL Server Local DB\Instances\ADSync2019"
Write-Host -ForegroundColor Yellow "Setting ADSync startup type to disabled ..."
Set-Service ADSync -StartupType Disabled
Write-Host -ForegroundColor Yellow "Stopping ADSync service ..."
Stop-Service ADSync -force
$ADSyncStatus = Get-Service ADSync
if ($ADSyncStatus.Status -eq 'Stopped'){
Write-Host -ForegroundColor Cyan "The ADSync service has been stopped ..."
}
else {
if ($ADSyncStatus.Status -eq 'Stopping' -or $ADSyncStatus.Status -eq 'Starting'){
Write-Host -ForegroundColor Yellow "Setting ADSync startup type to disabled ..."
Set-Service ADSync -StartupType Disabled
Write-Host -ForegroundColor Red "ADSync service was not stopped but stuck in stoping or starting ..."
$ADSyncService = Get-CimInstance -class win32_service | Where-Object name -eq 'ADSync'
$ADSyncProcess = Get-Process | Where-Object ID -eq $ADSyncService.processid
#Kill the ADSync process if need be ...
Write-Host -ForegroundColor red "Killing ADSync service processs forcfully ..."
Stop-Process $ADSyncProcess -Force
#Kill the sqlserver process if need be ... (in order to be able to overwrite the corrupt model db files)
Write-Host -ForegroundColor red "Killing sqlserver process forcfully ..."
$SqlServerProcess = Get-Process -name "sqlservr" -ErrorAction SilentlyContinue
if($SqlServerProcess){
Stop-Process $SqlServerProcess -Force}
}
}
$ADSyncStatus = Get-Service ADSync
if ($ADSyncStatus.Status -eq 'Stopped'){
Write-Host -ForegroundColor magenta "Copy known good copies of model DB database to AD Connect LocaclDB path file ..."
Copy-Item "$SQLServerTemplates\model.mdf" $ADConnectLocalDB
Write-Host -ForegroundColor magenta "Copy known good copy of model DB log file to AD Connect LocaclDB path ..."
Copy-Item "$SQLServerTemplates\modellog.ldf" $ADConnectLocalDB
Write-Host -ForegroundColor magenta "Setting ADSync startup type to automatic ..."
Set-Service ADSync -StartupType Automatic
Write-Host -ForegroundColor magenta "Starting ADSync service ..."
Start-Service ADSync
}
$ADSyncStatus = Get-Service ADSync
if ($ADSyncStatus.Status -eq 'Running' -and $ADSyncStatus.StartType -eq 'Automatic'){
Write-Host -ForegroundColor green "The ADSync service is running ..."
}
else {
Write-Host -ForegroundColor Red "ADSync service is not running, something went wrong! You must trouble shoot this"
}
That fixes this cause for when Microsoft Azure AD Sync service fails to start – event id 528. For now, we keep an eye on it and get alerts from the AD Connect health service in Azure when things break or when event id occurs on the AD Connect servers. Let’s see if Microsoft comes up with anything.
IMPORTANT UPDATE:Microsoft released Azure AD Connect 2.1.1.0 on March 24th 2022 which fixes the issue described in this blog post). You can read about it hereAzure AD Connect: Version release history | Microsoft DocsThe fun thing is the wrote a doc about how to fix it on March 25th 2022. The best option is top upgrade to AD Connect 2.1.1.0 or higher.