Microsoft Azure AD Sync service fails to start – event id 528

IMPORTANT UPDATE: Microsoft released Azure AD Connect 2.1.1.0 on March 24th, 2022 which fixes the issue described in this blog post). You can read about it here Azure AD Connect: Version release history | Microsoft Docs The fun thing is they wrote a doc about how to fix it on March 25th, 2022. The best option is to upgrade to AD Connect 2.1.1.0 or higher.

IMPORTANT UPDATE 2: Upgrade to version 2.1.15.0 (or higher) as that version also addresses LocalDB corruption issues!
Introduction

On Windows Server 2019 and Windows Server 2022 running AD Connect v2, I have been seeing an issue since October/November 2021 where Microsoft Azure AD Sync service fails to start – event id 528. It does not happen in every environment, but it does not seem to go away when it does. It manifests clearly by the Microsoft Azure AD Sync service failing to start after a reboot. If you do application-consistent backups or snapshots, you will notice errors related to the SQL Server VSS writer even before the reboot leaves the Microsoft Azure AD Sync service in a bad state. All this made backups a candidate for the cause. But that does not seem to be the case.

Microsoft Azure AD Sync service fails to start - event id 528
Microsoft Azure AD Sync service fails to start – event id 528

In the application event log, you’ll find Event ID 528 from SQLLocalDB 15.0 with the below content.

Windows API call WaitForMultipleObjects returned error code: 575. Windows system error message is: {Application Error}
The application was unable to start correctly (0x%lx). Click OK to close the application.
Reported at line: 3714.

Getting the AD Connect Server operational again

So, what does one do? Well, a Veeam Vanguard turns to Veeam and restores the VM from a restore point that a recent known good AD Connect installation.

But then the issue comes back

But then it comes back. Even worse, the AD Connect staging server suffers the same fate. So, again, we restore from backups. And guess what, a couple of weeks later, it happens again. So, you rebuild clean AD Connect VMs, and it happens again. We upgraded to every new version of AD Connect but no joy. You could think it was caused by failed updates or such, but no.

The most dangerous time is when the AD Connect service restarts. Usually that is during a reboot, often after monthly patching.

Our backup reports a failure with the application consistent backup of the AD Connect Server, often before Azure does so. The backup notices the issues with LocalDB before the AD Sync Service fails to start due to the problems.

The failing backups indicate that there is an issue with the LoclaDB database …

However, if you reboot enough, you can sometimes trigger the error. No backups are involved, it seems. That means it is not related to Veeam or any other application consistent backup. The backup process just stumbles over the LocalDB issue. It does not cause it. The error returns if we turn off application-consistent backups in Veeam any way. We also have SAN snapshots running, but these do not seem to cause the issue.

We did try all the tricks from an issue a few years back with backing up AD Connect servers. See https://www.veeam.com/kb2911 but even with the trick to prevent the unloading of the user profile
COM+ application stops working when users logs off – Windows Server | Microsoft Docs we could not get rid of the issue.

So backups, VSS, it seems there is a correlation but not causation.

What goes wrong with LocalDB

After a while, and by digging through the event and error logs of a server with the issue, we find that somehow, the model.mdf and model.ldf are toast for some inexplicable reason on a pseudo regular basis. Below you see a screenshot from the C:\Windows\ServiceProfiles\ADSync\AppData\Local\Microsoft\Microsoft SQL Server Local DB\Instances\ADSync2019\Error.log. Remember your path might differ.

That’s it, the model db seems corrupt for some reason.

You’ll find entries like “The log scan number (37:218:29) passed to log scan in database ‘model’ is not valid. This error may indicate data corruption or that the log file (.ldf) does not match the data file (.mdf).”

Bar restoring from backup, the fastest way to recover is to replace the corrupt model DB files with good ones. I will explain the process here because I am sure some of you don’t have a recent, good know backup.

Sure, you can always deploy new AD Connect servers, but that is a bit more involved, and as things are going, they might get corrupted as well. Again, this is not due to cosmic radiation on a one-off server. Now we see it happen sometime three weeks to a month apart, sometimes only a few days apart.

Manual fix by replacing the corrupt model dd files

Once you see the  SQLLocalDB event ID 528 entries in the application logs when your Microsoft Azure AD Sync service fails to start, you can do the following. First, check the logs for corruption issues with model DB. You’ll find them. To fix the problem, do the following.

Disable the Microsoft Azure AD Sync service. To stop the service that will hang in “starting” you will need to reboot the host. You can also try and force kill ADSync.exe via its PID

Depending on what user account the AD Sync Service runs under, you need to navigate to a different path. If you run under NT SERVICE\ADSync you need to navigate to

Microsoft Azure AD Sync service fails to start - event id 528
The account the Microsoft Azure AD Sync service runs under

C:\Windows\ServiceProfiles\ADSync\AppData\Local\Microsoft\Microsoft SQL Server Local DB\Instances\ADSync2019

Welcome to the home of the AD Connect LocalDB model database

If you don’t use the default account but another one, you need to go to C:\Users\ YOURADSyncUSER\AppData\Local\Microsoft\Microsoft SQL Server Local DB\Instances\ADSync2019

Open a second explorer Windows and navigate to C:\Program Files\Microsoft SQL Server\150\LocalDB\Binn\Templates. From there, you copy the model.mdf and modellog.ldf files and paste those in the folder you opened above, overwriting the existing, corrupt model.mdf and model.ldf files.

You can now change the Microsoft Azure AD Sync service back to start automatically and start the service.

If all goes well, the Microsoft Azure AD Sync service is running, and you can synchronize to your heart’s content.

Conclusion

If this doesn’t get resolved soon, I will automate the process. Just shut down or kill the ADSync process and replace the model.mdf and model.ldf files from a known good copy.

Here is an example script, which needs more error handling but wich you can run manually or trigger by monitoring for event id 528 or levering Task Scheduler. As always run this script in the lab first. Test it, make sure you understand what it does. You are the only one responsible for what you run on your server! Once you are done testing replace Write-Host with write-output or turn it into a function and use cmdletbinding and param to gain write-verbose if you don’t want all the output/feedback. Bothe those options are more automation friendly.

cls
$SQLServerTemplates = "C:\Program Files\Microsoft SQL Server\150\LocalDB\Binn\Templates"
$ADConnectLocalDB = "C:\Windows\ServiceProfiles\ADSync\AppData\Local\Microsoft\Microsoft SQL Server Local DB\Instances\ADSync2019"

Write-Host -ForegroundColor Yellow "Setting ADSync startup type to disabled ..."
Set-Service ADSync -StartupType Disabled

Write-Host -ForegroundColor Yellow "Stopping ADSync service  ..."
Stop-Service ADSync -force

$ADSyncStatus = Get-Service ADSync

if ($ADSyncStatus.Status -eq 'Stopped'){
    Write-Host -ForegroundColor Cyan "The ADSync service has been stopped  ..."
}
else {
    if ($ADSyncStatus.Status -eq 'Stopping' -or $ADSyncStatus.Status -eq 'Starting'){
        
        Write-Host -ForegroundColor Yellow "Setting ADSync startup type to disabled ..."
        Set-Service ADSync -StartupType Disabled

        Write-Host -ForegroundColor Red "ADSync service was not stopped but stuck in stoping or starting ..."
        $ADSyncService = Get-CimInstance -class win32_service | Where-Object name -eq 'ADSync'
        $ADSyncProcess = Get-Process | Where-Object ID -eq $ADSyncService.processid

        #Kill the ADSync process if need be ...
        Write-Host -ForegroundColor red "Killing ADSync service processs forcfully ..."
        Stop-Process $ADSyncProcess -Force

        #Kill the sqlserver process if need be ... (in order to be able to overwrite the corrupt model db files)
        Write-Host -ForegroundColor red "Killing sqlserver process forcfully ..."
         $SqlServerProcess = Get-Process -name "sqlservr" -ErrorAction SilentlyContinue
         if($SqlServerProcess){
        Stop-Process $SqlServerProcess -Force}

        }
    }

$ADSyncStatus = Get-Service ADSync
if ($ADSyncStatus.Status -eq 'Stopped'){

    Write-Host -ForegroundColor magenta "Copy known good copies of model DB database to AD Connect LocaclDB path file ..."
    Copy-Item "$SQLServerTemplates\model.mdf" $ADConnectLocalDB

    Write-Host -ForegroundColor magenta "Copy known good copy of model DB log file to AD Connect LocaclDB path ..."
    Copy-Item "$SQLServerTemplates\modellog.ldf" $ADConnectLocalDB


    Write-Host -ForegroundColor magenta "Setting ADSync startup type to automatic ..."
    Set-Service ADSync -StartupType Automatic

    Write-Host -ForegroundColor magenta "Starting ADSync service ..."
    Start-Service ADSync
}

$ADSyncStatus = Get-Service ADSync
if ($ADSyncStatus.Status -eq 'Running' -and $ADSyncStatus.StartType -eq 'Automatic'){
    Write-Host -ForegroundColor green "The ADSync service is running ..."
}
else {
    Write-Host -ForegroundColor Red "ADSync service is not running, something went wrong! You must trouble shoot this"
}


That fixes this cause for when Microsoft Azure AD Sync service fails to start – event id 528. For now, we keep an eye on it and get alerts from the AD Connect health service in Azure when things break or when event id occurs on the AD Connect servers. Let’s see if Microsoft comes up with anything.

IMPORTANT UPDATE: Microsoft released Azure AD Connect 2.1.1.0 on March 24th 2022 which fixes the issue described in this blog post). You can read about it here Azure AD Connect: Version release history | Microsoft Docs The fun thing is the wrote a doc about how to fix it on March 25th 2022. The best option is top upgrade to AD Connect 2.1.1.0 or higher.

PS: I am not the only one seeing this issue Azure AD Sync Connect keeps getting corrupted – Spiceworks

A VM that would not route

A VM that would not route

This blog post will address a troubleshooting exercise with a VM (virtual machine) that would not route. As it turned out it had the default gateway set to 0.0.0.0 next to the actual gateway IP address. The VM did its job as the workload it serves is in the same subnet as the client, as it happens in the same subnet of the DC and DNS. This meant it did not lose its trust with Active Directory.

But the admins could not RDP into that VM, nor would it update, But as it did its job, many months went by until it fell too far behind in updates so they could not ignore it anymore. That’s how things go goes in real life.

Finding & fixing the issue

Superficially the configuration of the VM was totally OK. The gateway for the NIC is correct.

Under Advanced we see no other entries that would cause any issues.

But we could not deny that we had a VM that would not route at hand. Let’s figure this out and fix it.

So what does one do? If you don’t trust the CLI, check the GUI, and if you don’t trust the GUI check the CLI. As in the GUI, everything seemed fine we checked via the CLI. Name resolution worked fine, both internally and externally when checking this with nslookup. But actually getting anywhere not on the same subnet was not successful. Naturally, I did check if any forward proxy was in play but that was not the case and, this was an issue for more use cases than HTTP(S).

When I ran ipconfig /all I quickly saw the culprit. We have a Default Gateway entry pointing to 0.0.0.0 next to one for the actual gateway!

So where does that come from? Not from the GUI settings, that we can see. So I ran route print and that show us the root cause

So we needed to drop the route sending traffic for 0.0.0.0 to its own IP address as the gateway. They missed this as it does not show up in the GUI at all.

I dropped all persistent routes for 0.0.0.0 via route delete 0.0.0.0.0 mask 0.0.0.0. I check if this deleted all persistent routes via route print.

At that moment routing won’t work and we need to add the gateway back to the NIC. YOu can use the GUI or route add -p 0.0.0.0 MASK 0.0.0.0 192.168.2.7 IF 9 Once I did that things lit up. We could download and install updates from the WSUS server, they had remote desktop access again. Routing worked again in other words.

How did it happen? Ah, somewhere, somehow, someone added that route. I am not paid to do archeology or forensics in this case so, I did not try to find out the what, when, and why. But my guess is that VM had another NIC at a given time with those setting and they removed it from the Hyper-V setting without cleaning up, leading to that setting being left behind in the routing table leaving a gateway 0.0.0.0 on the NIC that is only visible in via ipconfig /all. Or they have tried to add a gateway manually to address this or another issue.

A final note

When faced with this issue, some folks on the internet will tell you to reset the TCP/IP stack and Winsock with netsh, or add a new NIC with a new IP (dynamically or via DHCP) and dump the old one. But this is all bit drastic. Check the root cause and try to fix that first. You can try drastic measures when all else fails.

Check/repair/defragment an XFS volume

Introduction

As I have started to use XFS in bite-size deployments to gain experience with it I wanted to write up some of the toolings I found to manage XFS file systems. Here’s how to check/repair/defragment an XFS volume.

My main use case for XFS volumes is on hardened Linux repositories with immutability to use with Veeam Backup & Replication v11 and higher. It’s handy to be able to find out if XFS needs repairing and if they do, repair them. Another consideration is fragmentation. You can also check that and defrag the volume.

Check XFS Volume and repair it

xfs_repair is the tool you need. You can both check if a volume needs repair and actually repair it with the same tool. Note that the use of xfs_check has been depreciated or is not even available (anymore).

To work with xfs_repair you have to unmount the filesystem, so there will be downtime. Plan for a maintenance window.

To check the file system use the -n switch

sudo xfs_repair -n /dev/sdc
Check, repair, and defragment an XFS volume
Check, repair and defragment an XFS volumea dry run with xfs_repair -n

There is nothing much to do but we’ll now let’s run the repair.

sudo xfs_repair /dev/sdc
Check, repair, and defragment an XFS volume
Repairing an XFS file system

The output is similar as for the check we did for anything to repair is basically a dry run of what will be done. In this case, nothing.

Now, don’t forget to mount the file system again!

sudo mount /dev/sdc /mnt/veeamsfxrepo01-02

Check a volume for fragmentation and defrag it

Want to check the fragmentation of an XFS volume? You can but again, with xfs_db. The file system has to be unmounted for that or you will get the error xfs_db: can’t determine device size. To check for fragmentation run the following command against the storage device /file system.

sudo xfs_db -c frag -r /dev/sdc
Check, repair, and defragment an XFS volume
A lab simulation of sudo xfs_db -c frag -r /dev/sdc – Yeah know it’s meaningless 😉

Cool, now we know that we can defrag it online. For that we use xfs_fsr.

xfs_fsr /devsdc /mnt/veeamxfsrepo01-02
Check, repair, and defragment an XFS volume
There is nothing to do in our example

xfs_scrub – the experimental tool

xfs_scrub is a more recent addition but the program is still experimental. The good news is it will check and repair a mounted XFS filesystem. At least it sounds promising, right? It does, but it doesn’t work (Ubuntu 20.04.1 LTS).

No joy – still a confirmed bug – not assigned yet, importance undecided. Not yet my friends.

Conclusion

That’s it. I hope this helps you when you decide to take XFS for a spin for your storage needs knowing a bit more about the tooling. As said, for me, the main use case is hardened Linux repositories with immutability to use with Veeam Backup & Replication v11. In a Hyper-V environment of course.

Check vhid, password and virtual IP address Kemp LoadMaster

Introduction

Recently I was implementing a high available Kemp LoadMaster X15 system. I prepared everything, documented the switch and LM-X15 configuration, and created a VISIO to visualize it all. That, together with the migration and rollback scenario was all presented to the team lead and the engineer who was going to work on this with us. I told the team lead that all would go smoothly if my preparations were good and I did not make any mistakes in the configuration. Well, guess what, I made a mistake somewhere and had to solve a Kemp LoadMaster ad digest – md2=[31084da3…] md=[20dcd914…] – Check vhid, password and virtual IP address log entry.

Check vhid, password and virtual IP address

As, while all was working well, we saw the following entry inundate the system message file log:

<date> <LoadMasterHostName> ucarp[2193]: Bad digest – md2=[xxxxx…] md=[xxxxx…] – Check vhid, password and virtual IP address

Check vhid, password and virtual IP address
every second …

Wait a minute, as far as I know all was OK. The VHID was unique for the HA pair and we did not have duplicate IP addresses set anywhere on other network appliances. So what was this about?

Figuring out the cause

Well, we have a bond0 on eth0 and eth2 for the appliance management. We also have eth1 which is a special interface used for L1 health checks between the Loadmasters. We don’t use a direct link (different racks) so we configure them with an IP on a separate dedicated subnet. Then we have the bonds with the VLAN for the actual workloads via Virtual Services.

We have heartbeat health checks on bond0, eth1 and on at least one VLAN per bonds for the workloads.

Confirm that Promiscuous mode and PortFast are enabled. Check!
HA is configured for multicast traffic in our setup so we confirm that the switch allows multicast traffic. Check!

Make sure that switch configurations that block multicast traffic, such as ‘IGMP snooping’, are disabled on the switch/switch ports as needed. Check!

Now let’s look at possible causes and check our confguration:

So what else? The documentation states as possible other causes the following:

  1. There is another device on the network with the same HA Virtual ID. The LoadMasters in a HA pair should have the same HA Virtual ID. It is possible that a third device could be interfering with these units. As of LoadMaster firmware version 7.2.36, the LoadMaster selects a HA Virtual ID based on the shared IP address of the first configured interface (the last 8 bits). You can change the value to whatever number you want (in the range 1 – 255), or you can keep it at the value already selected. Check!
  2. An interface used for HA checks is receiving a packet from a different interface/appliance. If the LoadMaster has two interfaces connecting to the same switch, with Use for HA checks enabled, this can also cause these error messages. Disable the Use for HA checks option on one of the interfaces to confirm the issue. If confirmed, either leave the option disabled or move the interface to a separate switch.

I am sure there is no interference from another appliance. Check! As we had checked every other possible cause the line in red caught my attention. Could it be?

Time for some packet captures

So we took a TCP dump on bond0 and looked at it in Wireshark. You can make a TCP dump via debug options under System Log Files.

Check vhid, password and virtual IP address
Debug Options, once there find TCP dump.

Select your interface, click start, after 10 seconds or so click stop and download the dump

TCP dump

Do note that Wireshark identifies this as VRRP, but the LoadMaster uses CARP (open source) do set it to decode as CARP, that way you’ll see more interesting information in Info

No, not proprietary VRRP but CARP

Also filter on ip.dst == 244.0.0.18 (multicast address). What we get here is that on eth0 we see multicasts from eth1. That is the case described in the documentation. Aha!

Check vhid, password and virtual IP address
Aha, we see CARP multicasts from eth1 on eth0, that is what we call a clue!

So now what, do we need to move eth1 to another switch to solve this? Or disable the HA check? No, luckily not. Read on.

The fix for Check vhid, password and virtual IP address

No, I did not use one or more separate switches just to plug in the heartbeat HA interfaces on the LoadMasters. What I did is create a separate VLAN for the eth0 HA heartbeat uplink interfaces on the switches. This way I ensure that they are in a separate unicast group from the management interface uplinks on the switches

By selecting a different VLAN for the MGNT and Heartbeat interface uplink they are in different TV VLAN groups by default.

By default the Multicast TV VLAN Membership is per VLAN. The reason the actual workload interfaces did not cause an issue when we enabled HA checks is that these were trunk ports with a number of allowed VLANs, different from the management VLAN, which prevents this error being logged in the first place.

That this works was confirmed in the packet trace from the LM-X15 after making the change.

No more packets received from a different interface. Mission accomplished.

So that was it. The error was gone and we could move along with the project.

Conclusion

Well, I should have know as normally I do put those networks not just in a separate subnet but also make sure they are on different VLANs. This goes to show that no matter how experienced you are and how well you prepare you will still make mistakes. That’s normal and that’s OK, it means you are actually doing something. Key is how you deal with a mistake and that why I wrote this. To share how I found out the root cause of the issue and how I fixed it. Mistakes are a learning opportunity, use them as such. I know many organizations frown upon a mistake but really, these should grow up and don’t act this silly.