Replacing a failed disk in a stand-alone Storage Spaces with Mirror Accelerated Parity

Introduction

I use Storage spaces in various environments for several use cases, even with clients (see Move Storage Spaces from Windows 8.1 to Windows 10). In this blog post, we’ll walk through replacing a failed disk in a stand-alone Storage Spaces with Mirror Accelerated Parity. have a number of DELL R740XD stand-alone servers with a ton of storage that I use as backup targets. See A compact, high capacity, high throughput, and low latency backup target for Veeam Backup & Replication v10 for a nice article on a high-performance design with such servers. They deliver the repositories for the extents in Veeam Backup & Replication Scale-out Backup Repositories. They have MVME’s for the performance tier in Storage Spaces with Mirror Accelerated Parity.

Even with the best hardware and vendor, a disk can fail and yes, it happened to one of our NVME drives. Reseating the disk did not help and we were one disk shot in device manager. So yes the disk was dead (or worse the bus where it was seated, but that is less likely).

Replacing a failed disk in a stand-alone Storage Spaces with Mirror Accelerated Parity

So let’s take a look by running

Get-PhysicalDisk 

I immediately see that we have an NVME that has lost communication, it is gone and also no longer displays a disk number. It seems to be broken.

Replacing a failed disk in a stand-alone Storage Spaces with Mirror Accelerated Parity

That means we need to get rid of it in the storage spaces pool so we can replace it.

Getting rid of the failed disk properly

I put the disk that lost communication into a variable

$ProblemDisk = Get-PhysicalDisk | where-object OperationalStatus -like *lost*

We than retire the problematic disk

$ProblemDisk | Set-PhysicalDisk -Usage retired

We then run Get-PhysicalDisk again and yes, we see the disk was retired.

Replacing a failed disk in a stand-alone Storage Spaces with Mirror Accelerated Parity

Now grab that retired disk and save it to a parameter by running

$RetiredDisk = Get-PhysicalDisk | where-object  Usage -like *Retired*

Now remove the retired disk from the storage pool by running

Get-StoragePool -FriendlyName BackupStoragePool | Remove-PhysicalDisk -PhysicalDisk $RetiredDisk

Let this complete and check again with Get-PhysicalDisk, you will see the problematic disk has gone. Note that there are only 7 NVME disks left.

Replacing a failed disk in a stand-alone Storage Spaces with Mirror Accelerated Parity

It does not show an unrecognized disk that is still visible to the OS somehow. So we cannot try to reset it to try to get it back into action. We need to replace it and so we request a replacement disk with DELL support and swap them out.

Putting the new disk into service

Now we have our new disk we want it to be added to the storage pool. You should now see the new disk in Disk Manager as online and not initialized. Or when you select Add Physical Disk in the storage pool in Server Manager.

But we were doing so well in PowerShell so let’s continue there. We will add the new disk to the storage pool. Run

$DiskToAddToPool = Get-PhysicalDisk | where-object  Canpool -eq True

Get-StoragePool -FriendlyName BackupStoragePool | Add-PhysicalDisk -PhysicalDisk $DiskToAddToPool

When you run Get-PhysicalDisk again you will see that there are no disks left that can be pooled, meaning they are all in the storage pool. And we have 8 NMVE disks agaibn Good!

Now run

Optimize-StoragePool -FriendlyName BackupStoragePool

And let it run. You can check up on its progress via this little script.

while(1 -eq 1) {
Get-storagejob
write-host 'Wait'
start-sleep -seconds 10
} 
Keeping an eye on the storage pool optimization process

That’s it. All is well again and rebalanced. It also ensures the storage capacity contributed by the replaced disk will be available in the performance tier when I want to create an extra virtual disk. Storage Spaces at its best giving me the opportunity to leverage NVMe with other disks while maximizing the benefits of ReFS.

For more info on stand-alone storage space and PowerShell, you can find more info in Deploy Storage Spaces on a stand-alone server

Conclusion

As you have seen replacing a failed disk in a stand-alone Storage Spaces with Mirror Accelerated Parity is not too hard to do. You just need to wrap your head around how storage spaces work and investigate the commands a little. For that I recommend practicing on a Virtual Machine.

Troubleshooting 100% stalled Veeam backup jobs

Introduction

Recently I got to diagnose a really interesting Veeam Backup & Replication symptom. Imagine you have a backup environment that runs smoothly. All week long but then, suddenly, running backup jobs stall. News jobs that start do not make an ounce of progress. It is as the state of every job is frozen in time. Let’s investigate and dive into troubleshooting 100% stalled Veeam backup jobs.

That morning, the backup jobs had not made an ounce of progress since the night before and they never will, you can leave this for days sometimes a job in between does seem to work properly, but most often not so the job queue builds up.

Troubleshooting 100% stalled Veeam backup jobs

When looking at the stalled jobs, nothing in the Veeam GUI indicates an error. Looking at the Windows event logs we see no warning, error, or critical messages. All seems fine. As this Veeam environment uses ReFS on storage spaces we are a bit weary. While the bugs that caused slowdowns have been fixed, we are still alert to potential issues. The difference with the know (fixed) ReFS issues that this is no slowdown, No sir, the Veeam backup jobs have literally frozen in time but everything seems to be functional otherwise.

Another symptom of this issue is that the synthetic full backups complete perfectly well, but they finfish with an error message none the less due to a time out. This has no effect on the synthetic backup result (they are usable) but it is disconcerting to see an issue with this.

On top of that, data copies into the ReFS volumes work just fine and at an excellent speed. Via performance monitor, we can see that the rotation of full regions from mirror to parity is also working well once the mirror tier has reached a specified capacity level.

Time to dive into the Veeam logs I would say.

Veeam backup job log

So the next stop is the Veeam logs themselves. While those can seem a little intimidating, they are very useful to scroll through. And sure enough, we find the following in one of the stalled jobs its backup log.

This goes on all through the night …

For hours on end … it goes on that way.

VIRTUAL MACHINE TASK LOGS 1

When we look at the task log of ar virtual machine that is still at 0% we see the same reflected there. Note that nothing happens between 22:465 and 05:30, that’s when I disabled and enabled the vNIC of the preferred networks in the VBR virtual machine and it all sprung back to life.

notice the total stand still form 22:46 to 05:30 …

So it is clear we have a network issue of some sort. We checked the repository servers and the Hyper-V cluster but there everything is just fine. So where is it?

Virtual Machine task logs 2

We dive into the task log of one of the virtual machines who’s backing up and that is hanging at 88%. There we see one after the other reconnect to the repository IP (over the preferred network as defined in VB). That also happens all night long until we reset the VBR virtual machine’s preferred network vNICs. In the log snippet below notice the following:

Error    A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond (System.Net.Sockets.SocketException)

From the logs we deduct that the network error appears to be on the VBR virtual machine itself. This is confirmed by the fact that bouncing the vNICs of the preferred network (10.10.110.x is the preferred network subnet) on the VBR virtual machine kicks the jobs back into action. So what is the issue? So we start checking the network configurations and settings. The switch ports, pNICs, vNIC, vSwitches etc. to find out what’s going on, As it seems to work for days or a week before the issue shows up we suspect a jumbo frame issue so we start there.

The solution

While checking the configuration we to make sure jumbo frames are enabled on the vNIC and the pNICs of the vSwitch’s NIC team. That’s when we notice the jumbo frames are missing from those pNICs. So we set those again.

From the VBR virtual machine we run some ping tests. The default works fine.

Troubleshooting 100% stalled Veeam backup jobs

When we test with jumbo frames however we notice something. The ping tests do not complain about jumbo frames being too large and that with the “do not fragment” option set the “Packet needs to be fragmented but DF set.” Note it just says “request timed out”. This indicates an issue right here, jumbo frames are set but they do not work.

Troubleshooting 100% stalled Veeam backup jobs
So the requests time out, it does not complain about the jumbo frames … so we have another issue here than just the jumbo frame settings

As the requests time out and the ping test does not complain about the jumbo frames we have another issue here than just the jumbo frame settings. It smells of a firmware and/or driver issue. So we dive a bit further. That’s when I notice the driver for the relevant pNICs (Broadcom) is the inbox Windows driver. That’s no good. The inbox drivers only exists to be able to go out and fetch the vendor’s driver and firmware when need, as a courtesy so to speak. We copy those to the hosts that require an update. In this case, the nodes where the VBR virtual machine can run. The firmware update requires a reboot. When the host is up and the VBR virtual machine is running I test again.

Bingo, now a ping test succeeds.

Troubleshooting 100% stalled Veeam backup jobs
Success

What happened?

So did we really forget to update the drivers? Did we walk out of the offices to go in lockdown for the Corona crisis and forget about it? In the end, it turned out they did run the updates for the physical hosts. But for some reason, the Broadcom firmware and drivers did not get updated properly. However, that failed update seems to have also removed the Jumbo frame settings from the pNICs that are used for the virtual switch. After fixed both of these we have not seen the issue return.

Remarks

The preferred networks do not absolutely have to be present on the VBR server itself. Define, yes, present, no. But it speeds up backup job initialization a lot when they are there present on the VBR server and Veeam also indicates to do so in their documentation.

Why jumbo frames? Ah well the networks we use for the preferred networks are end to end jumbo frame enabled. So we maintain this in to the VBR server. We might get away by not setting jumbo frames on the VBR server but we want to be consistent.

Conclusion

It pays to make sure you have all settings correctly configured and are running the latest and greatest known good firmware. But that should have been the case here. And it all worked so well for quite a while before the backup jobs stall. The issue can lie in the details and sometimes things are not what you assume they are. Always verify and verify it again.

I hope this helps someone out there if they are ever troubleshooting 100% stalled Veeam backup jobs If you need help, reach out in the comments. There are a lot of very experienced and respected people around in my network that can help. Maybe even I can lend a hand and learn something along the way.

Resolve the 500 Internal Server Error for PHP 7.4 with IIS on Windows Server for WordPress 5.31

Resolve the 500 Internal Server Error for PHP 7.4 with IIS on Windows Server for WordPress

Many of my readers will know I run a tight ship when it comes to keeping my blog infrastructure up to date and as secure as I can. This means, among other things, I keep my operating system recent and patched. The same goes for my MySQL database, PHP versions, and WordPress + plugins.

Recently I wanted to upgrade to PHP 7.4 that was just released and should be compatible with WordPress 5.3.1.

Resolve the 500 Internal Server Error for PHP 7.4 with IIS on Windows Server

But once I switched to PHP 7.4 my site responded with the dreaded 500 Internal Server Error. Instead of going trial and error I decided to zoom in fast to the root cause to resolve the 500 Internal Server Error for PHP 7.4 with IIS on Windows Server.

Troubleshooting steps

The fasted way to figure out what issues PHP has is to open an elevated command prompt and navigate to the install path of your PHP version. For me, that would be C:\Program Files\PHP\php-7.4.0-nts-Win32-vc15-x64. Note that IIS with requires the non thread safe version of PHP.

Once there run php -m

I got greeted by this warning.

PHP Warning: ‘vcruntime140.dll’ 14.0 is not compatible with this PHP build linked with 14.16 in Unknown on line 0

Resolve the 500 Internal Server Error for PHP 7.4 with IIS on Windows Server

That ‘s clear enough. I need to upgrade my Visual C++ runtime. I have but I need a more recent version. Basically the one for Visual ++ 2105-2019. You can grab those here for both x64 and x86.

https://aka.ms/vs/16/release/VC_redist.x64.exe
https://aka.ms/vs/16/release/VC_redist.x86.exe

I tend to install both. Just in case I have an issue with x64 PHP I can try the x86 version to try and resolve it, Normally I default to x64 bit and for the past years that has worked very well.

Once I had installed those my blog sprung back to life and that was it. If you run php -m again you won’t see the warning anymore. No need to switch themes (I have a simple one that has longevity). I also do not have to empirically disable/enable plugins. I select quality, free and modern plugins that I keep up to date or replace when needed. So, for me, that was the fix. Since most troubleshooting guided don’t start here, give it a shot. The HTTP Error 500.0 – Internal Server Error is an annoying one to troubleshoot. The PHP error log might not tell you anything as it goes wrong before it can log anything and enabling more detailed error handling does not always pinpoint something precisely. It might even send you on a wild goose chase.

Resolve the 500 Internal Server Error for PHP 7.4 with IIS on Windows Server
Not very specific is it?

If you did change the feature settings for the error pages for your site to get more detailed information to “Detailed errors” don’t forget to set it back. Configure it back to the default of  “Detailed error for local requests and custom error pages for remote requests” or your custom pages. You don’t want the details shown publically.

That’s it. I hope it helps someone who tries to resolve the 500 Internal Server Error for PHP 7.4 with IIS on Windows Server

November 2019 updates caused Windows Server 2012 reboot loop

Introduction

Some laggards that still have some Windows Server 2012 virtual machines running got a bit of a nasty surprise last weekend. A number of them went into a boot loop. Apparently the November 2019 updates caused Windows Server 2012 reboot loop. Well, we have dealt with update issues before like here in Quick Fix Publish : VM won’t boot after October 2017 Updates for Windows Server 2016 and Windows 10 (KB4041691). No need to panic.

Symptoms

Not all Windows Server 2012 virtual machines were affected. We did not see any issues with Windows Server 2012 R2, 2016 or 2019. Well the symptom is a reboot loop and below I have a sequence of what it looked like visually.

November 2019 updates caused Windows Server 2012 reboot loop
Restarting
Looking good so far …
OK, stage 2 of 4 … stlil seems OK.
November 2019 updates caused Windows Server 2012 reboot loop
Could still be OK … but no, its starts again in an endless loop.

So what caused this Windows Server 2012 reboot loop?

Fix

We turned of the virtual machines on which the November 2019 updates caused Windows Server 2012 reboot loop. We started them up again in “Safe mode” which completed successfully. Finally, we then did a normal reboot and that completed as well. All updates had been applied bar one. That was the 2019-11 Servicing Stack Update for Windows Server 2012 (KB4523208).

We manually installed it via Windows update and that succeeded.

When reading the information about this update https://www.catalog.update.microsoft.com/ScopedViewInline.aspx?updateid=5fa2a68f-e7cd-43c7-a48a-5e080472cb77 its states it need to be installed exclusively.

Maybe that was the root cause. It got deployed via WSUS with the other updates for November 2019.

Anyway, all is well now. I remind you that we have at least 2 ways of restoring those virtual machines. In case we had not been able to fix them. Have known good backups and a way to restore them people.

Conclusion

We fixed the issue and patched those servers completely. So all is well now. Except for the fact that they now , once more, have been urged to get of this operating system asap. You don’t go more than N-2 behind. It incurs operational overhead and risks. They did not test updates against this old server VM and got bitten. Technology debt without a plan is never worth it.