As I have started to use XFS in bite-size deployments to gain experience with it I wanted to write up some of the toolings I found to manage XFS file systems. Here’s how to check/repair/defragment an XFS volume.
My main use case for XFS volumes is on hardened Linux repositories with immutability to use with Veeam Backup & Replication v11 and higher. It’s handy to be able to find out if XFS needs repairing and if they do, repair them. Another consideration is fragmentation. You can also check that and defrag the volume.
Check XFS Volume and repair it
xfs_repair is the tool you need. You can both check if a volume needs repair and actually repair it with the same tool. Note that the use of xfs_check has been depreciated or is not even available (anymore).
To work with xfs_repair you have to unmount the filesystem, so there will be downtime. Plan for a maintenance window.
To check the file system use the -n switch
sudo xfs_repair -n /dev/sdc
There is nothing much to do but we’ll now let’s run the repair.
sudo xfs_repair /dev/sdc
The output is similar as for the check we did for anything to repair is basically a dry run of what will be done. In this case, nothing.
Now, don’t forget to mount the file system again!
sudo mount /dev/sdc /mnt/veeamsfxrepo01-02
Check a volume for fragmentation and defrag it
Want to check the fragmentation of an XFS volume? You can but again, with xfs_db. The file system has to be unmounted for that or you will get the error xfs_db: can’t determine device size. To check for fragmentation run the following command against the storage device /file system.
sudo xfs_db -c frag -r /dev/sdc
Cool, now we know that we can defrag it online. For that we use xfs_fsr.
xfs_fsr /devsdc /mnt/veeamxfsrepo01-02
xfs_scrub – the experimental tool
xfs_scrub is a more recent addition but the program is still experimental. The good news is it will check and repair a mounted XFS filesystem. At least it sounds promising, right? It does, but it doesn’t work (Ubuntu 20.04.1 LTS).
That’s it. I hope this helps you when you decide to take XFS for a spin for your storage needs knowing a bit more about the tooling. As said, for me, the main use case is hardened Linux repositories with immutability to use with Veeam Backup & Replication v11. In a Hyper-V environment of course.
Recently I was implementing a high available Kemp LoadMaster X15 system. I prepared everything, documented the switch and LM-X15 configuration, and created a VISIO to visualize it all. That, together with the migration and rollback scenario was all presented to the team lead and the engineer who was going to work on this with us. I told the team lead that all would go smoothly if my preparations were good and I did not make any mistakes in the configuration. Well, guess what, I made a mistake somewhere and had to solve a Kemp LoadMaster ad digest – md2=[31084da3…] md=[20dcd914…] – Check vhid, password and virtual IP address log entry.
Check vhid, password and virtual IP address
As, while all was working well, we saw the following entry inundate the system message file log:
<date> <LoadMasterHostName> ucarp: Bad digest – md2=[xxxxx…] md=[xxxxx…] – Check vhid, password and virtual IP address
Wait a minute, as far as I know all was OK. The VHID was unique for the HA pair and we did not have duplicate IP addresses set anywhere on other network appliances. So what was this about?
Figuring out the cause
Well, we have a bond0 on eth0 and eth2 for the appliance management. We also have eth1 which is a special interface used for L1 health checks between the Loadmasters. We don’t use a direct link (different racks) so we configure them with an IP on a separate dedicated subnet. Then we have the bonds with the VLAN for the actual workloads via Virtual Services.
We have heartbeat health checks on bond0, eth1 and on at least one VLAN per bonds for the workloads.
Confirm that Promiscuous mode and PortFast are enabled. Check! HA is configured for multicast traffic in our setup so we confirm that the switch allows multicast traffic. Check!
Make sure that switch configurations that block multicast traffic, such as ‘IGMP snooping’, are disabled on the switch/switch ports as needed. Check!
Now let’s look at possible causes and check our confguration:
So what else? The documentation states as possible other causes the following:
There is another device on the network with the same HA Virtual ID. The LoadMasters in a HA pair should have the same HA Virtual ID. It is possible that a third device could be interfering with these units. As of LoadMaster firmware version 7.2.36, the LoadMaster selects a HA Virtual ID based on the shared IP address of the first configured interface (the last 8 bits). You can change the value to whatever number you want (in the range 1 – 255), or you can keep it at the value already selected. Check!
An interface used for HA checks is receiving a packet from a different interface/appliance. If the LoadMaster has two interfaces connecting to the same switch, with Use for HA checks enabled, this can also cause these error messages. Disable the Use for HA checks option on one of the interfaces to confirm the issue. If confirmed, either leave the option disabled or move the interface to a separate switch.
I am sure there is no interference from another appliance. Check! As we had checked every other possible cause the line in red caught my attention. Could it be?
Time for some packet captures
So we took a TCP dump on bond0 and looked at it in Wireshark. You can make a TCP dump via debug options under System Log Files.
Select your interface, click start, after 10 seconds or so click stop and download the dump
Do note that Wireshark identifies this as VRRP, but the LoadMaster uses CARP (open source) do set it to decode as CARP, that way you’ll see more interesting information in Info
Also filter on ip.dst == 244.0.0.18 (multicast address). What we get here is that on eth0 we see multicasts from eth1. That is the case described in the documentation. Aha!
So now what, do we need to move eth1 to another switch to solve this? Or disable the HA check? No, luckily not. Read on.
The fix for Check vhid, password and virtual IP address
No, I did not use one or more separate switches just to plug in the heartbeat HA interfaces on the LoadMasters. What I did is create a separate VLAN for the eth0 HA heartbeat uplink interfaces on the switches. This way I ensure that they are in a separate unicast group from the management interface uplinks on the switches
By default the Multicast TV VLAN Membership is per VLAN. The reason the actual workload interfaces did not cause an issue when we enabled HA checks is that these were trunk ports with a number of allowed VLANs, different from the management VLAN, which prevents this error being logged in the first place.
That this works was confirmed in the packet trace from the LM-X15 after making the change.
So that was it. The error was gone and we could move along with the project.
Well, I should have know as normally I do put those networks not just in a separate subnet but also make sure they are on different VLANs. This goes to show that no matter how experienced you are and how well you prepare you will still make mistakes. That’s normal and that’s OK, it means you are actually doing something. Key is how you deal with a mistake and that why I wrote this. To share how I found out the root cause of the issue and how I fixed it. Mistakes are a learning opportunity, use them as such. I know many organizations frown upon a mistake but really, these should grow up and don’t act this silly.
Even with the best hardware and vendor, a disk can fail and yes, it happened to one of our NVME drives. Reseating the disk did not help and we were one disk shot in device manager. So yes the disk was dead (or worse the bus where it was seated, but that is less likely).
Replacing a failed disk in a stand-alone Storage Spaces with Mirror Accelerated Parity
So let’s take a look by running
I immediately see that we have an NVME that has lost communication, it is gone and also no longer displays a disk number. It seems to be broken.
That means we need to get rid of it in the storage spaces pool so we can replace it.
Getting rid of the failed disk properly
I put the disk that lost communication into a variable
Let this complete and check again with Get-PhysicalDisk, you will see the problematic disk has gone. Note that there are only 7 NVME disks left.
It does not show an unrecognized disk that is still visible to the OS somehow. So we cannot try to reset it to try to get it back into action. We need to replace it and so we request a replacement disk with DELL support and swap them out.
Putting the new disk into service
Now we have our new disk we want it to be added to the storage pool. You should now see the new disk in Disk Manager as online and not initialized. Or when you select Add Physical Disk in the storage pool in Server Manager.
But we were doing so well in PowerShell so let’s continue there. We will add the new disk to the storage pool. Run
That’s it. All is well again and rebalanced. It also ensures the storage capacity contributed by the replaced disk will be available in the performance tier when I want to create an extra virtual disk. Storage Spaces at its best giving me the opportunity to leverage NVMe with other disks while maximizing the benefits of ReFS.
As you have seen replacing a failed disk in a stand-alone Storage Spaces with Mirror Accelerated Parity is not too hard to do. You just need to wrap your head around how storage spaces work and investigate the commands a little. For that I recommend practicing on a Virtual Machine.
Recently I got to diagnose a really interesting Veeam Backup & Replication symptom. Imagine you have a backup environment that runs smoothly. All week long but then, suddenly, running backup jobs stall. News jobs that start do not make an ounce of progress. It is as the state of every job is frozen in time. Let’s investigate and dive into troubleshooting 100% stalled Veeam backup jobs.
Troubleshooting 100% stalled Veeam backup jobs
When looking at the stalled jobs, nothing in the Veeam GUI indicates an error. Looking at the Windows event logs we see no warning, error, or critical messages. All seems fine. As this Veeam environment uses ReFS on storage spaces we are a bit weary. While the bugs that caused slowdowns have been fixed, we are still alert to potential issues. The difference with the know (fixed) ReFS issues that this is no slowdown, No sir, the Veeam backup jobs have literally frozen in time but everything seems to be functional otherwise.
Another symptom of this issue is that the synthetic full backups complete perfectly well, but they finfish with an error message none the less due to a time out. This has no effect on the synthetic backup result (they are usable) but it is disconcerting to see an issue with this.
On top of that, data copies into the ReFS volumes work just fine and at an excellent speed. Via performance monitor, we can see that the rotation of full regions from mirror to parity is also working well once the mirror tier has reached a specified capacity level.
Time to dive into the Veeam logs I would say.
Veeam backup job log
So the next stop is the Veeam logs themselves. While those can seem a little intimidating, they are very useful to scroll through. And sure enough, we find the following in one of the stalled jobs its backup log.
For hours on end … it goes on that way.
VIRTUAL MACHINE TASK LOGS 1
When we look at the task log of ar virtual machine that is still at 0% we see the same reflected there. Note that nothing happens between 22:465 and 05:30, that’s when I disabled and enabled the vNIC of the preferred networks in the VBR virtual machine and it all sprung back to life.
So it is clear we have a network issue of some sort. We checked the repository servers and the Hyper-V cluster but there everything is just fine. So where is it?
Virtual Machine task logs 2
We dive into the task log of one of the virtual machines who’s backing up and that is hanging at 88%. There we see one after the other reconnect to the repository IP (over the preferred network as defined in VB). That also happens all night long until we reset the VBR virtual machine’s preferred network vNICs. In the log snippet below notice the following:
Error A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond (System.Net.Sockets.SocketException)
From the logs we deduct that the network error appears to be on the VBR virtual machine itself. This is confirmed by the fact that bouncing the vNICs of the preferred network (10.10.110.x is the preferred network subnet) on the VBR virtual machine kicks the jobs back into action. So what is the issue? So we start checking the network configurations and settings. The switch ports, pNICs, vNIC, vSwitches etc. to find out what’s going on, As it seems to work for days or a week before the issue shows up we suspect a jumbo frame issue so we start there.
While checking the configuration we to make sure jumbo frames are enabled on the vNIC and the pNICs of the vSwitch’s NIC team. That’s when we notice the jumbo frames are missing from those pNICs. So we set those again.
From the VBR virtual machine we run some ping tests. The default works fine.
When we test with jumbo frames however we notice something. The ping tests do not complain about jumbo frames being too large and that with the “do not fragment” option set the “Packet needs to be fragmented but DF set.” Note it just says “request timed out”. This indicates an issue right here, jumbo frames are set but they do not work.
As the requests time out and the ping test does not complain about the jumbo frames we have another issue here than just the jumbo frame settings. It smells of a firmware and/or driver issue. So we dive a bit further. That’s when I notice the driver for the relevant pNICs (Broadcom) is the inbox Windows driver. That’s no good. The inbox drivers only exists to be able to go out and fetch the vendor’s driver and firmware when need, as a courtesy so to speak. We copy those to the hosts that require an update. In this case, the nodes where the VBR virtual machine can run. The firmware update requires a reboot. When the host is up and the VBR virtual machine is running I test again.
Bingo, now a ping test succeeds.
So did we really forget to update the drivers? Did we walk out of the offices to go in lockdown for the Corona crisis and forget about it? In the end, it turned out they did run the updates for the physical hosts. But for some reason, the Broadcom firmware and drivers did not get updated properly. However, that failed update seems to have also removed the Jumbo frame settings from the pNICs that are used for the virtual switch. After fixed both of these we have not seen the issue return.
The preferred networks do not absolutely have to be present on the VBR server itself. Define, yes, present, no. But it speeds up backup job initialization a lot when they are there present on the VBR server and Veeam also indicates to do so in their documentation.
Why jumbo frames? Ah well the networks we use for the preferred networks are end to end jumbo frame enabled. So we maintain this in to the VBR server. We might get away by not setting jumbo frames on the VBR server but we want to be consistent.
It pays to make sure you have all settings correctly configured and are running the latest and greatest known good firmware. But that should have been the case here. And it all worked so well for quite a while before the backup jobs stall. The issue can lie in the details and sometimes things are not what you assume they are. Always verify and verify it again.
I hope this helps someone out there if they are ever troubleshooting 100% stalled Veeam backup jobs If you need help, reach out in the comments. There are a lot of very experienced and respected people around in my network that can help. Maybe even I can lend a hand and learn something along the way.