Optimize the Veeam preferred networks backup initialization speed

When Veeam preferred networks cause slow backup initialization speeds

When using preferred networks in Veeam you choose to use another than the default host network for backups and restores. In this post, we’ll discuss how to optimize the Veeam preferred networks backup initialization speed because we aim for optimal performance. TL-DR: You need to provide connectivity to the preferred networks for the Veeam Backup & Replication server. It seems a common mistake I run into every now and then. Ultimately it makes people think Veeam is slow. No, it is just a configuration mistake.

Why use a preferred network?

Backups can fill up a 1Gbps pipe very fast. Many people still use 1Gbps networking as default connectivity to the hosts. Even when they leverage 10Gbps or better it is often in a converged network setup. This means that only part of the bandwidth goes to host connectivity. Few have 10Gbps for “just” host connectivity. This means it makes sense to select a different higher bandwidth network for backup and restore traffic.

Hence for high volume, high-performance backup and restores it is smart to look for a bigger pipe to leverage. Some environments have dedicated backup networks at 10Gbps or better. But we find way more high bandwidth networks for other purposes. In Hyper-V environments, you’ll have those for SMB networking like CSV, Live Migration variants and storage replication. Hyper-Converged Infrastructure deployments use these networks for storage as well. With S2D you’ll find more and more 25/50/100Gbps. All these can be leveraged as a preferred backup network in Veeam

Setting up a preferred network

Setting up a preferred network is easy. First of all, you figure out which network to use. You then add those to the preferred networks as follows:

In file menu select “Network Traffic Rules”

Optimize the Veeam preferred network backup initialization speed

Click “Add” and specify the source IP as well as the target IP range. You can op to encrypt the traffic and /or set a bandwidth limit.

We have two SMB storage networks available, we enter both.

There is no need to have the preferred network registered in DNS. It will work fine without.

I hope it is clear that the source (Hyper-V Hosts), the target (backup repository or the extends in a Scale-Out Backup Repository) and any Off Host Proxies need connectivity to the preferred network(s). If you leverage WAN accelerators, Gateways Servers, log shipping servers than these also need access. Last but not least you should also make sure that the Veeam Backup Server (VBR) has access to the preferred networks. This is one that a lot of people seem to forget. May because it is most often a VM if it is not a shared role on the repository server or such and things do work without it.

When the VBR server has no access to the preferred networks things still work but initialization of the backup and restore jobs is a lot slower. Let’s test this.

Slow Initialization of backup and restore jobs

As a result of using preferred networks you might probably notice the following:

  • First of all, we notice a slow down in the overall initialization of the backup and restore job.
  • This manifests itself in a slow start of the actual VM backup/restore and reducing the number of simultaneous backups/restores of VMs within a job.

Without the VBR server having connectivity to the preferred networks

23:54 to complete the backup job (no connectivity to the preferred network)

Optimize the Veeam preferred networks backup initialization speed

With the VBR server having connectivity to the preferred networks. Notice how smooth and continuous the throughput is.

07:55 to complete the backup job (with connectivity to the preferred network) => 3 times as fast.

When you look into the Veeam backup logs for this job you will find at various stages attempts by the VBR server to connect to the preferred networks. If it can’t it has to wait until it times out. You see entries like:

A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.10.110.2:2509 (System.Net.Sockets.SocketException)

Optimize the Veeam preferred network backup initialization speed
Just a small part of all the NetSocket time out you will find for every single VM in the job. Here VBR is trying to connect to one of the extends in the SOBR.

This happens for every file in the backups (config files and disks) for every extend in the Scale-Out Backup Repository (per VM backup chain). This slows down the entire backup job tremendously.

Conclusion

I always make sure that the VBR servers in my environments have preferred network connectivity. Consequently, initialization is faster for both backups and restores. Test it out for yourself! It is the first thing I check when people complain of really slow backup. Do they have preferred networks set up? Check if the VBR server has connectivity to them!

When using file shares as backup targets you should leverage continuous available SMB 3 file shares

Introduction

When using file shares as backup targets you should leverage Continuous Available SMB 3 file shares. For now, at least. A while back Anton Gostev wrote a very interesting piece in his “The Word from Gostev”. It was about an issue that they saw with people using SMB 3 files shares as backup targets with Veeam Backup & Replication. To some it was a reason to cry wolf. But it’s a probably too little-known issue that can and a such might (will) occur. You need to be aware of it to make good decisions and give good advice.

I’m the business of building rock solid solutions that are highly available to continuous available. This means I’m always looking into the benefits and drawbacks of design choices. By that I mean I study, test and verify them as well. I don’t do “Paper Proof of Concepts”. Those are just border line fraud.

So, what’s going on and what can you do to mitigate the risk or avoid it all together?

Setting the scenario

Your backup software (in our case Veeam Backup & Recovery) running on Windows leverages an SMB 3 file share as a backup target. This could be a Windows Server file share but it doesn’t have to be. It could be a 3rd party appliance or storage array.

When using file shares as backup targets you should leverage Continuous Available SMB 3 file shares.

The SMB client

The client is the SMB 3 Client Microsoft delivers in the OS (version depends on the OS version). But this client is under control of Microsoft. Let’s face it the source in these scenarios is a Hyper-V host/cluster or a Windows SMB 3 Windows File share, clustered or not.

The SMB server

In regards to the target, i.e. the SMB Server you have a couple of possibilities. Microsoft or 3rd party.

If it’s a third-party SMB 3 implementation on Linux or an appliance. You might not even know what is used under the hood as an OS and 3rd party SMB 3 solution. It could be a storage vendors native SMB 3 implementation on their storage array or simple commodity NAS who bought a 3rd party solution to leverage. It might be high available or in many (most?) cases it is not. It’s hard to know if the 3rd party implements / leverages the full capabilities of the SMB 3 stack as Microsoft does or not. You light not know of there are any bugs in there or not.

You get the picture. If you bank on appliances, find out and test it (trust but verify). But let’s assume its capabilities are on par with what Windows offers and that means the subject being discussed goes for both 3rd party offerings and Windows Server.

When the target is Windows Server we are talking about SMB 3 File Shares that are either Continuous Available or not. For backup targets General Purpose File Shares will do. You could even opt to leverage SOFS (S2D for example). In this case you know what’s implemented in what version and you get bug fixes from MSFT.

When you have continuously available (CA) SMB 3 shares you should be able to sleep sound. SMB 3 has you covered. The risks we are discussing is related to non-CA SMB 3 file shares.

What could go wrong?

Let’s walk through this. When your backup software writes to an SMB 3 share it leverages the SMB 3 client & server in the SMB 3 stack. Unlike when Veeam uses its own data mover, all the cool data persistence stuff is handled by Windows transparently. The backup software literally hands of the job to Windows. Which is why you can also leverage SMB Multichannel and SMB direct with your backups if you so desire. Read Veeam Backup & Replication leverages SMB Multichannel and Veeam Backup & Replication Preferred Subnet & SMB Multichannel for more on this.

If you are writing to a non-CA SMB 3 share your backup software receives the messages the data has been written. Which actually means that the data is cached in the SMB Clients “queue” of data to write but which might not have been written to the storage yet.

For short interruptions this is survivable and for Office and the like this works well and delivers fast performance. If the connection is interrupted or the share is unavailable the queue keeps the data in memory for a while. So, if the connection restores the data can be written. The SMB 3 Client is smart.

However, this has its limits. The data cache in the queue doesn’t exist eternally. If the connectivity loss or file share availability take too long the data in the SMB 3 client cache is lost. But it was not written to storage! To add a little insult to injury the SBM client send back “we’re good” even when the share has been unreachable for a while.

For backups this isn’t optimal. Actually, the alarm bell should start ringing when it is about backups. Your backup software got a message the data has been written and doesn’t know any better. But is not on the backup target. This means the backup software will run into issues with corrupted backups sooner or later (next backup, restores, synthetic full backups, merges, whatever comes first).

Why did they make it this way?

This is OK default behavior. it works just fine for Office files / most knowledge worker client software that have temp files, auto recovery, and all such lovely capabilities and work is mostly individual and interactive. Those applications are resilient to this by nature. Mind you, all my SMB 3 file share deployments are clustered and highly available where appropriate. By “appropriate” I mean when we don’t have off line caching for those shares as a requirement as those too don’t mix well (https://blogs.technet.microsoft.com/filecab/2016/03/15/offline-files-and-continuous-availability-the-monstrous-union-you-should-not-consecrate/). But when you know what your doing it rocks. I can actually failover my file server roles all day long for patching, maintenance & fun when the clients do talk SMB 3. Oh, and it was a joy to move that data to new SANs under the hood. More on that perhaps in another post. But I digress.

You need adequate storage in all uses cases

This is a no brainer. Nothing will save you if the target storage isn’t up to the task. Not the Veeam data move or SMB3 shares with continuous availability. Let’s be very clear about this. Even at the cost-effective side of the equation the storage has to be of sufficient decent quality to prevent data loss. That means decent controllers with battery cached IO as safe guard etc. Whether that’s a SAN or a “simple” raid controller or pass through HBA’s for storage spaces, doesn’t matter. You have to have it. Putting your data on SATA drives without any save guard is sure way of risking data loss. That’s as simple as it gets. You don’t do that, unless you don’t care. And if you care, you would not be reading this!

Can this be fixed?

Well as a non-SMB 3 developer I would say we need an option added that the SMB 3 client can be configured to not report success until that data has been effectively written on the target, or at least has landed somewhere on quality, cache protected storage.

This option does not exist today. I do not work for Microsoft but I know some people there and I’m pretty sure they want to fix it. I’m just not sure how big of a priority it is at the moment. For me it’s important that when a backup application goes to a non-continuous available file share it can request that it will not cache and the SMB Server says “OK” got it, I will behave accordingly. Now the details in the implementation will be different but you get the message?

I would like to make the case that it should be a configurable option. It is not needed for all scenarios and it might (will) have an impact on performance. How big that would be I have no clue. I’m just a blogger who does IT as a job. I’m not a principal PM at Microsoft or so.

If you absolutely want to make sure, use clustered continuous available file shares. Works like a charm. Read this blog Continuous available general purpose file shares & ReFSv3 provide high available backup targets, there is even one of my not so professional videos show casing this.

It’s also important not to panic. Most of you might even never has heard or experienced this. But depending on the use case and the quality of the network and processes you might. In a backup scenario this is not something that makes for a happy day.

The cry wolf crowd

I’ll be blunt. WARNING. Take a hike if you have a smug “Windoze sucks” attitude. If you want to deal dope you shouldn’t be smoking too much of your own stuff, but primarily know it inside out. NFS in all its varied implementations has potential issues as well. So, I’d also do my due diligence with any solution you recommend. Trust but verify, remember?! Actually, an example of one such an issue was given for an appliance with NFS by Veeam. Guess what, every one has issues. Choose your poison, drink it and let other chose theirs. Condescending remarks just make you look bad every time. And guess what that impression tends to last. Now on the positive side, I hear that caching can be disabled on modern NFS client implementations. So, the potential issue is known and is is being addressed there as well.

Conclusion

Don’t panic. I just discussed a potential issue than can occur and that you should be aware off when deciding on a backup target. If you have rock solid networking and great server management processes you can go far without issues, but that’s not 100 % fail proof. As I’m in the business of building the best possible solutions it’s something you need to be aware off.

But know that they can occur, when and why so you can manage the risk optimally. Making Windows Server SMB 3 file shares Continuously Available will protect against this effectively. It does require failover clustering. But at least now you know why I say that when using file shares as backup targets you should leverage continuous available SMB 3 file shares

When you buy appliances or 3rd party SMB 3 solutions, this issue also exists but be extra diligent even with highly available shares. Make sure it works as it should!

I hope Microsoft resolves this issue as soon as possible. I’m sure they want to. They want their products to be the best and fix any possible concerns you might have.

Correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backups

Introduction

It’s not a secret that while guest clustering with VHDSets works very well. We’ve had some struggles in regards to host level backups however. Right now I leverage Veeam Agent for Windows (VAW) to do in guest backups. The most recent versions of VAW support Windows Failover Clustering. I’d love to leverage host level backups but I was struggling to make this reliable for quite a while. As it turned out recently there are some virtual machine permission issues involved we need to fix. Both Microsoft and Veeam have published guidance on this in a KB article. We automated correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backup

The KB articles

Early August Microsoft published KB article with all the tips when thins fail Errors when backing up VMs that belong to a guest cluster in Windows. Veeam also recapitulated on the needed conditions and setting to leverage guest clustering and performing host level backups. The Veeam article is Backing up Hyper-V guest cluster based on VHD set. Read these articles carefully and make sure all you need to do has been done.

For some reason another prerequisite is not mentioned in these articles. It is however discussed in ConfigStoreRootPath cluster parameter is not defined and here https://docs.microsoft.com/en-us/powershell/module/hyper-v/set-vmhostcluster?view=win10-ps You will need to set this to make proper Hyper-V collections needed for recovery checkpoints on VHD Sets. It is a very unknown setting with very little documentation.

But the big news here is fixing a permissions related issue!

The latest addition in the list of attention points is a permission issue. These permissions are not correct by default for the guest cluster VMs shared files. This leads to the hard to pin point error.

Error Event 19100 Hyper-V-VMMS 19100 ‘BackupVM’ background disk merge failed to complete: General access denied error (0x80070005). To fix this issue, the folder that holds the VHDS files and their snapshot files must be modified to give the VMMS process additional permissions. To do this, follow these steps for correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backup.

Determine the GUIDS of all VMs that use the folder. To do this, start PowerShell as administrator, and then run the following command:

get-vm | fl name, id
Output example:
Name : BackupVM
Id : d3599536-222a-4d6e-bb10-a6019c3f2b9b

Name : BackupVM2
Id : a0af7903-94b4-4a2c-b3b3-16050d5f80f

For each VM GUID, assign the VMMS process full control by running the following command:
icacls <Folder with VHDS> /grant “NT VIRTUAL MACHINE\<VM GUID>”:(OI)F

Example:
icacls “c:\ClusterStorage\Volume1\SharedClusterDisk” /grant “NT VIRTUAL MACHINE\a0af7903-94b4-4a2c-b3b3-16050d5f80f2”:(OI)F
icacls “c:\ClusterStorage\Volume1\SharedClusterDisk” /grant “NT VIRTUAL MACHINE\d3599536-222a-4d6e-bb10-a6019c3f2b9b”:(OI)F

My little PowerShell script

As the above is tedious manual labor with a lot of copy pasting. This is time consuming and tedious at best. With larger guest clusters the probability of mistakes increases. To fix this we write a PowerShell script to handle this for us.

#Didier Van Hoye
#Twitter: @WorkingHardInIT 
#Blog: https://blog.Workinghardinit.work
#Correct shared VHD Set disk permissions for all nodes in guests cluster

$GuestCluster = "DemoGuestCluster"
$HostCluster = "LAB-CLUSTER"

$PathToGuestClusterSharedDisks = "C:\ClusterStorage\NTFS-03\GuestClustersSharedDisks"


$GuestClusterNodes = Get-ClusterNode -Cluster $GuestCluster

ForEach ($GuestClusterNode in $GuestClusterNodes)
{

#Passing the cluster name to -computername only works in W2K16 and up.
#As this is about VHDS you need to be running 2016, so no worries here.
$GuestClusterNodeGuid = (Get-VM -Name $GuestClusterNode.Name -ComputerName $HostCluster).id

Write-Host $GuestClusterNodeGuid "belongs to" $GuestClusterNode.Name

$IcalsExecute = """$PathToGuestClusterSharedDisks""" + " /grant " + """NT VIRTUAL MACHINE\"+ $GuestClusterNodeGuid + """:(OI)F"
write-Host "Executing " $IcalsExecute
CMD.EXE /C "icacls $IcalsExecute"

} 

Below is an example of the output of this script. It provides some feedback on what is happening.

Correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backup

Correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backup

PowerShell for the win. This saves you some searching and typing and potentially making some mistakes along the way. Have fun. More testing is underway to make sure things are now predictable and stable. We’ll share our findings with you.

The lure of having a Ransomware Fund

Introduction

What is the the lure of having a ransomware fund all about? It’s the idea that just paying is the best way to deal with a ransomware incident.While preventing as many ransomware attacks as possible is great, it is not something that will be 100% effective. Detecting an incident as early as possible is key to minimizing the effects. This even in the event of successful and early detection some data has been compromised (encrypted). The nature and function of that data will determine the blast radius and the fall out. To recover from that the attack needs to be stopped by finding and eliminating the points of infection.Next to that, the proven ability to restore data and do so fast is a key capability when it comes to recovering form a ransomware attack. If you don’t you’ll either need to eat the loss or try to pay up.

Dealing with Ransomware step by step

  • Prevention is not 100% effective. Don’t bank on it.
  • Early detection
  • Swift & adequate response
  • Quarantine, wipe (nuke from orbit) of contaminated systems & data
  • See if a free decryption solution is available via the security community or your police services cyber crime department
  • Restore your data. You must have multiple options. You must have implemented the 3-2-1 rule. But beware, your off site, air gapped copy cannot be too old. You need to have fairly recent backups in there to have a decent RPO that is meaningful to the business.
  • Bring data, systems and services back into production.

Now make sure you can do this for end user files, server data (images, VMs, Databases, configuration files,  backups) regardless of where it is (on-premises, private, hybrid & public cloud) what delivery model it comes in (Physical, virtual, IAAS, PAAS, SAAS, Serverless).

The lure of having a Ransomware Fund (Isn’t it cheaper to pay?)

Now some bean counter might come up with the idea that paying is cheaper (and easier) than prevention, let alone backup & restore capabilities.

The lure of having a Ransomware Fund

Some would even consider it a “cost of doing business”. This is the the lure of having a ransomware Fund. Ouch, well I know many parts of the world are a lot less save than mine but this is a path down a slippery slope so dangerous you will fall down sooner or later. Let’s look at why that is.

petya ransomware

The lure of having a Ransomware Fund

First, let’s not forget about the down time caused no matter how you resolve it. So prevention and early detection are key. You might not even survive if you pay and get your data back.

Secondly, while I love the idea of prevention and early detection this doesn’t mean that you can get rid of your backup and restore capabilities. Prevention is an mitigation strategy, it doesn’t eradicate the issue. Early detection minimizes the immediate and secondary damage in many cases. But not in all cases and it is also not perfect.

Third, when you pay your ransom how sure are you you’ll get your decryption key and be able to access your data? Well it seems only in 50% of the cases. Now, some ransomware “businesses’’ have a better customer service than many commercial companies and governments. But that doesn’t mean all of them do and by definition they are not honest people. Unless you consider ransomware “Encryption As A Service” that helps you with GDPR. I think not. You might think that a smart ransomware player delivers not to ruin future revenue streams by acquiring a bad reputation. Probably true, but they to can make mistakes, you can make mistakes, you can become road kill of vandals or of criminals who desire or are hired to incur havoc on a certain industry.

Finally, you might end up being a repeat victim as you have shown the willingness & ability to pay. Don’t forget that ransomware is not like mobster protection money. It will not protect you from others or the same ones doing it again.

Conclusion

Banking on having an emergency stash of Bitcoin (ransomware fund) just to pay ransomware isn’t your best option. It might be a last resort faced with the alternative of bankruptcy but even then it remains a costly and risky gamble.

I know that for some people in IT, backups seem outdated and from a gone by era, a solution to a problem form yesterday. I kid you not. Well, I advise you to think again and act upon what you concluded.