When using file shares as backup targets you should leverage continuous available SMB 3 file shares

Introduction

When using file shares as backup targets you should leverage Continuous Available SMB 3 file shares. For now, at least. A while back Anton Gostev wrote a very interesting piece in his “The Word from Gostev”. It was about an issue that they saw with people using SMB 3 files shares as backup targets with Veeam Backup & Replication. To some it was a reason to cry wolf. But it’s a probably too little-known issue that can and a such might (will) occur. You need to be aware of it to make good decisions and give good advice.

I’m the business of building rock solid solutions that are highly available to continuous available. This means I’m always looking into the benefits and drawbacks of design choices. By that I mean I study, test and verify them as well. I don’t do “Paper Proof of Concepts”. Those are just border line fraud.

So, what’s going on and what can you do to mitigate the risk or avoid it all together?

Setting the scenario

Your backup software (in our case Veeam Backup & Recovery) running on Windows leverages an SMB 3 file share as a backup target. This could be a Windows Server file share but it doesn’t have to be. It could be a 3rd party appliance or storage array.

When using file shares as backup targets you should leverage Continuous Available SMB 3 file shares.

The SMB client

The client is the SMB 3 Client Microsoft delivers in the OS (version depends on the OS version). But this client is under control of Microsoft. Let’s face it the source in these scenarios is a Hyper-V host/cluster or a Windows SMB 3 Windows File share, clustered or not.

The SMB server

In regards to the target, i.e. the SMB Server you have a couple of possibilities. Microsoft or 3rd party.

If it’s a third-party SMB 3 implementation on Linux or an appliance. You might not even know what is used under the hood as an OS and 3rd party SMB 3 solution. It could be a storage vendors native SMB 3 implementation on their storage array or simple commodity NAS who bought a 3rd party solution to leverage. It might be high available or in many (most?) cases it is not. It’s hard to know if the 3rd party implements / leverages the full capabilities of the SMB 3 stack as Microsoft does or not. You light not know of there are any bugs in there or not.

You get the picture. If you bank on appliances, find out and test it (trust but verify). But let’s assume its capabilities are on par with what Windows offers and that means the subject being discussed goes for both 3rd party offerings and Windows Server.

When the target is Windows Server we are talking about SMB 3 File Shares that are either Continuous Available or not. For backup targets General Purpose File Shares will do. You could even opt to leverage SOFS (S2D for example). In this case you know what’s implemented in what version and you get bug fixes from MSFT.

When you have continuously available (CA) SMB 3 shares you should be able to sleep sound. SMB 3 has you covered. The risks we are discussing is related to non-CA SMB 3 file shares.

What could go wrong?

Let’s walk through this. When your backup software writes to an SMB 3 share it leverages the SMB 3 client & server in the SMB 3 stack. Unlike when Veeam uses its own data mover, all the cool data persistence stuff is handled by Windows transparently. The backup software literally hands of the job to Windows. Which is why you can also leverage SMB Multichannel and SMB direct with your backups if you so desire. Read Veeam Backup & Replication leverages SMB Multichannel and Veeam Backup & Replication Preferred Subnet & SMB Multichannel for more on this.

If you are writing to a non-CA SMB 3 share your backup software receives the messages the data has been written. Which actually means that the data is cached in the SMB Clients “queue” of data to write but which might not have been written to the storage yet.

For short interruptions this is survivable and for Office and the like this works well and delivers fast performance. If the connection is interrupted or the share is unavailable the queue keeps the data in memory for a while. So, if the connection restores the data can be written. The SMB 3 Client is smart.

However, this has its limits. The data cache in the queue doesn’t exist eternally. If the connectivity loss or file share availability take too long the data in the SMB 3 client cache is lost. But it was not written to storage! To add a little insult to injury the SBM client send back “we’re good” even when the share has been unreachable for a while.

For backups this isn’t optimal. Actually, the alarm bell should start ringing when it is about backups. Your backup software got a message the data has been written and doesn’t know any better. But is not on the backup target. This means the backup software will run into issues with corrupted backups sooner or later (next backup, restores, synthetic full backups, merges, whatever comes first).

Why did they make it this way?

This is OK default behavior. it works just fine for Office files / most knowledge worker client software that have temp files, auto recovery, and all such lovely capabilities and work is mostly individual and interactive. Those applications are resilient to this by nature. Mind you, all my SMB 3 file share deployments are clustered and highly available where appropriate. By “appropriate” I mean when we don’t have off line caching for those shares as a requirement as those too don’t mix well (https://blogs.technet.microsoft.com/filecab/2016/03/15/offline-files-and-continuous-availability-the-monstrous-union-you-should-not-consecrate/). But when you know what your doing it rocks. I can actually failover my file server roles all day long for patching, maintenance & fun when the clients do talk SMB 3. Oh, and it was a joy to move that data to new SANs under the hood. More on that perhaps in another post. But I digress.

You need adequate storage in all uses cases

This is a no brainer. Nothing will save you if the target storage isn’t up to the task. Not the Veeam data move or SMB3 shares with continuous availability. Let’s be very clear about this. Even at the cost-effective side of the equation the storage has to be of sufficient decent quality to prevent data loss. That means decent controllers with battery cached IO as safe guard etc. Whether that’s a SAN or a “simple” raid controller or pass through HBA’s for storage spaces, doesn’t matter. You have to have it. Putting your data on SATA drives without any save guard is sure way of risking data loss. That’s as simple as it gets. You don’t do that, unless you don’t care. And if you care, you would not be reading this!

Can this be fixed?

Well as a non-SMB 3 developer I would say we need an option added that the SMB 3 client can be configured to not report success until that data has been effectively written on the target, or at least has landed somewhere on quality, cache protected storage.

This option does not exist today. I do not work for Microsoft but I know some people there and I’m pretty sure they want to fix it. I’m just not sure how big of a priority it is at the moment. For me it’s important that when a backup application goes to a non-continuous available file share it can request that it will not cache and the SMB Server says “OK” got it, I will behave accordingly. Now the details in the implementation will be different but you get the message?

I would like to make the case that it should be a configurable option. It is not needed for all scenarios and it might (will) have an impact on performance. How big that would be I have no clue. I’m just a blogger who does IT as a job. I’m not a principal PM at Microsoft or so.

If you absolutely want to make sure, use clustered continuous available file shares. Works like a charm. Read this blog Continuous available general purpose file shares & ReFSv3 provide high available backup targets, there is even one of my not so professional videos show casing this.

It’s also important not to panic. Most of you might even never has heard or experienced this. But depending on the use case and the quality of the network and processes you might. In a backup scenario this is not something that makes for a happy day.

The cry wolf crowd

I’ll be blunt. WARNING. Take a hike if you have a smug “Windoze sucks” attitude. If you want to deal dope you shouldn’t be smoking too much of your own stuff, but primarily know it inside out. NFS in all its varied implementations has potential issues as well. So, I’d also do my due diligence with any solution you recommend. Trust but verify, remember?! Actually, an example of one such an issue was given for an appliance with NFS by Veeam. Guess what, every one has issues. Choose your poison, drink it and let other chose theirs. Condescending remarks just make you look bad every time. And guess what that impression tends to last. Now on the positive side, I hear that caching can be disabled on modern NFS client implementations. So, the potential issue is known and is is being addressed there as well.

Conclusion

Don’t panic. I just discussed a potential issue than can occur and that you should be aware off when deciding on a backup target. If you have rock solid networking and great server management processes you can go far without issues, but that’s not 100 % fail proof. As I’m in the business of building the best possible solutions it’s something you need to be aware off.

But know that they can occur, when and why so you can manage the risk optimally. Making Windows Server SMB 3 file shares Continuously Available will protect against this effectively. It does require failover clustering. But at least now you know why I say that when using file shares as backup targets you should leverage continuous available SMB 3 file shares

When you buy appliances or 3rd party SMB 3 solutions, this issue also exists but be extra diligent even with highly available shares. Make sure it works as it should!

I hope Microsoft resolves this issue as soon as possible. I’m sure they want to. They want their products to be the best and fix any possible concerns you might have.

Correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backups

Introduction

It’s not a secret that while guest clustering with VHDSets works very well. We’ve had some struggles in regards to host level backups however. Right now I leverage Veeam Agent for Windows (VAW) to do in guest backups. The most recent versions of VAW support Windows Failover Clustering. I’d love to leverage host level backups but I was struggling to make this reliable for quite a while. As it turned out recently there are some virtual machine permission issues involved we need to fix. Both Microsoft and Veeam have published guidance on this in a KB article. We automated correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backup

The KB articles

Early August Microsoft published KB article with all the tips when thins fail Errors when backing up VMs that belong to a guest cluster in Windows. Veeam also recapitulated on the needed conditions and setting to leverage guest clustering and performing host level backups. The Veeam article is Backing up Hyper-V guest cluster based on VHD set. Read these articles carefully and make sure all you need to do has been done.

For some reason another prerequisite is not mentioned in these articles. It is however discussed in ConfigStoreRootPath cluster parameter is not defined and here https://docs.microsoft.com/en-us/powershell/module/hyper-v/set-vmhostcluster?view=win10-ps You will need to set this to make proper Hyper-V collections needed for recovery checkpoints on VHD Sets. It is a very unknown setting with very little documentation.

But the big news here is fixing a permissions related issue!

The latest addition in the list of attention points is a permission issue. These permissions are not correct by default for the guest cluster VMs shared files. This leads to the hard to pin point error.

Error Event 19100 Hyper-V-VMMS 19100 ‘BackupVM’ background disk merge failed to complete: General access denied error (0x80070005). To fix this issue, the folder that holds the VHDS files and their snapshot files must be modified to give the VMMS process additional permissions. To do this, follow these steps for correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backup.

Determine the GUIDS of all VMs that use the folder. To do this, start PowerShell as administrator, and then run the following command:

get-vm | fl name, id
Output example:
Name : BackupVM
Id : d3599536-222a-4d6e-bb10-a6019c3f2b9b

Name : BackupVM2
Id : a0af7903-94b4-4a2c-b3b3-16050d5f80f

For each VM GUID, assign the VMMS process full control by running the following command:
icacls <Folder with VHDS> /grant “NT VIRTUAL MACHINE\<VM GUID>”:(OI)F

Example:
icacls “c:\ClusterStorage\Volume1\SharedClusterDisk” /grant “NT VIRTUAL MACHINE\a0af7903-94b4-4a2c-b3b3-16050d5f80f2”:(OI)F
icacls “c:\ClusterStorage\Volume1\SharedClusterDisk” /grant “NT VIRTUAL MACHINE\d3599536-222a-4d6e-bb10-a6019c3f2b9b”:(OI)F

My little PowerShell script

As the above is tedious manual labor with a lot of copy pasting. This is time consuming and tedious at best. With larger guest clusters the probability of mistakes increases. To fix this we write a PowerShell script to handle this for us.

Below is an example of the output of this script. It provides some feedback on what is happening.

Correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backup

Correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backup

PowerShell for the win. This saves you some searching and typing and potentially making some mistakes along the way. Have fun. More testing is underway to make sure things are now predictable and stable. We’ll share our findings with you.

The lure of having a Ransomware Fund

Introduction

What is the the lure of having a ransomware fund all about? It’s the idea that just paying is the best way to deal with a ransomware incident.While preventing as many ransomware attacks as possible is great, it is not something that will be 100% effective. Detecting an incident as early as possible is key to minimizing the effects. This even in the event of successful and early detection some data has been compromised (encrypted). The nature and function of that data will determine the blast radius and the fall out. To recover from that the attack needs to be stopped by finding and eliminating the points of infection.Next to that, the proven ability to restore data and do so fast is a key capability when it comes to recovering form a ransomware attack. If you don’t you’ll either need to eat the loss or try to pay up.

Dealing with Ransomware step by step

  • Prevention is not 100% effective. Don’t bank on it.
  • Early detection
  • Swift & adequate response
  • Quarantine, wipe (nuke from orbit) of contaminated systems & data
  • See if a free decryption solution is available via the security community or your police services cyber crime department
  • Restore your data. You must have multiple options. You must have implemented the 3-2-1 rule. But beware, your off site, air gapped copy cannot be too old. You need to have fairly recent backups in there to have a decent RPO that is meaningful to the business.
  • Bring data, systems and services back into production.

Now make sure you can do this for end user files, server data (images, VMs, Databases, configuration files,  backups) regardless of where it is (on-premises, private, hybrid & public cloud) what delivery model it comes in (Physical, virtual, IAAS, PAAS, SAAS, Serverless).

The lure of having a Ransomware Fund (Isn’t it cheaper to pay?)

Now some bean counter might come up with the idea that paying is cheaper (and easier) than prevention, let alone backup & restore capabilities.

The lure of having a Ransomware Fund

Some would even consider it a “cost of doing business”. This is the the lure of having a ransomware Fund. Ouch, well I know many parts of the world are a lot less save than mine but this is a path down a slippery slope so dangerous you will fall down sooner or later. Let’s look at why that is.

petya ransomware

The lure of having a Ransomware Fund

First, let’s not forget about the down time caused no matter how you resolve it. So prevention and early detection are key. You might not even survive if you pay and get your data back.

Secondly, while I love the idea of prevention and early detection this doesn’t mean that you can get rid of your backup and restore capabilities. Prevention is an mitigation strategy, it doesn’t eradicate the issue. Early detection minimizes the immediate and secondary damage in many cases. But not in all cases and it is also not perfect.

Third, when you pay your ransom how sure are you you’ll get your decryption key and be able to access your data? Well it seems only in 50% of the cases. Now, some ransomware “businesses’’ have a better customer service than many commercial companies and governments. But that doesn’t mean all of them do and by definition they are not honest people. Unless you consider ransomware “Encryption As A Service” that helps you with GDPR. I think not. You might think that a smart ransomware player delivers not to ruin future revenue streams by acquiring a bad reputation. Probably true, but they to can make mistakes, you can make mistakes, you can become road kill of vandals or of criminals who desire or are hired to incur havoc on a certain industry.

Finally, you might end up being a repeat victim as you have shown the willingness & ability to pay. Don’t forget that ransomware is not like mobster protection money. It will not protect you from others or the same ones doing it again.

Conclusion

Banking on having an emergency stash of Bitcoin (ransomware fund) just to pay ransomware isn’t your best option. It might be a last resort faced with the alternative of bankruptcy but even then it remains a costly and risky gamble.

I know that for some people in IT, backups seem outdated and from a gone by era, a solution to a problem form yesterday. I kid you not. Well, I advise you to think again and act upon what you concluded.

 

ReFS Supported Deployment Scenarios Updated

Introduction

Some support statements for ReFS have been updated recently. These reflect well over a year of me, fellow MVPs and others testing and providing feedback to Microsoft. For all practical purposes I’m talking about ReFSv3, which was introduced with Windows Server 2016. Read up on this because that’s what I’m discussing here: Resilient File System (ReFS) overview

As many you know the ReFS supported storage deployment option has “fluctuated a bit. It was t limited ReFS to Storage Spaces and standalone disks only. That meant no RAID controllers, no FC or iSCSI LUNs via a SAN whether that was a high end one or and entry level one that you normally only use for backup purposes.

I was never really satisfied with the reasons why and I kept being a passionate advocate for a decent explanation as tying a files system with the capabilities and potential of ReFS to almost a single storage solution (S2D, and yes that’s a very good HCI offering) isn’t going to help proliferate the goodness of ReFS around the globe.

I was not alone and many others, amongst them fellow MVPs Anton Gostev (Senior Vice President, Product Management at Veaam and an industry heavy weight when it comes to credibility and technical skill), Cars ten Rachfahl and Jan Kappen (both at Rachfahl IT-Solutions) were arguing he case for broader ReFS support. Last week we go the news that the ReFS deployment documentation had been revised. Guest what? Progress! A big thank you to Andrew Hansen for taking the time to hear us plead or case, listen to our testing results and passionate feedback. He picked up the ball, ran with it and delivered! Let’s take a look.

ReFS Storage Deployment Options

Storage Spaces Direct

Deploying ReFS on Storage Spaces Direct is recommended for virtualized workloads or network-attached storage. This is well known and is used for a Hyper Converged Infrastructure and Converged (SOFS) solution (Hyper-V, IIS, SQL, User Profile Disks and even archival or backup targets). You can deploy it with simple, mirrored (2-way or 3-way), parity or Mirror accelerated parity volumes.

Storage Spaces

Storage Spaces supports local non-removable direct-attached via BusTypes SATA, SAS, NVME, or attached via HBA (aka RAID controller in pass-through mode). You can deploy it with simple, mirrored (2-way or 3-way) or parity volumes. Do note that this can be both non-shared as shared storage spaces (Shared SAS enclosures). This is the high available solution with storage spaces we have before Windows Server 2016 added S2D.

Basic disks

Deploying ReFS on basic disks is best suited for applications that implement their own software resiliency and availability solutions. Applications that introduce their own resiliency and availability software solutions can leverage integrity-streams, block-cloning, and the ability to scale and support large data sets. A poster child for this use case is and Exchange DAG.

Now it is important to note that basic disks with ReFS are supported with local non-removable direct-attached disks via BusTypes SATA, SAS, NVME, or RAID. So yes, you can have RAID 1, 5,6,10 and make the storage redundant. Now, be smart, ReFS is great but it is not magic. If your workload requires redundancy and high availability you should provide it. This is not different when you use NTFS. When you have shared PCI RAID controllers (which can be redundant like in a DELL VRTX) this can be uses as well to create high availability deployments with shared storage.

SAN Storage

You can also use ReFS with a SAN over FC or iSCSI, normally those are always configured with some form of storage redundancy. You can consume the ReFS SAN storage on stand alone, member or clustered serves for high availability. As long as you use that storage for supported use cases. For example, it is and remains not support to put knowledge worker data on SOFS shares, not matter what the underlying storage for ReFS or NTFS volumes is. For backups this can leveraged to build some very capable solutions.

What were the concerns that made ReFS Support so limited at a given point in time?

Well one of them was confusion and concerns around how data gets flushed and persisted with non-storage spaces and simple disks. A valid concern but one you have with any file system so any storage array or controller needs to handle this well. As it turns out any decent piece of storage hardware/controller that’s on the Microsoft Hardware Compatibility List and is certified does its job well enough to guarantee this happens correctly. So, any certified OEM SAN, both entry level ones to high end enterprise grade gear is supported. Just like any good (certified) raid controller. Those are backed with battery backed caches that can survive down time for days to many weeks. You just pick the one that fits your needs, use case and budget form the options you have. That can be S2D, a SAN, a raid controller, or even basic directly attached disks.

My take on things

Why do I like the new supported options? Well because I have been testing them for backup targets, both high available one as non- high available one. I can have the benefits of ReFS that can be leveraged by backup software (Veeam Backup & Replication 9.5 for example) and have better performance, data protection with more type of storage than S2D. I like to have options and choices when designing as solution.

It is important to note one thing when you do not use ReFS in combination with Storage Spaces (S2D, Shared storage Spaces or “stand alone” storage spaces) with any form of data redundancy (2-way or 3-way mirror, parity, mirror accelerate parity). You will not have the built-in capability to repair data corruption than can occur while data sits on disk (bit rot) by leveraging the redundant copies in storage Spaces. That only comes when ReFS is combined with redundant Storage Spaces. Not with Simple Storage Spaces or any other storage array, redundant or not. The combination of ReFS with Storage Spaces offers this capability and is one of its selling points.

Other than that, the above ReFS storage deployment options let you leverage the benefits ReFS has to offer and yes, for some use case that will be preferred over NTFS. But don’t think NTFS should now only be used for the OS and such. That’s not the case. It is and remains very much the dominant file system for Windows. It’s just that now we get to leverage the goodness of ReFS for suitable scenarios with a lot more storage deployment options. This has a reason. For example, if you are going to do Hyper-V with a SAN the supported file system is NTFS, not ReFS. Mind you ReFS works but it’s not supported. I have tested this and while it works one of the concerns is the redirect IO traffic this incurs. With S2D the network fabric to deal with this is there by design: SMB Direct (RDMA) over 10Gbps or better. With a SAN that’s not necessarily so and as a result the network leveraged by CSV traffic might take a beating. The network traffic behavioral patterns are also different with ReFS versus NTFS on SAN based CSV than what you are used to with NFTS when it comes to owner and non-owner nodes. While I can make things work I must consider the benefits versus the risk of being unsupported. On a good SAN with ODX support that’s not worth the risk. Might this ever change? Maybe, but for now that’s it.

That said, when I design my ReFS LUNs and fabric well with a SAN and use them for a supported uses case like backup targets I am supported and I get to leverage the benefits of ReFS as it fits the use case very well (DPM, Veeam).

A side note on mirror accelerated parity

Mirror accelerated parity is only supported with S2D. That’s the only thing that, in regards to backup an archive targets that I want to keep testing (see Hyper-V Amigos Showcast Episode 12 – ReFS and Backup )and asking Microsoft to support at least on non-shared Storage spaces. I know shared storage spaces is being depreciated, no worries. That would make for some great, budget, archival and backup targets due to the fact you get bit rot protection due to the combination ReFS with redundant Storage Spaces. I even have some ideas on how to add tuning capabilities to the mirror / parity movement of data based on data age etc. I can dream right ?

Conclusion

To all the naysayers, the ones that bashed me when I discussed options for and the potential for ReFSv3 outside of S2D, take note, this is where we are today.

clip_image001

And I like it. I like the options ReFSv3 offers with variety of storage solutions to design and implement backup targets for many different needs and budgets. That’s what I like as I’m convinced that one size fits all solution are an illusion. Even at economies of scale and with commodity materials understanding the context in which to design and implement a solution matters, as it allows you to chose the proper methods for the given needs when you genuinely understand the challenge.

If you need help with this there are quite a number of highly skilled, experienced people with the right mindset to make help you maximize your ROI and TCO in an effective and efficient way. Many of these are MVPs and have their own business or work for IT firms where customers are not milked like cattle but really do provide high value services. Just reach out.