Troubleshooting Veeam B&R Error code: ‘32768’. Failed to create VM recovery snapshot

I recently had to move a Windows Server 2016 VM over to another cluster (2012R2 to 2016 cluster)  and to do so I uses shared nothing live migration. After the VM was happily running on the new cluster I kicked of a Veeam backup job to get a first restore point for that VM. Better safe than sorry right?

image

But the job and the retries failed for that VM. The error details are:

Failed to create snapshot Compellent Replay Manager VSS Provider on repository01.domain.com (mode: Veeam application-aware processing) Details: Job failed (‘Checkpoint operation for ‘FailedVM’ failed. (Virtual machine ID 459C3068-9ED4-427B-AAEF-32A329B953AD). ‘FailedVM’ could not initiate a checkpoint operation: %%2147754996 (0x800423F4). (Virtual machine ID 459C3068-9ED4-427B-AAEF-32A329B953AD)’). Error code: ‘32768’.
Failed to create VM recovery snapshot, VM ID ‘3459c3068-9ed4-427b-aaef-32a329b953ad’.

Also when the job fails over to the native Windows VSS approach when the HW VSS provider fails it still does not work. At first that made me think of a bug that sued to exist in Windows Server 2016 Hyper-V where a storage live migration of any kind would break RCT and new full was needed to fix it. That bug has long since been fixed and no a new full backup did not solve anything here. Now there are various reasons why creating a checkpoint will not succeed so we need to dive in deeper. As always the event viewer is your friend. What do we see? 3 events during a backup and they are SQL Server related.
image

image

image

On top of that the SQLServerWriter  is in a non retryable error when checking with vssadmin list writers.

image

It’s very clear there is an issue with the SQL Server VSS Writer in this VM and that cause the checkpoint to fail. You can search for manual fixes but in the case of an otherwise functional SQL Server I chose to go for a repair install of SQL Server. The tooling for hat is pretty good and it’s probably the fastest way to resolve the issues and any underlying ones we might otherwise still encounter.

After running a successful repair install of SQL Server we get greeted by an all green result screen.

image

So now we check vssadmin list writers again to make sure they are all healthy if not restart the SQL s or other relevant service if possible. Sometime you can fix it by restarting a service, in that case reboot the server. We did not need to do that. We just ran a new retry in Veeam Backup & Replication and were successful.

There you go. The storage live migration before the backup of that VM made me think we were dealing with an early Windows Server 2016 Hyper-V bug but that was not the case. Trouble shooting is also about avoiding tunnel vision.

VeeamOn 2017 Points of Interest

Introduction

I’m back form attending, speaking, learning and sharing experiences and knowledge at VeeamON 2017 (and DELL EMC World before). It was a blast and I had the opportunity to engage in very interesting discussions with experts from around the globe.

image

As it was a Veeam event it wil be no surprise that we got some very interesting information about the new Veeam offerings now as well as in the near future. Points of particular interest to me are:

  • Veeam backup for file shares. Really this might solve my entire dubio around virtualizing very large capacity clustered files shares (100-200TB) I have to protect. I’m looking forward to testing and leveraging the various restore options like File share rollback. Handy when ransomware just struck.
  • I like what Veeam is doing for disaster recovery in Microsoft’s Azure public cloud. Veeam’s Direct Restore and new Power Network (PN) in order to facilitate and automate the disaster recovery process.
  • The Veeam agent that can protect Windows ad Linux based physical servers and endpoints, along with applications running in Microsoft Azure, AWS and other public clouds tied into Veeam Backup & Replication. We will also get support for failover clusters with this. Something I have been lobbying for!
  • They support native object storage support using Amazon S3, Amazon Glacier, Microsoft Azure Blob etc.
  • They announced improved and extended Office 365 protection including OneDrive for Business and SharePoint Online. One of those improvements is very handy with multiple tenants.
    Ramsomware did something very significant beyond reminding everyone of the importance of recoverable backups and that is reigniting the interest in tape as a backup medium. The inherent “air gap” that tape offers has become more interesting to many people as ransomware can also delete or encrypt backups. So the 3-2-1 rule has never been more important and is being extended by additional rules of thumb. The product to investigate for me is Starwind Virtual Tape Library (VTL). What I like is that I can have an air gapped backup integrated with Veeam in Amazon AWS. Even while my entire business might run in Azure, this separates my data protection technology and location form my production / development environment. Ideal for maximum isolation to protect us form both external and insider threats and risks while avoiding the need to deal with physical tapes. This is and remains a major concern for operational costs and RTO.

Conclusion

The new capabilities are very welcome to help solve the challenges we have now and the ones we see coming in the near future. We have plenty of ideas and plans to build the next generation of data protection and data availability solutions. Whatever the need, on-premises, IAAS, PAAS, SAAS, private/hybrid/public cloud, the need to protect data against loss and down time is there in one form or another. That is and remains a primary responsibility of any business regardless of the technology. As always, my fellow MVPs and Vanguards are ready, willing and able to get the job done.

Veeam Backup & Replication leverages SMB Multichannel

Introduction

Is it true that Veeam Backup & Replication leverages SMB Multichannel? That is a question that I was asked recently. The answer is yes, when you have a backup design and configuration that allows for this. If that’s the case it will even happen automatically when possible. That’s how SMB 3 works. That means it’s a good idea to pay attention to the network design so that you’re not surprised by the route your backup traffic flows. Mind you that this could be a good surprise, but you might want to plan for it.

I’ll share a quick lab setup where SMB 3 Multichannel kicks in. Please don’t consider this a reference guide for your backup architectural design but as a demo of how SMB multichannel can be leveraged to your advantage.

Proving Veeam Backup & Replication leverages SMB Multichannel

Here’s a figure of a quick lab setup I threw together.

clip_image002

There are a couple of significant things to note here when it comes to the automatic selection of the best possible network path.

SMB 3 Multichannel picks the best solution based on its logic. You can read more about that here. I’ve included the figure with the overview below.

clip_image003

The figure nicely show the capabilities of the NIC situation. To select the best possible network path SMB 3 uses the following logic:

1. RDMA capable NICs (rNICs) are preferred and chosen first. rNICs combine the highest throughput, the lowest latency and bring CPU offloading. on the processor when pushing through large amounts of data.

2. RSS capable NICs: NIcs with Receive Side Scaling (RSS) improve scalability by not being limited to core zero on the server. Configured correctly RSS offers the second-best capabilities.

3. The speed of the NICs is the 3rd evaluation criteria: a 10 Gbps NIC offers way more throughput than a 1 Gbps NIC.

Following this logic it is clear that Multichannel will select our 2 RDMA capable 10Gbps NICs over the management LBFO interface which does not support RDMA and while supporting RSS can only deliver 2Gbps throughput at best. That’s exactly what you see in the screenshot below.

image

Conclusion

So yes, Veeam Backup & Replication leverages SMB Multichannel! Please note that this did not require us to set SMB 3 Multichannel constraints or a preferred network for backups in Veeam Backup & Replication. It’s possible to do so when needed but ideally you design your solution to have no need for this and let automatic detection chose the best network path correctly. This is the case in our little lab setup. The backup traffic flows over 10.1.0.0/16 network even when our Veeam Backup & Replication VM, the Hyper-V host and the backup target have 10.10.0.0/16 as their management subnet. That’s the one they exist on the Active Directory domain they belong to for standard functionality. But as both the source and the target can be reached via 2*10Gbps RDMA capable NICs on the 10.1.0.0/16 subnet SMB3 will select those according to its selection criteria. No intervention needed.

SMB Direct Support

Now that we have shown that Veeam Backup & Replication backups in certain configurations can and will leverage SMB Multichannel to your benefit another question pops up. Can and does Veeam Backup & Replication leverage SMB Direct? The answer to that is also, yes. If SMB Direct is correctly configured on all the hosts and switches their networks paths in between it will. Multichannel is the mechanism used to detect SMB Direct capabilities, so if multichannel works and sees SMB Direct is possible it will leverage that. That’s why when SMB Direct or RDMA is enabled on your NICs it’s important that it is configured correctly throughout the entire network path used. Badly configured SMB Direct leads to very bad experiences.

Now think about that. High throughput, low latency and CPU offloading, minimizing the CPU impact on your Hyper-V hosts, SOFS nodes, S2D nodes and backup targets. Not bad at all, especially not since you’re probably already implementing SMB Direct in many of these deployments. It’s certainly something that could and should be considered when design solutions or optimizing existing ones.

More SMB3 and Windows Server 2016 Goodness

When you put your SMB3 file share continuously available on a Windows 2012 (R2) or Windows Server 2016 cluster (it doesn’t need to be on a CSV disk) you’ll gain high availability trough transparent failover with SMB3 and except for a short pause your backups will keep running even when the backup target node reboots or crashes after the File Server role has failed over. Now, start combing that with ReFSv3 in Windows Server 2016 and the Veeam Backup & Replication v9.5 support of this and you can see a lot of potential here to optimize many aspects of your backup design delivering effective and efficient solutions.

Things to investigate further

One question that pops up in my mind is what happens if we configure a preferred back-up network in Veeam Backup & Replication. Will this affect the operation of SMB multichannel at all? By that I means would enabling a preferred network in Veeam prevent multichannel from using more than one NIC?

I my opinion it should allow for multiple scenarios actually. When you have equally capable NICs that are on different subnets you might want to make sure it uses only one. After all, Veeam uses the subnet to configure a preferred path, or multiple subnets for that matter. Now multichannel will kick in with multiple equally capable NICs whether they are on the same subnet or not and if they are on the same subnet you might want them both to be leveraged even when setting a preferred path in Veeam. Remember that 1 IP / NIC is used to set up an SMB session and then it detects capabilities available, i.e. multiple paths, SMB Direct, RSS, speed, within 1 or across multiple subnets.

I’ll leave the combination of Veeam Backup & Replication and SMB multichannel for a future blog post.

Hyper-V Amigos Showcast Episode 12–ReFS v3.1 and Backup

In this Episode Carsten and I look at a single host deployment with Storage Spaces on Windows Server 2016. We create a “Hybrid” disk just like in Storage Spaces Direct by combining SSD & HDD in a storage Tier. We were very happy to discover that ReFSv3.1 does real time tiering.

image

We’re very excited about this because we want to leverage the benefits if Veeam Backup & Replication 9.5 brings by leveraging ReFSv3.1 (Block Cloning) in regards to backup transformation actions and Grandfather-Father-Son (GFS) spaces savings. To do so we’re looking at our options to get these benefits and capabilities leveraging affordable yet performant storage for our backup targets. S2D is one such option but might be cost prohibitive or overkill in certain environments.

ReFS v3.1 on non-clustered Windows Server 2016 hosts bring us integrity streaming, file corruption repair with instant recovery as protection against bit rot, the performance of tiered storage and SMB3 as a backup target at a great price point.

We encourage you to watch the video and see for yourself. As always, we had fun and hope your can learn something together with us, the Hyper-V Amigos Smile