Troubleshooting Veeam B&R Error code: ‘32768’. Failed to create VM recovery snapshot

I recently had to move a Windows Server 2016 VM over to another cluster (2012R2 to 2016 cluster)  and to do so I uses shared nothing live migration. After the VM was happily running on the new cluster I kicked of a Veeam backup job to get a first restore point for that VM. Better safe than sorry right?

image

But the job and the retries failed for that VM. The error details are:

Failed to create snapshot Compellent Replay Manager VSS Provider on repository01.domain.com (mode: Veeam application-aware processing) Details: Job failed (‘Checkpoint operation for ‘FailedVM’ failed. (Virtual machine ID 459C3068-9ED4-427B-AAEF-32A329B953AD). ‘FailedVM’ could not initiate a checkpoint operation: %%2147754996 (0x800423F4). (Virtual machine ID 459C3068-9ED4-427B-AAEF-32A329B953AD)’). Error code: ‘32768’.
Failed to create VM recovery snapshot, VM ID ‘3459c3068-9ed4-427b-aaef-32a329b953ad’.

Also when the job fails over to the native Windows VSS approach when the HW VSS provider fails it still does not work. At first that made me think of a bug that sued to exist in Windows Server 2016 Hyper-V where a storage live migration of any kind would break RCT and new full was needed to fix it. That bug has long since been fixed and no a new full backup did not solve anything here. Now there are various reasons why creating a checkpoint will not succeed so we need to dive in deeper. As always the event viewer is your friend. What do we see? 3 events during a backup and they are SQL Server related.
image

image

image

On top of that the SQLServerWriter  is in a non retryable error when checking with vssadmin list writers.

image

It’s very clear there is an issue with the SQL Server VSS Writer in this VM and that cause the checkpoint to fail. You can search for manual fixes but in the case of an otherwise functional SQL Server I chose to go for a repair install of SQL Server. The tooling for hat is pretty good and it’s probably the fastest way to resolve the issues and any underlying ones we might otherwise still encounter.

After running a successful repair install of SQL Server we get greeted by an all green result screen.

image

So now we check vssadmin list writers again to make sure they are all healthy if not restart the SQL s or other relevant service if possible. Sometime you can fix it by restarting a service, in that case reboot the server. We did not need to do that. We just ran a new retry in Veeam Backup & Replication and were successful.

There you go. The storage live migration before the backup of that VM made me think we were dealing with an early Windows Server 2016 Hyper-V bug but that was not the case. Trouble shooting is also about avoiding tunnel vision.

Testing Compellent Replay Manager 7.8

Testing Compellent Replay Manager 7.8

So today I found the Replay Manager 7.8 bits to download.image

As is was awaiting this eagerly (see Off Host Backup Jobs with Veeam and Replay Manager 7.8). So naturally, I set of my day by testing Compellent Replay Manager 7.8. I deployed in on a 2 node DELL PowerEdge Cluster with FC access to a secondary DELL Compellent running SC 6.7.30 (you need to be on 6.7).

image

The first thing I noticed is the new icon.

image

That test cluster is running Windows Server 2016 Datacenter edition and is fully patched. The functionality is much the same as it was. There is one difference and that if you launch the back upset manually of a local volume for a CSV and that CSV is not owned y the Node in which you launch it the backup is blocked.

image

This did not use to be the case. With scheduled backup sets this is not an issue, it detects the owner of the CSV and uses that.

image

Just remember when running a backup manually you nee to launch it from the CSV owner node in Replay Manager and all is fine.

image

Other than that testing has been smooth and naturally we’ll be leveraging RM 7.8 with transportable snapshots with Veeam B&R 9.5 as well.

Things to note

Replay Manager 7.8 is not backward compatible with 7.7.1 or lower so you have to have the same version on your Replay Manager management server as on the hosts you want to protect. You also have to be running SC 6.7 or higher.

Wish list

I’d love to see Replay manager become more intelligent and handle VM Mobility better. The fact that VMs are tied to the node on which the backup set is create is really not compatible with the mobility of VMs (maintenance, dynamic optimization, CSV balancing, …). A little time and effort here would go a long way.

Second. Live Volumes has gotten a lot better but we still need to choose between Replay Manager  snapshots & Live Volumes. In an ideal world that would not be the case and Replay manager would have the ability to handle this dynamically. A big ask perhaps, but it would be swell.

I just keep giving the feedback as I’m convinced this is a great SAN for Hyper-V environments and they could beat anyone by make a few more improvements.

DELL EMC Ready Nodes and Storage Spaces Direct

Introduction

Unless you have been living under a rock you must have heard about Storage Spaces Direct (S2D) in Windows Server 2016, which has gone RTM in Q4 2016.

There is a lot of enthusiasm for S2D and we have seen heard and assisted in early adopter situations. But that’s a bit of pioneering with OEM/MSFT approved components. So now bring in the DELL EMC Ready Nodes and Storage Spaces Direct.

DELL EMC Storage Spaces Direct Ready Nodes

So enter the DELL EMC ready nodes. These will be available this summer and should help less adventurous but interested customers get on the S2D bandwagon. These were announced at DELL EMC world 2017 and on may 30th they published some information on the TechCenter.

If offers a fully OEM supported S2D solution to the customers that cannot or will not carry the engineering effort associated with self built solution.

I was sort of hoping these would leverage the PowerEdge 740DX from the start but they seem to have opted to begin with the DELL 730DX. I’m pretty sure the R740DX will follow soon as it’s a perfect fit for the use case having 25Gbps support. In that respect I expect a refresh of the switches offered as well as the S4048 is a great switch but keeps us at 10Gbps. If I was calling the shots I’d that ready and done sooner rather than later as the 25/50/100Gbps network era is upon us. There’s a reason I’ve been asking for 25Gpbs capable switches with smaller port counts for SME.

Maybe this is an indication of where they think this offering will sell best. But I’d be considering future deployments when evaluating network gear purchases. These have a long service time. And when S2D proves it self I’m sure the size of the deployments will grow and with it the need for more bandwidth. Mind you 10Gbps isn’t bad even if if, for Hyper-V nodes would be doing 2*dual port Mellanox Connect-X 3 Pro cards.

Having mentioned them, I am very happy to see the Mellanox RoCE cards in there.That’s the best choice they could have made. The 1Gbps on board NICs are Intel, which matches my preference. The game is a foot!