Replay Manager 7.8 and cluster OS rolling upgrade Tips

Compellent Replay manager 7.8  Windows Server 2016 Clusters in mixed mode or at cluster functional lever 8

Consider this a a quick publish about tips for when you combine Replay Manager 7.8, Compellent and Windows Server 2016. Many of you will be doing cluster operating system rolling upgrade of your Windows Server 2012 R2 clusters to Windows Server 2016. If you have done your homework and made sure your hardware is supported you can still run into a surprise. As long as your in mixed mode (Wi2K12R2 mixed with W2K16 nodes) or have not updated the cluster functional level to 9 (Windows Server 2016) you will have a few issues.

In Replay Manager 7.8  itself you’ll notice that the nodes of your cluster only see the CSV LUNs under local volumes that they are the owner of currently. Normally you’ll see all of the CSV LUNs of the (Hyper-V) cluster on all of the nodes of that cluster. So that’s not the expected behavior. This leads to failed  restore points when you run a snapshot from a host that is not the owner of the CSV etc.

image

On top of that when you try to run a backup job it will fail. The reason given is:

The requested volumes is not supported because it is not managed by the provider, is a dynamic volume, or it has some other incompatibility with the current operation.

The fix? Just update your upgrade cluster to cluster functional level  (level 9)

It’s as easy as that. The moment you upgrade your cluster functional level to 9 you will see all the CSV on the cluster on every node of that cluster you connect to. At that moment the replays will also work. That’s OK, you want to move swiftly trough the rolling upgrade and once you’re comfortable all drivers and firmware are working fine. You do not want to be in a the lower cluster version too long, but upgrade to benefit from the new capabilities in Windows Server 2016 Failover clustering. You do need to know this when you start your upgrades

image

Close your backups apps, restart the Replay manager service on the cluster nodes, refresh / reconnect to the backup apps, and voila. You’ll see the image you are use to in Replay Manager 7.8 (green text / arrows) and the backup jobs will work as well as any other backup product using the Compellent Replay Manager 7.8 hardware VSS provider.image

I hope this helps some of you out there. So yes Replay Manager 7.8 supports Windows Server 2016 Clusters with CSV LUNs but if you upgraded your cluster via cluster operating system rolling upgrade you need to have upgraded your cluster functional level! Until then, Replay Manager 7.8 isn’t going to work very well.

So there you go, that’s another reason to move through that process fast and smooth as you can.

Still missing in action for Hyper-V with Replay Manager 7.8

I’d really like for Replay Manager to be a bit more cluster friendly. No matter what node you are connected to they show you all CSV LUNs in the cluster. Since Replay manager 7.8 with Windows Server 2016 when you run a job manually you must start it when connected to the cluster node that owns the CSV or the job will fail with “No resources found on current cluster node for backup set”.

image

This was not the case with Windows Server 2012(R2) and earlier versions of Replay Manager. That did throw some benign errors in the event logs on the cluster node but it did work. I would love for DELLEMC to make sure the Replay Manager Client is smart enough to detect who owns the CSV and make sure it’ starts the job from that node. That would be a lot more user friendly. At the very least it should indicate which of the CSV LUNs you see are owned by the cluster node you are connected to.But when launching a backup job for a CSV that’s not owned by the node you are connect to the job quits/fails. They can detect the node they need, launch the job on that node and show it to you. That avoids having to go find out yourself what cluster node to connect to in Replay manager when you need to run a out of schedule job manually? The tech/logic is already there as the scheduled jobs get launched on the correct node.

It would also be great if they finally could get the logic built into Replay manager for the Hyper-V VM backups to know on what CSV and Hyper-V node the VM lives and deal with that. Sure it might cause more more snapshots to be made but that’s an invalid argument. When the VMs are on the same node,but different  CSV’s that’s already happening. Really on VM per job to avoid this isn’t a great answer.

Troubleshooting Veeam B&R Error code: ‘32768’. Failed to create VM recovery snapshot

I recently had to move a Windows Server 2016 VM over to another cluster (2012R2 to 2016 cluster)  and to do so I uses shared nothing live migration. After the VM was happily running on the new cluster I kicked of a Veeam backup job to get a first restore point for that VM. Better safe than sorry right?

image

But the job and the retries failed for that VM. The error details are:

Failed to create snapshot Compellent Replay Manager VSS Provider on repository01.domain.com (mode: Veeam application-aware processing) Details: Job failed (‘Checkpoint operation for ‘FailedVM’ failed. (Virtual machine ID 459C3068-9ED4-427B-AAEF-32A329B953AD). ‘FailedVM’ could not initiate a checkpoint operation: %%2147754996 (0x800423F4). (Virtual machine ID 459C3068-9ED4-427B-AAEF-32A329B953AD)’). Error code: ‘32768’.
Failed to create VM recovery snapshot, VM ID ‘3459c3068-9ed4-427b-aaef-32a329b953ad’.

Also when the job fails over to the native Windows VSS approach when the HW VSS provider fails it still does not work. At first that made me think of a bug that sued to exist in Windows Server 2016 Hyper-V where a storage live migration of any kind would break RCT and new full was needed to fix it. That bug has long since been fixed and no a new full backup did not solve anything here. Now there are various reasons why creating a checkpoint will not succeed so we need to dive in deeper. As always the event viewer is your friend. What do we see? 3 events during a backup and they are SQL Server related.
image

image

image

On top of that the SQLServerWriter  is in a non retryable error when checking with vssadmin list writers.

image

It’s very clear there is an issue with the SQL Server VSS Writer in this VM and that cause the checkpoint to fail. You can search for manual fixes but in the case of an otherwise functional SQL Server I chose to go for a repair install of SQL Server. The tooling for hat is pretty good and it’s probably the fastest way to resolve the issues and any underlying ones we might otherwise still encounter.

After running a successful repair install of SQL Server we get greeted by an all green result screen.

image

So now we check vssadmin list writers again to make sure they are all healthy if not restart the SQL s or other relevant service if possible. Sometime you can fix it by restarting a service, in that case reboot the server. We did not need to do that. We just ran a new retry in Veeam Backup & Replication and were successful.

There you go. The storage live migration before the backup of that VM made me think we were dealing with an early Windows Server 2016 Hyper-V bug but that was not the case. Trouble shooting is also about avoiding tunnel vision.

Hyper-V integration components 6.3.9600.18692

After the July 2017 round of patching we got a new version of the Hyper-V integration components on Windows Server 2012 R2. Yes, something that you no longer need to deal with manually since Windows Server 2016. But hey, my guess is that many of you are still taking care of Windows Server 2012 R2 Hyper-V deployments. I’m still taking care of a couple of Windows Server 2012 R2 Clusters, so don’t be shy now.

The newest version (at the time of writing) is 6.3.9600.18692 and 1st appeared in the June 27, 2017—KB4022720 (Preview of Monthly Rollup) update. It has since  been release in the July 11, 2017—KB4025336 (Monthly Rollup) update. You can follow up on the versions of the IC via this link Hyper-V Integration Services: List of Build Numbers

image

That means that you’ll need to upgrade the integration components for the VMs running on your Hyper-V (cluster) nodes after patching those.

image

And yes despite some issues we have seen with QA on updates in the past we still keep our environment very well up to date as when doing balanced risk management the benefits of a modern, well patched environment are very much there. Both for fixing bugs and mitigating security risks. Remember WannaCry ?

So my automation script has run against my Windows Server 2012 R2  Clusters. have you taken care of yours? I did adapt it to deal with the ever growing number of Windows Server 2016 VMs we see running, yes even on Windows Server 2012 R2 Hyper-V hosts.

image

You cannot connect multiple NICs to a single Hyper-V vSwitch without teaming on the host

Can you connect multiple NICs to a single Hyper-V vSwitch without teaming on the host

Recently I got a question on whether a Hyper-V virtual switch can be connected to multiple NICs without teaming. The answer is no. You cannot connect multiple NICs to a single Hyper-V vSwitch without teaming on the host.

This question makes sense as many people are interested in the ease of use and the great results of SMB Multichannel when it comes to aggregation and redundancy. But the answer lies in the name “SMB”. It’s only available for SMB traffic. Believe it or not but there is still a massive amount of network traffic that is not SMB and all that traffic has to pass through the Hyper-v vSwitch.

What can we do?

Which means that any redundant scenario that requires other traffic to be supported than SMB 3 will need to use a different solution than SMB Multichannel. Basically, this means using NIC teaming on a server. In the pre Windows Server 2012 era that meant 3rd party products. Since Windows Server 2012 it means native LBFO (switch independent, static or LACP). In Windows Server 2016 Switch Embedded Teaming (SET) was added to your choice op options. SET only supports switch independent teaming (for now?).

If redundancy on the vSwitch is not an option you can use multiple vSwitches connected to separate NIC and physical switches with Windows native LBFO inside the guests. That works but it’s a lot of extra work and overhead so you only do this when it makes sense. One such an example is SR-IOV which isn’t exposed on top of  a LBFO team.