Replay Manager 7.8 and cluster OS rolling upgrade Tips

Compellent Replay manager 7.8  Windows Server 2016 Clusters in mixed mode or at cluster functional lever 8

Consider this a a quick publish about tips for when you combine Replay Manager 7.8, Compellent and Windows Server 2016. Many of you will be doing cluster operating system rolling upgrade of your Windows Server 2012 R2 clusters to Windows Server 2016. If you have done your homework and made sure your hardware is supported you can still run into a surprise. As long as your in mixed mode (Wi2K12R2 mixed with W2K16 nodes) or have not updated the cluster functional level to 9 (Windows Server 2016) you will have a few issues.

In Replay Manager 7.8  itself you’ll notice that the nodes of your cluster only see the CSV LUNs under local volumes that they are the owner of currently. Normally you’ll see all of the CSV LUNs of the (Hyper-V) cluster on all of the nodes of that cluster. So that’s not the expected behavior. This leads to failed  restore points when you run a snapshot from a host that is not the owner of the CSV etc.

image

On top of that when you try to run a backup job it will fail. The reason given is:

The requested volumes is not supported because it is not managed by the provider, is a dynamic volume, or it has some other incompatibility with the current operation.

The fix? Just update your upgrade cluster to cluster functional level  (level 9)

It’s as easy as that. The moment you upgrade your cluster functional level to 9 you will see all the CSV on the cluster on every node of that cluster you connect to. At that moment the replays will also work. That’s OK, you want to move swiftly trough the rolling upgrade and once you’re comfortable all drivers and firmware are working fine. You do not want to be in a the lower cluster version too long, but upgrade to benefit from the new capabilities in Windows Server 2016 Failover clustering. You do need to know this when you start your upgrades

image

Close your backups apps, restart the Replay manager service on the cluster nodes, refresh / reconnect to the backup apps, and voila. You’ll see the image you are use to in Replay Manager 7.8 (green text / arrows) and the backup jobs will work as well as any other backup product using the Compellent Replay Manager 7.8 hardware VSS provider.image

I hope this helps some of you out there. So yes Replay Manager 7.8 supports Windows Server 2016 Clusters with CSV LUNs but if you upgraded your cluster via cluster operating system rolling upgrade you need to have upgraded your cluster functional level! Until then, Replay Manager 7.8 isn’t going to work very well.

So there you go, that’s another reason to move through that process fast and smooth as you can.

Still missing in action for Hyper-V with Replay Manager 7.8

I’d really like for Replay Manager to be a bit more cluster friendly. No matter what node you are connected to they show you all CSV LUNs in the cluster. Since Replay manager 7.8 with Windows Server 2016 when you run a job manually you must start it when connected to the cluster node that owns the CSV or the job will fail with “No resources found on current cluster node for backup set”.

image

This was not the case with Windows Server 2012(R2) and earlier versions of Replay Manager. That did throw some benign errors in the event logs on the cluster node but it did work. I would love for DELLEMC to make sure the Replay Manager Client is smart enough to detect who owns the CSV and make sure it’ starts the job from that node. That would be a lot more user friendly. At the very least it should indicate which of the CSV LUNs you see are owned by the cluster node you are connected to.But when launching a backup job for a CSV that’s not owned by the node you are connect to the job quits/fails. They can detect the node they need, launch the job on that node and show it to you. That avoids having to go find out yourself what cluster node to connect to in Replay manager when you need to run a out of schedule job manually? The tech/logic is already there as the scheduled jobs get launched on the correct node.

It would also be great if they finally could get the logic built into Replay manager for the Hyper-V VM backups to know on what CSV and Hyper-V node the VM lives and deal with that. Sure it might cause more more snapshots to be made but that’s an invalid argument. When the VMs are on the same node,but different  CSV’s that’s already happening. Really on VM per job to avoid this isn’t a great answer.

Microsoft Active Directory Replication Status Tool won’t upgrade

For getting a quick insight into the AD replication health of an environment the Microsoft Active Directory Replication Status Tool is a very handy instrument. The only annoyance is the expiration of the license that forces you to download a new one and upgrade. A bit of a convoluted way to update free software but hey it is handy and free.

image

And then again …

image

OK, I’ll download the new one. But the Microsoft Active Directory Replication Status Tool won’t upgrade. That’s because the currently installed version is newer than the one you just downloaded form the Microsoft site. That’s annoying, did they post the wrong version?

image

Let’s install the new version quickly in a VM. Now looking at the executable in the current install and the new one they are the same … so the license is the only thing causing an issue here; not a version difference actually.

Old version

image

New version

image

 

Let’s look at the license.xml file in C:\Program Files (x86)\Microsoft Active Directory Replication Status Tool\Licensing

image

The only difference between the old and the new installed is the license file.You can see it has the expiration dates in the future.

image

So the fix is easy, just uninstall the currently installed version of AD Replication Status tool wherever it is installed and reinstall the one you downloaded. It seems to be exactly the same version but that’s how you get it working again with a fresh license.xml file. Note that you cannot copy the license file between machine, the generated signature is wrong.

Hope this helps someone.

Troubleshooting Veeam B&R Error code: ‘32768’. Failed to create VM recovery snapshot

I recently had to move a Windows Server 2016 VM over to another cluster (2012R2 to 2016 cluster)  and to do so I uses shared nothing live migration. After the VM was happily running on the new cluster I kicked of a Veeam backup job to get a first restore point for that VM. Better safe than sorry right?

image

But the job and the retries failed for that VM. The error details are:

Failed to create snapshot Compellent Replay Manager VSS Provider on repository01.domain.com (mode: Veeam application-aware processing) Details: Job failed (‘Checkpoint operation for ‘FailedVM’ failed. (Virtual machine ID 459C3068-9ED4-427B-AAEF-32A329B953AD). ‘FailedVM’ could not initiate a checkpoint operation: %%2147754996 (0x800423F4). (Virtual machine ID 459C3068-9ED4-427B-AAEF-32A329B953AD)’). Error code: ‘32768’.
Failed to create VM recovery snapshot, VM ID ‘3459c3068-9ed4-427b-aaef-32a329b953ad’.

Also when the job fails over to the native Windows VSS approach when the HW VSS provider fails it still does not work. At first that made me think of a bug that sued to exist in Windows Server 2016 Hyper-V where a storage live migration of any kind would break RCT and new full was needed to fix it. That bug has long since been fixed and no a new full backup did not solve anything here. Now there are various reasons why creating a checkpoint will not succeed so we need to dive in deeper. As always the event viewer is your friend. What do we see? 3 events during a backup and they are SQL Server related.
image

image

image

On top of that the SQLServerWriter  is in a non retryable error when checking with vssadmin list writers.

image

It’s very clear there is an issue with the SQL Server VSS Writer in this VM and that cause the checkpoint to fail. You can search for manual fixes but in the case of an otherwise functional SQL Server I chose to go for a repair install of SQL Server. The tooling for hat is pretty good and it’s probably the fastest way to resolve the issues and any underlying ones we might otherwise still encounter.

After running a successful repair install of SQL Server we get greeted by an all green result screen.

image

So now we check vssadmin list writers again to make sure they are all healthy if not restart the SQL s or other relevant service if possible. Sometime you can fix it by restarting a service, in that case reboot the server. We did not need to do that. We just ran a new retry in Veeam Backup & Replication and were successful.

There you go. The storage live migration before the backup of that VM made me think we were dealing with an early Windows Server 2016 Hyper-V bug but that was not the case. Trouble shooting is also about avoiding tunnel vision.

Dell iDRAC 6 Remote Console Connection Failed

Dell iDRAC 6 Remote Console Connection Failed

I recently had the honor to fix a real annoying issue with the iDRAC on rather old DELL hardware, R710 servers that are stilling puling their weight. They have been upgraded to the latest firmware naturally and DELL allows access to those updates to anyone without the need for a support contract (happy users/customers).You can perfectly configure Java site exceptions and use Firefox or Chrome to connect to it (IE is different story, you can connect but the view is messed up). Anyway the browser isn’t the big issue. The  problem was that Dell iDRAC 6 remote console connection failed consistently at the very last moment with “Connection Failed”

image

image

Note: are you nuts?

Yes I like 25/50/100Gbps RDMA, S2D, All Flash etc. I do live the vanguard live on the bleeding edge, but part of that is funding solutions that fit the environment. In this case. They have multiple spare servers and extra disks on top the ones they use in the lab or even in production. So even when a server or a component fails they can use that to fix it. They have the hands on and savvy staff members to do that. No problem. This is not an organization driven by fear of risk and responsibility but by results and effective TCO/ROI. They know very well what they can handle and what not. On top of that they know very well what part of IT sectors sales and marketing promises/predictions are FUD and which are reality. This means they can make decisions based on optimizing for their needs delivering real results.

Leveraging old hardware does mean that sometimes you’ll  run into silly issues but annoying issues like older DRAC cards with modern client operating systems, browsers and recent Java versions.

Most tricks are to be found on line to get those to work together but sometimes even those fails. First of all make sure all network requirements are in order (ports, firewall etc) and on top of that:

  • Upgraded the DRAC Firmware to the latest v2.85
  • Add DRAC IP into the Java Exception List.
  • Change Java Network Setting from Browser to Direct Connect
  • Hack the Java config files
  • Disable Encrypted Video on the DRAC
  • Reset the DRAC
  • On top of this you can run and older version of the browser and Java but at a certain point this becomes a silly option. You see at a given moment the entire stack as moved ahead and one trick like running an old version of Java won’t do it anymore and keeping a VM around that’s at a 10 year old tech/version level is a pain.

The missing piece for me: generate & upload SHA256 certs

So let me share you what extra step got the remote console of the DELL R710 iDRAC to work with the most recent version of Java, Windows 10 and the latest of the greatest Firefox browser at the time of writing.

The trick that finally did it is to generate a CSR on the DRAC while you are connected to it. You see, many people never upload their own certs and if they did, it might have been many years ago. Those old SHA1 certs are frowned upon by modern browsers and Java.

image

image

Open the CSR file, copy the content and submit it to a PKI you have or a free one on line like at getacert.com. Just fill out some random info in the request and you’ll get a SHA256 cert for download immediately that “valid” a couple of months. Enough for testing or getting out of a pickle. Your own corporate CA will do better for long term needs.

image

On top of that you’ll need to reset the DRAC card and give it a few minutes.

image

Reconnect to the DRAC and after that, without failure, we could connect to the on all R710 servers where before we kept getting the dreaded “Connection Failed” error otherwise.

That’s it! Good luck.