WEBINAR: Troubleshooting Microsoft Hyper-V – 4 Tales from the Trenches

I’ve teamed up with Altaro as a guest in one of their Hyper-V webinars to discuss trouble shooting Microsoft Hyper-V. I’ll be joining my fellow Microsoft Cloud and Datacenter Management  Andy Syrewicze for this in a virtual cross Atlantic team

WEBINAR: Troubleshooting Microsoft Hyper-V - 4 Tales from the Trenches

We’ll give you some pointers on good practices and where to start when things go south. We’ll also add some real world examples to the mix to spice things up. As you can imagine these issues are often a lot more fun after the facts and after having solved them. If you work in IT long enough you know that one day (or night) trouble will come knocking and that the stress related with that can be gut wrenching.

The good news is that it isn’t the end of the world. In well designed and managed environment you can minimize downtime and you do not have to suffer through unrecoverable losses as long as you’re well prepared. We hope this webinar will help the attendees prevent those problems in their environment. If not, it will provide you with some insight on how to prepare for and handle them.

The webinar is on February 25th, 2016 at 4pm CET / 10am EST!

The time has been chosen to make it feasible for as many many people across all time zones to attend. So, go on, register, it’s free and both Andy and I are looking forward to sharing our trouble shooting tales from the trenches.

webinar-Join-button-troubleshooting-hyperv

Recover From Expanding VHD or VDHX Files On VMs With Checkpoints

So you’ve expanded the virtual disk (VHD/VHDX) of a virtual machine that has checkpoints (or snapshots as they used to be called) on it. Did you forget about them?  Did you really leave them lingering around for that long?  Bad practice and not supported (we don’t have production snapshots yet, that’s for Windows Server 2016). Anyway your virtual machine won’t boot. Depending on the importance of that VM you might be chewed out big time or ridiculed. But what if you don’t have a restore that works? Suddenly it’s might have become a resume generating event.

All does not have to be lost. Their might be hope if you didn’t panic and made even more bad decisions. Please, if you’re unsure what to do, call an expert, a real one, or at least some one who knows real experts. It also helps if you have spare disk space, the fast sort if possible and a Hyper-V node where you can work without risk. We’ll walk you through the scenarios for both a VHDX and a VHD.

How did you get into this pickle?

If you go to the Edit Virtual Hard Disk Wizard via the VM settings it won’t allow for that if the VM has checkpoints, whether the VM is online or not.

image

VHDs cannot be expanded on line. If the VM had checkpoints it must have been shut down when you expanded the VHD. If you went to the Edit Disk tool in Hyper-V Manager directly to open up the disk you don’t get a warning. It’s treated as a virtual disk that’s not in use. Same deal if you do it in PowerShell

Resize-VHD -Path “C:\ClusterStorage\Volume2\DidierTest06\Virtual Hard Disks\RuinFixedVHD.vhd” -SizeBytes 15GB

That just works.

VHDXs can be expanded on online if they’re attached to a vSCSI controller. But if the VM has checkpoints it will not allow for expanding.

image

So yes, you deliberately shut it down to be able to do it with the the Edit Disk tool in Hyper-V Manager. I know, the warning message was not specific enough but consider this. The Edit disk tool when launched directly has no idea of what the disk you’re opening is used for, only if it’s online / locked.

Anyway the result is the same for the VM whether it was a VHD or a VHDX. An error when you start it up.

[Window Title]
Hyper-V Manager

[Main Instruction]
An error occurred while attempting to start the selected virtual machine(s).

[Content]
‘DidierTest06’ failed to start.

Synthetic SCSI Controller (Instance ID 92ABA591-75A7-47B3-A078-050E757B769A): Failed to Power on with Error ‘The chain of virtual hard disks is corrupted. There is a mismatch in the virtual sizes of the parent virtual hard disk and differencing disk.’.

Virtual disk ‘C:\ClusterStorage\Volume2\DidierTest06\Virtual Hard Disks\RuinFixedVHD_8DFF476F-7A41-4E4D-B41F-C639478E3537.avhd’ failed to open because a problem occurred when attempting to open a virtual disk in the differencing chain, ‘C:\ClusterStorage\Volume2\DidierTest06\Virtual Hard Disks\RuinFixedVHD.vhd’: ‘The size of the virtual hard disk is not valid.’.

image

You might want to delete the checkpoint but the merge will only succeed for the virtual disk that have not been expanded.  You actually don’t need to do this now, it’s better if you don’t, it saves you some stress and extra work. You could remove the expanded virtual disks from the VM. It will boot but in many cased the missing data on those disks are very bad news. But al least you’ve proven the root cause of your problems.

If you inspect the AVVHD/AVHDX file you’ll get an error that states

The differencing virtual disk chain is broken. Please reconnect the child to the correct parent virtual hard disk.

image

However attempting to do so will fail in this case.

Failed to set new parent for the virtual disk.

The Hyper-V Virtual Machine Management service encountered an unexpected error: The chain of virtual hard disks is corrupted. There is a mismatch in the virtual sizes of the parent virtual hard disk and differencing disk. (0xC03A0017).

image

Is there a fix?

Let’s say you don’t have a backup (shame on you). So now what? Make copies of the VHDX/AVHDX or VHD/AVHD and save guard those. You can also work on copies or on the original files.I’ll just the originals as this blog post is already way too long. If you. Note that some extra disk space and speed come in very handy now. You might even copy them of to a lab server. Takes more time but at least you’re not working on a production host than.

Working on the original virtual disk files (VHD/AVHD and / or VHDX/AVHDX)

If you know the original size of the VHDX before you expanded it you can shrink it to exactly that. If you don’t there’s PowerShell to the rescue if you want to find out the minimum size.

image

But even better you can shrink it to it’s minimum size, it’s a parameter!

Resize-VHD -Path “C:\ClusterStorage\Volume2\DidierTest06\Virtual Hard Disks\RuinFixedVHD.vhd” -ToMinimumSize

Now you not home yet. If you restart the VM right now it will fail … with the following error:

‘DidierTest06’ failed to start. (Virtual machine ID 7A54E4DB-7CCB-42A6-8917-50A05354634F)

‘DidierTest06’ Synthetic SCSI Controller (Instance ID 92ABA591-75A7-47B3-A078-050E757B769A): Failed to Power on with Error ‘The chain of virtual hard disks is corrupted. There is a mismatch in the identifiers of the parent virtual hard disk and differencing disk.’ (0xC03A000E). (Virtual machine ID 7A54E4DB-7CCB-42A6-8917-50A05354634F)

image

What you need to do is reconnect the AVHDX to it’s parent and choose to ignore the ID mismatch. You can do this via Edit Disk in Hyper-V Manager of in PowerShell. For more information on manually merging & repairing checkpoints see my blogs on this subject here. In this post I’ll just show the screenshots as walk through.

image

image

image

image

image

Once that’s done you’re VHDX is good to go.

For a VHD you can’t shrink that with the inbox tools. There is however a free command line tool that can do that names VHDTool.exe. The original is hard to find on the web so here is the installer if you need it. You only need the executable, which is portable actually, don’t install this on a production server. It has a repair switch to deal with just this occurrence!

Here’s an example of my lab …

D:\SysAdmin>VhdTool.exe /repair “C:\ClusterStorage\Volume2\DidierTest06\Virtual Hard Disks\RuinFixedVHD.vhd” “C:\ClusterStorage\Volume2\DidierTest06\Virtual Hard Disks\RuinFixedVHD_8DFF476F-7A41-4E4D-B41F-C639478E3537.avhd”

image

That’s it for the VHD …

You’re back in business!  All that’s left to do is get rid of the checkpoints. So you delete them. If you wanted to apply them an get rid of the delta, you could have just removed the disks, re-added the VHD/VHDX and be done with it actually. But in most of these scenarios you want to keep the delta as you most probably didn’t even realize you still had checkpoints around. Zero data loss Winking smile.

Conclusion

Save your self the stress, hassle and possibly expense of hiring an expert.  How? Please do not expand a VHD or VHDX of a virtual machine that has checkpoints. It will cause boot issues with the expanded virtual disk or disks! You will be in a stressful, painful pickle where you might not get out of if you make the wrong decisions and choices!

As a closing note, you must have have backups and restores that you have tested. Do not rely on your smarts and creativity or that others, let alone luck. Luck runs out. Otions run out. Even for the best and luckiest of us. VEEAM has save my proverbial behind a few times already.

The Hitch Hikers Guide to Hyper-V Administration: Don’t Panic

Not all information you might see or is presented to you is valid. You need to check, that’s the prime reason we have the “trust but verify” mantra in IT. If you don’t you might start trouble shooting a ghost issue. An example of this are GUI issues, such as when you leave the Hyper-V Manager GUI open for way to long and the information goes stale in the cache.

The below screen shot is what caused some diligent admins to start trouble shooting a non existent problem. The figured that the VMs were left in a locked state due to backups failing. But hey, all backups had run and succeeded?! So they searched and found  KB article 2964439 Hyper-V virtual machine backup leaves the VM in a locked state. When they wanted to install the hotfix it failed stating it was not applicable to their system.

At that moment they considered killing the VMMS.exe service and/or failing over the nodes. While preparing for that they’d logged in to all nodes, only to see the issue not present there. That made ‘m think and step back for a while.

image

In this case it’s just a quirk with the Hyper-V manager that is left open way to long. Right click the host and refresh or close the GUI and reopen it is all that’s needed to see the real information.

So slow down before you start trouble shooting & recovering form a “ghost” problem. It may cause real issues. The lesson here is you should not go into the “Action Jackson” mode. You can move swift and efficient but the ability to execute does not constitute just speed it doing what’s needed when and when needed. Here ends the lesson Smile

Get-ClusterLog Got Better In Windows Server 2016

When the going get’s tough the tough get going. But that doesn’t mean we don’t like and edge or won’t take advantage of tools and features that make our job easier.

In Windows Server 2016 Failover clustering Microsoft added some features to do just that when it comes to troubleshooting.

This is what Get-CusterLog does for you: it writes the FailoverClustering/Diagnostics events to a cluster.log file on every member node of that cluster. Collecting them all form there is tedious so they gave us the –destination parameter to set a common target folder on the host where we run the command.

image

So unless you get paid by the hour you’d normally you’d run Get-ClusterLog with the –Destination parameter so all the cluster logs from all cluster members are dumped into the destination folder for your.  But in Windows Server 2016 they went the extra mile.  More often than not other event logs are asked and needed. So a great improvement here is that this command now dumps all the relevant other channels into the cluster.log files generated and separates them out via a “header” [===LOGINQUESTION ===]

We now find following logs included:
[=== Microsoft-Windows-ClusterAwareUpdating-Management/Admin logs ===]
[=== Microsoft-Windows-ClusterAwareUpdating/Admin logs ===]
[=== Microsoft-Windows-FailoverClustering/DiagnosticVerbose ===]
[=== System ===]

image

This saves a lot of time as more often than not those are asked for and needed to troubleshoot. Note the DiagnosticVerbose log. This is a permanent parallel event channel that logs the verbose information. This avoids the overhead of having to set the logging level of the normal Diagnostic log to verbose and trying to reproduce the issue. Pretty cool, the info is there and it doesn’t cause the standard logging to roll over faster as that logs at the default level.

We also get the cluster objects listed in the log now to help with diagnosing issues.

[=== Resources ===]
[=== Groups ===]
[=== Resource Types ===]
[=== Nodes ===]
[=== Networks ===]
[=== Network Interfaces ===]
[=== Volume ===]
[=== Volume Logs ===]

image

Another improvement is that the log now indicates the offset against UTC or allows you to specify the –UseLocalTime parameter to get you the log in the time settings of the server. Both these options can be handy correlating events.

image

I’m happy with these efforts to gather the information needed to diagnose an issue easier and faster. It’s not about perfection but making progress and that what’s happening.