Not all information you might see or is presented to you is valid. You need to check, that’s the prime reason we have the “trust but verify” mantra in IT. If you don’t you might start trouble shooting a ghost issue. An example of this are GUI issues, such as when you leave the Hyper-V Manager GUI open for way to long and the information goes stale in the cache.
The below screen shot is what caused some diligent admins to start trouble shooting a non existent problem. The figured that the VMs were left in a locked state due to backups failing. But hey, all backups had run and succeeded?! So they searched and found KB article 2964439 Hyper-V virtual machine backup leaves the VM in a locked state. When they wanted to install the hotfix it failed stating it was not applicable to their system.
At that moment they considered killing the VMMS.exe service and/or failing over the nodes. While preparing for that they’d logged in to all nodes, only to see the issue not present there. That made ‘m think and step back for a while.
In this case it’s just a quirk with the Hyper-V manager that is left open way to long. Right click the host and refresh or close the GUI and reopen it is all that’s needed to see the real information.
So slow down before you start trouble shooting & recovering form a “ghost” problem. It may cause real issues. The lesson here is you should not go into the “Action Jackson” mode. You can move swift and efficient but the ability to execute does not constitute just speed it doing what’s needed when and when needed. Here ends the lesson
When the going get’s tough the tough get going. But that doesn’t mean we don’t like and edge or won’t take advantage of tools and features that make our job easier.
In Windows Server 2016 Failover clustering Microsoft added some features to do just that when it comes to troubleshooting.
This is what Get-CusterLog does for you: it writes the FailoverClustering/Diagnostics events to a cluster.log file on every member node of that cluster. Collecting them all form there is tedious so they gave us the –destination parameter to set a common target folder on the host where we run the command.
So unless you get paid by the hour you’d normally you’d run Get-ClusterLog with the –Destination parameter so all the cluster logs from all cluster members are dumped into the destination folder for your. But in Windows Server 2016 they went the extra mile. More often than not other event logs are asked and needed. So a great improvement here is that this command now dumps all the relevant other channels into the cluster.log files generated and separates them out via a “header” [===LOGINQUESTION ===]
We now find following logs included:
[=== Microsoft-Windows-ClusterAwareUpdating-Management/Admin logs ===]
[=== Microsoft-Windows-ClusterAwareUpdating/Admin logs ===]
[=== Microsoft-Windows-FailoverClustering/DiagnosticVerbose ===]
[=== System ===]
This saves a lot of time as more often than not those are asked for and needed to troubleshoot. Note the DiagnosticVerbose log. This is a permanent parallel event channel that logs the verbose information. This avoids the overhead of having to set the logging level of the normal Diagnostic log to verbose and trying to reproduce the issue. Pretty cool, the info is there and it doesn’t cause the standard logging to roll over faster as that logs at the default level.
We also get the cluster objects listed in the log now to help with diagnosing issues.
[=== Resources ===]
[=== Groups ===]
[=== Resource Types ===]
[=== Nodes ===]
[=== Networks ===]
[=== Network Interfaces ===]
[=== Volume ===]
[=== Volume Logs ===]
Another improvement is that the log now indicates the offset against UTC or allows you to specify the –UseLocalTime parameter to get you the log in the time settings of the server. Both these options can be handy correlating events.
I’m happy with these efforts to gather the information needed to diagnose an issue easier and faster. It’s not about perfection but making progress and that what’s happening.
Here’s two little tips to solve some small hardware issues you might run into with a Compellent SAN. But first, you’re never on your own with CoPilot support. They are just one phone call away so I suggest if you see these to minor issues you give them a call. I speak from experience that CoPilot rocks. They are really good and go the extra mile. Best storage support I have ever experienced.
- Always notify CoPilot as they will see the alerts come in and will contact you for sure . Afterwards they’ll almost certainly will do a quick health check for you. But even better during the entire process they keep an eye on things to make sure you SAN is doing just fine. And if you feel you’d like them to tackle this, they will send out an engineer I’m sure.
- Note that we’re talking about the SC40 controllers & disk bays here. The newer genuine DELL hardware is better than the super micro ones.
The audible alert without any issues what so ever
We kept getting an audible alert after we had long solved any issues on one of the SANs. The system had been checked a couple of times and everything was in perfect working order. Except for that audible alarm that just didn’t want to quit. A low priority issue I know but every time we walk into the data center we were going “oh oh” for a false alert. That’s not the kind of conditioning you want. Alerts are only to be made when needed and than they do need to be acted upon!
Working on this with CoPilot support we got rid of it by reseating the upper I/O module. You can do this on the fly – without pulling SAS-cables out or so, they are redundant, as long as you do it one by one and the cabling is done right (they can verify that remotely for you if needed).
But we got lucky after the first one. After the “Swap Clear” was requested every warning condition was cleared and we got rid of the audible alert beep! Copilot was on the line with us and made sure all paths are up and running so no bad things could happen. That’s what you have a copilot for.
Front panel display dimming out on a Compellent Disk Bay
We have multiple Compellent SANs and on one of those we had a disk bay with a info panel that didn’t light up anymore. A silly issue but an annoying one as this one also show you the disk bay ID.
Do we really replace the disk bay to solve this one? As that light had come on and of a couple of time it could just be a bad contact so my colleague decided to take a look. First he removed the protective cover and then, using some short & curved screw drivers, he took of the body part. The red arrow indicates the little latch that holds the small ribbon cable in place.
That was standing right open. After locking that down the info appeared again on the panel. The covers was screwed on again and voila. Solved.
As this year comes to an end I’d like to draw your attention to Microsoft’s new Top Support Solutions blog on TechNet. It was created this as part of their continuous efforts to keep the various technical communities informed about the most relevant answers to the top questions or issues experienced with their products. They identify these top issues by analyzing the question in their forums and their other support channels.
So if you need to find answers for your self or your customers go take a look at the "Top Solutions Content" blog. Changes are you’ll find valuable information about the Microsoft top support solutions for several of their popular products in Server and Tools. It might save you and your clients or manager a lot of time, effort and money. It’s also a great resource to make your colleagues, community, user group or clients aware of.