Nowadays everyone seems to be heading to the hills to try and cash in on the new gold rush. Data! You have all heard that data is the new gold. Some call it the new oil, as that is their favorite fantasy, that’s all good. But there are drawbacks to this.
The cost of data and data waste
Like any resource you mine it does come with associated costs. A cost that has to be covered by the value you derive from it. That value has to exceed the money spend to gather, process and consume it. That can be an expensive business.
On top of that you have to deal with “the waste” it creates as a byproduct. Waste can be toxic. Mining data tends to produce nuclear data waste, the bad kind where “safe levels” are hard to determine. In the rush to grab the gold many forget a couple of important lessons from history. We should know by now we need to act proactively to avoid waste. That is the cheapest option in the long run and mitigates many of the risks. We should also know that not all data is gold, some of it just glitters, but it isn’t valuable. Fool’s data, like fool’s gold, is essentially worthless no matter how much money you have spent. Even worse, it’s still produces nuclear data waste asset than can get you into (legal) trouble.
Data storages and backups
How much data to you need to get to the gold and at what cost. Storage capabilities as well as storage capacity grows fast and cost seems under control for now. But will this last forever? And even if so, what’s the ratio of data gold versus raw data stored? Can we improve this ratio? Because even when things are cheap, why even do it if it is not needed?
Protecting the data and the waste
And then there is the cost with protecting that data as well as the governance around it. The sad reality with data is that once you have it the probability that it will get you into trouble is real. Data, sooner or latter will get lost, misplaced, sold, hacked, leaked, … it’s almost guaranteed. Ask any real InfoSec professional (not the standard issue, policy quoting security officers, those are just windows dressing) and they will open your eyes to the reality of the risks. It’s very sobering.
The gold rush
As with any hype or gold rush we can avoid costly mistakes buy looking at history. Think about who benefited and who lost out. Think about why this happened and how. Can you see any parallels?
Many people are drawn to the data gold fields. Very few strike a gold vein.
There is a lot of money to be made selling the tools, supplies and gear to mine the data, process it, store and protect it.
Gathering raw data and processing it can be highly toxic
Storing and protecting the gold is expensive and hard.
Let’s dive a bit deeper into these issues, what these mean and how they materialize. They all have one thing in common for sure and that is that the fear of missing out is one of the driving factors.
Gold diggers
The reality is that many people that now become “data scientists” are not all highly skilled mathematicians and experts at statistics analysis. It’s a new hype, just like OLAP tools and data mining where before. We now have BI, big data and data science. That’s where the gold can be found so that’s where gold diggers flock to. Some have the skills, abilities and the luck to derive wealth from that. Most will just have the job digging.
There is a lot more data than there is science in the hype created around data scientists. Data scientists should be great at math and statistics. Those are not very fields of human endeavor that do not scale well. They are not even popular. Attaching “scientist” to something doesn’t make it a science. Be sure of the quality of your gold and make sure it is not fool’s gold. But the gold rush is on. There’s money to be made. In an era where science is viewed by many as “an opinion” the urge to derive some credibility from adding “science” to any endeavor is a paradox. It is on the rise. Clearly, this shows the value of real science even when some only seem to like it when it suits their agenda.
But the field is exploding as companies want people working on all the raw data they collect. As one statistician stated: “I used to be a boring, underpaid geek with glasses, now that I’m a data scientist I’m cool, in demand and paid very well even if the work is less scientific.” That’s the nature of the beast. Her employer got a real statistician, but as the mines require a lot more bodies, many will make due with less.
Merchants
The “sure” money is in the supply chain. Storage, networking, compute … no matter where it is (cloud, fog or on-premises computing). There is money to be made with tools to process the data, protect it (backups) and secure it against unwanted prying eyes and theft. If you’re selling any of those business is a booming.
Everyone seems obsessed with collecting data. Luckily storage costs are down per GB and we can store ever more. We also need to protect more. But whoever deletes data? Who dares push the button? A lot of data is collected “just in case”. We might find gold in there later and if we don’t have it we cannot for sure. The fear of missing out in action. That is great if you’re selling stuff. Data lakes, data ponds, storage blobs or tables, Mongo DB or SQL PaaS, storage arrays, data processing technology and data protection. These can be products or services, it doesn’t matter, there is money to be made. And while you’re selling you’re not asking the buyers if the really need it. You don’t question them, you praise their insights and help them protect their investment. Everyone is doing it, so must you. The copy/paste strategy in action.
Nuclear data waste
While the vast growth in data is spectacular. A lot of it is crap. But there is very little effort put into being selective. It’s too cheap right now to collect and store it. No one want to say “we don’t need it” and be the one to blame if you don’t have it.
But in the age of data leaks, hackers, privacy concerns and ever more legislation around data protection it’s worth making sure you don’t store data just because you can. Storing data holds inherent risks. Risk of losing it, corrupting it, deriving faulty information from it, leaking it, have it stolen or abused. It
In the age of GDPR and many other rightful privacy and data protection concerns collecting data should be treated like nuclear power. The value it brings is undeniable. But you don’t need vast amounts of nuclear fuel to deliver that value. You do need very good processes, fail safes, regulation, capable people and technology.
We should start looking at data as nuclear fuel and as such, after use and processing, part of it is left as toxic nuclear data waste. It’s a long-term toxic by product of the process of deriving information form data. Minimize the collection, storage and of data to achieve your goals at minimum cost and risk. Luckily, we have a very good solution for toxic data waste. You can delete it and wipe it securely.
We have to stop thinking that more is better when much of it is junk. The overhead of caring for that junk is ridiculous. We might have to do so for nuclear waste out of need. But for data there are alternatives. Destroy it if you don’t need it. It’s the safest way to handle the legal and reputation risks related to it. That will take a conscious effort.
Efforts and costs
Critical thinking about collecting data is lacking. That is understandable. There is a lot of the money to be made in data mining is in providing the tools to collect, process, store and protect the data. Even with many people that warn us of the security issues and legal responsibilities around it is often about selling services and products. For many all this might turn out to be a lot like the other gold rushes. There were a lot more suppliers of the tools that got rich than actual finders of profitable gold mines. This means there is also a lot of pressure and incentive to feed the “data is the new gold” beast.
Where a SQL database or a data warehouse at least meant you had to put effort into collecting the data, the rise of unstructured data technologies means way too often we don’t care and we’ll figure it out later. Imagine doing that to nuclear fuel! For now, the technical advances in storage and data technologies has allowed us to act without too much deliberation on the sanity of our choices. That might change, it might be wise to avoid the cold shower when it does and benefit from minimizing toxic data risks today.
Conclusion
Now true data gold is very valuable, but make sure you can recognize it. Just going through the motions and buying the tools, copying “in the know” statements from the internet isn’t going to cut it. That is called pretending. Sure, it’s fun. It is also a very dangerous and costly mistake when things get real. At best you look like an idiot with money. Many sales people will separate you from your money very efficiently.
The smarter organizations already have a data strategy that includes waste avoidance, reduction and management. Many don’t unfortunately. Collecting data for those is the main goal, driven by the tyranny of action over strategies. You have to be seen acting and being in charge. The buzz words have to be present and you have to come across as a “can do sir, yes sir” person. Well that is what will kill you. The late Norman Schwarzkopf knew this all too well.
Take care of your weaknesses, figure them out before they hurt you and before they destroy your ability to exploit your strengths. That people is a strategy exercise. I can do that for you and it will cost you a lot of money. But remember, strategies are not products you can buy, they are not commodities and as such buying them is a paradox. A strategy is what will give you the edge over your competitors.If you have others determine your strategy, your competitors will pay them more to find out . So, roll up your sleeves and put in the effort yourself. In the end, it’s all about common sense and this is true for data-mining, AI and BI as well.
I was tasked to troubleshoot a cluster where cluster aware updating (CAU) failed due to the nodes never succeeding going into maintenance mode. It seemed that none of the obvious or well know issues and mistakes that might break live migrations were present. Looking at the cluster and testing live migration not a single VM on any node would live migrate to any other node. So, I take a peek the event id and description and it hits me. I have seen this particular event id before.
Log Name: System Source: Microsoft-Windows-Hyper-V-High-Availability Date: 9/27/2018 15:36:44 Event ID: 21502 Task Category: None Level: Error Keywords: User: SYSTEM Computer: NODE-B.datawisetech.corp Description: Live migration of ‘Virtual Machine ADFS1’ failed. Virtual machine migration operation for ‘ADFS1’ failed at migration source ‘NODE-B’. (Virtual machine ID 4B5F2F6C-AEA3-4C7B-8342-E255D1D112D7) Failed to verify collection registry for virtual machine ‘ADFS1’: The system cannot find the file specified. (0x80070002). (Virtual Machine ID 4B5F2F6C-AEA3-4C7B-8342-E255D1D112D7).The live migration fails due to non-existent SharedStoragePath or ConfigStoreRootPath which is where collections metadata lives.
More errors are logged
There usually are more related tell-tale events. They however are clear in pin pointing the root cause.
On the destination host
On the destination host you’ll find event id 21066:
Log Name: Microsoft-Windows-Hyper-V-VMMS-Admin Source: Microsoft-Windows-Hyper-V-VMMS Date: 9/27/2018 15:36:45 Event ID: 21066 Task Category: None Level: Error Keywords: User: SYSTEM Computer: NODE-A.datawisetech.corp Description: Failed to verify collection registry for virtual machine ‘ADFS1’: The system cannot find the file specified. (0x80070002). (Virtual Machine ID 4B5F2F6C-AEA3-4C7B-8342-E255D1D112D7).
A bunch of 1106 events per failed live migration per VM in like below:
Log Name: Microsoft-Windows-Hyper-V-VMMS-Operational Source: Microsoft-Windows-Hyper-V-VMMS Date: 9/27/2018 15:36:45 Event ID: 1106 Task Category: None Level: Error Keywords: User: SYSTEM Computer: NODE-A.datawisetech.corp Description: vm\service\migration\vmmsvmmigrationdestinationtask.cpp(5617)\vmms.exe!00007FF77D2171A4: (caller: 00007FF77D214A5D) Exception(998) tid(1fa0) 80070002 The system cannot find the file specified.
As well as event id 21111: Log Name: Microsoft-Windows-Hyper-V-High-Availability-Admin Source: Microsoft-Windows-Hyper-V-High-Availability Date: 9/27/2018 15:36:44 Event ID: 21111 Task Category: None Level: Error Keywords: User: SYSTEM Computer: NODE-B.datawisetech.corp Description: Live migration of ‘Virtual Machine ADFS1’ failed.
… event id 21066: Log Name: Microsoft-Windows-Hyper-V-VMMS-Admin Source: Microsoft-Windows-Hyper-V-VMMS Date: 9/27/2018 15:36:44 Event ID: 21066 Task Category: None Level: Error Keywords: User: SYSTEM Computer: NODE-B.datawisetech.corp Description: Failed to verify collection registry for virtual machine ‘ADFS1’: The system cannot find the file specified. (0x80070002). (Virtual Machine ID 4B5F2F6C-AEA3-4C7B-8342-E255D1D112D7).
… and event id 21024: Log Name: Microsoft-Windows-Hyper-V-VMMS-Admin Source: Microsoft-Windows-Hyper-V-VMMS Date: 9/27/2018 15:36:44 Event ID: 21024 Task Category: None Level: Error Keywords: User: SYSTEM Computer: NODE-B.datawisetech.corp Description: Virtual machine migration operation for ‘ADFS1’ failed at migration source ‘NODE-B’. (Virtual machine ID 4B5F2F6C-AEA3-4C7B-8342-E255D1D112D7)
Live migration fails due to non-existent SharedStoragePath or ConfigStoreRootPath explained
This is what a Windows Server 2016/2019 cluster that has not been configured with a looks like.
Get-VMHostCluster -ClusterName “W2K19-LAB”
HKLM\Cluster\Resources\GUIDofWMIResource\Parameters there is a value called ConfigStoreRootPath which in PowerShell is know as the SharedStoragePath property. You can also query it via
And this is what it looks like in the registry (0.Cluster and Cluster keys.) The resource ID we are looking at is the one of the Virtual Machine Cluster WMI resource.
If it returns a path you must verify that it exists, if not you’re in trouble with live migrations. You will also be in trouble with host level guest cluster backups or Hyper-V replicas of them. Maybe you don’t have guest cluster or use in guest backups and this is just a remnant of trying them out.
When I run it on the problematic cluster I get a path points to a folder on a CSV that doesn’t exist.
Did they rename the CSV? Replace the storage array? Well as it turned out they reorganized and resized the CSVs. As they can’t shrink SAN LUNs the created new ones. They then leveraged storage live migration to move the VMs.
The old CSV’s where left in place for about 6 weeks before they were cleaned up. As this was the first time they ran Cluster Aware Updating after removing them this is the first time they hit this problem. Bingo! You probably think you’ll just change it to an existing CSV folder path or delete it. Well as it turns out, you cannot do that. You can try …
Set-VMHostCluster : The operation on computer ‘W2K19-LAB’ failed: The WS-Management service cannot process the request. The WMI service or the WMI provider returned an unknown error: HRESULT 0x80070032 At line:1 char:1 + Set-VMHostCluster -ClusterName “W2K19-LAB” -SharedStoragePath “C:\Clu … + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : NotSpecified: (:) [Set-VMHostCluster], VirtualizationException + FullyQualifiedErrorId : OperationFailed,Microsoft.HyperV.PowerShell.Commands.SetVMHostCluster
Whatever you try, deleting, overwriting, … no joy. As it turns out you cannot change it and this is by design. A shaky design I would say. I understand the reasons because if it changes or is deleted and you have guest clusters with collection depending on what’s in there you have backup and live migration issues with the guest clusters. But if you can’t change it you also run into issues if storage changes. You dammed if you do, dammed if you don’t.
Workaround 1
What
Create a CSV with the old name and folder(s) to which the current path is pointing. That works. It could even be a very small one. As test I use done of 1GB. Not sure of that’s enough over time but if you can easily extend your CSV that’s should not pose a problem. It might actually be a good idea to have this as a best practice. Have a dedicated CSV for the SharedStoragePath. I’ll need to ask Microsoft.
How
You know how to create a CSV and a folder I guess, that’s about it. I’ll leave it at that.
Workaround 2
What
Set the path to a new one in the registry. This could be a new path (mind you this won’t fix any problems you might already have now with existing guest clusters).
Delete the value for that current path and leave it empty. This one is only a good idea if you don’t have a need for VHD Set Guest clusters anymore. Basically, this is resetting it to the default value.
How
There are 2 ways to do this. Both cost down time. You need to bring the cluster service down on all nodes and then you don’t have your CSV’s. That means your VMs must be shut down on all nodes of the cluster
The Microsoft Support way
Well that’s what they make you do (which doesn’t mean you should just do it even without them instructing you to do so)
Export your HKLM\Cluster\Resources\GUIDofWMIResource\Parameters for save keeping and restore if needed.
Shut down all VMs in the cluster or even the ones residing on a CSV even if not clusterd.
Stop the cluster service on all nodes (the cluster is shutdown if you do that), leave the one you are working on for last.
From one node, open up the registry key
Click on HKEY_LOCAL_MACHINE and then click on file, then select load hive
Browse to c:\windows\cluster, and select CLUSDB
Click ok, and then name it DB
Expand DB, then expand Resources
Select the GUID of Virtual Machine WMI
Click on parameters, on (configStoreRootPath) you will find the value
Double click on it, and delete it or set it to a new path on a CSV that you created already
Start the cluster service
Then start the cluster service from all nodes, node by node
My way
Not supported, at your own risk, big boy rules apply. I have tried and tested this a dozen times in the lab on multiple clusters and this also works.
In the registry key Cluster (HKLM\Cluster\Resources\GUIDofWMIResource\Parameters) of ever cluster node delete the content of the REGZ value for configStoreRootPath, so it is empty or change it to a new path on a CSV that you created already for this purpose.
If you have a cluster with a disk witness, the node who owns the disk witness also has a 0.Cluster key (HKLM\0.Cluster\Resources\GUIDofWMIResource\Parameters). Make sure you also to change the value there.
When you have done this. You have to shut down all the virtual machines. You then stop the cluster service on every node. I try to work on the node owning the disk witness and shut down the cluster on that one as the final step. This is also the one where I start again the cluster again first so I can easily check that the value remains empty in both the Cluster and the 0.Cluster keys. Do note that with a file share / cloud share witness, knowing what node was shut down last can be important. See https://blog.workinghardinit.work/2017/12/11/cluster-shared-volumes-without-active-directory/. That’s why I always remember what node I’m working on and shut down last.
Start up the cluster service on the other nodes one by one.
This avoids having to load the registry hive but editing the registry on every node in large clusters is tedious. Sure, this can be scripted in combination with shutting down the VMs, stopping the cluster service on all nodes, changing the value and then starting the cluster services again as well as the VMs. You can control the order in which you go through the nodes in a script as well. I actually did script this but I used my method. you can find it at the bottom of this blog post.
Both methods will work and live migrations will work again. Any existing problematic guest cluster VMs in backup or live migration is food for another blog post perhaps. But you’ll have things like driving your crazy.
Some considerations
Workaround 1 is a bit of a “you got to be kidding me” solution but at least it leaves some freedom replace, rename, reorganize the other CSVs as you see fit. So perhaps having a dedicated CSV just for this purpose is not that silly. Another benefit is that this does not involve messing around in the cluster database via the registry. This is something we advise against all the time but now has become a way to get out of a pickle.
Workaround 2 speaks for its self. There is two ways to achieve this which I have shown. But a word of warning. The moment the path changes and you have some already existing VHD Set guests clusters that somehow depend on that you’ll see that backups start having issues and possibly even live migrations. But you’re toast for all your Live migrations anyway already so … well yeah, what can I do.
So, this is by design. Maybe it is but it isn’t very realistic that your stuck to a path and name that hard and that it causes this much grief or allows for people to shoot themselves in the foot. It’s not like all this documented somewhere.
Conclusion
This needs to be fixed. While I can get you out of this pickle it is a tedious operation with some risk in a production environment. It also requires down time, which is bad. On top of that it will only have a satisfying result if you don’t have any VHD Set guest clusters that rely on the old path. The mechanism behind the SharedStoragePath isn’t as robust and flexible yet as it should be when it comes to changes & dealing with failed host level guest cluster backups.
I have tested this in Windows 2019 insider preview. The issue is still there. No progress on that front. Maybe in some of the future cumulative updates, things will be fixed to make guest clustering with VHD Set a more robust and reliable solution. The fact that Microsoft relies on guest clustering to support some deployment scenarios with S2D makes this even more disappointing. It is also a reason I still run physical shared storage-based file clusters.
The problematic host level backups I can work around by leveraging in guest backups. But the path issue is unavoidable if changes are needed.
After 2 years of trouble with the framework around guest cluster backups / VHD Set, it’s time this “just works”. No one will use it when it remains this troublesome and you won’t fix this if no one uses this. The perfect catch 22 situation.
The Script
$ClusterName = "W2K19-LAB"
$OwnerNodeWitnessDisk = $Null
$RemberLastNodeThatWasShutdown = $Null
$LogFileName = "ConfigStoreRootPathChange"
$RegistryPathCluster = "HKLM:\Cluster\Resources\$WMIClusterResourceID\Parameters"
$RegistryPathClusterDotZero = "HKLM:\0.Cluster\Resources\$WMIClusterResourceID\Parameters"
$REGZValueName = "ConfigStoreRootPath"
$REGZValue = $Null #We need to empty the value
#$REGZValue = "C:\ClusterStorage\ReFS-01\SharedPath" #We need to set a new path.
#Region SupportingFunctionsAndWorkFlows
Workflow ShutDownVMs {
param ($AllVMs)
Foreach -parallel ($VM in $AllVMs) {
InlineScript {
try {
If ($using:VM.State -eq "Running") {
Stop-VM -Name $using:VM.Name -ComputerName $using:VM.ComputerName -force
}
}
catch {
$ErrorMessage = $_.Exception.Message
$ErrorLine = $_.InvocationInfo.Line
$ExceptionInner = $_.Exception.InnerException
Write-2-Log -Message "!Error occured!:" -Severity Error
Write-2-Log -Message $ErrorMessage -Severity Error
Write-2-Log -Message $ExceptionInner -Severity Error
Write-2-Log -Message $ErrorLine -Severity Error
Write-2-Log -Message "Bailing out - Script execution stopped" -Severity Error
}
}
}
}
#Code to shut down all VMs on all Hyper-V cluster nodes
Workflow StartVMs {
param ($AllVMs)
Foreach -parallel ($VM in $AllVMs) {
InlineScript {
try {
if ($using:VM.State -eq "Off") {
Start-VM -Name $using:VM.Name -ComputerName $using:VM.ComputerName
}
}
catch {
$ErrorMessage = $_.Exception.Message
$ErrorLine = $_.InvocationInfo.Line
$ExceptionInner = $_.Exception.InnerException
Write-2-Log -Message "!Error occured!:" -Severity Error
Write-2-Log -Message $ErrorMessage -Severity Error
Write-2-Log -Message $ExceptionInner -Severity Error
Write-2-Log -Message $ErrorLine -Severity Error
Write-2-Log -Message "Bailing out - Script execution stopped" -Severity Error
}
}
}
}
function Write-2-Log {
[CmdletBinding()]
param(
[Parameter()]
[ValidateNotNullOrEmpty()]
[string]$Message,
[Parameter()]
[ValidateNotNullOrEmpty()]
[ValidateSet('Information', 'Warning', 'Error')]
[string]$Severity = 'Information'
)
$Date = get-date -format "yyyyMMdd"
[pscustomobject]@{
Time = (Get-Date -f g)
Message = $Message
Severity = $Severity
} | Export-Csv -Path "$PSScriptRoot\$LogFileName$Date.log" -Append -NoTypeInformation
}
#endregion
Try {
Write-2-Log -Message "Connecting to cluster $ClusterName" -Severity Information
$MyCluster = Get-Cluster -Name $ClusterName
$WMIClusterResource = Get-ClusterResource "Virtual Machine Cluster WMI" -Cluster $MyCluster
Write-2-Log -Message "Grabbing Cluster Resource: Virtual Machine Cluster WMI" -Severity Information
$WMIClusterResourceID = $WMIClusterResource.Id
Write-2-Log -Message "The Cluster Resource Virtual Machine Cluster WMI ID is $WMIClusterResourceID " -Severity Information
Write-2-Log -Message "Checking for quorum config (disk, file share / cloud witness) on $ClusterName" -Severity Information
If ((Get-ClusterQuorum -Cluster $MyCluster).QuorumResource -eq "Witness") {
Write-2-Log -Message "Disk witness in use. Lookin up for owner node of witness disk as that holds the 0.Cluster registry key" -Severity Information
#Store the current owner node of the witness disk.
$OwnerNodeWitnessDisk = (Get-ClusterGroup -Name "Cluster Group").OwnerNode
Write-2-Log -Message "Owner node of witness disk is $OwnerNodeWitnessDisk" -Severity Information
}
}
Catch {
$ErrorMessage = $_.Exception.Message
$ErrorLine = $_.InvocationInfo.Line
$ExceptionInner = $_.Exception.InnerException
Write-2-Log -Message "!Error occured!:" -Severity Error
Write-2-Log -Message $ErrorMessage -Severity Error
Write-2-Log -Message $ExceptionInner -Severity Error
Write-2-Log -Message $ErrorLine -Severity Error
Write-2-Log -Message "Bailing out - Script execution stopped" -Severity Error
Break
}
try {
$ClusterNodes = $MyCluster | Get-ClusterNode
Write-2-Log -Message "We have grabbed the cluster nodes $ClusterNodes from $MyCluster" -Severity Information
Foreach ($ClusterNode in $ClusterNodes) {
#If we have a disk witness we also need to change the in te 0.Cluster registry key on the current witness disk owner node.
If ($ClusterNode.Name -eq $OwnerNodeWitnessDisk) {
if (Test-Path -Path $RegistryPathClusterDotZero) {
Write-2-Log -Message "Changing $REGZValueName in 0.Cluster key on $OwnerNodeWitnessDisk who owns the witnessdisk to $REGZvalue" -Severity Information
Invoke-command -computername $ClusterNode.Name -ArgumentList $RegistryPathClusterDotZero, $REGZValueName, $REGZValue {
param($RegistryPathClusterDotZero, $REGZValueName, $REGZValue)
Set-ItemProperty -Path $RegistryPathClusterDotZero -Name $REGZValueName -Value $REGZValue -Force | Out-Null}
}
}
if (Test-Path -Path $RegistryPathCluster) {
Write-2-Log -Message "Changing $REGZValueName in Cluster key on $ClusterNode.Name to $REGZvalue" -Severity Information
Invoke-command -computername $ClusterNode.Name -ArgumentList $RegistryPathCluster, $REGZValueName, $REGZValue {
param($RegistryPathCluster, $REGZValueName, $REGZValue)
Set-ItemProperty -Path $RegistryPathCluster -Name $REGZValueName -Value $REGZValue -Force | Out-Null}
}
}
Write-2-Log -Message "Grabbing all VMs on all clusternodes to shut down" -Severity Information
$AllVMs = Get-VM –ComputerName ($ClusterNodes)
Write-2-Log -Message "We are shutting down all running VMs" -Severity Information
ShutdownVMs $AllVMs
}
catch {
$ErrorMessage = $_.Exception.Message
$ErrorLine = $_.InvocationInfo.Line
$ExceptionInner = $_.Exception.InnerException
Write-2-Log -Message "!Error occured!:" -Severity Error
Write-2-Log -Message $ErrorMessage -Severity Error
Write-2-Log -Message $ExceptionInner -Severity Error
Write-2-Log -Message $ErrorLine -Severity Error
Write-2-Log -Message "Bailing out - Script execution stopped" -Severity Error
Break
}
try {
#Code to stop the cluster service on all cluster nodes
#ending with the witness owner if there is one
Write-2-Log -Message "Shutting down cluster service on all nodes in $MyCluster that are not the owner of the witness disk" -Severity Information
Foreach ($ClusterNode in $ClusterNodes) {
#First we shut down all nodes that do NOT own the witness disk
If ($ClusterNode.Name -ne $OwnerNodeWitnessDisk) {
Write-2-Log -Message "Stop cluster service on node $ClusterNode.Name" -Severity Information
if ((Get-ClusterNode -Cluster W2K19-LAB | where-object {$_.State -eq "Up"}).count -ne 1) {
Stop-ClusterNode -Name $ClusterNode.Name -Cluster $MyCluster | Out-Null
}
Else {
Stop-Cluster -Cluster $MyCluster -Force | Out-Null
$RemberLastNodeThatWasShutdown = $ClusterNode.Name
}
}
}
#Whe then shut down the nodes that owns the witness disk
#If we have a fileshare etc, this won't do anything.
Foreach ($ClusterNode in $ClusterNodes) {
If ($ClusterNode.Name -eq $OwnerNodeWitnessDisk) {
Write-2-Log -Message "Stopping cluster and as such last node $ClusterNode.Name" -Severity Information
Stop-Cluster -Cluster $MyCluster -Force | Out-Null
$RemberLastNodeThatWasShutdown = $OwnerNodeWitnessDisk
}
}
#Code to start the cluster service on all cluster nodes,
#starting with the original owner of the witness disk
#or the one that was shut down last
Foreach ($ClusterNode in $ClusterNodes) {
#First we start the node that was shut down last. This is either the one that owned the witness disk
#or just the last node that was shut down in case of a fileshare
If ($ClusterNode.Name -eq $RemberLastNodeThatWasShutdown) {
Write-2-Log -Message "Starting the clusternode $ClusterNode.Name that was the last to shut down" -Severity Information
Start-ClusterNode -Name $ClusterNode.Name -Cluster $MyCluster | Out-Null
}
}
Write-2-Log -Message "Starting the all other clusternodes in $MyCluster" -Severity Information
Foreach ($ClusterNode in $ClusterNodes) {
#We then start all the other nodes in the cluster.
If ($ClusterNode.Name -ne $RemberLastNodeThatWasShutdown) {
Write-2-Log -Message "Starting the clusternode $ClusterNode.Name" -Severity Information
Start-ClusterNode -Name $ClusterNode.Name -Cluster $MyCluster | Out-Null
}
}
}
catch {
$ErrorMessage = $_.Exception.Message
$ErrorLine = $_.InvocationInfo.Line
$ExceptionInner = $_.Exception.InnerException
Write-2-Log -Message "!Error occured!:" -Severity Error
Write-2-Log -Message $ErrorMessage -Severity Error
Write-2-Log -Message $ExceptionInner -Severity Error
Write-2-Log -Message $ErrorLine -Severity Error
Write-2-Log -Message "Bailing out - Script execution stopped" -Severity Error
Break
}
Start-sleep -Seconds 15
Write-2-Log -Message "Grabbing all VMs on all clusternodes to start them up" -Severity Information
$AllVMs = Get-VM –ComputerName ($ClusterNodes)
Write-2-Log -Message "We are starting all stopped VMs" -Severity Information
StartVMs $AllVMs
#Hit it again ...
$AllVMs = Get-VM –ComputerName ($ClusterNodes)
StartVMs $AllVMs
The script is below as promised. If you use this without testing in a production environment and it blows up in your face you are going to get fired and it is your fault. You can use it both to introduce as fix the issue. The action are logged in the directory where the script is run from.
Recently I was involved in troubleshooting a load balanced web service. That led me to quickly write PowerShell script to Monitor a web service. The original web service was actually not the problem (well not this time and after we’s set it to recycle a lot more until someone fixes the code and bar the fact there is no health check on the load balancer!?). I “failed” as it din’t really handle a another failed web service it depended on very well so that was unclear during initial troubleshooting. That web service is not highly available bar with manually switching over to a “stand by” server a ARR, no loadbalancing. But that’s another discussion.
The culprit & the state of our industry
When we found the problematic web service and saw it ran on Tomcat we tried restarting the Tomcat service but that didn’t help. Rebooting the servers on which it ran didn’t help either. Until some one sent us a document with the restart procedure for those servers. This also stated the Catalina folder need to be deleted for this to work and get the service back up an running. It also stated they often needed to do this twice. Well, OK … Based on that note we worked under their assumption nothing in that folder that is needed, as nothing was said about safe guarding any of that.
Having said that, why on earth over all those years, the developers did not find out what is causing the issue and fixed it beats me. For year and years they’ve been doing this manually. Sometime several days a week, sometime multiple times a day. On several servers. Good luck when no one is around to so, or knows the process. The doc was from a developer. A developer in what is supposed to be a DevOps environment. No one ever made the effort to find out what makes the web service crash or automate recovery.
PowerShell Script to Monitor a web service
I think it’s safe to say I won’t get them to any form of site resilience engineering soon. But I did leave them with a script they can schedule to automate their manual actions. This does mean that an “ordinary” restart of the server does not fix any issues with the web service. So, ideally this script is also run at server startup!
The script has basic error handling and logging but it’s a quick fix for a manual process so it’s not a spic & span script. but it’s enough to do the job for now and hopefully inspire them to do better. It is 2018 after all and even Site Resilience Engineering needs a new incarnation in this fashion driven industry.
I’ve included this PowerShell script to monitor a web service below as an example and reference to my future self. Enjoy.
<#
Author: Didier Van Hoye
Date: 2018/09/24
version: 0.9.1
Blog: https://blog.workinghardinit.work
Twitter: @WorkingHardInIT
This PowerShell scripts automated the restart of Tomcat7 when needed. The need is based
it the web servcie running on Tomcat7 returns HTTP status 200 or not.
The work this script does is based on the memo that describe the manual procedure.
It takes away the manual reactive actions that they did multiple days per week, sometimes
multiple times per day
You can register this script as a scheduled task to run every X times.
Below is a example. NOTE LINE WRAPS!!!
Schtasks.exe /CREATE /TN MonitorMyWebService /TR "Powershell.exe C:\SysAdmin\Scripts\MonitorMyWebService.ps1"
/RU SYSTEM /RL HIGHEST /F /SC DAILY /RI 15 /ST 00:00
Having said that, why on earth over all those years the developers did not find out
what was causing the issue and fixed that beats me. Also since the need to have the catalina
folder deleted for this to work we work under their assumption nothing in there is needed.
This does mean that an "ordinary" restart of the server does not fix any issues with the web service.
So, ideally this script is also run at server startup.
The script logs its findings and actions to a script in the script directory.
#>
$ErrorActionPreference = "Stop"
$FolderToDelete = "C:\Program Files\Apache Software Foundation\Tomcat 7.0\work\Catalina"
$MystatusRunning = 'Running'
$MyStatusStopped = 'Stopped'
$MyService = "Tomcat7"
$MyWebService = "https://mywebservice.company.com/metadatasearch"
$MyWebServiceStatus = 0
$MyServiceCheckLogFile = "MetaDataMonitor"
#region CheckWebServiceStatus
Function CheckWebService() {
[CmdletBinding()]
param(
[String]$WebService
)
try {
# Create new web request.
$HTTP_Request = [System.Net.WebRequest]::Create($WebService)
# WGet a response from the site
$HTTP_Response = $HTTP_Request.GetResponse()
# Cast the status of the service to an integer
$HTTP_Status = [int]$HTTP_Response.StatusCode
If ($HTTP_Status -eq 200) {
##Write-Host "All is OK!"
Return $HTTP_Status
}
Else {
##Write-Host "The service might be down!"
Return $HTTP_Status
}
# Don't litter :-)
$HTTP_Response.Close()
}
Catch {
#Write-Host $_.Exception.InnerException
if ($_.Exception.InnerException -contains "The remote server returned an error: (500) Internal Server Error.") {
$HTTP_Status = [int]500
}
else {
$HTTP_Status = [int]999
}
Return $HTTP_Status
}
Finally {
}
}
#endregion
#region Write-2-Log
function Write-2-Log {
[CmdletBinding()]
param(
[Parameter()]
[ValidateNotNullOrEmpty()]
[string]$Message,
[Parameter()]
[ValidateNotNullOrEmpty()]
[ValidateSet('Information','Warning','Error')]
[string]$Severity = 'Information'
)
$Date = get-date -format "yyyyMMdd"
[pscustomobject]@{
Time = (Get-Date -f g)
Message = $Message
Severity = $Severity
} | Export-Csv -Path "$PSScriptRoot\$MyServiceCheckLogFile-$Date.log" -Append -NoTypeInformation
}
#endregion
Write-Log -Message "Starting Web Service Status check" -Severity Information
$MyWebServiceStatus = CheckWebService $MyWebService
<#
For some reason once might not be enough
So we strafe them twice - what's worth shooting once is worth shooting twice.
#>
For ($counter = 1; $counter -le 2; $counter++) {
Try {
If ($MyWebServiceStatus -ne 200) {
#Write-Host "The webservice has a problem"
Write-2-Log -message "The webservice has a problem as it did not return HTTP Status 200" -Severity Warning
#Stop the TomCat service if it is running
$ServiceObject = $Null
$ServiceObject = Get-Service $MyService
If ($ServiceObject | Where-Object {$_.Status -eq $myStatusRunning}) {
#Write-Host "Running"
Write-2-Log -Message "Stopping $MyService ..." -Severity Information
Stop-Service $MyService
$ServiceObject.WaitForStatus($MyStatusStopped, "00:00:05")
#Get-Service $MyService
Write-2-Log -Message "$MyService has been stopped..." -Severity Information
#Write-Host "Stopped"
}
#Write-Host "Delete folder"
if (Test-path $FolderToDelete -PathType Container) {
#Write-Host Folder Exists
Write-2-Log -Message "The $FolderToDelete exists ..." -Severity Information
#Delete the folder & its contents
Get-ChildItem -Path "$FolderToDelete\\*" -Recurse | Remove-Item -Force -Recurse
Write-2-Log -Message "$FolderToDelete content has been recursively deleted ..." -Severity Information
Remove-Item $FolderToDelete -Recurse -force
Write-2-Log -Message "$FolderToDelete has been deleted ..." -Severity Information
#Write-Host Folder Deleted
}
#Start the TomCat service
$ServiceObject = $Null
$ServiceObject = Get-Service $MyService
if ($ServiceObject | Where-Object {$_.status -eq $MyStatusStopped} ) {
Write-2-Log -Message "Starting $MyService ..." -Severity Information
Start-Service $MyService
$ServiceObject.WaitForStatus($MyStatusRunning, "00:00:05")
#Get-Service $MyService
Write-2-Log -Message "$MyService has been started..." -Severity Information
#Write-Host Started
}
}
}
catch {
Write-2-Log -Message "Bailing out of the script due to error" -Severity Warning
Write-2-Log -Message $_.Exception.Message -Severity Error
Write-2-Log -Message $_.InvocationInfo.PositionMessage -Severity Error
Write-2-Log -Message $_.FullyQualifiedErrorID -Severity Error
Write-2-Log -Message "Stopping Web Service Status check" -Severity Information
}
Finally {
Start-Sleep -Seconds 5
}
}
Sometimes I get asked how do you know all this shit? Especially after fixing an issue. Well partially because I read and experiment a lot. Partially experience. That’s it. It doesn’t come to me in visions, dreams or by a 25Gbps fibre up-link to the big brain in the cloud.
It takes time and effort. That’s it. Time is priceless and we all have 24 hours in a day. Effort is something we chip into the equation. That has a cost and as a result a price. And as with everything there is a limit to what one can do. So where I spend my time is a balance between need, interest, ROI, fun factor and avoiding BS.
What’s also important to know is that I know far less than I would like to. I mean that. I have met so many people that are smarter, quicker, better and more entrepreneurial than me that … it would be demotivating. But it isn’t, I just enjoy the insights & education it brings me. On top of that it’s a welcome change form modern landscape office chatter and helps maintain / restore some sort of faith in mankind I guess. It’s also fun.
Fix my problems already!
That’s fine, you say, but “why can’t you fix all our problems then huh”? That’s easy. I don’t know enough to fix ALL your problems. I also probably don’t want to fix them as you don’t or won’t pay me enough to fix them. And even if you did, I might not have time for it or you might be beyond saving. A lot of your issues are being created by a lack of context, insight & understanding. I call that wishful management. Basically it means that you’re digging yourself into a hole faster than we can get you out. It’s not even a question of skills, resources or money, it’s just hopeless. A bit like Enterprise IT at times.
You’re being negative
“Geez such negativity”. No it’s not. Its recognizing the world is not perfect. That not all issues with technical solutions are technology induced. It’s about realizing that things can and will go wrong.
So part of my endeavors is making sure I know what to do when the shit hits the fan. To be able to do so you need to understand the technology used, build it, break it, recover it. The what & how depends on the solution at hand (cattle versus holey cows). Failure is not an option you chose “not to select”. Failure is guaranteed. By the time I fail for real I try to be prepared by repeated failure in the lab.
I spend time in the lab for hands on testing. I also spend time at my desk, on the road, in my comfy chair reading, scribbling down notes, writing & drawing concepts & ideas. Nothing else. And during walks I tend to process all my impressions. It’s something that helps me, so I make room for that. I highly recommend that you figure out what works for you. A favorite of mine is to grab a coffee and sit down. Without my e-mail open, with my phone muted, without a calendar nagging me or the pseudo crisis of the moment stealing my time.
That in combination with actually working with the technology is what brings the understanding, the insight, the context. My core team members and network buddies can always get a hold of me in that reserved time and I will answer their call. Why? Because I know they won’t abuse it and have a serious need, not some self inflicted crisis which look bad but poses little danger.