Altaro Free Webinar: What’s New in Windows Server 2019

Altaro Free Webcast: What’s New in Windows Server 2019

Altaro software is running a free webinar: what’s New in Windows Server 2019. The timing, October the 3rd, could not be better as Microsoft Ignite lies right behind us and we can all use some help in putting that barrage of announcements in to context. That help is here and offered by industry experts like Andy Syrewicze, Rob Corradini and Symon Perriman.

So register here an get some insights, views and guidance from industry experts on the value of Windows Server 2019 and what this means to so many of us in the IT field. As our industry is changing and new balances are found, let me say that Windows Server 2019 will play a significant role in the next years building the future of IT.

image

My take

The industries I advise know I advocate Serverless, containers and servers as well as PAAS & SAAS. My strength lies in the fact I know the IT stack. Compute, storage, memory and networks. I started my career developing code and as such I know code needs an excellent environment to run it so it can shine. I also know I don’t know a lot. So I’m active in the community,  attend conferences, listen and learn from other peoples and vendors their point of view and insights. You must avoid tunnel vision and echo chambers. But you also must grasp your own industry and business in order to make decisions and move ahead.

Observe, orient, decide and act in a never ending cycle. So register or Altaro’s free webinar: what’s New in Windows Server 2019 and get a head start in this process. I have registered and intent to attend unless work priorities prevent me from doing so. I most certainly hope not! See you there. Register here. It.s free, all you got to do is show up and invest some time in your own future.

Live Migration Fails due to non-existent SharedStoragePath or ConfigStoreRootPath

Introduction

I was tasked to troubleshoot a cluster where cluster aware updating (CAU) failed due to the nodes never succeeding going into maintenance mode. It seemed that none of the obvious or well know issues and mistakes that might break live migrations were present. Looking at the cluster and testing live migration not a single VM on any node would live migrate to any other node.
So, I take a peek the event id and description and it hits me. I have seen this particular event id before.

Live Migration Fails due to non-existent SharedStoragePath or ConfigStoreRootPath

Log Name:      System
Source:        Microsoft-Windows-Hyper-V-High-Availability
Date:          9/27/2018 15:36:44
Event ID:      21502
Task Category: None
Level:         Error
Keywords:
User:          SYSTEM
Computer:      NODE-B.datawisetech.corp
Description:
Live migration of ‘Virtual Machine ADFS1’ failed.
Virtual machine migration operation for ‘ADFS1’ failed at migration source ‘NODE-B’. (Virtual machine ID 4B5F2F6C-AEA3-4C7B-8342-E255D1D112D7)
Failed to verify collection registry for virtual machine ‘ADFS1’: The system cannot find the file specified. (0x80070002). (Virtual Machine ID 4B5F2F6C-AEA3-4C7B-8342-E255D1D112D7).
The live migration fails due to non-existent SharedStoragePath or ConfigStoreRootPath which is where collections metadata lives.

More errors are logged

There usually are more related tell-tale events. They however are clear in pin pointing the root cause.

On the destination host

On the destination host you’ll find event id 21066:

Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          9/27/2018 15:36:45
Event ID:      21066
Task Category: None
Level:         Error
Keywords:
User:          SYSTEM
Computer:      NODE-A.datawisetech.corp
Description:
Failed to verify collection registry for virtual machine ‘ADFS1’: The system cannot find the file specified. (0x80070002). (Virtual Machine ID 4B5F2F6C-AEA3-4C7B-8342-E255D1D112D7).

A bunch of 1106 events per failed live migration per VM in like below:

Log Name:      Microsoft-Windows-Hyper-V-VMMS-Operational
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          9/27/2018 15:36:45
Event ID:      1106
Task Category: None
Level:         Error
Keywords:
User:          SYSTEM
Computer:      NODE-A.datawisetech.corp
Description:
vm\service\migration\vmmsvmmigrationdestinationtask.cpp(5617)\vmms.exe!00007FF77D2171A4: (caller: 00007FF77D214A5D) Exception(998) tid(1fa0) 80070002 The system cannot find the file specified.

On the source host

On the source host you’ll find event id 1840 logged
Log Name:      Microsoft-Windows-Hyper-V-Worker-Operational
Source:        Microsoft-Windows-Hyper-V-Worker
Date:          9/27/2018 15:36:44
Event ID:      1840
Task Category: None
Level:         Error
Keywords:
User:          NT VIRTUAL MACHINE\4B5F2F6C-AEA3-4C7B-8342-E255D1D112D7
Computer:      NODE-B.datawisetech.corp
Description:
[Virtual machine 4B5F2F6C-AEA3-4C7B-8342-E255D1D112D7] onecore\vm\worker\migration\workertaskmigrationsource.cpp(281)\vmwp.exe!00007FF6E7C46141: (caller: 00007FF6E7B8957D) Exception(2) tid(ff4) 80042001     CallContext:[\SourceMigrationTask]

As well as event id 21111:
Log Name:      Microsoft-Windows-Hyper-V-High-Availability-Admin
Source:        Microsoft-Windows-Hyper-V-High-Availability
Date:          9/27/2018 15:36:44
Event ID:      21111
Task Category: None
Level:         Error
Keywords:
User:          SYSTEM
Computer:      NODE-B.datawisetech.corp
Description:
Live migration of ‘Virtual Machine ADFS1’ failed.

… event id 21066:
Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          9/27/2018 15:36:44
Event ID:      21066
Task Category: None
Level:         Error
Keywords:
User:          SYSTEM
Computer:      NODE-B.datawisetech.corp
Description:
Failed to verify collection registry for virtual machine ‘ADFS1’: The system cannot find the file specified. (0x80070002). (Virtual Machine ID 4B5F2F6C-AEA3-4C7B-8342-E255D1D112D7).

… and event id 21024:
Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          9/27/2018 15:36:44
Event ID:      21024
Task Category: None
Level:         Error
Keywords:
User:          SYSTEM
Computer:      NODE-B.datawisetech.corp
Description:
Virtual machine migration operation for ‘ADFS1’ failed at migration source ‘NODE-B’. (Virtual machine ID 4B5F2F6C-AEA3-4C7B-8342-E255D1D112D7)

Live migration fails due to non-existent SharedStoragePath or ConfigStoreRootPath explained

If you have worked with guest clusters and the ConfigStoreRootPath you know about issues with collections/ groups & checkpoints. This is related to those. If you haven’t heard anything yet read https://blog.workinghardinit.work/2018/09/10/correcting-the-permissions-on-the-folder-with-vhds-files-checkpoints-for-host-level-hyper-v-guest-cluster-backups/.

This is what a Windows Server 2016/2019 cluster that has not been configured with a looks like.

Get-VMHostCluster  -ClusterName “W2K19-LAB”

image

HKLM\Cluster\Resources\GUIDofWMIResource\Parameters there is a value called ConfigStoreRootPath which in PowerShell is know as the SharedStoragePath property.  You can also query it via

And this is what it looks like in the registry (0.Cluster and Cluster keys.) The resource ID we are looking at is the one of the Virtual Machine Cluster WMI resource.

image

If it returns a path you must verify that it exists, if not you’re in trouble with live migrations. You will also be in trouble with host level guest cluster backups or Hyper-V replicas of them. Maybe you don’t have guest cluster or use in guest backups and this is just a remnant of trying them out.

When I run it on the problematic cluster I get a path points to a folder on a CSV that doesn’t exist.

Get-VMHostCluster -ClusterName “W2K19-LAB
ClusterName SharedStoragePath
———– —————–
W2K19-LAB   C:\ClusterStorage\ReFS-01\SharedStoragePath

What happend?

Did they rename the CSV? Replace the storage array? Well as it turned out they reorganized and resized the CSVs. As they can’t shrink SAN LUNs the created new ones. They then leveraged storage live migration to move the VMs.

The old CSV’s where left in place for about 6 weeks before they were cleaned up. As this was the first time they ran Cluster Aware Updating after removing them this is the first time they hit this problem. Bingo! You probably think you’ll just change it to an existing CSV folder path or delete it. Well as it turns out, you cannot do that. You can try …

PS C:\Users\administrator1> Set-VMHostCluster -ClusterName “W2K19-LAB” -SharedStoragePath “C:\ClusterStorage\Volume1\SharedStoragePath”

Set-VMHostCluster : The operation on computer ‘W2K19-LAB’ failed: The WS-Management service cannot process the request. The WMI service or the WMI provider returned an unknown error: HRESULT 0x80070032
At line:1 char:1
+ Set-VMHostCluster -ClusterName
“W2K19-LAB” -SharedStoragePath “C:\Clu …
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo          : NotSpecified: (:) [Set-VMHostCluster], VirtualizationException
+ FullyQualifiedErrorId : OperationFailed,Microsoft.HyperV.PowerShell.Commands.SetVMHostCluster

Or try …
$path = “C:\ClusterStorage\Volume1\Hyper-V\Shared”
Get-ClusterResource “Virtual Machine Cluster WMI” | Set-ClusterParameter -Name ConfigStoreRootPath -Value $path -Create

Whatever you try, deleting, overwriting, … no joy. As it turns out you cannot change it and this is by design. A shaky design I would say. I understand the reasons because if it changes or is deleted and you have guest clusters with collection depending on what’s in there you have backup and live migration issues with the guest clusters. But if you can’t change it you also run into issues if storage changes. You dammed if you do, dammed if you don’t.

Workaround 1

What

Create a CSV with the old name and folder(s) to which the current path is pointing. That works. It could even be a very small one. As test I use done of 1GB. Not sure of that’s enough over time but if you can easily extend your CSV that’s should not pose a problem. It might actually be a good idea to have this as a best practice. Have a dedicated CSV for the SharedStoragePath. I’ll need to ask Microsoft.

How

You know how to create a CSV and a folder I guess, that’s about it.  I’ll leave it at that.

Workaround 2

What

Set the path to a new one in the registry. This could be a new path (mind you this won’t fix any problems you might already have now with existing guest clusters).

Delete the value for that current path and leave it empty. This one is only a good idea if you don’t have a need for VHD Set Guest clusters anymore. Basically, this is resetting it to the default value.

How

There are 2 ways to do this. Both cost down time. You need to bring the cluster service down on all nodes and then you don’t have your CSV’s. That means your VMs must be shut down on all nodes of the cluster

The Microsoft Support way

Well that’s what they make you do (which doesn’t mean you should just do it even without them instructing you to do so)

  1. Export your HKLM\Cluster\Resources\GUIDofWMIResource\Parameters for save keeping and restore if needed.
  2. Shut down all VMs in the cluster or even the ones residing on a CSV even if not clusterd.
  3. Stop the cluster service on all nodes (the cluster is shutdown if you do that), leave the one you are working on for last.
  4. From one node, open up the registry key
  5. Click on HKEY_LOCAL_MACHINE and then click on file, then select load hive
  6. Browse to c:\windows\cluster, and select CLUSDB
  7. Click ok, and then name it DB
  8. Expand DB, then expand Resources
  9. Select the GUID of Virtual Machine WMI
  10. Click on parameters, on (configStoreRootPath) you will find the value
  11. Double click on it, and delete it or set it to a new path on a CSV that you created already
  12. Start the cluster service
  13. Then start the cluster service from all nodes, node by node

My way

Not supported, at your own risk, big boy rules apply. I have tried and tested this a dozen times in the lab on multiple clusters and this also works.

  1. In the registry key Cluster (HKLM\Cluster\Resources\GUIDofWMIResource\Parameters) of ever cluster node delete the content of the REGZ value for configStoreRootPath, so it is empty or change it to a new path on a CSV that you created already for this purpose.
  2. If you have a cluster with a disk witness, the node who owns the disk witness also has a 0.Cluster key (HKLM\0.Cluster\Resources\GUIDofWMIResource\Parameters). Make sure you also to change the value there.
  3. When you have done this. You have to shut down all the virtual machines. You then stop the cluster service on every node. I try to work on the node owning the disk witness and shut down the cluster on that one as the final step. This is also the one where I start again the cluster again first so I can easily check that the value remains empty in both the Cluster and the 0.Cluster keys. Do note that with a file share / cloud share witness, knowing what node was shut down last can be important. See https://blog.workinghardinit.work/2017/12/11/cluster-shared-volumes-without-active-directory/. That’s why I always remember what node I’m working on and shut down last.
  4. Start up the cluster service on the other nodes one by one.
  5. This avoids having to load the registry hive but editing the registry on every node in large clusters is tedious. Sure, this can be scripted in combination with shutting down the VMs, stopping the cluster service on all nodes, changing the value and then starting the cluster services again as well as the VMs. You can control the order in which you go through the nodes in a script as well. I actually did script this but I used my method. you can find it at the bottom of this blog post.

Both methods will work and live migrations will work again. Any existing problematic guest cluster VMs in backup or live migration is food for another blog post perhaps. But you’ll have things like driving your crazy.

Some considerations

Workaround 1 is a bit of a “you got to be kidding me” solution but at least it leaves some freedom replace, rename, reorganize the other CSVs as you see fit. So perhaps having a dedicated CSV just for this purpose is not that silly. Another benefit is that this does not involve messing around in the cluster database via the registry. This is something we advise against all the time but now has become a way to get out of a pickle.

Workaround 2 speaks for its self. There is two ways to achieve this which I have shown. But a word of warning. The moment the path changes and you have some already existing VHD Set guests clusters that somehow depend on that you’ll see that backups start having issues and possibly even live migrations. But you’re toast for all your Live migrations anyway already so … well yeah, what can I do.

So, this is by design. Maybe it is but it isn’t very realistic that your stuck to a path and name that hard and that it causes this much grief or allows for people to shoot themselves in the foot. It’s not like all this documented somewhere.

Conclusion

This needs to be fixed. While I can get you out of this pickle it is a tedious operation with some risk in a production environment. It also requires down time, which is bad. On top of that it will only have a satisfying result if you don’t have any VHD Set guest clusters that rely on the old path. The mechanism behind the SharedStoragePath isn’t as robust and flexible yet as it should be when it comes to changes & dealing with failed host level guest cluster backups.

I have tested this in Windows 2019 insider preview. The issue is still there. No progress on that front. Maybe in some of the future cumulative updates, things will be fixed to make guest clustering with VHD Set a more robust and reliable solution. The fact that Microsoft relies on guest clustering to support some deployment scenarios with S2D makes this even more disappointing. It is also a reason I still run physical shared storage-based file clusters.

The problematic host level backups I can work around by leveraging in guest backups. But the path issue is unavoidable if changes are needed.

After 2 years of trouble with the framework around guest cluster backups / VHD Set, it’s time this “just works”. No one will use it when it remains this troublesome and you won’t fix this if no one uses this. The perfect catch 22 situation.

The Script

$ClusterName = "W2K19-LAB"
$OwnerNodeWitnessDisk = $Null
$RemberLastNodeThatWasShutdown = $Null
$LogFileName = "ConfigStoreRootPathChange"

$RegistryPathCluster = "HKLM:\Cluster\Resources\$WMIClusterResourceID\Parameters"
$RegistryPathClusterDotZero = "HKLM:\0.Cluster\Resources\$WMIClusterResourceID\Parameters"
$REGZValueName = "ConfigStoreRootPath" 
$REGZValue = $Null #We need to empty the value
#$REGZValue = "C:\ClusterStorage\ReFS-01\SharedPath" #We need to set a new path.

#Region SupportingFunctionsAndWorkFlows
Workflow ShutDownVMs {
    param ($AllVMs)
    
    Foreach -parallel ($VM in $AllVMs) {
        InlineScript {
            try {
                If ($using:VM.State -eq "Running") {
                    Stop-VM -Name $using:VM.Name -ComputerName $using:VM.ComputerName -force 
                } 
            }
            catch {
                $ErrorMessage = $_.Exception.Message
                $ErrorLine = $_.InvocationInfo.Line
                $ExceptionInner = $_.Exception.InnerException
                Write-2-Log -Message "!Error occured!:" -Severity Error
                Write-2-Log -Message $ErrorMessage -Severity Error
                Write-2-Log -Message $ExceptionInner -Severity Error
                Write-2-Log -Message $ErrorLine -Severity Error
                Write-2-Log -Message "Bailing out - Script execution stopped" -Severity Error
            }
        }
    }
}

#Code to shut down all VMs on all Hyper-V cluster nodes
Workflow StartVMs {
    param ($AllVMs)
    Foreach -parallel ($VM in $AllVMs) {
        InlineScript {
            try {
                if ($using:VM.State -eq "Off") {
                    Start-VM -Name $using:VM.Name -ComputerName $using:VM.ComputerName 
                }
            }
            catch {
                $ErrorMessage = $_.Exception.Message
                $ErrorLine = $_.InvocationInfo.Line
                $ExceptionInner = $_.Exception.InnerException
                Write-2-Log -Message "!Error occured!:" -Severity Error
                Write-2-Log -Message $ErrorMessage -Severity Error
                Write-2-Log -Message $ExceptionInner -Severity Error
                Write-2-Log -Message $ErrorLine -Severity Error
                Write-2-Log -Message "Bailing out - Script execution stopped" -Severity Error
            }
        }
    }
}
function Write-2-Log {
    [CmdletBinding()]
    param(
        [Parameter()]
        [ValidateNotNullOrEmpty()]
        [string]$Message,
        [Parameter()]
        [ValidateNotNullOrEmpty()]
        [ValidateSet('Information', 'Warning', 'Error')]
        [string]$Severity = 'Information'
    )
 
    $Date = get-date -format "yyyyMMdd"
    [pscustomobject]@{
        Time     = (Get-Date -f g)
        Message  = $Message
        Severity = $Severity
        
    } | Export-Csv -Path "$PSScriptRoot\$LogFileName$Date.log" -Append -NoTypeInformation
}


#endregion

Try {
    Write-2-Log -Message "Connecting to cluster $ClusterName" -Severity Information
    $MyCluster = Get-Cluster -Name $ClusterName
    $WMIClusterResource = Get-ClusterResource "Virtual Machine Cluster WMI" -Cluster $MyCluster
    Write-2-Log -Message "Grabbing Cluster Resource: Virtual Machine Cluster WMI" -Severity Information
    $WMIClusterResourceID = $WMIClusterResource.Id
    Write-2-Log -Message "The Cluster Resource Virtual Machine Cluster WMI ID is $WMIClusterResourceID " -Severity Information
    Write-2-Log -Message "Checking for quorum config (disk, file share / cloud witness) on $ClusterName" -Severity Information

    If ((Get-ClusterQuorum -Cluster $MyCluster).QuorumResource -eq "Witness") {
        Write-2-Log -Message "Disk witness in use. Lookin up for owner node of witness disk as that holds the 0.Cluster registry key" -Severity Information
        #Store the current owner node of the witness disk.
        $OwnerNodeWitnessDisk = (Get-ClusterGroup -Name "Cluster Group").OwnerNode 
        Write-2-Log -Message "Owner node of witness disk is $OwnerNodeWitnessDisk" -Severity Information
    }
}
Catch {
    $ErrorMessage = $_.Exception.Message
    $ErrorLine = $_.InvocationInfo.Line
    $ExceptionInner = $_.Exception.InnerException
    Write-2-Log -Message "!Error occured!:" -Severity Error
    Write-2-Log -Message $ErrorMessage -Severity Error
    Write-2-Log -Message $ExceptionInner -Severity Error
    Write-2-Log -Message $ErrorLine -Severity Error
    Write-2-Log -Message "Bailing out - Script execution stopped" -Severity Error
    Break
}

try {
    $ClusterNodes = $MyCluster | Get-ClusterNode
    Write-2-Log -Message "We have grabbed the cluster nodes $ClusterNodes from $MyCluster" -Severity Information

    Foreach ($ClusterNode in $ClusterNodes) {
        #If we have a disk witness we also need to change the in te 0.Cluster registry key on the current witness disk owner node.
        If ($ClusterNode.Name -eq $OwnerNodeWitnessDisk) {
            if (Test-Path -Path $RegistryPathClusterDotZero) {
                Write-2-Log -Message "Changing $REGZValueName in 0.Cluster key on $OwnerNodeWitnessDisk who owns the witnessdisk to $REGZvalue" -Severity Information
                Invoke-command -computername $ClusterNode.Name -ArgumentList $RegistryPathClusterDotZero, $REGZValueName, $REGZValue {
                    param($RegistryPathClusterDotZero, $REGZValueName, $REGZValue)
                    Set-ItemProperty -Path $RegistryPathClusterDotZero -Name $REGZValueName -Value $REGZValue -Force | Out-Null}
            }
        }
        if (Test-Path -Path $RegistryPathCluster) {
            Write-2-Log -Message "Changing $REGZValueName in Cluster key on $ClusterNode.Name to $REGZvalue" -Severity Information
            Invoke-command -computername $ClusterNode.Name -ArgumentList $RegistryPathCluster, $REGZValueName, $REGZValue {
                param($RegistryPathCluster, $REGZValueName, $REGZValue)
                Set-ItemProperty -Path $RegistryPathCluster -Name $REGZValueName -Value $REGZValue -Force | Out-Null}
        }
    }

    Write-2-Log -Message "Grabbing all VMs on all clusternodes to shut down" -Severity Information
    $AllVMs = Get-VM –ComputerName ($ClusterNodes)
    Write-2-Log -Message "We are shutting down all running VMs" -Severity Information
    ShutdownVMs $AllVMs
}

catch {
    $ErrorMessage = $_.Exception.Message
    $ErrorLine = $_.InvocationInfo.Line
    $ExceptionInner = $_.Exception.InnerException
    Write-2-Log -Message "!Error occured!:" -Severity Error
    Write-2-Log -Message $ErrorMessage -Severity Error
    Write-2-Log -Message $ExceptionInner -Severity Error
    Write-2-Log -Message $ErrorLine -Severity Error
    Write-2-Log -Message "Bailing out - Script execution stopped" -Severity Error
    Break
}

try {
    #Code to stop the cluster service on all cluster nodes
    #ending with the witness owner if there is one
    Write-2-Log -Message "Shutting down cluster service on all nodes in $MyCluster that are not the owner of the witness disk" -Severity Information
    Foreach ($ClusterNode in $ClusterNodes) {
        #First we shut down all nodes that do NOT own the witness disk
    
        If ($ClusterNode.Name -ne $OwnerNodeWitnessDisk) {
            Write-2-Log -Message "Stop cluster service on node $ClusterNode.Name" -Severity Information
            if ((Get-ClusterNode -Cluster W2K19-LAB | where-object {$_.State -eq "Up"}).count -ne 1) {
                Stop-ClusterNode -Name $ClusterNode.Name -Cluster $MyCluster | Out-Null
            }
            Else {
                Stop-Cluster -Cluster $MyCluster -Force | Out-Null
                $RemberLastNodeThatWasShutdown = $ClusterNode.Name
            }
        }
    }
    #Whe then shut down the nodes that owns the witness disk
    #If we have a fileshare etc,  this won't do anything.
    Foreach ($ClusterNode in $ClusterNodes) {
        If ($ClusterNode.Name -eq $OwnerNodeWitnessDisk) {
            Write-2-Log -Message "Stopping cluster and as such last node $ClusterNode.Name" -Severity Information
            Stop-Cluster -Cluster $MyCluster -Force | Out-Null
            $RemberLastNodeThatWasShutdown = $OwnerNodeWitnessDisk
        }
    }  
    #Code to start the cluster service on all cluster nodes,
    #starting with the original owner of the witness disk
    #or the one that was shut down last


    Foreach ($ClusterNode in $ClusterNodes) {
        #First we start the node that was shut down last. This is either the one that owned the witness disk
        #or just the last node that was shut down in case of a fileshare
        If ($ClusterNode.Name -eq $RemberLastNodeThatWasShutdown) {
            Write-2-Log -Message "Starting the clusternode $ClusterNode.Name that was the last to shut down" -Severity Information
            Start-ClusterNode -Name $ClusterNode.Name -Cluster $MyCluster | Out-Null
        }           
    }

    Write-2-Log -Message "Starting the all other clusternodes in $MyCluster" -Severity Information
    Foreach ($ClusterNode in $ClusterNodes) {
        #We then start all the other nodes in the cluster.     
        If ($ClusterNode.Name -ne $RemberLastNodeThatWasShutdown) {
            Write-2-Log -Message "Starting the clusternode $ClusterNode.Name" -Severity Information
            Start-ClusterNode -Name $ClusterNode.Name -Cluster $MyCluster | Out-Null
        }
    }
}

catch {
    $ErrorMessage = $_.Exception.Message
    $ErrorLine = $_.InvocationInfo.Line
    $ExceptionInner = $_.Exception.InnerException
    Write-2-Log -Message "!Error occured!:" -Severity Error
    Write-2-Log -Message $ErrorMessage -Severity Error
    Write-2-Log -Message $ExceptionInner -Severity Error
    Write-2-Log -Message $ErrorLine -Severity Error
    Write-2-Log -Message "Bailing out - Script execution stopped" -Severity Error
    Break
}

Start-sleep -Seconds 15
Write-2-Log -Message "Grabbing all VMs on all clusternodes to start them up" -Severity Information
$AllVMs = Get-VM –ComputerName ($ClusterNodes)
Write-2-Log -Message "We are starting all stopped VMs" -Severity Information
StartVMs $AllVMs
#Hit it again ...
$AllVMs = Get-VM –ComputerName ($ClusterNodes)
StartVMs $AllVMs

The script is below as promised. If you use this without testing in a production environment and it blows up in your face you are going to get fired and it is your fault. You can use it both to introduce as fix the issue. The action are logged in the directory where the script is run from.

PowerShell Script to Monitor a web service

Introduction

Recently I was involved in troubleshooting a load balanced web service. That led me to quickly write PowerShell script to Monitor a web service. The original web service was actually not the problem (well not this time and after we’s set it to recycle a lot more until someone fixes the code and bar the fact there is no health check on the load balancer!?). I “failed” as it din’t really handle a another failed web service it depended on very well so that was unclear during initial troubleshooting. That web service is not highly available bar with manually switching over to a “stand by” server a ARR, no loadbalancing. But that’s another discussion.

The culprit & the state of our industry

When we found the problematic web service and saw it ran on Tomcat we tried restarting the Tomcat service but that didn’t help. Rebooting the servers on which it ran didn’t help either. Until some one sent us a document with the restart procedure for those servers. This also stated the Catalina folder need to be deleted for this to work and get the service back up an running. It also stated they often needed to do this twice. Well, OK … Based on that note we worked under their assumption nothing in that folder that is needed, as nothing was said about safe guarding any of that.

Having said that, why on earth over all those years, the developers did not find out
what is causing the issue and fixed it beats me. For year and years they’ve been doing this manually. Sometime several days a week, sometime multiple times a day. On several servers. Good luck when no one is around to so, or knows the process. The doc was from a developer.  A developer in what is supposed to be a DevOps environment. No one ever made the effort to find out what makes the web service crash or automate recovery.

PowerShell Script to Monitor a web service

I think it’s safe to say I won’t get them to any form of site resilience engineering soon. But I did leave them with a script they can schedule to automate their manual actions.  This does mean that an “ordinary” restart of the server does not fix any issues with the web service. So, ideally this script is also run at server startup!

The script has basic error handling and logging but it’s a quick fix for a manual process so it’s not a spic & span script. but it’s enough to do the job for now and hopefully inspire them to do better. It is 2018 after all and even Site Resilience Engineering needs a new incarnation in this fashion driven industry.

I’ve included this PowerShell script to monitor a web service below as an example and reference to my future self. Enjoy.

<#
Author: Didier Van Hoye
Date: 2018/09/24
version: 0.9.1
Blog: https://blog.workinghardinit.work
Twitter: @WorkingHardInIT

This PowerShell scripts automated the restart of Tomcat7 when needed. The need is based
it the web servcie running on Tomcat7 returns HTTP status 200 or not.
The work this script does is based on the memo that describe the manual procedure.
It takes away the manual reactive actions that they did multiple days per week, sometimes
multiple times per day

You can register this script as a scheduled task to run every X times.
Below is a example. NOTE LINE WRAPS!!!
Schtasks.exe /CREATE /TN MonitorMyWebService /TR "Powershell.exe C:\SysAdmin\Scripts\MonitorMyWebService.ps1" 
/RU SYSTEM /RL HIGHEST /F /SC DAILY /RI 15 /ST 00:00

Having said that, why on earth over all those years the developers did not find out
what was causing the issue and fixed that beats me. Also since the need to have the catalina
folder deleted for this to work we work under their assumption nothing in there is needed.
This does mean that an "ordinary" restart of the server does not fix any issues with the web service.
So, ideally this script is also run at server startup.

The script logs its findings and actions to a script in the script directory.
#>
$ErrorActionPreference = "Stop"
$FolderToDelete = "C:\Program Files\Apache Software Foundation\Tomcat 7.0\work\Catalina"
$MystatusRunning = 'Running'
$MyStatusStopped = 'Stopped'
$MyService = "Tomcat7"
$MyWebService = "https://mywebservice.company.com/metadatasearch"
$MyWebServiceStatus = 0
$MyServiceCheckLogFile = "MetaDataMonitor"

#region CheckWebServiceStatus 
Function CheckWebService() {
    [CmdletBinding()]
    param(
        [String]$WebService
    )
    try {
        # Create new web request.
        $HTTP_Request = [System.Net.WebRequest]::Create($WebService)

        # WGet a response from the site
        $HTTP_Response = $HTTP_Request.GetResponse()

        # Cast the status of the service to an integer
        $HTTP_Status = [int]$HTTP_Response.StatusCode
 
        If ($HTTP_Status -eq 200) {
            ##Write-Host "All is OK!"
            Return $HTTP_Status
        }
        Else {
            ##Write-Host "The service might be down!"
            Return $HTTP_Status
        }
    
        # Don't litter :-)
        $HTTP_Response.Close()
    }
    
    Catch {
        #Write-Host   $_.Exception.InnerException
        if ($_.Exception.InnerException -contains "The remote server returned an error: (500) Internal Server Error.") {
            $HTTP_Status = [int]500
        }
         else {
           $HTTP_Status = [int]999
         }   
        
        Return $HTTP_Status 
    }
    Finally {

    }
}
#endregion

#region Write-2-Log
function Write-2-Log {
    [CmdletBinding()]
    param(
        [Parameter()]
        [ValidateNotNullOrEmpty()]
        [string]$Message,

        [Parameter()]
        [ValidateNotNullOrEmpty()]
        [ValidateSet('Information','Warning','Error')]
        [string]$Severity = 'Information'
    )

    $Date = get-date -format "yyyyMMdd"
    [pscustomobject]@{
        Time = (Get-Date -f g)
        Message = $Message
        Severity = $Severity
        
    } | Export-Csv -Path "$PSScriptRoot\$MyServiceCheckLogFile-$Date.log" -Append -NoTypeInformation
}
#endregion

Write-Log -Message "Starting Web Service Status check" -Severity Information
$MyWebServiceStatus = CheckWebService $MyWebService

<#
For some reason once might not be enough
So we strafe them twice - what's worth shooting once is worth shooting twice.
#>

For ($counter = 1; $counter -le 2; $counter++) {

    Try {       
        
        If ($MyWebServiceStatus -ne 200) {
            #Write-Host "The webservice has a problem"
            Write-2-Log -message "The webservice has a problem as it did not return HTTP Status 200" -Severity Warning
            #Stop the TomCat service if it is running
            $ServiceObject = $Null
            $ServiceObject = Get-Service $MyService
            If ($ServiceObject | Where-Object {$_.Status -eq $myStatusRunning}) {
                #Write-Host "Running"
                Write-2-Log -Message "Stopping $MyService ..." -Severity Information
                Stop-Service $MyService
                $ServiceObject.WaitForStatus($MyStatusStopped, "00:00:05")
                #Get-Service $MyService
                Write-2-Log -Message "$MyService has been stopped..." -Severity Information
                #Write-Host "Stopped"
            }

            #Write-Host "Delete folder"
            if (Test-path $FolderToDelete -PathType Container) {    
                #Write-Host Folder Exists
                Write-2-Log -Message "The $FolderToDelete exists ..." -Severity Information
                #Delete the folder & its contents
                Get-ChildItem -Path "$FolderToDelete\\*" -Recurse | Remove-Item -Force -Recurse
                Write-2-Log -Message "$FolderToDelete content has been recursively deleted ..." -Severity Information
                Remove-Item $FolderToDelete -Recurse -force
                Write-2-Log -Message "$FolderToDelete has been deleted ..." -Severity Information
                #Write-Host Folder Deleted
            }

            #Start the TomCat service
            $ServiceObject = $Null
            $ServiceObject = Get-Service $MyService 
            if ($ServiceObject | Where-Object {$_.status -eq $MyStatusStopped} ) {
                        
                Write-2-Log -Message "Starting $MyService ..." -Severity Information
                Start-Service $MyService
                $ServiceObject.WaitForStatus($MyStatusRunning, "00:00:05")
                #Get-Service $MyService
                Write-2-Log -Message "$MyService has been started..." -Severity Information
                #Write-Host Started
            }  
        }       
    }
    catch {
        Write-2-Log -Message "Bailing out of the script due to error" -Severity Warning
        Write-2-Log -Message $_.Exception.Message -Severity Error
        Write-2-Log -Message $_.InvocationInfo.PositionMessage -Severity Error
        Write-2-Log -Message $_.FullyQualifiedErrorID -Severity Error
        Write-2-Log -Message "Stopping Web Service Status check" -Severity Information
    }
    Finally {
        Start-Sleep -Seconds 5
    }    
}

How do you know all this shit?

How do you know all this shit?

Sometimes I get asked how do you know all this shit? Especially after fixing an issue. Well partially because I read and experiment a lot. Partially experience. That’s it. It doesn’t come to me in visions, dreams or by a 25Gbps fibre up-link to the big brain in the cloud.

It takes time and effort. That’s it. Time is priceless and we all have 24 hours in a day. Effort is something we chip into the equation. That has a cost and as a result a price. And as with everything there is a limit to what one can do. So where I spend my time is a balance between need, interest, ROI, fun factor and avoiding BS.

What’s also important to know is that I know far less than I would like to. I mean that. I have met so many people that are smarter, quicker, better and more entrepreneurial than me that … it would be demotivating. But it isn’t, I just enjoy the insights & education it brings me. On top of that it’s a welcome change form modern landscape office chatter and helps maintain / restore some sort of faith in mankind I guess. It’s also fun.

Fix my problems already!

That’s fine, you say, but “why can’t you fix all our problems then huh”? That’s easy. I don’t know enough to fix ALL your problems.  I also probably don’t want to fix them as you don’t or won’t pay me enough to fix them. And even if you did, I might not have time for it or you might be beyond saving. A lot of  your issues are being created by a lack of context, insight & understanding. I call that wishful management. Basically it means that you’re digging yourself into a hole faster than we can get you out. It’s not even a question of skills, resources or money, it’s just hopeless. A bit like Enterprise IT at times.

You’re being negative

“Geez such negativity”. No it’s not. Its recognizing the world is not perfect. That not all issues with technical solutions are technology induced. It’s about realizing that things can and will go wrong.

So part of my endeavors is making sure I know what to do when the shit hits the fan. To be able to do so you need to understand the technology used, build it, break it, recover it. The what & how depends on the solution at hand (cattle versus holey cows). Failure is not an option you chose “not to select”. Failure is  guaranteed. By the time I fail for real I try to be prepared by repeated failure in the lab.

I spend time in the lab for hands on testing.  I also spend time at my desk, on the road, in my comfy chair reading, scribbling down notes, writing & drawing concepts & ideas. Nothing else. And during walks I tend to process all my impressions. It’s something that helps me, so I make room for that.  I highly recommend that you figure out what works for you. A favorite of mine is to grab a coffee and sit down. Without my e-mail open, with my phone muted, without a calendar nagging me or the pseudo crisis of the moment stealing my time.

That in combination with actually working with the technology is what brings the understanding, the insight, the context. My core team members and network buddies can always get a hold of me in that reserved time and I will answer their call. Why? Because I know they won’t abuse it and have a serious need, not some self inflicted crisis which look bad but poses little danger.