PowerShell Script to Monitor a web service

Introduction

Recently I was involved in troubleshooting a load balanced web service. That led me to quickly write PowerShell script to Monitor a web service. The original web service was actually not the problem (well not this time and after we’s set it to recycle a lot more until someone fixes the code and bar the fact there is no health check on the load balancer!?). I “failed” as it din’t really handle a another failed web service it depended on very well so that was unclear during initial troubleshooting. That web service is not highly available bar with manually switching over to a “stand by” server a ARR, no loadbalancing. But that’s another discussion.

The culprit & the state of our industry

When we found the problematic web service and saw it ran on Tomcat we tried restarting the Tomcat service but that didn’t help. Rebooting the servers on which it ran didn’t help either. Until some one sent us a document with the restart procedure for those servers. This also stated the Catalina folder need to be deleted for this to work and get the service back up an running. It also stated they often needed to do this twice. Well, OK … Based on that note we worked under their assumption nothing in that folder that is needed, as nothing was said about safe guarding any of that.

Having said that, why on earth over all those years, the developers did not find out
what is causing the issue and fixed it beats me. For year and years they’ve been doing this manually. Sometime several days a week, sometime multiple times a day. On several servers. Good luck when no one is around to so, or knows the process. The doc was from a developer.  A developer in what is supposed to be a DevOps environment. No one ever made the effort to find out what makes the web service crash or automate recovery.

PowerShell Script to Monitor a web service

I think it’s safe to say I won’t get them to any form of site resilience engineering soon. But I did leave them with a script they can schedule to automate their manual actions.  This does mean that an “ordinary” restart of the server does not fix any issues with the web service. So, ideally this script is also run at server startup!

The script has basic error handling and logging but it’s a quick fix for a manual process so it’s not a spic & span script. but it’s enough to do the job for now and hopefully inspire them to do better. It is 2018 after all and even Site Resilience Engineering needs a new incarnation in this fashion driven industry.

I’ve included this PowerShell script to monitor a web service below as an example and reference to my future self. Enjoy.

<#
Author: Didier Van Hoye
Date: 2018/09/24
version: 0.9.1
Blog: https://blog.workinghardinit.work
Twitter: @WorkingHardInIT

This PowerShell scripts automated the restart of Tomcat7 when needed. The need is based
it the web servcie running on Tomcat7 returns HTTP status 200 or not.
The work this script does is based on the memo that describe the manual procedure.
It takes away the manual reactive actions that they did multiple days per week, sometimes
multiple times per day

You can register this script as a scheduled task to run every X times.
Below is a example. NOTE LINE WRAPS!!!
Schtasks.exe /CREATE /TN MonitorMyWebService /TR "Powershell.exe C:\SysAdmin\Scripts\MonitorMyWebService.ps1" 
/RU SYSTEM /RL HIGHEST /F /SC DAILY /RI 15 /ST 00:00

Having said that, why on earth over all those years the developers did not find out
what was causing the issue and fixed that beats me. Also since the need to have the catalina
folder deleted for this to work we work under their assumption nothing in there is needed.
This does mean that an "ordinary" restart of the server does not fix any issues with the web service.
So, ideally this script is also run at server startup.

The script logs its findings and actions to a script in the script directory.
#>
$ErrorActionPreference = "Stop"
$FolderToDelete = "C:\Program Files\Apache Software Foundation\Tomcat 7.0\work\Catalina"
$MystatusRunning = 'Running'
$MyStatusStopped = 'Stopped'
$MyService = "Tomcat7"
$MyWebService = "https://mywebservice.company.com/metadatasearch"
$MyWebServiceStatus = 0
$MyServiceCheckLogFile = "MetaDataMonitor"

#region CheckWebServiceStatus 
Function CheckWebService() {
    [CmdletBinding()]
    param(
        [String]$WebService
    )
    try {
        # Create new web request.
        $HTTP_Request = [System.Net.WebRequest]::Create($WebService)

        # WGet a response from the site
        $HTTP_Response = $HTTP_Request.GetResponse()

        # Cast the status of the service to an integer
        $HTTP_Status = [int]$HTTP_Response.StatusCode
 
        If ($HTTP_Status -eq 200) {
            ##Write-Host "All is OK!"
            Return $HTTP_Status
        }
        Else {
            ##Write-Host "The service might be down!"
            Return $HTTP_Status
        }
    
        # Don't litter :-)
        $HTTP_Response.Close()
    }
    
    Catch {
        #Write-Host   $_.Exception.InnerException
        if ($_.Exception.InnerException -contains "The remote server returned an error: (500) Internal Server Error.") {
            $HTTP_Status = [int]500
        }
         else {
           $HTTP_Status = [int]999
         }   
        
        Return $HTTP_Status 
    }
    Finally {

    }
}
#endregion

#region Write-2-Log
function Write-2-Log {
    [CmdletBinding()]
    param(
        [Parameter()]
        [ValidateNotNullOrEmpty()]
        [string]$Message,

        [Parameter()]
        [ValidateNotNullOrEmpty()]
        [ValidateSet('Information','Warning','Error')]
        [string]$Severity = 'Information'
    )

    $Date = get-date -format "yyyyMMdd"
    [pscustomobject]@{
        Time = (Get-Date -f g)
        Message = $Message
        Severity = $Severity
        
    } | Export-Csv -Path "$PSScriptRoot\$MyServiceCheckLogFile-$Date.log" -Append -NoTypeInformation
}
#endregion

Write-Log -Message "Starting Web Service Status check" -Severity Information
$MyWebServiceStatus = CheckWebService $MyWebService

<#
For some reason once might not be enough
So we strafe them twice - what's worth shooting once is worth shooting twice.
#>

For ($counter = 1; $counter -le 2; $counter++) {

    Try {       
        
        If ($MyWebServiceStatus -ne 200) {
            #Write-Host "The webservice has a problem"
            Write-2-Log -message "The webservice has a problem as it did not return HTTP Status 200" -Severity Warning
            #Stop the TomCat service if it is running
            $ServiceObject = $Null
            $ServiceObject = Get-Service $MyService
            If ($ServiceObject | Where-Object {$_.Status -eq $myStatusRunning}) {
                #Write-Host "Running"
                Write-2-Log -Message "Stopping $MyService ..." -Severity Information
                Stop-Service $MyService
                $ServiceObject.WaitForStatus($MyStatusStopped, "00:00:05")
                #Get-Service $MyService
                Write-2-Log -Message "$MyService has been stopped..." -Severity Information
                #Write-Host "Stopped"
            }

            #Write-Host "Delete folder"
            if (Test-path $FolderToDelete -PathType Container) {    
                #Write-Host Folder Exists
                Write-2-Log -Message "The $FolderToDelete exists ..." -Severity Information
                #Delete the folder & its contents
                Get-ChildItem -Path "$FolderToDelete\\*" -Recurse | Remove-Item -Force -Recurse
                Write-2-Log -Message "$FolderToDelete content has been recursively deleted ..." -Severity Information
                Remove-Item $FolderToDelete -Recurse -force
                Write-2-Log -Message "$FolderToDelete has been deleted ..." -Severity Information
                #Write-Host Folder Deleted
            }

            #Start the TomCat service
            $ServiceObject = $Null
            $ServiceObject = Get-Service $MyService 
            if ($ServiceObject | Where-Object {$_.status -eq $MyStatusStopped} ) {
                        
                Write-2-Log -Message "Starting $MyService ..." -Severity Information
                Start-Service $MyService
                $ServiceObject.WaitForStatus($MyStatusRunning, "00:00:05")
                #Get-Service $MyService
                Write-2-Log -Message "$MyService has been started..." -Severity Information
                #Write-Host Started
            }  
        }       
    }
    catch {
        Write-2-Log -Message "Bailing out of the script due to error" -Severity Warning
        Write-2-Log -Message $_.Exception.Message -Severity Error
        Write-2-Log -Message $_.InvocationInfo.PositionMessage -Severity Error
        Write-2-Log -Message $_.FullyQualifiedErrorID -Severity Error
        Write-2-Log -Message "Stopping Web Service Status check" -Severity Information
    }
    Finally {
        Start-Sleep -Seconds 5
    }    
}

How do you know all this shit?

How do you know all this shit?

Sometimes I get asked how do you know all this shit? Especially after fixing an issue. Well partially because I read and experiment a lot. Partially experience. That’s it. It doesn’t come to me in visions, dreams or by a 25Gbps fibre up-link to the big brain in the cloud.

It takes time and effort. That’s it. Time is priceless and we all have 24 hours in a day. Effort is something we chip into the equation. That has a cost and as a result a price. And as with everything there is a limit to what one can do. So where I spend my time is a balance between need, interest, ROI, fun factor and avoiding BS.

What’s also important to know is that I know far less than I would like to. I mean that. I have met so many people that are smarter, quicker, better and more entrepreneurial than me that … it would be demotivating. But it isn’t, I just enjoy the insights & education it brings me. On top of that it’s a welcome change form modern landscape office chatter and helps maintain / restore some sort of faith in mankind I guess. It’s also fun.

Fix my problems already!

That’s fine, you say, but “why can’t you fix all our problems then huh”? That’s easy. I don’t know enough to fix ALL your problems.  I also probably don’t want to fix them as you don’t or won’t pay me enough to fix them. And even if you did, I might not have time for it or you might be beyond saving. A lot of  your issues are being created by a lack of context, insight & understanding. I call that wishful management. Basically it means that you’re digging yourself into a hole faster than we can get you out. It’s not even a question of skills, resources or money, it’s just hopeless. A bit like Enterprise IT at times.

You’re being negative

“Geez such negativity”. No it’s not. Its recognizing the world is not perfect. That not all issues with technical solutions are technology induced. It’s about realizing that things can and will go wrong.

So part of my endeavors is making sure I know what to do when the shit hits the fan. To be able to do so you need to understand the technology used, build it, break it, recover it. The what & how depends on the solution at hand (cattle versus holey cows). Failure is not an option you chose “not to select”. Failure is  guaranteed. By the time I fail for real I try to be prepared by repeated failure in the lab.

I spend time in the lab for hands on testing.  I also spend time at my desk, on the road, in my comfy chair reading, scribbling down notes, writing & drawing concepts & ideas. Nothing else. And during walks I tend to process all my impressions. It’s something that helps me, so I make room for that.  I highly recommend that you figure out what works for you. A favorite of mine is to grab a coffee and sit down. Without my e-mail open, with my phone muted, without a calendar nagging me or the pseudo crisis of the moment stealing my time.

That in combination with actually working with the technology is what brings the understanding, the insight, the context. My core team members and network buddies can always get a hold of me in that reserved time and I will answer their call. Why? Because I know they won’t abuse it and have a serious need, not some self inflicted crisis which look bad but poses little danger.

When using file shares as backup targets you should leverage continuous available SMB 3 file shares

Introduction

When using file shares as backup targets you should leverage Continuous Available SMB 3 file shares. For now, at least. A while back Anton Gostev wrote a very interesting piece in his “The Word from Gostev”. It was about an issue that they saw with people using SMB 3 files shares as backup targets with Veeam Backup & Replication. To some it was a reason to cry wolf. But it’s a probably too little-known issue that can and a such might (will) occur. You need to be aware of it to make good decisions and give good advice.

I’m the business of building rock solid solutions that are highly available to continuous available. This means I’m always looking into the benefits and drawbacks of design choices. By that I mean I study, test and verify them as well. I don’t do “Paper Proof of Concepts”. Those are just border line fraud.

So, what’s going on and what can you do to mitigate the risk or avoid it all together?

Setting the scenario

Your backup software (in our case Veeam Backup & Recovery) running on Windows leverages an SMB 3 file share as a backup target. This could be a Windows Server file share but it doesn’t have to be. It could be a 3rd party appliance or storage array.

When using file shares as backup targets you should leverage Continuous Available SMB 3 file shares.

The SMB client

The client is the SMB 3 Client Microsoft delivers in the OS (version depends on the OS version). But this client is under control of Microsoft. Let’s face it the source in these scenarios is a Hyper-V host/cluster or a Windows SMB 3 Windows File share, clustered or not.

The SMB server

In regards to the target, i.e. the SMB Server you have a couple of possibilities. Microsoft or 3rd party.

If it’s a third-party SMB 3 implementation on Linux or an appliance. You might not even know what is used under the hood as an OS and 3rd party SMB 3 solution. It could be a storage vendors native SMB 3 implementation on their storage array or simple commodity NAS who bought a 3rd party solution to leverage. It might be high available or in many (most?) cases it is not. It’s hard to know if the 3rd party implements / leverages the full capabilities of the SMB 3 stack as Microsoft does or not. You light not know of there are any bugs in there or not.

You get the picture. If you bank on appliances, find out and test it (trust but verify). But let’s assume its capabilities are on par with what Windows offers and that means the subject being discussed goes for both 3rd party offerings and Windows Server.

When the target is Windows Server we are talking about SMB 3 File Shares that are either Continuous Available or not. For backup targets General Purpose File Shares will do. You could even opt to leverage SOFS (S2D for example). In this case you know what’s implemented in what version and you get bug fixes from MSFT.

When you have continuously available (CA) SMB 3 shares you should be able to sleep sound. SMB 3 has you covered. The risks we are discussing is related to non-CA SMB 3 file shares.

What could go wrong?

Let’s walk through this. When your backup software writes to an SMB 3 share it leverages the SMB 3 client & server in the SMB 3 stack. Unlike when Veeam uses its own data mover, all the cool data persistence stuff is handled by Windows transparently. The backup software literally hands of the job to Windows. Which is why you can also leverage SMB Multichannel and SMB direct with your backups if you so desire. Read Veeam Backup & Replication leverages SMB Multichannel and Veeam Backup & Replication Preferred Subnet & SMB Multichannel for more on this.

If you are writing to a non-CA SMB 3 share your backup software receives the messages the data has been written. Which actually means that the data is cached in the SMB Clients “queue” of data to write but which might not have been written to the storage yet.

For short interruptions this is survivable and for Office and the like this works well and delivers fast performance. If the connection is interrupted or the share is unavailable the queue keeps the data in memory for a while. So, if the connection restores the data can be written. The SMB 3 Client is smart.

However, this has its limits. The data cache in the queue doesn’t exist eternally. If the connectivity loss or file share availability take too long the data in the SMB 3 client cache is lost. But it was not written to storage! To add a little insult to injury the SBM client send back “we’re good” even when the share has been unreachable for a while.

For backups this isn’t optimal. Actually, the alarm bell should start ringing when it is about backups. Your backup software got a message the data has been written and doesn’t know any better. But is not on the backup target. This means the backup software will run into issues with corrupted backups sooner or later (next backup, restores, synthetic full backups, merges, whatever comes first).

Why did they make it this way?

This is OK default behavior. it works just fine for Office files / most knowledge worker client software that have temp files, auto recovery, and all such lovely capabilities and work is mostly individual and interactive. Those applications are resilient to this by nature. Mind you, all my SMB 3 file share deployments are clustered and highly available where appropriate. By “appropriate” I mean when we don’t have off line caching for those shares as a requirement as those too don’t mix well (https://blogs.technet.microsoft.com/filecab/2016/03/15/offline-files-and-continuous-availability-the-monstrous-union-you-should-not-consecrate/). But when you know what your doing it rocks. I can actually failover my file server roles all day long for patching, maintenance & fun when the clients do talk SMB 3. Oh, and it was a joy to move that data to new SANs under the hood. More on that perhaps in another post. But I digress.

You need adequate storage in all uses cases

This is a no brainer. Nothing will save you if the target storage isn’t up to the task. Not the Veeam data move or SMB3 shares with continuous availability. Let’s be very clear about this. Even at the cost-effective side of the equation the storage has to be of sufficient decent quality to prevent data loss. That means decent controllers with battery cached IO as safe guard etc. Whether that’s a SAN or a “simple” raid controller or pass through HBA’s for storage spaces, doesn’t matter. You have to have it. Putting your data on SATA drives without any save guard is sure way of risking data loss. That’s as simple as it gets. You don’t do that, unless you don’t care. And if you care, you would not be reading this!

Can this be fixed?

Well as a non-SMB 3 developer I would say we need an option added that the SMB 3 client can be configured to not report success until that data has been effectively written on the target, or at least has landed somewhere on quality, cache protected storage.

This option does not exist today. I do not work for Microsoft but I know some people there and I’m pretty sure they want to fix it. I’m just not sure how big of a priority it is at the moment. For me it’s important that when a backup application goes to a non-continuous available file share it can request that it will not cache and the SMB Server says “OK” got it, I will behave accordingly. Now the details in the implementation will be different but you get the message?

I would like to make the case that it should be a configurable option. It is not needed for all scenarios and it might (will) have an impact on performance. How big that would be I have no clue. I’m just a blogger who does IT as a job. I’m not a principal PM at Microsoft or so.

If you absolutely want to make sure, use clustered continuous available file shares. Works like a charm. Read this blog Continuous available general purpose file shares & ReFSv3 provide high available backup targets, there is even one of my not so professional videos show casing this.

It’s also important not to panic. Most of you might even never has heard or experienced this. But depending on the use case and the quality of the network and processes you might. In a backup scenario this is not something that makes for a happy day.

The cry wolf crowd

I’ll be blunt. WARNING. Take a hike if you have a smug “Windoze sucks” attitude. If you want to deal dope you shouldn’t be smoking too much of your own stuff, but primarily know it inside out. NFS in all its varied implementations has potential issues as well. So, I’d also do my due diligence with any solution you recommend. Trust but verify, remember?! Actually, an example of one such an issue was given for an appliance with NFS by Veeam. Guess what, every one has issues. Choose your poison, drink it and let other chose theirs. Condescending remarks just make you look bad every time. And guess what that impression tends to last. Now on the positive side, I hear that caching can be disabled on modern NFS client implementations. So, the potential issue is known and is is being addressed there as well.

Conclusion

Don’t panic. I just discussed a potential issue than can occur and that you should be aware off when deciding on a backup target. If you have rock solid networking and great server management processes you can go far without issues, but that’s not 100 % fail proof. As I’m in the business of building the best possible solutions it’s something you need to be aware off.

But know that they can occur, when and why so you can manage the risk optimally. Making Windows Server SMB 3 file shares Continuously Available will protect against this effectively. It does require failover clustering. But at least now you know why I say that when using file shares as backup targets you should leverage continuous available SMB 3 file shares

When you buy appliances or 3rd party SMB 3 solutions, this issue also exists but be extra diligent even with highly available shares. Make sure it works as it should!

I hope Microsoft resolves this issue as soon as possible. I’m sure they want to. They want their products to be the best and fix any possible concerns you might have.

Correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backups

Introduction

It’s not a secret that while guest clustering with VHDSets works very well. We’ve had some struggles in regards to host level backups however. Right now I leverage Veeam Agent for Windows (VAW) to do in guest backups. The most recent versions of VAW support Windows Failover Clustering. I’d love to leverage host level backups but I was struggling to make this reliable for quite a while. As it turned out recently there are some virtual machine permission issues involved we need to fix. Both Microsoft and Veeam have published guidance on this in a KB article. We automated correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backup

The KB articles

Early August Microsoft published KB article with all the tips when thins fail Errors when backing up VMs that belong to a guest cluster in Windows. Veeam also recapitulated on the needed conditions and setting to leverage guest clustering and performing host level backups. The Veeam article is Backing up Hyper-V guest cluster based on VHD set. Read these articles carefully and make sure all you need to do has been done.

For some reason another prerequisite is not mentioned in these articles. It is however discussed in ConfigStoreRootPath cluster parameter is not defined and here https://docs.microsoft.com/en-us/powershell/module/hyper-v/set-vmhostcluster?view=win10-ps You will need to set this to make proper Hyper-V collections needed for recovery checkpoints on VHD Sets. It is a very unknown setting with very little documentation.

But the big news here is fixing a permissions related issue!

The latest addition in the list of attention points is a permission issue. These permissions are not correct by default for the guest cluster VMs shared files. This leads to the hard to pin point error.

Error Event 19100 Hyper-V-VMMS 19100 ‘BackupVM’ background disk merge failed to complete: General access denied error (0x80070005). To fix this issue, the folder that holds the VHDS files and their snapshot files must be modified to give the VMMS process additional permissions. To do this, follow these steps for correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backup.

Determine the GUIDS of all VMs that use the folder. To do this, start PowerShell as administrator, and then run the following command:

get-vm | fl name, id
Output example:
Name : BackupVM
Id : d3599536-222a-4d6e-bb10-a6019c3f2b9b

Name : BackupVM2
Id : a0af7903-94b4-4a2c-b3b3-16050d5f80f

For each VM GUID, assign the VMMS process full control by running the following command:
icacls <Folder with VHDS> /grant “NT VIRTUAL MACHINE\<VM GUID>”:(OI)F

Example:
icacls “c:\ClusterStorage\Volume1\SharedClusterDisk” /grant “NT VIRTUAL MACHINE\a0af7903-94b4-4a2c-b3b3-16050d5f80f2”:(OI)F
icacls “c:\ClusterStorage\Volume1\SharedClusterDisk” /grant “NT VIRTUAL MACHINE\d3599536-222a-4d6e-bb10-a6019c3f2b9b”:(OI)F

My little PowerShell script

As the above is tedious manual labor with a lot of copy pasting. This is time consuming and tedious at best. With larger guest clusters the probability of mistakes increases. To fix this we write a PowerShell script to handle this for us.

#Didier Van Hoye
#Twitter: @WorkingHardInIT 
#Blog: https://blog.Workinghardinit.work
#Correct shared VHD Set disk permissions for all nodes in guests cluster

$GuestCluster = "DemoGuestCluster"
$HostCluster = "LAB-CLUSTER"

$PathToGuestClusterSharedDisks = "C:\ClusterStorage\NTFS-03\GuestClustersSharedDisks"


$GuestClusterNodes = Get-ClusterNode -Cluster $GuestCluster

ForEach ($GuestClusterNode in $GuestClusterNodes)
{

#Passing the cluster name to -computername only works in W2K16 and up.
#As this is about VHDS you need to be running 2016, so no worries here.
$GuestClusterNodeGuid = (Get-VM -Name $GuestClusterNode.Name -ComputerName $HostCluster).id

Write-Host $GuestClusterNodeGuid "belongs to" $GuestClusterNode.Name

$IcalsExecute = """$PathToGuestClusterSharedDisks""" + " /grant " + """NT VIRTUAL MACHINE\"+ $GuestClusterNodeGuid + """:(OI)F"
write-Host "Executing " $IcalsExecute
CMD.EXE /C "icacls $IcalsExecute"

} 

Below is an example of the output of this script. It provides some feedback on what is happening.

Correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backup

Correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backup

PowerShell for the win. This saves you some searching and typing and potentially making some mistakes along the way. Have fun. More testing is underway to make sure things are now predictable and stable. We’ll share our findings with you.