Using VEEAM FastSCP for Microsoft Azure to help protect my blog

My buddies in IT know about some of my mantras. The fact that I like “* in depth”. Backup in depth for example. Which is just my variant on the 3-2-1 rule in backups. Things go wrong and relying on one way to recover is risky. “One is none, two is one” is just one of the mantras I live by in IT. Or at least try to, I’m not perfect.

So besides backups in Azure I also copy the backup files I make for my blog outside of the VM, out of Azure. That means the BackWPup files and the MySQL dumps I create regularly via a scheduled job.

That copy is not made manually but is automated with VEEAM FastSCP for Microsoft Azure. It’s easy, free and it works.  I’ve blogged about it before but that blog might have been lost in the huge onslaught of Microsoft Ignite 2015 announcements.

It’s all quite simple. First of all you need to create a data dump location for the backups we do on our blog server. That’s copied out by but VEEAM FastSCP for Microsoft Azure ensures I have an extra copy do those which doesn’t rely on Azure

image

 

Add your VM in Azure to VEEAM FastSCP for Microsoft Azure

image

It’s easy, specify the information you can find about your VM on the Azure management portal. Optionally you can skip the SSL requirement and certificate verifications. Do note you need to use the correct PowerShell port (end point) for that particular VM in your Azure subscription.

image

When successful you can browse the file system of your Azure VM.

image

Create one or more jobs (depending on what & how you’re organizing your backups)

image

Give the job a descriptive name

image

Select what folders on the Azure VM you want to backup by simply browsing to it.

image

Select the target folder on the system where VEEAM FastSCP for Microsoft Azure is running by, again, simply browsing to it.

image

Set a schedule according to your needs

image

If you need to run some PowerShell before or after a download here’s the place to do so.

image

Click finish and hit Start Job to lick it of and test it. Here’s the WordPress Blog backup download job running.

image

By using VEEAM FastSCP for You can download folders and files to your system at home, to a virtual machine, whether this is on premise or also in the cloud. Perhaps even in AWS (IAAS) if you’re really paranoid. By doing a simple restore of your blog and changing your DNS entry you can even get it up and running if Azure would ever be the target of a major outage causing attack. You could even keep blogging about it Smile.

So do yourself a favor. Check it out!

Hit me baby one more time or the Faster Fast Ring of Windows 10 Insider Builds

Hit me baby one more time!

This blog is brought to you by Francesco V. Buccoli, a brilliant ex Hyper-V MVP who went blue badge and became a PFE. Why? Because he called me a genius, that’s why!

image

Here we go again, things are heating up in the last straight track towards RTM of Windows 10. We’re now getting build 10162 right on the heals of build 10159 that basically overran people who were still downloading 10158.

No this is not some PM in Redmond hitting the publish button by mistake again a la “Oops, I did it again” but it’s with intent and purpose. Deliver an awesome client right from the start.

So far it’s all good. The quality of these lasted builds, even during the limited time we get to spend with them, is very good and show real improvements over the entire line. Windows 10 should be ready for rollout at RTM/GA if the quality is this good and only improves.

We lead, we weren’t born to follow.

Now, go download it already and I’ll quit the cheesy music references Winking smile

Testing Virtual Machine Compute Resiliency in Windows Server 2016

No matter what high quality gear you use, how well you design your environment and how much redundancy you build in you will see transient failures in your environment at one point in time. In combination with the push to ever more commodity hardware and the increased use of converged deployments leveraging Ethernet transient failures have become more frequent occurrence then they used to be.

Failover clustering by tradition reacts very “assertive” to failures in order to provide high to continuous availability to our virtual machines. That’s great, we want it to do that, but this binary approach comes at a cost under certain conditions. When reacting too fast and too proactively to transient failures we actually can get  less high or continuous availability in certain scenarios than if the cluster would just have evaluated the situation a bit more cautiously. It’s for this reason that Microsoft introduced increased “Virtual Machine Compute Resiliency” to deal with intra-cluster communication failures in a Windows Server 2016 cluster.

I have helped out a number of fellow MVPs over the past 6 months with this new feature and I dove back into my lab notes to blog about this and help you out with your own testing. The early work was done with Technical Preview v1. In that release it was disabled by default (the value for cluster property “ResiliencyDefaultPeriod”  was set to 0) and the keyword “Default” was used in cluster property “resiliencylevel” for the what is now called ‘IsolateOnSpecialHeartbeat’ and is no longer the default at installation. If that doesn’t confuse you yet, I’ll find another reason to tell you to move to technical preview v2. In TPv2 Virtual Machine Compute Resiliency is enabled and configured by default but in TPv1 you had to enable and configure it yourself. I  advise you to stop testing with v1 and move to v2 and future technical preview release in order for you to test with the most recent bits and functionality.

Investigating the feature configuration

When testing new features in Windows Server Technical Preview Hyper-V you’re on your own once in a while as much is not documented yet. Playing around with PowerShell helps you discover stuff. A  Get-Cluster  | fl * teaches us all kinds of cool stuff such as these new cluster properties:

ResiliencyDefaultPeriod
QuarantineDuration
ResiliencyLevel

Here’s a screenshot of Windows Server 2016TPv1 (Please stop using this version and move to TPv2!)

image

Now when you’re running Windows Server 2016TP v2 this feature has been enabled by default (ResilienceyDefaultPeriod has been filled out as well as QuarantineDuration) and the resiliency level has been set to “AlwaysIsolate”.

image

After some lab work with this I figured out what we need to know to make VM Compute Resiliency to work in our labs:

  • Make sure your cluster functional level is running at version 9
  • Make sure your VMs are at version 6.X
  • Make sure the Operating systems of the VM is Windows Server Technical Preview v2 (Again move away from TPv1)
  • Enable Isolation/Quarantine via PowerShell:

(get-cluster).resiliencylevel
(get-cluster).resiliencylevel = ‘AlwaysIsolate’ or 2
(get-cluster).resiliencylevel
(get-cluster).resiliencylevel = ‘IsolateOnSpecialHeartbeat’  or  1
(get-cluster).resiliencylevel

Please note that all nodes need to be on line to make this change in the technical preview. I got the two accepted values by trial and error and the blog by Subhasish Bhattacharya confirms these are the only 2 ones.

  • Set the timings to some not too high and not too low value to play in the lab without having to wait to long before it’s back to normal (the values I use in my current Technical Preview lab environment are not a recommendation whatsoever, they only facilitate my testing and learning, this has nothing to do with any production environment) . For lab testing I chose:

(get-cluster).ResiliencyDefaultPeriod = 60  Note that setting this to 0 reverts you back to pre Windows Server 2016 behavior and actually disables this feature. The default is 240 seconds

(get-cluster).QuarantineDuration = 300 The default is 7200 seconds, but I’m way to impatient in my lab for that so I set the quarantine duration lower as I want to see the results of my experiments fast, but beware of just messing with this duration in production without thinking about it. Just saying!

Testing the feature and its behaviour

Then you’re ready to start abusing your cluster to demo Isolation mode & quarantine. I basically crash the Cluster service on one of the nodes in the cluster.  Note that cleanly stopping the service is not good enough, it will nicely drain that node for you. which is not what we want to see. Crash it of force stop it via stop-process -name clussvc –Force.

So what do we see happen:

    • The node on which we crashed the cluster server experiences a “transient” intra-cluster communication failure. This node is placed into an Isolated state and removed from its active cluster membership.

image

  • The VMs running at version 6.2 go into Unmonitored state. The other ones just fail over. Unmonitored means you that the cluster is no longer actively managing the VM but you can still look at the condition of the VM via PowerShell or Hyper-V manager. image

image

image

Based on the type of storage you’re using for your VMs the story is different:

  1. File Storage backed (SMB3/SOFS): The VM continues to run in the Online state. This is possible because the SMB share itself has no dependency on the Hyper-V cluster. Pretty cool!
  2. Block Storage backed (FC / FCoE / iSCSI / Shared SAS / PCI RAID)): The VMs go to Running-Critical and then placed in the Paused Critical state. As you have a intra-cluster communication failure (in our case losing the cluster service) the isolated node no longer has access to the Cluster Shared Volumes in the cluster and this is the only option there is.

image

  • If the isolated node doesn’t recover from this presumed transient failure it will, after the time specified in ResiliencyDefaultPeriod (default of 4 minutes : 240 s) go into a down state. The VMs fail over to another node in the cluster. Normally during this experiment the cluster service will come back on line automatically.
  • If a node, does recover but goes into isolated 3 times within 1 hour, it is placed into a Quarantine state for the time specified in QuarantineDuration (default two hours or 7200 s) . The VMS running on this node are drained to another node in the cluster. So if you crash that service repeatedly (3 times within an hour) the Hyper-V Node will go into  “Quarantine” status for the time specified (in our lab 5 minutes as we set it to 300 s). The VMs will be live migrated off even if the node is up and running when the cluster service comes up again.

You might notice that this screenshot is a different lab cluster. Yes, it’s a TPv1 cluster as for some reason the Live Migration part on Quarantine is broken on my TPv2 lab. It’s a clean install, completely green field. Probably a bug.image

It’s the frequency of failures that determines that the node goes into quarantine for the amount of time specified. That’s a clear sign for you to investigate and make sure things are OK. The node is no longer allowed to join the cluster for a fixed time period (default: 2 hours)­. The reason for this is to prevent “flapping nodes” from negatively impacting other nodes and the overall cluster health. There is also a fixed (not configurable as far as I know) amount if nodes that can be quarantined at any give time: 20% or only one node can be quarantined (whatever comes first, in the case of a 2, 3 or for node cluster it’s one node max that can be in quarantine).

If you want to get a quarantined node out of quarantine immediately you can rejoin it to the cluster via a single PowerShell command: Start-ClusterNode –CQ  (CQ = Clear Quarantine). Handy in the lab or in real live when things have been fixed and you want that node back in action asap.

Conclusion

Now this sounds pretty good doesn’t it? And it is. Especially if you’re running you’re running your VMs on a SOFS share. Then the VMs will remain online during the Isolation / Unmonitored phase but when you have “traditional” block level storage they won’t. They’ll go in mode as the in that design you have lost access to the CSV. Now, if you ever needed yet another reason to move to a Scale Out File Server & SMB 3 to deliver storage for your VMs I have just given you one! Hey storage vendors … how is that full SMB 3 feature stack coming on your storage arrays? Or do you really just want us to abstract you away behind a Windows SOFS cluster?

Subhasish Bhattacharya Has blogged about this as well here. It’s a feature we’ll test at length to get a grip on the behavior so we know how the cluster nodes will behave under certain conditions. Trust, but verify is my mantra and it’s way better to figure out how a feature behaves in the lab than having to figure it out when you see it for the very first time in production based on assumptions. Just saying.

WorkingHardInIT Blog Maintenance Window & Tools Used

As you might have noticed my blog was down last night for about 1 hour and 45 minutes between 22:20 and 00:10. A bit longer than I wanted but I needed more time do deal with the upgrade of MySQL as part of the routine maintenance I do on my WordPress blog server.

In the environments under my care I take care to take the time to do routine maintenance to avoid falling behind to much in firmware, drivers, patches, etc. This takes some effort but as it helps prevent bigger issues in the long run it’s worth while to do so. I take the same approach with my blog as much as possible. Most of this maintenance goes by without you ever noticing. The windows updates reboots being the exception. WordPress upgrades, plugin upgrades, PHP upgrades, etc. … all go swiftly usually which means I’m pretty well covered there, frequently.

Upgrading MySQL however is always a bit of a time consuming effort and depending on what version you’re upgrading from and to witch one it can actually mean multiple sequential upgrades (5.1 to 5.5.44 to 5.6.25).image

I practiced this upgrade on a copy of the VM in azure to make sure I could handle whatever came up and still I had to deal with some challenges I did not encounter in the test environment. That show that I’m not a full time hard core MySQL guru I guess.

Anyway after getting to MySQL 5.6.25 from 5.5.44 and fixing some issues with TIMESTAMP with implicit DEFAULT value is deprecated (easy fix) and dealing with the error in MySQL Workbench:

An unhandled exception occurred (Error executing ‘SELECT t.PROCESSLIST_ID,
IF (NAME = ‘thread/sql/event_scheduler’,’event_scheduler’,t.PROCESSLIST_USER) PROCESSLIST_USER,t.PROCESSLIST_HOST,t.PROCESSLIST_DB,t.PROCESSLIST_COMMAND,
t.PROCESSLIST_TIME,t.PROCESSLIST_STATE,t.THREAD_ID,t.TYPE,t.NAME,t.PARENT_THREAD_ID,
t.INSTRUMENTED,t.PROCESSLIST_INFO,a.ATTR_VALUE FROM performance_schema.threads t 
LEFT OUTER JOIN performance_schema.session_connect_attrs a ON t.processlist_id = a.processlist_id AND (
a.attr_name IS NULL OR a.attr_name = ‘program_name’) WHERE t.TYPE <> ‘BACKGROUND”
Native table ‘performance_schema’.’threads’ has the wrong structure.
SQL Error: 1682). Please refer to the log files for details.

which I fixed by running run mysqld –performance_schema I’m rocking everything up to date once more.

image

Always have good backups, make exports of your database schema, data and structures in MySQL and have multiple ways out when things go south. In Azure I’m relying on Backup Vault where I protect my virtual machine with schedules backup jobs. I also backup my WordPress with the data via a plug in and export the database via MySQL Workbench.

image

Those dumps are copied out of the VM to where ever I want (Azure, One Drive, home PC, a VM running in AWS …) to make sure I have multiple options to recover.

image

VEEAM FastSCP for Microsoft Azure comes in very handy for this by the way. You might want to check it out if you’re in need of an automated and secure way to get data out of a VM running in Microsoft Azure!