In-place upgrade of an Azure virtual machine

Introduction

In the cloud it’s all about economies of scale, automation, wipe and (redeploy). Servers are cattle to be destroyed and rebuild when needed. And “needed” here is not like in the past. It has become the default way. But not every one or every workload can achieve that operational level.

There are use cases where I do use in place upgrades for my own infrastructure or for very well known environments where I know the history and health of the services. An example of this is my blog virtual machine. Currently that is running on Azure IAAS, Windows Server 2016, WordPress 4.9.1, MySQL 5.7.20, PHP. It has never been reinstalled.

I upgraded my WordPress versions many times for both small incremental as major releases. I did the same for my MySQL instance used for my blog. The same goes for PHP etc. The principal here is the same. Avoid risk of tech debt, security risks and major maintenance outages by maintaining a modern platform that is patched and up to date. That’s the basis of a well running and secure environment.

In-place upgrade of an Azure virtual machine

In this approach updating the operating system needs to be done as well so my blog went from Windows Server 2012 R2 to Windows Server 2016. That was also an in place upgrade. The normal way of doing in place upgrades, from inside the virtual machine is actually not supported by Azure and you can shoot yourself in the foot by doing so.

The reason for this is the risk. You do not have console access to an Azure IAAS virtual machine. This means that when things go wrong you cannot fix it. You will have to resort to restoring a backup or other means of disaster recovery. There is also no quick way of applying a checkpoint to the VM to return to a well known situation. Even when all goes well you might lose RDP access (didn’t have it happen yet). But even if all goes well and that normally is the case, you’ll be stuck at the normal OOBE screen where you need to accept the license terms that you get after and upgrade to Windows Server 2016.

clip_image002

The default upgrade will boot to that screen and you cannot confirm it as you have no console access. If you have boot diagnostics enabled for the VM you can see the screen but you cannot get console access. Bummer. So what can you do?

Supported way of doing an in-place upgrade of an Azure virtual machine

Microsoft gives you two supported options to upgrade an Azure IAAS virtual machine in

An in-place system upgrade is not supported on Windows-based Azure VMs. These approaches mitigate the risk. The first is actually a migration to a new virtual machine. The second one is doing the upgrade locally on the VHD disk you download from Azure and then upload to create a new IAAS virtual machine. All this avoids messing up the original virtual machine and VHD.

Unsupported way of doing an in-place upgrade of an Azure virtual machine

There is one  way to do it, but if it goes wrong you’ll have to consider the VM as lost. I have tested this approach in a restored backup of the real virtual machine to confirm it works. But, it’s not supported and you assume all risks when you try this.

Mount your Windows Server 2016 ISO in the Windows Server 2012 R2 IAAS virtual machine. Open an administrative command prompt and navigate to the drive letter (mine was ESmile of the mounted ISO. From there you launch the upgrade as follows:

E:\setup.exe /auto upgrade /DynamicUpdate enable /pkey CB7KF-BWN84-R7R2Y-793K2-8XDDG /showoobe none

The key is the client KMS key so it can activate and the /showoobe none parameter is where the “magic” is at. This will let you manually navigate through the wizard and the upgrade process will look very familiar (and manual). But the big thing here is that you told the upgrade not to show the OOBE screen where you accept the license terms and as such you won’t get stuck there. So fare I have done this about 5 times and I have never lost RDP access due to the in-place upgrade. So this worked for me. But whatever you do, make sure you have a backup, a  way out, ideally multiple ways out!

Note that you can use  /Quiet to automate things completely. See  Windows Setup Command-Line Options

Nested virtualization can give us console access

Since we now have nested virtualization you have an option to fix a broken in-^lace upgrade but by getting console access to a nested VM using the VHD of the VM which upgrade failed. See:

Conclusion

If Microsoft would give us virtual machine console access or  DRAC or ILO capabilities that would take care of this issue. Having said all that, I known that in place upgrades of applications, services or operating systems isn’t the cloud way. I also realize that dogmatic purism doesn’t help in a lot of scenarios so if I can help people leverage Azure even when they have “pre cloud” needs, I will as long as it doesn’t expose them to unmanaged risk. So while I don’t recommend this, you can try it if that’s the only option you have available for your situation. Make sure you have a way out.

IAAS has progressed a tremendous amount over the last couple of years. It still has to get on par with capabilities we have not only become accustomed to but learned to appreciate over over the years. But it’s moving in the right direction making it a valid choice for more use cases. As always when doing cloud, don’t do copy paste, but seek the best way to handle your needs.

Hyper V Amigos Showcast 15 – VMFleet explained

After a short break in broadcasts, due to extremely busy times, the Hyper-V Amigos ride again! Yes, fellow Microsoft MVP Carsten Rachfahl (@hypervserver) from Rachfahl IT Solutions and yours truly are back in the saddle!

In episode 15 of the Hyper-V Amigos Showcast we look at VMFleet. VMFleet is a free Microsoft test suite to show/test and help analyze the performance of a Hyper-converged Storage Spaces Direct Setup.

image

In this episode Hyper V Amigos Showcast 15 – VMFleet explained we show you how to set it up, how it works and how to measure and evaluate your S2D deployment with it. We also share some real life tips with you. All this in just around 2 hours. So stick around for the long haul!

Cluster Shared Volumes without Active Directory

Introduction

Cluster Shared Volumes without Active Directory will come online if certain conditions are met. Basically, when the cluster can form. That’s what we’ll talk about here. There are a lot of things to consider when virtualizing your Active Directory environment, partially or completely. Too much for this blog post alone and much of it has been discussed before. Here I’ll discuss when people count on their Cluster Shared Volumes coming online when active directory is not available but are disappointed and think that it’s a bug or a broken promise. It is not! We know that since Windows Server 2012 the CSVs coming online do not depend on Active Directory being available (thanks to the local CLIUSR account being used for cluster starts).

Cluster Shared Volumes without Active Directory

So, let’s address getting your Cluster Shared Volumes online when AD is not available. The things to realize is that not having Active Directory has been taken care off. The thing that still can go wrong is that the cluster doesn’t come on line properly and that means that the CSV LUNs won’t be online as those are cluster resource. Hyper-V can boot the VMs without the cluster being formed but as the VMs reside on a CSV and those are not available, that’s what’s causing the problem. That means that getting your cluster to come up is the real issue.

Getting your cluster to come up is the real issue.

If cluster shared volumes can come up without a domain controller / AD being available since Windows Server 2012 (Failover Clustering and Active Directory Integration) how is it that some many people still have issues with it? How do you make this work for you? You can use a disk witness to protect you in most cases. When you have a file share or cloud witness you can take some step to avoid running into this issue.

There is a great article on cluster behavior here: Failover Cluster Node Startup Order in Windows Server 2012 R2 (the same rules apply for Windows Server 2016). Read it and you’ll notice that the behavior differs depending on whether you have a witness defined and whether that witness is a witness disk or a file share or cloud witness. The disk witness has a copy of the cluster database and if available will help you out of any situation where the paxos on the disk witness (if it’s available) is more recent then the one on the cluster node as the node will download the cluster database info from the disk witness. But a file share or cloud witness don’t have a copy, so they have a small disadvantage under certain scenarios. CSV’s that won’t come up is when booting the nodes in the wrong order when using a file share or cloud witness is one of those scenarios but not having AD isn’t the root cause of this.

Some people will never notice this issue at all, especially not when they have disk witness, but when they do, it might be at a very inconvenient moment in time. Examples of where this situation can occur are a single cluster environment where the domain controllers are running on CSV and high available in that cluster that was shut down completely, the cluster has a file share witness and the nodes are started in the wrong order.

Making sure your CSVs come online = clustering being formed

Well for one make sure you are using Windows Server 2012 or higher for your cluster. That’s a given.

Beyond that you basically just need to know what to do in what scenario to get your cluster up and running so you won’t have issues with getting the CSV to come up. You just have to follow some rules of thumb and you’ll be fine. Also, there’s almost always a way to get out of pickle, just don’t panic. But also remember that you can make your live easier when you design your solutions with failure in mind and by knowing your options so you can act correctly. I my example I’ll be using Hyper-V cluster with CSVs but the same goes for SOFS, SQL Server clusters leveraging CSVs etc.

Planned down time

If you’re shutting down a complete cluster you have two options to make sure things go a smooth as possible.

Option 1 – A clean cluster shutdown by the book

  • Shut down the workload, i.e. the VMs.
  • Stop the cluster
  • Shutdown the Cluster nodes.
  • Boot the cluster nodes. You can start with any you like. The CSVs will come on line. You will be able to start the VMs. Do start with the domain controllers and wait for them to come on line before starting the others.

During this you’ll see some “collateral” events, errors, warnings due to Active Directory not being available. The cluster name has issues without Active Directory but that a management connection point, it doesn’t mean the cluster doesn’t work. Once Active Directory is available the cluster name will come on line automatically when the default failure policy restarts it. You can also manage the cluster via RDP or console by connection to “.” locally on that node or use the running node name. You can also try to bring cluster name on line manually when Active Directory is up and running.

Option 2 – A clean cluster shutdown as it happens most of the times in real life

Which is one a lot of people do to keep part of the work load running as long as possible.

  • Shut down the no critical workload, i.e. the VMs.
  • Pause a node so the critical workloads live migrate to other nodes
  • When the node is paused, shut down the node.
  • You rinse and repeat this until the last node is left with only the most critical VMs

clip_image002

  • Finally, you stop the workload on that last cluster node and shut it down as well.
  • Now comes the critical part: Remember what node you shut down last. You have to start that one first and you’ll see that your CSV will come on line. If you boot another node, you might panic as the CSV will not come on line.

Now I need to correct this a little bit. With a disk witness you are OK whatever node you boot. When the paxos numbers on the first node to boot and the disk witness are compared the most recent copy will be used. Either the local one on the node will be used directly or after it has been updated with the data form the disk witness. To make things simple for the ops team I always tell them to note the last node they shut down anyway no matter what type of witness they have. It’s good info to have.

With a file share or cloud witness the shutdown/startup order really comes into play. The reason this happens is that by shutting down node by node we end up with a one node cluster (last man standing).  When that’s shut down that’s the only node that knows about the last (potential) changes in the cluster as it holds a copy of the cluster database. Remember that a files hare or a cloud witness has no copy of the cluster database. That’s why the last node to be shut down has to come on line first when comparing the paxos numbers with the witness as that node can form a cluster. If the node that boots first does not hold the most recent paxos number it cannot download the cluster databases info from the file share or cloud witness. Such a node cannot form a cluster and bring the CSVs online. If the first node to boot was the last one to shut down, it can form a cluster and the CSVs will come on line. This is the big difference with option one where you shut down the entire cluster and then take the nodes of line.

You might not know or remember the order. If that’s the case you still have options like starting the cluster node with the -fixquorum option (net.exe start clussvc /forcequorum) at the risk of loosing some cluster changes that are not in the local cluster database.

No need to go to immediately backups, extract the domain controller VMs from SAN snapshots or mount a LUN to a different machine to extract the VM files for the DC or so. Don’t panic!

One or more failed nodes

Well as long as the cluster survives your domain controller VMs should fail over. Keep ‘m on separate node (anti-affinity), separate CSV LUNs, if possible separate clusters if all domain controller virtual machines are going to be running high available on a cluster node and that cluster is still functional after all. No issues here.

Total cluster failure

The cluster nodes all show due to a “global” BSOD or are turned off due to a power failure or a storage array crash. This is more the realm of bad dreams I know, it does happen. Often things will recover and you’ll be fine but you can do your part. The same rules apply, get the cluster to form and you’ll have your CSVs on line. In a bad case -fixquorum is your friend but normally it’s not your first option. In the worst case you’ll need recovery from backup of the cluster or rebuild it. It’s a very bad day if it comes to that. And cluster recovery is not the subject of this post.

Conclusion

Don’t blame Active Directory and start troubleshooting or fixing the wrong problem. So yes, CSV will come on line when certain conditions are met and you can work yourself out of a pickle if needed. But during a disaster that’s only extra work and stress you might not want to worry about if you can avoid it. It’s good to know how to resolve issues around CSVs not coming online when the shit hits the fan as even the best laid out plans tend to get side tracked by reality when disaster strikes.

If you cannot guarantee control over all the prerequisites and might not have the skills in please when needed, you might consider other options. Some of these are actually the best practices of the past when a CSV would indeed not come on line without active directory in any way. This is great for AD related issues but not for you offline CSVs, they need the cluster to form properly!

Sure, you can run the domain controller virtual machines on local storage, and not made high available. This cloud be on one of the cluster nodes (*) or on a stand-alone Hyper-V host. Having a physical domain controller is also a possibility. This helps avoid issues with AD in virtualized environments as many other services are very dependent on them and it’s good to have on one available all of the time and get them back on line a.s.a.p..

I’ll leave you with the fact that virtualizing domain controllers can be done but it pays to study up on how to do it well and test your assumptions in the lab. There is a lot of information on virtualizing domain controllers for a reason. Read it and process what you’ve learned from it. You might find that this CCV thingies is not the most complex subject to deal with.

(*) Please note that some cluster deployments like HCI based on S2D do not support running other (local) storage in addition to the boot OS and the S2D storage pool volumes.

My favorite deployment for VMs with Discrete Device Assignment for GPU

Introduction

Recently I had an interesting discussion on how to leverage Discrete Device Assignment (DDA) for GPU needs when it’s only needed for a certain number of virtual machines. Someone had read my blogs on leveraging DDA that made here optimistic and enthusiastic. But she noticed in the lab she could not leverage DDA on a VM running on a cluster and she could not use storage QoS policies on a stand-alone Hyper-V host with local storage. So, what could she do?

Well for one, her findings are correct. Microsoft did not enable DDA on clustered virtual machines. It doesn’t make sense as the GPU hardware is tied to the virtual machine and any high availability, both planned (live migration) or unplanned (failover) isn’t possible and available anyway. It just cannot be done. I hear you, when you say “but they pulled it off for SR-IOV for networking”. Sure, but please keep in mind that network cards with Ethernet and TCP/IP allows for different approaches than high end video.

My favorite deployment for VMs with Discrete Device Assignment for GPU

My favorite deployment for VMs with Discrete Device Assignment (DDA) for GPU leveraged SMB3 SOFS shares for the virtual hard disks and stand-alone Hyper-V hosts that are member servers in the domain. Let me explain why.

Based on what we discussed above we have some options. One work around is running the DDA virtual machines not high available on local storage on a cluster node. But that would mean you would have a few VMs on all the nodes and that all those nodes must have a DDA capable GPU. Or if you limit the number of nodes that have such a GPU you’ll have a few odd balls in your cluster. You’ll need to manage some extra complexity and must save guard against assigning a GPU via DDA that is already in use for RemoteFX. That cause all kinds of unpleasantness, nothing too deadly but not something you want to do if on your production VDI clusters for fun. It’s a bit like not running a domain controller on a CSV and not making it highly available. If that’s the only option you have you can do that, and I do when needed as Microsoft has improved a lot of things to make this a better and less risky experience. But I prefer to have either physical one or host it on a separate non-clustered Hyper-V host if that’s an option because not all storage solutions and environments have all capabilities needed to make that fool proof.

Also note that running other storage on a S2D node isn’t supported. You have your OS on the boot disks and the disks used in storage spaces. Odd ones out aren’t supposed to be there as S2D will try to recruit them. You can get do it when using traditional shared storage

What I also don’t like about that is that if the cluster storage is not SMB3 SOFS you don’t get the benefit of storage QoS policies in Windows Server 2016, that only works with CSV. So optionally you could leave the non-clustered VM on a CSV. But that’s perhaps a bit confusing and some people might think some forgot to make the machine high available etc.

My preferred setup to get high available storage for virtual machines with DDA needs that benefits from what storage QoS polies have to offer for VDI is to use standalone Hyper-V hosts that have DDA capable GPUs and leverage SMB3 SOFS shares for the virtual Machines.

clip_image002

The virtual machines cannot be high available anyway so you lose nothing there. The beauty is that in this case, as you leverage a Windows Server 2016 SOFS cluster for Hyper-V storage over SMB3 shares, you do get Storage QoS policies.

#On a SOFS node

Get-StorageQosPolicy -Name DedicatedTier1Policy | Get-StorageQosFlow | ft InitiatorName, *IOPS, Status, PolicyID, filePath -AutoSize

#Query for the VM disks on the Hyper-V node

Get-VM -Name DDAVMSOFSStorage -ComputerName RemoteFXHost | Get-VMHardDiskDrive |fl *

clip_image004

#We generate some IO and get some stats on a SOFS node

get-storageQosFlow

get-storageQoSVolume -Mountpoint C:\ClusterStorage\SOFSDEMO\

get-storageQoSVolume -Mountpoint C:\ClusterStorage\SOFSDEMO\ | fl

clip_image006

You can start out with one Hyper-V node and add more when needed, that scale out. Depending on the needs of the virtual machines and specs of the servers (Memory, CPU cores) and the capability and number of GPU in the video cards you get some scale up as well.

To learn more about DDA go here:  https://blog.workinghardinit.work/?s=DDA&submit=Search

To learn more about storage QoS policies go here:

Some more considerations

By going disaggregated. You can leverage a SOFS share for both virtual machines running on a Hyper-V cluster or on stand-alone (non-clustered) Hyper-V that are domain members. The SOFS cluster can be leveraging S2D, traditional storage spaces with shared SAS (JBODs) or even a FC, iSCSI or shared SAS SANS if that the only option you have. That’s all OK as long as it’s SOFS running on Windows Server 2016 and the Hyper-V hosts (stand alone or clustered) are a running 2016 as well (needed for Storage QoS policies and DDA). There is no need for the Hyper-V host to be part of a cluster to get the best results you need. If I use SOFS for both scenarios I can use the same storage array, but I don’t need to. I could also use separate storage arrays. If the Hyper-V cluster is leveraging CSV instead of SOFS I will need to use a separate one for SOFS as its ill advised to mix Hyper-V workloads with the SOFS role. Keep things easy, clear and supportable. I’ll borrow a picture I got from a Microsoft PM recently, do seek out the bad ideas.

clip_image008