Windows Server vNext Soft Restart – A way to speed up reboots? Not in Technical Preview 9841

As you all probably know I’m also playing around with and testing Windows Server vNext Tech Preview and one of the nice new features in there I have my eye on is Soft Restart.

image

There is little information on this feature out there right now but from the description “Soft Restart” looks like a way to get faster Windows boot times by cutting down on device firmware initialization. When it’s not needed that would be a great thing to have as with > 10gbps live migration speeds the boot time of our hardware loaded (DRAC, NICs, HBA, BMC, …) servers is what makes it the longest single step per node during cluster aware updating. Interesting if this is indeed what it’s there for.

But let’s find out if this is indeed what we think it is Smile. First of all the installation of this feature requires a restart. Keep this in mind.

There are 2 ways to kick it off that I know of but to me there must be more … it would be a shame not to have this integrated as an option into Cluster Aware Updating for example.

Option 1: via shutdown

image

So let’s try shutdown /r /soft /t 000.  No joy, doesn’t make one bit of difference and nothing logged or so to indicate an issue.

Option 2: PowerShell via Restart-Computer –Soft

No joy here either …

image

What could be the problem?

So I figured I needed enterprise grade server hardware with some FC cards & lots of NIC and memory to notice the difference. On a VM it might do nothing, but I assure you I doesn’t do anything on the PC based home lab either. So I dragged a DELL PowerEdge R730 with exactly that into the game. But still no joy. Then I thought some more and decided it might integrate with the hardware capabilities to do so of I went to install the latest and greatest DELL Server Manager software to see if that make a difference. But again, no joy.

It’s probably not lit up yet in this release of the Technical Preview 9841. For now I’ll be content with the 28-30% improved reboot speeds the DELL R730 UEFI brought us. I’d love to speed things up a bit as time is money and valuable Winking smile but we’ll have to wait for the next code drop to see if and how it works …

Windows Server Technical Preview delivers integration services updates through Windows Update

Benefits of delivering updates to the integration services via Windows Updates

In Windows Server  vNext aka the Technical Preview the integration services are being delivered through Windows Update (and as such the well know tools such a s WSUS, …). This is significant in reducing the operational burden to make sure they are up to date. Many of us turned to PowerShell scripting to handle this task. So did I and I still find myself tweaking the scripts once in a while for a condition I had not dealt with before or just to get better feedback or reporting. Did I ever tell you that story about the cluster where a 100VMs did not have a virtual DVD drive (they removed them to improve performance) … that was yet another improvement to my script => detect the absence of a virtual DVD drive. In this day and age, virtualization has both scaled up and out with ever more virtual machines per host and in total. The process of having to load an ISO in a virtual DVD drive inside a virtual machine to install upgrades to integration services seems arcane and it’s very timely that it has been replaced by an operation process more befitting a Cloud OS Winking smile.

I have optimized this process with some PowerShell scripting and it wasn’t to painful anymore. The script upgrades all the VMs on the hosts and even puts them back in the state if found them in (Stopped, Saved, Running). A screenshot of the script in action below.

image

I’m glad that it’s now integrated through Windows Update and part of other routine maintenance that’s done on the guests anyway.

But is not only good news for us “on premises” system administrators and integrators. It’s also important for service/cloud providers and (hosted) private cloud hosters. This change means that the tenants  have control of updates to the integration services of their virtual machines. They update their Windows virtual machines with all updates during their normal patch cycles and now this includes the integration services. This provides operation ease (single method) and avoids some of the discussions about when to upgrade the integration services.

Legacy Operating Systems

Shortly after the release of the Windows Server Technical Preview, updates to integration services for Windows guests began being distributed through Windows Update. This means that on that version the vmguest.iso is no longer needed and as such it’s no longer included with Hyper-V.  This means that if you run an unsupported (most often legacy) version of Windows you’ll need to grab the latest possible vmguest.iso from an W2K12R2 Hyper-V host and try to install that and see if it works.

What about Linux and FreeBSD?

Well nothing has changed and how that’s taken care of you can read here: Linux and FreeBSD Virtual Machines on Hyper-V

Hyper-V Technical Preview Live Migration & Changing Static Memory Size

I have played with  Hot Add & Remove Static Memory in a Hyper-V vNext Virtual Machine before and I love it. As I’m currently testing (actually sometimes abusing) the Technical Preview a bit to see what breaks I’m sometimes testing silly things. This is one of them.

I took a Technical Preview VM with 45GB of memory, running in a Technical Preview Hyper-V cluster and live migrate it.  I then tried to change the memory size up and down during live migration to see what happens, or at least nothing goes “BOINK”. Well, not much, we get a notification that we’re being silly. So no failed migrations, crashed or messed up VMs or, even worse hosts.

image

It’s early days yet but we’re getting a head start as there is a lot to test and that will only increase. The aim is to get a good understanding of the features, the capabilities and the behavior to make sure we can leverage our existing infrastructures and software assurance benefits as fast a possible. Rolling cluster upgrades should certainly help us do that faster, with more ease and less risk. What are your plans for vNext? Are you getting a feeling for it yet or waiting for a more recent test version?

Hardware maintenance, the unsung hero of IT or “what hero culture?”

How does one keep an IT Infrastructure in top form? With care, knowledge, dedication and maintenance. For some this still comes as a surprise. To many the job is done when a product or software is acquired, sold or delivered. After all what else is there to be done?

Lots. True analysis, design and architecture requires a serious effort. Despite the glossy brochures the world isn’t a perfect and shiny as it should be. Experience and knowledge go a long way in making sure you build solid solutions that can be maintained with minimal impact on the services.

Maintenance must be one of the least appreciated areas that are valuable and necessary. The things we do that management, not even IT management, knows about are numerous. Let alone that they would understand what and why.

Take firmware upgrades for example. Switches, load balancers, servers …. The right choice of a solution and the right design means you’ll be able to do maintenance without downtime or service impact.

Who’s manager knows that even server PSUs need upgrades? Do they realize how much down time that takes for a server with redundant power supplies?

image

It takes up to 20-25 minutes per node. Yes! So you see that 10Gbps live migration network has yet another benefit, cuts down on the total time needed to complete this effort in a cluster. Combine it with Cluster Aware Updating and it’s fully automated. Just make sure people in ops know it takes this long or they might start trouble shooting something that’s normal. So you want to have clusters, you want independent redundant switches or MLAG/VLT, vPC, …

image

Yes, an older switch model, but the only stack still in use in a data center. At client sites I don’t mind that much, different workload.

Think about your storage fabrics, load balancers, gateways … all redundant & independent to allow no service affecting maintenance.

If you do not have a solution & practices in place that keep your business running during maintenance people might avoid it. As a result you might suffer down time that’s classified as buggy software or unavoidable hardware failure. But there is another side to that medal, the good old saying “if it ain’t broke, don’t fix it”. On top of that even hardware maintenance requires care and needs a plan to deal with failure; it to has bugs and can go wrong.

There is a lot of noise about the “hero” culture or IT Ops and a “cowboy mentality” with system administrators. Partially this is supposed to be cultivated by the fact they get rewarded for being a hero, or so I read. In my experience that’s not really the case, you work at night or through the night and have to show up at work anyway and explain what went wrong. No appreciation, money or anything.  Basically you as an admin pay the price. There is no over time pay, on call remuneration or anything. Maybe it’s different but I have not seen many “hero cultures” in real life in IT Ops (as said we pay personally for our mistakes or misfortunes). Realistically I have the perception that the“cowboy culture” is a lot more rampant at the white collar managers  level. You known when they decide to buy the latest and greatest solution du jour to fix something that isn’t caused by existing products or technology. When it blows up it’s an operational issue. Right? Well, don’t worry, the cloud will make it all go way! Cloud. Sure, cloud is big and getting bigger. It brings many benefits but also drawbacks, especially when done wrong. There are many factors to consider and it can’t be done just like that. It needs the same care and effort in analysis, design, architecture, deployment as all other infrastructure. You see it’s not just operational ease where the benefit of cloud lies but in the fact that when done right it allows for a whole different way of building & supporting services. That’s where the real value is.

So yes that’s why we do architecture and why we design with a purpose. So we can schedule regular maintenance.  So we can minimize or even avoid any impact. On premises we build solutions that allow this to be done during office hours and can survive even a failed firmware upgrade. In the cloud we try to protect against cloud provider failure. You might have notices they to have issues.

Cowboys? Hero culture? No me, site resilience engineering for the win!