Virtual Network Appliances I Use for Hyper-V Labs

When you build and maintain a test lab you’re always on the lookout for gear you can use. That’s either hardware or virtual appliances. My main concern is cost, it should work well on Hyper-V and the ability to mimic real world environments. That’s a great help for educational purposes as well as for testing and as an aid to troubles shooting. One of the nice things virtualization and now also cloud IAAS offers is the ability to run virtual storage and network appliances that allow us to have that real world look and feel. Add to that ever more software defined storage, networking and compute and we’re able to build very realistic labs. The limits we’re left with are time, money and space.

When building a lab some people tend to run into perceived limitations of their hypervisor. That’s to be expected as for many that hypervisor is just something to quickly get up and running an get to work writing code, implementing a backup solution or whatever the workload at hand is all about. The tip here is not to give up to fast.

More recently I’m build/working on a new lab setup simulating different sites. I need to route between these isolated test networks and load balance traffic in a site redundant manner. The idea was to mimic real life as well as we good. Add to that lab setup an Azure “site” and it’s fun all over. It’s all based on Hyper-V and Windows Server virtual machines but some components are not. Windows NLB has had its best day and RRAS is limited in the abilities I need to test. They can and do work fine for certain scenarios, but not for all that I need to test. I add virtual load balancers, virtual switches with the look and feel of physical ones and the same for virtual firewalls.

Now in real life you’ll be dealing with Link Aggregation Groups, Trunking, MLAG, routing, teaming … in short the tools of the trade when doing networking. One side effect of this is that on a Hyper-V host you quickly run out of physical network ports to work with. That’s not a problem, in real life your firewall or load balancer does not have 48 ports either. Often you have 4 to 8 and sometimes more, but often not, ports at your disposal and depending on the complexity that’s more than enough or not at all. Trunking & VLAN’s are the way we deal with this. In the Hyper-V GUI you will not find a way to define a trunk on an vNIC attached to a vSwitch. But this can be done via PowerShell. So please do not reject Hyper-V as not being up to the job. It is! Read about this in my blog post.

People often ask me what virtual network appliances I Use for Hyper-V Labs. This does vary over time, but there are some constants. In the lab I hate wasting time on time bombed trials. So I avoid those in favor of either fully featured solutions or I use free open source alternatives. Smart vendors provide the easiest access possible to their solutions. They realize that easy access delivers the ability to learn and test every aspect of the products which make a huge difference in the success of their offerings in the real world. When it comes to load balancers I use the KEMP Virtual Load Masters. You can read more about these in projects and lab testing  in blogs about the KEMP (Virtual) Load Master.

As an MVP I got 1 free license. Together with the ability to restore configurations I can have a pseudo permanent redundant load balancing setup. Only building labs for multi-site geo load balancing solutions requires to start from scratch every time. For routing I use VyOS, it works on both hardware and on a bunch of hypervisors with X64 bit virtual machines. When I need the look and feel of a firewall you’ll encounter in business I use Opnsense. It supports the synthetic vNICs with the enlightened Hyper-V drivers. Yup, the integration components are there.  It doesn’t boot from UEFI so no Generation 2 virtual machine support as of yet. imageimage

Another good one is IPFire. This one also does a nice job with the integration components.

image

I also have a DELL SonicWall in my home office where I have some ports to play with but it tends to be leveraged more for the permanent parts of the lab. It’s a crucial & permanent component.

SonicWALL NSA 220 Wireless-N Appliance

Exchange 2016 On The Horizon

With Exchange 2016 on the horizon (RTM in Q4 2015) I’ve been prepping the lab infrastructure and dusting of some parts of the Exchange 2010/2013 lab deployments to make sure I’m ready to start testing an migration to Exchange 2016. While Office 365 offers great value for money sometimes there is no option to switch over completely and a (used) hybrid scenario is the way to go.  This can be regulations, politics, laws, etc. No matter what we have to come up with a solutions that satisfy all needs as well as possible. Even in 2015 or 2016 this will mean on premises e-mail. This is no my “default” option when it comes to e-mail in anno 2015, but it’s still a valid option and choice. So they can get the best of both worlds and be compliant. Is this the least complex solution? No, but it gives them the compliancy they need and/or want. It’s not my job to push companies 100% to the cloud. That’s for the CIO to decide and for cloud vendors to market/sell. I’m in the business of helping create the best possible solution to the challenge at hand.

Figure: Exchange 2016 Architecture © Microsoft

The labs were setup to test & prepare for production deployments. It all runs on Hyper-V and it has survived upgrades of hypervisor (half of the VMs are even running on Windows Server 2016 hosts) and the conversion of the VHDX to VHDX.  These labs have been kept around for testing and trouble shooting. There are fully up to date. It’s fun to see the old 2009 test mails still sitting in some mail boxes.

image

Both Windows NLB and Kemp Technologies Loadmasters are used. Going forward we’ll certainly keep using the hardware load balancing solution. Oh, when it comes to load balancing, there only the best possible solution for your needs in your environment. That will determine which of the various options you have you’ll use. In Exchange 2016 that’s a will be very different from Exchange 2010 in regards to session affinity, affinity is no longer needed since Exchange 2013.image

In case you’re wondering what LoadMaster you need take a look at their sizing guides:

Another major change will be the networking. On Windows Server 2012 R2 we’ll go with a teamed 10Gbps NIC for all workloads simplifying the setup.  Storage wise one change will be the use of ReFS, especially if we can do this on Storage Spaces. The data protection you get from that combination is just awesome. Disk wise the IOPS have dropped yet even a little more so that should be OK. Now, being a geek I’m still tempted to leverage cheap / larger SSDs to give flying performance Smile. If possible at all I’d like to make it a self contained solution, so no external storage form any type of SAN / centralized storage. Just local disks. I’m eyeing the DELL R730DX for that purpose. Ample of storage, the ability to have 2 controllers and my experience with these has been outstanding.

So no virtualization? Sure where and when it makes sense and it fits in with the needs, wants and requirements of the business.  You can virtualize Exchange and it is supported. It goes without saying (serious bragging alert here) that I can make Hyper-V scale and perform like a boss.

Putting Windows Server 2016 TPv3 To The Test

I’ve dedicated some time to start investigating the new and improved feature and capabilities ever since Technical Preview 1 (TPv1). We kept going with TPv2 and now TPV3. The proving grounds for putting Windows Server 2016 TPv3 to the test are up and running.image

As usual I’ll be sharing some of the results and finding.  I only use the public Technical Previews for this so this means that it’s public information you can read about and go test  or find out about yourself.

So far things are going quite well. I’m learing a lot. Sure, I’m also hitting some issues left and right but on the whole Windows Server 2016 is giving me good vibes.

Expertise, insights, knowledge and experience is hard won. It’s never free. So I test what I need to find out about, find interesting or think what will be valuable in the future. Asking for me to go and test things for you on demand isn’t really going to work. I have bills to pay and cannot spend time, effort & resources on all of the numerous roles and features available to us in this release. Trust me I get enough offer to work for free or peanuts from both strangers and employers, so, thanks for the offers but I need no more 😉

Testing Virtual Machine Compute Resiliency in Windows Server 2016

No matter what high quality gear you use, how well you design your environment and how much redundancy you build in you will see transient failures in your environment at one point in time. In combination with the push to ever more commodity hardware and the increased use of converged deployments leveraging Ethernet transient failures have become more frequent occurrence then they used to be.

Failover clustering by tradition reacts very “assertive” to failures in order to provide high to continuous availability to our virtual machines. That’s great, we want it to do that, but this binary approach comes at a cost under certain conditions. When reacting too fast and too proactively to transient failures we actually can get  less high or continuous availability in certain scenarios than if the cluster would just have evaluated the situation a bit more cautiously. It’s for this reason that Microsoft introduced increased “Virtual Machine Compute Resiliency” to deal with intra-cluster communication failures in a Windows Server 2016 cluster.

I have helped out a number of fellow MVPs over the past 6 months with this new feature and I dove back into my lab notes to blog about this and help you out with your own testing. The early work was done with Technical Preview v1. In that release it was disabled by default (the value for cluster property “ResiliencyDefaultPeriod”  was set to 0) and the keyword “Default” was used in cluster property “resiliencylevel” for the what is now called ‘IsolateOnSpecialHeartbeat’ and is no longer the default at installation. If that doesn’t confuse you yet, I’ll find another reason to tell you to move to technical preview v2. In TPv2 Virtual Machine Compute Resiliency is enabled and configured by default but in TPv1 you had to enable and configure it yourself. I  advise you to stop testing with v1 and move to v2 and future technical preview release in order for you to test with the most recent bits and functionality.

Investigating the feature configuration

When testing new features in Windows Server Technical Preview Hyper-V you’re on your own once in a while as much is not documented yet. Playing around with PowerShell helps you discover stuff. A  Get-Cluster  | fl * teaches us all kinds of cool stuff such as these new cluster properties:

ResiliencyDefaultPeriod
QuarantineDuration
ResiliencyLevel

Here’s a screenshot of Windows Server 2016TPv1 (Please stop using this version and move to TPv2!)

image

Now when you’re running Windows Server 2016TP v2 this feature has been enabled by default (ResilienceyDefaultPeriod has been filled out as well as QuarantineDuration) and the resiliency level has been set to “AlwaysIsolate”.

image

After some lab work with this I figured out what we need to know to make VM Compute Resiliency to work in our labs:

  • Make sure your cluster functional level is running at version 9
  • Make sure your VMs are at version 6.X
  • Make sure the Operating systems of the VM is Windows Server Technical Preview v2 (Again move away from TPv1)
  • Enable Isolation/Quarantine via PowerShell:

(get-cluster).resiliencylevel
(get-cluster).resiliencylevel = ‘AlwaysIsolate’ or 2
(get-cluster).resiliencylevel
(get-cluster).resiliencylevel = ‘IsolateOnSpecialHeartbeat’  or  1
(get-cluster).resiliencylevel

Please note that all nodes need to be on line to make this change in the technical preview. I got the two accepted values by trial and error and the blog by Subhasish Bhattacharya confirms these are the only 2 ones.

  • Set the timings to some not too high and not too low value to play in the lab without having to wait to long before it’s back to normal (the values I use in my current Technical Preview lab environment are not a recommendation whatsoever, they only facilitate my testing and learning, this has nothing to do with any production environment) . For lab testing I chose:

(get-cluster).ResiliencyDefaultPeriod = 60  Note that setting this to 0 reverts you back to pre Windows Server 2016 behavior and actually disables this feature. The default is 240 seconds

(get-cluster).QuarantineDuration = 300 The default is 7200 seconds, but I’m way to impatient in my lab for that so I set the quarantine duration lower as I want to see the results of my experiments fast, but beware of just messing with this duration in production without thinking about it. Just saying!

Testing the feature and its behaviour

Then you’re ready to start abusing your cluster to demo Isolation mode & quarantine. I basically crash the Cluster service on one of the nodes in the cluster.  Note that cleanly stopping the service is not good enough, it will nicely drain that node for you. which is not what we want to see. Crash it of force stop it via stop-process -name clussvc –Force.

So what do we see happen:

    • The node on which we crashed the cluster server experiences a “transient” intra-cluster communication failure. This node is placed into an Isolated state and removed from its active cluster membership.

image

  • The VMs running at version 6.2 go into Unmonitored state. The other ones just fail over. Unmonitored means you that the cluster is no longer actively managing the VM but you can still look at the condition of the VM via PowerShell or Hyper-V manager. image

image

image

Based on the type of storage you’re using for your VMs the story is different:

  1. File Storage backed (SMB3/SOFS): The VM continues to run in the Online state. This is possible because the SMB share itself has no dependency on the Hyper-V cluster. Pretty cool!
  2. Block Storage backed (FC / FCoE / iSCSI / Shared SAS / PCI RAID)): The VMs go to Running-Critical and then placed in the Paused Critical state. As you have a intra-cluster communication failure (in our case losing the cluster service) the isolated node no longer has access to the Cluster Shared Volumes in the cluster and this is the only option there is.

image

  • If the isolated node doesn’t recover from this presumed transient failure it will, after the time specified in ResiliencyDefaultPeriod (default of 4 minutes : 240 s) go into a down state. The VMs fail over to another node in the cluster. Normally during this experiment the cluster service will come back on line automatically.
  • If a node, does recover but goes into isolated 3 times within 1 hour, it is placed into a Quarantine state for the time specified in QuarantineDuration (default two hours or 7200 s) . The VMS running on this node are drained to another node in the cluster. So if you crash that service repeatedly (3 times within an hour) the Hyper-V Node will go into  “Quarantine” status for the time specified (in our lab 5 minutes as we set it to 300 s). The VMs will be live migrated off even if the node is up and running when the cluster service comes up again.

You might notice that this screenshot is a different lab cluster. Yes, it’s a TPv1 cluster as for some reason the Live Migration part on Quarantine is broken on my TPv2 lab. It’s a clean install, completely green field. Probably a bug.image

It’s the frequency of failures that determines that the node goes into quarantine for the amount of time specified. That’s a clear sign for you to investigate and make sure things are OK. The node is no longer allowed to join the cluster for a fixed time period (default: 2 hours)­. The reason for this is to prevent “flapping nodes” from negatively impacting other nodes and the overall cluster health. There is also a fixed (not configurable as far as I know) amount if nodes that can be quarantined at any give time: 20% or only one node can be quarantined (whatever comes first, in the case of a 2, 3 or for node cluster it’s one node max that can be in quarantine).

If you want to get a quarantined node out of quarantine immediately you can rejoin it to the cluster via a single PowerShell command: Start-ClusterNode –CQ  (CQ = Clear Quarantine). Handy in the lab or in real live when things have been fixed and you want that node back in action asap.

Conclusion

Now this sounds pretty good doesn’t it? And it is. Especially if you’re running you’re running your VMs on a SOFS share. Then the VMs will remain online during the Isolation / Unmonitored phase but when you have “traditional” block level storage they won’t. They’ll go in mode as the in that design you have lost access to the CSV. Now, if you ever needed yet another reason to move to a Scale Out File Server & SMB 3 to deliver storage for your VMs I have just given you one! Hey storage vendors … how is that full SMB 3 feature stack coming on your storage arrays? Or do you really just want us to abstract you away behind a Windows SOFS cluster?

Subhasish Bhattacharya Has blogged about this as well here. It’s a feature we’ll test at length to get a grip on the behavior so we know how the cluster nodes will behave under certain conditions. Trust, but verify is my mantra and it’s way better to figure out how a feature behaves in the lab than having to figure it out when you see it for the very first time in production based on assumptions. Just saying.