A highly redundant Application Delivery Controller Setup with KempTechnologies

Introduction

The goal was to make sure the KempTechnologies LoadMaster Application Delivery Controller was capable to handle the traffic to all load balanced virtual machines in a high volume data and compute environment. Needless to say the solution had to be highly available.

A highly redundant Application Delivery Controller Setup with KempTechnologies

The environment offers rack and row as failure units in power, networking and compute. Hyper-V clusters nodes are spread across racks in different rows. Networking is high to continuously available allowing for planned and unplanned maintenance as well as failure of switches. All racks have redundant PDUs that are remotely managed over Ethernet. There is a separate out of band network with remote access.

The 2 Kemp LoadMasters are mounted a different row and different rack to spread the risk and maintain high availability. Eth0 & Eth2 are in active passive bond for a redundant management interface, eth1 is used to provide a secondary backup link for HA. These use the switch independent redundant switches of the rack that also uplink (VLT) to the Force10 switches (spread across racks and rows themselves). The two 10GBps ports are in an active-passive bond to trunked ports of the two redundant switch independent 10 Gbps switches in the rack. So we also have protection against port or cable failures.

image

Some tips: Use TRUNK for the port mode, not general with DELL switches.

This design allows gives us a lot of capabilities.We have redundant networking for all networks. We have an active-passive LoadMasters which means:

  • Failover when the active on fails
  • Non service interrupting firmware upgrades
  • The rack is the failure domain. As each rack is in a different row we also mitigate “localized” issues (power, maintenance affecting the rack, …)

Combine this with the fact that these are bare metal LoadMasters (DELL R320 with iDRAC –  see Remote Access to the KEMP R320 LoadMaster (DELL) via DRAC Adds Value) we have out of band management even when we have network issues. The racks are provisioned with PDU that are managed over Ethernet so we can even cut the power remotely if needed to resolve issues.

Conclusion

The results are very good and we get “zero ping loss” failover between the LoadMaster Nodes during testing.

We have a solid, redundant Application Deliver Controller deployment that does not break the switch independent TOR setup that exists in all racks/rows. It’s active passive on the controller level and active-passive at the network (bonding) level. If that is an issue the TOR switches should be configured as MLAGs. That would enable LACP for the bonded interfaces. At the LoadMaster level these could be configured as a cluster to get an active-active setup, if some of the restrictions this imposes are not a concern to your environment.

Important Note:

Some high end switches such as the Force10 Series with VLT support attaching single homes devices (devices not attached to both members on an VLT). While VLT and MLAG are very similar MLAGs come with their own needs & restrictions. Not all switches that support MLAG can handle single homed devices. The obvious solution is no to attach single homed devices but that is not always a possibility with certain devices. That means other solutions are need which could lead to a significant rise in needed switches defeating the economics of affordable redundant TOR networking (cost of switches, power, rack space, operations, …) or by leveraging MSTP and configuring a dedicates MSTP network for a VLAN which also might not always be possible / feasible so solve the issue. Those single homed devices might very well need to be the same VLANs as the dual homed ones. Stacking would also solve the above issue as the MLAG restrictions do not apply. I do not like stacking however as it breaks the switch independent redundant network design; even during planned maintenance as a firmware upgrade brings down the entire stack.

One thing that is missing is the ability to fail over when the network fails. There is no concept of a “protected” network. This could help try mitigate issues where when a virtual service is down due to network issues the LoadMaster could try and fail over to see if we have more success on the other node. For certain scenarios this could prevent long periods of down time.

Accelerated Checkpoint merging with ReFS v2 in Windows Server 2016

Introduction

This blog post is a teaser where we show you some of the results we have seen with ReFS v2 in Windows 2016 (TPv4). In a previous blog post (Lightning Fast Fixed VHDX File Creation Speed With ReFS on Windows Server 2016) we have demonstrated the very fast VHDX file creation capabilities we got with ReFS v2. Now we look at another benefit of ReFS v2 in a Hyper-V environment, thanks to a feature or ReFS v2 called block cloning. We get accelerated checkpoint merging with ReFs v2 in Windows 2016

The Demo

For this short demo we have a virtual machine running Windows Server 2016. It resides on a CSV formatted with REFS (64K unit allocation size). Inside the virtual machine there is a second data disk. Our  VM called CheckPointReFS (64K unit allocation size) has this data volume formatted with ReFS (64K unit allocation size) and it runs on the ReFS formatted CSV. The disks in this test are fixed sized VHDX files.

On the data volumes we have about 30GB worth of ISO files. We checkpoint the VMs and then create a copy of those files on the data volume.

image

We then delete this checkpoint.

image

Via the events 19070 (start of a background disk merge) and 19080 (completion of a background disk merge) in the Microsoft-Windows-Hyper-V-VMMS/Admin logs we calculate the time this took: 5 seconds.

image_thumb76

image_thumb74

There are moments you just have to say “WAUW”. Really this rocks and it’s amazing. So amazing I figured I made a mistake and I ran it again … 4 seconds. WOEHOE!  What where the times you saw when you last deleted a large checkpoint?

I am really looking forward to do more testing with ReFS v2 capabilities with Hyper-V on Windows 2016.

Creating a bootable VHD or VHDX from an existing one

Creating a bootable VHD or VHDX from an existing one is a great capability to have.There are a couple of reasons why one might need or want to do this. In windows 2012 (R2) this is even a part of normal live migration operations. Storage live migration for example is nothing but the live streaming of the data of your live virtual hard disk into a new VDH/VHDX. You have multiple options when it comes to creating a bootable VHD/VHDX from an existing one and they all serve their specific purposes,which might or might not overlap.

This is great stuff to do migrations, reorganize storage, defrag your internal dynamic VHDX structure etc.  But you’re not limited to those options. When you want to convert from VHD to VHDX you’ll leverage Convert-VHDX. You can also create a new VHDX with an old one as the source with New-VHDX. Great for all kind of operations including off line migration, updates, testing on exact copies of the original disk etc. You might think it’s better to just copy the disk but for a conversion that will not work, that won’t deal with internal fragmentation which can be important for performance testing when your migrating to new storage, a new cluster & Hyper-V version and such.

Recently people asked me if this would work with their OS disk. The virtual disk that the boot from. Yes that will work. Both New-VHD and Convert-VHD will create a fully bootable new virtual disk if the source virtual disk was bootable to begin with. No problem, They have to, if you think about it. Using Convert-VHD to move from VHD to VHDX and even change the cluster sizes of the disk would be no good if the VM doesn’t boot anymore. Like wise with New-VHD.

The only thing that need some real tender loving care is when you convert a VM from generation to generation 2. The script provided to to that by John Howard (MSFT) use fully supported technologies. The script itself is not a supported product, but you’re not doing anything unsupported with it.

So all people needing to convert, defrag or move  VMs to new virtual hard disks. Do a few test to verify your assumptions and go forward. Step into that bright new future you’ve been missing out on for the past 3 years.

CryptoWall 3.0 Strikes To Close for Comfort

Instead of testing Windows Server 2016 TPv4 a bit more during “slow” hours we got distracted from that a bit CryptoWall 3.0 strikes to close for Comfort. Last week we, my team and I, had to distinct displeasure of having to tackle a “ransomware” infection inside a business network. Talk about petting a burning dog.

We were lucky on a few fronts. The anti malware tools got the infection in the act and shut it down. We went from zero and 100 miles per hour and had the infected or suspect client systems ripped of the network and confiscated.  We issue a brand new imaged PC in such incidents. No risks are taken there.

Then there was a pause … anything to be seen on the anti malware tools? Any issues being reported?  Tick tock … tick tock … while we were looking at the logs to see what we were dealing with. Wait Out …

Contact! The first reports came in about issues with opening files on the shares and soon the service desk found the dreaded images on subfolders on those shares.

image

Pucker time as we moved to prevent further damage and started an scan & search for more encrypted files and evidence of damage. I’m not going to go into detail about what, why, when and how. As in all fights you have to fight as you are. No good wishing for better defenses, tools, skills or training. At that moment you do what you think you need to do to contain the situation, clean up, restore data and hope for the best.

What can I say? We got lucky. We did our best. I’d rather not have to do that again. We have multiple types of backup & restore capabilities and that was good. But you do not want to call all data lost beyond a point and start restoring dozen of terabytes of corporate data to a last know good without any insight on the blast radius and fall out of that incident.

The good thing was our boss was on board to do what needed and could be done and let us work. We tried to protect our data while we started the cleanup and restores where needed. It could have been a lot uglier, costlier and potentially deadly. This time our data protection measures saved the day. And at least 2 copies of those were save from infection. Early detection and response was key. The rest was luck.

Crypto wall moves fast. It attempts to find active command and control infrastructure immediately. As soon as it gets it public key from the command and control server that it starts using to encrypt files. The private key securely hidden behind “a pay wall” somewhere in a part of the internet you don’t want to know about. All that happens in seconds. Stopping that is hard. Being fast limits damage. Data recovery options are key. Everyday people are being trapped by phishing e-mails with malicious attachments, drive by downloads on infected website or even advertisement networks.

Read more on CryptoWall 3.0 here https://www.sentinelone.com/blog/anatomy-of-cryptowall-3-0-a-look-inside-ransomwares-tactics/  Details on how to protect and detect depend on your anti malware solution. It’s very sobering, to say the least.

It makes me hate corporate apps that require outdated browsers even more. Especially since we’ve been able to avoid that till now. But knowing all to well forces are at work to introduce those down grade browsers with “new” software. Insanity at its best.