Windows Server 2012 Cluster in a Box as a New Form Factor?

Let’s look at “Cluster in a Box” (CiB)as a building block or a form factor. Let’s say you’ve committed to building a private/hybrid cloud for your organizations but you’re at the end of your hardware life cycle or you just don’t have the capacity right now to build it. What options do you have. Do you want to acquire storage, data connectivity network gear, servers, NICs with etc. or will you just buy CiB blocks to scale out as you go? Perhaps you’ll buy a Hyper-V fast track solution or if you’re really big a one or multiple containers.

I do think that the modular principle throughout the data center is pretty cool. The industry has done a great job at this with servers and smaller components as well as with the modular containers by SUN, HP, DELL.

clip_image002

While I do like and admire the concept of the “shipping container form factor” I do find it a couple of sizes too large to be practical for most of us. After all, let’s face it, we’re not all building public cloud service data centers. This means that between what we have seen today with server & storage modularity and the container form factor we’ve got a void. While some of these voids have been filled for specific applications like Exchange 2010 through custom build solutions by some vendors you cannot call this modular. Is a very application specific solution. The other, more generic, solution that has existed for a while now is the hardware that vendors deliver with the Hyper-V fast track we’ve mentioned already. Whiles these are nice, pre-configured solutions these are, again, not very modular. It’s not a complete unit that just needs to be hooked the network and provisioned with power. The time is ripe with the current state of Microsoft Windows Server 2012 to fill that void using the “Cluster in a Box” form factor. That would mean that in the future we could of the same benefits as the big players but at a size that’s fit for our purposes in the smaller data centers. This opens up a lot of scenarios for better efficiency.

What if the entire unit shipped to a customer contains everything packed away internally. That is servers, networking and storage. You just have to mount it in a rack, connect it to redundant power outlets and to redundant network paths. That’s it. Just power it up, fill out the wizard and be done with it. That’s all it takes to have a functional Hyper-V, Scale Out File System, SQL Server cluster etc. With the capabilities delivered by Windows Server 2012 this could very well be a scenario that might evolve. It’s more than just a business in or a branch office in a box. I can also be more that the Scale Out File Server unit for a private cloud solution. It just might be the first step of a new form factor building block for medium to even some large enterprises. If the economies are too good to be ignored I think this might happen.

clip_image004

The reason I think that this concept will work is that we have virtual machine mobility now so we no longer need to fear the isolation that silos might create. As a matter of fact this is a key element that might drive this. For the applications that are less suited for virtualization today we see two solutions. One is in the scalability of the Hyper-V platform with Windows Server 2012 and the other is the fact that the shared nothing approach is gaining popularity. It started with Exchange 2010 but is no also available with SQL Server 2012.

These clusters in a box can be made with existing servers (blades or not), storage and switches but I think there will be also new designs that are purpose build and not just existing hardware in a “rackable” box as in my drawings below Smile. Those boxes might have some scale up capability or come in different sizes

image

But scale out is the way that would make this work in the bigger environments, whatever the size of the Cluster in a Box.

image

Some SAN Storage Fun

At the end of this day I was doing some basic IO tests on some LUNs on one of the new Compellent SANs. It’s amazing what 10 SSDs can achieve … We can still beat them in  certain scenarios but it takes 15 times more disks. But that’s not what this blog is about. This is about goofing off after 20:00 following another long day in another very long week, it’s about kicking the tires of Windows and the SAN now that we can.

For fun I created a 300TB LUN on a DELL Compellent, thin provisioned off cause, I only have 250 TB Smile

I then mounted it to a Windows 2008 R2 test server.

image

The documented limit of a Volume in Windows 2008 R2 is 256TB when you use 64K allocation size. So I tested this limit by trying to format the entire LUN and create a 300TB simple volume. I brought it online, initialized it to an GPT disk, created a simple volume with an allocation unit size of 64K and well that failed with following error:

Failed Format300TB

There is nothing unexpected about this. This has to do with the maximum NTFS volume size supported on a GPT disk. It depends on the cluster size that is selected at the time of formatting. NTFS is currently limited to 2^32-1 allocation units. This yields a 256TB volume, using 64k clusters. However, this has only been tested to 16TB, or 17,592,186,040,320 bytes, using 4K cluster size. You can read up on this in Frequently asked questions about the GUID Partitioning Table disk architecture. The table below shows the NTFS limits based on cluster size.

image

This was the first time I had the opportunity to test these limits I formatted part of that LUN to a size close to the limit and than formatted the remainder to a second simple volume.

image

I still need get a Windows Server 2012 test server hooked up to the SAN. To see if anything has changed there. One thing is for sure, you could put at least 3 64TB VHDX files on a single volume in Windows. Not too shabby Smile. It’s more than enough to put just about any backup software into problems. Be warned, MSFT tested and guarantees performance & behavior up to 64TB in Windows Server 2012, but beyond that you’d better do your own due diligence.

The next thing I’ll do when I have a Windows Server 2012 host hooked up is, is create 64TB VHDX file and see if I can go beyond it before things break. Why, well because I can and I want to take the new SAN and Windows 2012 for a ride to see what boundaries we can push. The SANs are just being set up so now is the time to do some testing.

Windows Server 2012 with Hyper-V & The New VHDX Format Leads The Way

Introduction

Whether you realize this or not but our trusted old VHD format is getting a bit old in the tooth. As a matter of fact it has been around since the last century. It has served us well but now it needs a major overhaul to better serve us at present and to prepare us for the decades to come. We (at least in the environments I support) see a continuing demand for bigger virtual disks & ever better performance. This should be no surprise. Not only does the amount of data produced keep going up year after year but we’re virtualizing more very resource intensive workloads than ever. Think image intensive data that has to be processed by number crunching virtual machines or large databases like SQL Servers. Sure 64 vCPUs and 1TB of memory are great and impressive but we also need loads of fast and ever more reliable storage. Trying to serve and support these needs with combined 2TB disks is very cumbersome (to be polite) and pass trough disks take a way a lot of the flexibility & options the VHD format gives us. So here comes the new VHDX format.  There is no back porting here, the only OS at the moment that supports VHDX is Windows Server 2012. The good news here is that we have in box tools to convert between VHD & VHDX.

Bigger, Better & Faster

Size

The VHDX format supports up to 64TB now. Yes that is 32 times more than the current VHD. As a matter of fact al lot of SANs still in use today won’t give you that size of LUN. Is there a need for this?  Well, I circle in some places with huge files in massive amounts so I can use big LUNs and large data VHDX files. Concatenating disks is something I do no like to do. Come upgrade/maintenance/renewal time that one bites too much for comfort.

There are also some other virtual disk formats that need to wake up and break that 2TB size boundary . Especially when Microsoft states that this is not a File format hard limitation. By that they mean they have room to increase it. Wow!

Protection Against Disk Corruption

The VHDX format also provides corruption protection during power failures for the VHDX files. This is done by a logging mechanism for the updates of the VHDX metadata structures. The logging mechanism is contained within the VHDX file so no worries, you won’t have to worry about managing log files. The overhead is minimal, as they only log metadata such as block allocations, block state updates and NOT the actual data stored. So no, it has not become a database Smile you need to manage, don’t worry. The protection works only for the VHDX file and not the data that is written to it. That job falls to NTFS or ReFS. What we discussed here was protection against VHDX file corruption.

The Need For Speed

With VHDX we also get larger block sizes up to 256MB for dynamic & differencing disks, meaning they perform better with workloads that allocate in larger chunks.

Modern Large Sector Disks

We get support to run VHDX on large sector disks without loosing performance.

I refer you to KB articles Using Hyper-V with large sector drives on Windows Server 2008 and Windows Server 2008 R2 and Information about Microsoft support policy for large-sector drives in Windows.

As you can read there the performance hit for both non fixed VHDs and applications is pretty bad. The 512e (4K physical and 512-byte logical sector size) approach is bad due to the Read-Modify-Write (RMW) process overhead in dynamic & differencing disks. 4K native (4K logical sector size) just isn’t supported by Hyper-V before Windows Server 2012. The maximum logical & physical sector size is now 4KB and that means that we get a lot better performance when running applications that are designed to use 4KB workloads in Hyper-V 3.0. VHDX structures are aligned on MB boundaries, so the need for the RMW from the disk is eliminated if the physical sector size of the virtual disk is set to 4K.

image

Storing Custom Metadata

We also get the ability to store custom metadata in the VHDX  file for information we find relevant. This could be about what’s on there, OS version or patches applied.
ODX Support. This custom data is stored using key/value pairs that support up to 1024 entries of 1MB. That should be adequate for a while Winking smile.

VHDX Leverages Offline Data Transfer (ODX)

The virtual stack allows ODX requests from the guest to flow down all the way to the hardware and as such VHDX operations benefit from this as well. Examples of this are:

  • Creating VHDX files, even such large ones has gotten an whole lot faster. Especially if you can offload this to the SAN. If your storage vendor supports ODX then you’re in VHDX creation speed heaven! As a bonus  even VHD files created in Windows Server 2012 benefit from this technology.
  • On top of that Merge & Mirror operation are also offloaded to the hardware which is great for merging snapshots or live storage migration.
  • In the future the virtual machines themselves might/will be able to pass through offload operations. This is hard core stuff  and due to the file layout far from trivial.

Please note that this only works with SCSI attached VHDX files. IDE devices have no ODX support capabilities.

TRIM/UNMAP Support

With Windows Server 2012 / VHDX we get what is described in the documentation “’Efficiency in representing data (also known as “trim”), which results in smaller file size and allows the underlying physical storage device to reclaim unused space. (Trim requires physical disks directly attached to a virtual machine or SCSI disks in the VM, and trim-compatible hardware.) It also requires Windows Server 2012 on hosts & guests.

It’s a major benefit in the “Stay Thin” philosophy associated with thin provisioning. No more running “sdelete” in your windows VMs (tedious, slow, resource intensive) or installing an agent (less tedious) to support reclaiming space. This is important to many of us and this level of support and integration makes our lives a lot easier & speeds things up. So choose you storage wisely.

TRIM is the specification for this functionality by Technical Committee T13, that handles all standards for ATA interfaces. UNMAP is the Technical Committee T10 specification for this and is the full equivalent of TRIM but for SCSI disks. UNMAP is used to remove physical blocks from the storage allocation in thinly provisioned Storage Area Networks. My understanding is that is what is used on the physical storage depends on what storage it is (SSD/SAS/SATA/NL-SAS or SAN with one or all or the above).

Basically VHDX disks report themselves as thin provision capable. That means that any deletes as well as defrag operation in the guests will send down “unmaps” to the VHDX file, which will be used to ensure that block allocations within the VHDX file is freed up for subsequent allocations as well as the same requests are forwarded to the physical hardware which can reuse it for it’s thin provisioning purpose. This means that an VHDX will only consume storage for really stored data & not for the entire size of the VHDX, even when it is a fixed one. You can see that not t just the operating system but also the application/hypervisor that owns the file systems on which the VHDX lives needs to be TRIM/UNMAP aware to pull this off. It is worth nothing this mean that it only works on the SCSI attached storage in the virtual machine, not on IDE connected VHDX disks.

Closing Thoughts On The Future Proof VHDX Format

For anyone interested in developing against the VHDX formats the specifications will be published. So that’s good news for ISVs, big and small. For all the reasons mentioned above I’m a fan of the VHDX format Open-mouthed smile and it’s yet one more reason to go full speed ahead with testing Windows 2012 so we can move forward fast and reap the benefits of reliability & scalability without sacrificing performance.

Shared Nothing Live Migration White Board Time – Scenario I

The Problem

Let’s say you are very happy with your SAN. You just love the snapshots, the thin provisioning, deduplication, automatic storage tiering, replication, ODX and the SMI-S support. Live is good! But you have one annoying issue. For example; to get the really crazy IOPS for your SQL Server 2012 DAG nodes you would have to buy 72 SSDs to add to you tier 1 storage in that SAN. That’s a lot of money I you know the price range of those. But perhaps you don’t even have SSDs in your SAN.To get the required amount of IOPS from your SAN with SAS or NL-SAS disks in second and respectively third level storage tier you would need to buy a ridiculous amount of disks and, let’s face it, waste that capacity. Per IOPS that becomes a very expensive and unrealistic option.

Some SSD only SAN vendors will happily sell you a SAN that address the high IOPS need to help out with that problem. After all that is their niche, their unique selling point, fixing IOPS bottle necks of the big storage vendors where and when needed. This is cheaper solution per IOPS than you standard SAN can deliver but it’s still a lot of money, especially if you need more than a couple of terabytes of storage. Granted they might give you some extra SAN functionality you are used to, but you might not need that.

Yes I know there are people who say that when you have such needs you also have the matching budgets. Maybe, but what if you don’t? Or what if you do but you can put 500.000 € towards another need or goal? Your competitive advantage for pricing your products and winning customers might come form that budget Winking smile

Creative Thinking or Nuts?

Let’s see if we can come up with a home grown solution bases on Windows Server 2012 Hyper-V. If we can this might solve your business need, save a ton of money and extend  (or even save) the usefulness of you SAN in your environment. The latter is possible because you successfully eliminated the biggest disk IO from you SAN.

The Solution Scenario

So let’s build 3 Hyper-V hosts, non-clustered, each with its own local SAS based storage with commodity SSD drives. You can use either storage pools/spaces with a non-raid SAS HBA or use a RAID SAS HBA with controller based virtual disks for this. If you’ve seen what Microsoft achieved with this during demos you know you can easily get to hundreds of thousands of IOPS. Let’s say you achieve half of what MSFT did in both IOPS and latency. Let’s just put a number on it => that’s about 500.000 IOPS and 5GB/s. Now reduce that for overhead of virtualization, the position of the moon and the fact things turn out a bit less than expected. So let’s settle for 250.000 IOPS and 2.5GB/s. Anybody here who knows what this kind of numbers would cost you with the big storage vendors their SANs? Right, case closed. Don’t just look at the cost, put it into context and look at the value here. What does and can your SAN do and at what cost?

OK we lose some performance due to the virtualization overhead. But let’s face it. We can use SR-IOB to get the very best network performance. We have hundreds of thousands of IOPS. All the cores on the hosts are dedicated to a single virtual machine running a SQL Server DAG node and bar 4Gb of RAM for the OS we can give all the RAM in the hosts to the VM. This one VM to one host mapping delivers a tremendous amount of CPU, Memory, Network and Storage capabilities to your SQL Server. This is because it gets exclusive use of the resources on the host, bar those that the host requires to function properly.

In this scenario it is the DAG that provides high availability to the SQL Server database. So we do not mind loosing shared storage here.

image

Because we have virtualized the SQL server you can leverage Shared Nothing Live Migration to move the virtual machines with SQL server to the central storage of the SAN without down time if the horsepower is no longer needed. That means that you might migrate another application to those standalone Hyper-V hosts That could be high disk IO intensive application, that is perhaps load balanced in some way so you can have multiple virtual machines mapped to the hosts (1 to 1, many to one). You could even automate this all and use the “Beast” as a dynamic resource based on temporal demands.

In the case of the SQL Server DAG you might opt to keep one DAG member on the SAN so it can be replicated and backed up via snapshot or whatever technology you are leveraging on that storage.

Extend to Other Use Cases

More scenarios are possible. You could build such a beast to be a Scale Out File Server or PCI RAID/Shared SAS if you need shared storage to build a Hyper-V cluster when your apps require it for high availability.

image

The latter looks a lot like a cluster in a box actually. I don’t think we’ll see a lot iSCSI in cluster in a box scenarios, SAS might be the big winner here up to 4 nodes (without a “SAS switch”, which brings even “bigger” scenarios to live with zoning,  high availability, active cables and up to 24Gbps of bandwidth per port).

Using a SOFS means that if you also use SMB 3.0 support with your central SAN you can leverage RDMA for shared nothing live migration, which could help out with potentially very large VHDs of your virtual SQL Servers.

Please note that the big game changer here compared to previous versions of windows is Shared Nothing Live Migration. This means that now you have virtual machine mobility. High performance storage and the right connectivity (10Gbps, Teaming, possibly RDMA if using SMB 3.0 as source and target storage) means we no longer mind that much to have separate storage silos. This opens up new possibilities for alleviating IOPS issues. Just size this example to your scenarios & needs to think about what it can do for you.

Disclaimer: This is white board thinking & design, not a formal solution. But I’d cannot ignore the potential and possibilities. And for the critics. No this doesn’t mean that I’m saying modern SANs don’t have a place anymore. Far from it, they solve a lot of needs in a lot of scenarios, but they do have some (very expensive) pain points.