DELL PowerEdge R730 Improves Boot Times

The DELL generation 13 servers are blazingly fast and capable servers. That’s has been well documented by now and more and more people are experiencing it themselves. These are my current preferred servers due to the best value in the market for hard core, no nonsense, high performance virtualization with Hyper-V.

They also have better boot/reboot speeds than the previous generations with UEFI.  We noticed this during deployment and testing. So we decided to informally check how much things have improved.

Using the DELL DRAC8 We test the speed form Windows Server restart …

image

… over the various boot phases …

image

… to the visual appearance of the logon screen

image

So now let’s quickly compare this for a DELL PowerEdge R720 and a PowerEdge R730. Bothe with the same amount of memory, cards, controllers etc. None of these servers had VMS running or another workload at the time of restart.

For the R720 this gave us:

image

and the results for a Windows initiated server restart on a DELL PowerEdge 730 with EUFI boot is:

image

This was reproducible. So we can see that we EUFI boot times have decrease with about 30%. I like that. You might think this is not important but it adds up during trouble shooting or when doing Cluster Aware Updates of a large 16+ node cluster.

Now thing are beginning to look even better as vNext of Windows has this feature call “Soft Restart” which should help us cut down on boot times even more when possible. But that’s for another blog post.

Live Migration Speed Check List – Take It Easy To Speed It Up

When configuring live migrations it’s easy to go scrounge on all the features and capabilities we have in Windows Server 2012 R2.

There is no one stopping you configuring 50 simultaneous live migrations. When you have only one, two or even four 1Gbps NICs at your disposal,  you might stick to 1 or 2 VMs per available 1Gbps. But why limit yourself if you have one or multiple 10Gbps pipes or bigger ready to roll? Well let’s discuss a little what happens when you do a live migration on a Hyper-V cluster with CSV storage. Initiating a live migrations kicks of a slew of activities.

  1. First it is establish form where (aka the source host) to where we are migrating (aka the target host).
  2. Permissions are checked, are we allowed to do this?
  3. Do we have enough memory on the target to do this? If so allocate that memory.
  4. Set up a skeleton VM on the target host that is a perfect copy of the source VM’s  specifications and configure dependencies on the target host.
  5. Let’s see if we can get a network connection set up and running. If that works, we’re cool and can now transfer the memory.
  6. A bitmap is created to track the changes to the memory pages of the source VM’s pages. Each memory page is copied from the source host to the target host VM during which the memory page is marked clean.
  7. As long as the source VM is running memory is changing, which continues to be tracked in the bitmap and as such that page is mapped as dirty over there. In an iterative process this dirty memory is copied over again and so on. This continues until the remaining dirty memory is minimal. This will take longer if the VM is very memory intensive.
  8. The tiniest amount of not yet copied dirty memory is that part of a VMs state that is copied during “black out”. For this to happen the VM on the source host is paused, the remaining state is copied.
  9. A final check is done to confirm all is well and then the virtual machine is resumed on the target host.
  10. Any remains of the VM on the source host are cleaned up.

That’s actually a lot of work and as you can see copying the state is just part of the process. The more bandwidth & the lower the latency we throw at this part of the process becomes less of the total time spent during live migration.

If you can’t fill of just fill the bandwidth of your 10/40/46Gbps pipe or pipes & you operate at line speed, what’s left as overhead? Everything that’s not actual the copy of VM state. The trick is to keep the host busy so you minimize idle time of the network copies. I.e we want to fill up that bandwidth just right but  not go overboard otherwise  the work to manage a large number of multiple live migrations might actually slow you down. Compare it to juggling with balls. You might be very good and fast at it but when you have to many balls to attend to you’ll get into trouble because you have to spread you attention to wide, i.e. you’re doing more context switching that is optimal.

So tweaking the number of simultaneous live migrations to your environment is the last step in making sure a node is drained as fast as possible. Slowing things down can actually speed things up.  So when you get your 10Gbps or better pipes in production it pays of to test a bit and find the best settings for your environment.

Let’s recap all of the live migration optimization tips I have given over the years and add a final word of advice.  Those who have been reading my blog for a while know I enjoy testing to find what works best and I do tweak settings to get best performance and results. However you have to learn and accept that it makes no sense in real life to hunt for 1% or 2% reduction in live migration speeds. You’ll get one off  hiccups that slow you down more than that.

So what you need to do is tweak the things that matter the most and will get you 99% results?

  • Get the biggest pipe you need & can afford. Bigger pipes are always better than lots of aggregated smaller pipes when it come to low latency & high throughput.
  • Choose the best performance settings Hyper-V offers you. You can choose from TCP/IP,Compression, SMB. Ben Armstrong has a blog post on this Faster Live Migration–Which Option Should You Choose? I’d like to add that you can use NIC teaming for live migration as well and prior to Windows Server 2012 R2 that was the only way to aggregate bandwidth. Now you have more options. I prefer SMB but when I don’t have 10Gbps at my disposal I have found that compression really makes a difference. In my home  lab where I have only 1Gbps, the horror, it stopped me from going crazy Smile (being addicted to 10Gbps).

image

  • Optimize the power settings for your server BIOS if you want an extra speed & smoothness with 10Gbps (less so with 1Gbps). Look here An Early Look At Live Migration Over TCP/IP & Multichannel In Windows Server 2012 R2 Preview, the network traffic is a lot more stable, i.e. a flat line!  In Windows 2008 R2 this was a real need for 10Gbps or you’d be stuck at 16% max.
  • Enable Jumbo Frames for another 15-20%. Thanks to Multi Channel I can visualize this now. See also this blog post Live Migration Can Benefit From Jumbo Frames. The pictures say it all!
  • Figure out the best number of simultaneous live migrations in your environments. Well you just read this blog, so now you know.  Start at 4 and experiment upwards. Tune it back down if the speed deteriorates. The “best” number depends on your environment.

If you do these 5 things you’ll have really gotten the best performance out of your infrastructure that’s possible for live migration. Bar compression, which is not magic either but reducing the GB you need to transport at the cost of CPU cycles, you just cannot push more than 1.25GB/s trough a single 10Gbps pipe and so on. You might keep looking to grab another 1% or 2% improvement left and right  but might I suggest you have more pressing issues to attend to that, when fixed are a lot more rewarding? Knocking 1 or 2 seconds of a 100 second host evacuation is not going to matter, it’s a glitch. Stop, don’t over engineer it, don’t IBM it, just move on. If you don’t get top performance after tweaking these 5 settings you should look at all the moving parts involved between the host as the issue is there (drivers, firmware, cables, switch configurations, …) as you have a mistake or problem somewhere along the way.

Disk to Disk Backup Solution with Windows Server 2012 & Commodity DELL Hardware – Part II

As I blogged in a previous post we’ve been building a Disk2Disk based backup solution with commodity hardware as all the appliances on the market are either to small in capacity for our needs, ridiculously expensive or sometimes just suck or a combination of the above (Virtual Library Systems or Virtual Tape Libraries come to mind, one of my biggest technology mistakes ever, at least the ones I had and in my humble opinion Disappointed smile) .

Here’s a logical drawing of what we’re talking about. We are using just two backup media agent building blocks (server + storage)  in our setup for now so we can scale out.

image

Now in future post I hope to be discussing storage spaces & Windows deduplication thrown into the mix.

So what do we get?

Not to shabby …  > 1TB/Hour

image

To great …

image

In close up you are seeing just 2 Windows 2012 Hyper-V cluster nodes, each being backed up over a native LBFO team of 2*1Gbps NIC ports to one Windows Server 2012 Backup Media Agent with a 10Gbps pipe. Look at the max throughput we got  …

image

Sure this is under optimal conditions, but guess what? When doing backup from multiple hosts to dual backup media servers or more we’re getting very fast backups at very low cost compared to some other solutions out there. This is our backup beast Smile. More bandwidth needed at the backup media server? It has dual port 10Gbps that can be teamed and/or leverage SMB 3.0 multichannel. High volume hosts can use 10Gbps at the source as well.

Lessons learned

  • The Windows 2012 networking improvements rock. Upgrade and benefit from it! We’re seeing great results thanks to Multichannel leveraging RSS and in box NIC teaming (LBFO).
  • A couple of 1Gbps NICS teamed on Windows Server 2012 work really well. Don’t worry about not having 10Gbps on all your hosts.
  • Having 10Gbps on your backup media hosts (target) is great as you’ll be pushing a lot of data to them from multiple (source) hosts.
  • Make sure your backup software supports enough streams before it keels over under the load you’re pushing through. More streams means more concurrent files (read VHDs/VMs) and thus more throughput and allows multichannel to shine over RSS capable NICs.
  • Find the sweet sport for number of disks per node and total IOPS versus the throughput you can send to the backup media agents. 4 Nodes of 50TB might be better than 2 nodes of a 100TB. If you can, experiment a bit to find your optimal backup block size.
  • Isolate your backup network traffic from data traffic either physically or by other means (QOS) and don’t route it all over the place to end up where it needs to be.
  • We’re doing this using Dell PowerConnect 5424 (end of life) /5524 switches … no need for the real  high end very expensive gear to do this. The 10Gbps switch, well yes that’s always high end at the moment.
  • Use JBODS with SAS/Storage spaces & you’ll be fine. Select them carefully for performance. You can use bays like the MD3X00 if you want to replicates the backups somewhere otherwise MD12x0 will do or any other decent JBOD => even cheaper. You can also mix, some building blocks that can replicate & other on Storage Spaces /JBOS. Mix and match with different backup needs means you have flexibility. Note that at TechEd Europe (June 2012), in a session by DELL, they mentioned the need for a firmware update with the MD1200 to optimize performance with Storage Spaces.

It’s all about the money in a smart way!

As I said before, you will not get fired for:

  • Increasing backup throughput at least 4 fold (without dedupe)
  • Increasing backup capacity 3.5 fold (without deduplication)
  • Doing the above for 20% of systems that are replaced & new offerings with specialized appliances (even at hilarious discount rates). That’s CAPEX reduction.
  • This helps pay for the primary storage, DRC site & extra SAN for data replication in case of disaster
  • Make backups faster, more reliable & reduce OPEX (The difference for us is huge and important)
  • Putting an affordable scale up & scale out Disk2Disk backup solution into place to the business can safely handle future backup loads as very acceptable costs.
  • It’s a modular solution which we like. On top of that it’s about as zero vendor lock in as it gets. You can mix servers, bays, switches. Use what you like best for your needs. Only the bays have to remain the same within an individual “building block”.

Cost reduction is one thing but look at what we get whilst saving money… wow!

What am I missing?  Specialized dedupe. Yes, but we’re  going for the poor mans workaround there. More on that later.  As long as we get enough throughput that doesn’t hurt us. And give the cost we cannot justify it + it’s way to much vendor lock in. If you can’t get backups done without, sure you might need to walk that route. But for now I’m using a different path. Our money is better spend in different places.

Now how to get the same economic improvements from the backup software? Disk capacity licensing sucks. But we need a solution that can handle this throughput & load , is reliable, has great support & product information, get’s support for new release fast after RTM (come on CommVault, get a move on) and is simple to use ==> even more money and time saved.

Spin off huge file server project?

Why is support for new releases in backup software important. Because the lack of it is causing me delays. Delays cost me, time, money & opportunities. I’m really interested to covert our large LUN file servers to Windows Server 2012 Hyper-V virtual machines, which I now can rather smoothly thanks to those big VDHX sizes that are possible now and slash the backup times of those millions of small files to pieces by backing them up per VHDX over this setup. But I’ll have to wait and see when CommVault will support VHDX files and GPT disks in guests because they are not moving as fast as a leading (and high cost) product should. Altaro & Veeam have them beaten solid there and I don’t like to be held back.