The Microsoft Management Summit 2013

MMS 2013 is in Las Vegas, Nevada, USA

Time flies fast and it’s time to look ahead to 2013. My continuing investment in myself is part of that.  Despite a lot of rumors about big changes to MMS (its future, location, timing etc.) things will go forward as they’ve been in the past years. That includes the location. As you probably already heard it’s back in Las Vegas, state of Nevada, USA. So after the, for many people, somewhat disconcerting announcement at MMS 2012 indicating the above mentioned changes, MMS 2013 will once again be held in Las Vegas again. As before it will be focused on the entire System Center Suite. That was confirmed by a mail form the MSS conference team recently and a TechNet blog post

image

Recently is was announced that the MMS 2013 content survey is now open. So they’re planning for the Microsoft Management Summit 2013 content and they’d like to hear from us. Why? Well, the better they align the content of the conference to our needs, the better it will be as an experience. This means our return on investment will be bigger which is always a good thing. So if you’re going or thinking of going this is the place, MMS 2013 Content Survey, to voice your opinions on what it should look like content wise. You have two more weeks to fill it out and than it’s scheduled to close down.

Why Attend?

It’s great to have an event focused on managing, deploying and protecting the infrastructure we’ve spent so much time, effort and money building. This conference is dedicated to exactly that. Smaller in scale but very focused. All together in the same hotel/conference center for 5 long days living in System Center and nothing else. As the world’s top operators in this space are there, the networking opportunities are also excellent. I can still remember the amount of talking and discussing I did with my colleagues in 2012, that was stimulating.

It’s also the place to provide feedback to Microsoft about System Center. Things you like, don’t like, things that are missing etc. I most certainly have some feedback for them.

Will I attend?

I’ll most certainly try to attend, that’s for sure. So it’s time to fill out the request form and start cutting through the red tape. Let’s hope the economy doesn’t tank completely and that we can go. The chips might be down right now but let’s not cost cut ourselves out of skills, education, opportunities and a future. Remember, keep moving forward and don’t quit yet, you can always give up later Winking smile.

Hyper-V Guest Storage Performance: Above & Beyond 1 Million IOPS

Making a million IOPS possible in a Windows Server 2012 Hyper-V VM

A lot of you will have seen the demos of a Hyper-V guest with VHDX disks running on Windows Server 2012 doing a million apps, if you haven’t yet, take a look here. While some quickly dismissed this as “irrelevant boasting” without real life relevance, I respectfully disagree. This is smart future proofing by Microsoft and provides a hypervisor ready for the future hardware capabilities and capable to handle the most demanding workloads today & in the years to come. Sure such a demo is under lab/ideal conditions and does not reflect the majority of real life environments but it’s nice to see what a hypervisor is capable of if and when you might need it. Remember there was a day that 4GB was a lot of RAM and 2TB sounded gigantic. Also remember that some people have larger needs than others.  Until Windows Server 2008 R2 you had some limitations in storage IO performance that would not allow for a million IOPS. These had to be addressed or all the efforts with regards to capabilities and performance in regard to storage, CPU, networking and memory would just hit those particular bottlenecks. So it is addressing real needs and indeed also smart future proofing.

Capabilities of virtual machine storage IO throughput in Windows 2008 R2

The capabilities listed below dictate the IO capabilities in virtual machines running on Windows Server 2008 R2 Hyper-V:

  1. Limited to one IO channel per virtual SCSI Controller
  2. 256 queue depth/SCSI for all devices attached to that SCSI adapter.
  3. There was one fixed vCPU (0) dedicated to handling IO.

image

The picture above illustrates these limits. You see two virtual SCSI Controllers each having 2 VHD virtual disks attached. Each disk shares the one channel the controller it is attached to has.

These limits could become a bottle neck but that was never was too big of a problem with a maximum of 4 vCPUs in Windows 2008 R2 Hyper-V. If needed for performance we might have attached VHDs to different virtual SCSI controllers for the best possible performance in Windows Server 2008 R2 Hyper-V .

With 64 vCPUs and ever more demanding workloads these limitations would become a (serious) issue so this needed to be addressed. If not, despite all other efforts in regards to the 4 big resources (memory, storage and network) in Windows 2012, this would remain the limiting factor of IOPS inside a virtual machine on Windows 2012.

Windows Server 2012 improvements to virtual machine storage IO scaling

image

The picture above illustrates the improvements in Windows Server 2012 Hyper-V IO Scaling:

  1. There is now 1 channel per 16 vCPUs, per virtual SCSI device, per controller. So that means you have 4 channels, per VHDX attached to a virtual SCSI Controller when you have 64 vCPUs in the virtual machine. Compared to before, this is a significant improvement and a much needed one with the 64 vCPUs capability there is now.
  2. IO interrupt handling is now distributed amongst all vCPUs and this process is NUMA aware. This is a huge improvement!
  3. There is now a 256 queue depth/device attached to a specific SCSI adapter. That’s another big improvement.

That people, is how you get a virtual machine to handle a million IOPS. Nice! The questions or doubts whether Hyper-V can deliver the capacity, throughput & performance have been wiped of the table, yes also for virtual storage IOPS. You can now go straight to how it will address your business needs. From my experience it does so brilliantly and very cost effectively. Life might not be perfect but it is very good Smile

Quest Technical Experts Conference 2012 Europe Podcast

As the readers of my blog know I was in Barcelona this week to attend the Technical Experts Conference Europe 2012 organized by Quest (now a part of DELL).  Together with my fellow MVPs, friends and colleagues  Aidan Finn (@joe_elway), Carsten Rachfahl (@hypervserver) and Hans Vredevoort (@hvredevoort) we presented 8 sessions at the Virtualization Track and did a Hyper-V Experts panel to share what we have learned and help answer questions attendees on the new capabilities of Hyper-V in Windows Sever 2012. It was both fun and interesting to do. We learned some more from each other and also from the questions of an alert audience whom we enjoyed presenting to.

Mattias Sundling, a Technical evangelist at Quest and the owner of the Virtualization Track at TEC 2012 Europe did an audio podcast all 4 of us MVPs in the “Virtual Machine” expertise presenting in that track . You can find that podcast here on the vKernel/DELL web site http://www.vkernel.com/podreader/items/tec-europe-2012-mvps-with-mattias-sundling or on YouTube by clicking on the screenshot below. Enjoy.

image

Windows Server 2012 64TB Volumes And The New Check Disk Approach

Introduction

I a previous post I mentioned the use 64TB volumes in Windows Server 2012 in a supported scenario. That’s a lot of storage and data. There’s a cost side to this all and it also incurs some risk to have all that data on one volume. Windows 2012 tries to address the cost issue with commodity storage in combination with the excellent resilience of storage space to reduce both cost and risk. Apart from introducing ReFS they also did some work on NFTS to help with reliability. We already discussed the use of the flush command in Windows Server 2012 64TB NTFS Volumes and the Flush Command.  Now we’ll look at the new approach for detecting and repairing corruptions in NTFS which optimizes uptime through on line repair and keeps off line repairs minimized and very short thanks to spot fixing.

On top of these improvements studying this process taught me two very interesting things:

  1. The snapshot size limit is also a reason why NFTS volumes are not bigger than 64TB. See the explanation below!
  2. Cluster Shared Volumes an CSVFS enable continuous availability even when spot fix is run! See below for more details.

So read on and find out why I’m not worried about the 50TB & 37TB LUNs we use for Disk2Disk backups.

Hitting the practical boundaries of check disk (CHKDSK)

While NTFS has been able to handle volumes up to 256TB in size, this was never used in real life due t the fact that most people don’t have that amount of storage available (or need to have) and that the supported limited was 16TB. With Windows 2012 this has become 64TB. That’s just about future proof enough for the time being I’d say Winking smile. In real life the practical volume size has been smaller than this die to a number of reasons.

There is the limitation of basic disks which are solved with GPT, but that has it’s own requirements. The there are the storage arrays on which the biggest LUN you can create varies from 2TB tot 16, 50TB or more depending on the type, brand and model. Another big concern was based on potentially long CHKDSK execution time. No that the volumes size is the factor here, it’s the number of files on the volume that dictates how long CHKDSK will run. But volume size and number of files very often go hand I hand.

While Microsoft has been reducing the execution time of with every windows release since Windows 2000 the limit of additional improvements that could be made with the current approach have reach a practical limit. Faced with ever bigger volumes, a huge number of files and ever more “Always On” services, requiring very high to continuous availability, things needed to change drastically.

A vastly improved check disk (CHKDSK) for a new era

This change came through a new approach for detecting and repairing corruptions in NTFS that consists of:

  1. Enhanced detection and handling of corruptions in NTFS via on-line repair
  2. Change the CHKDSK execution model  to separate analysis and repair phases
  3. File system health monitored via Action Center and Server Manager

Enhanced NTFS Corruption Handling

NTFS now logs information on the nature of a detected corruption that cannot be repaired on line. This is maintained in new metadata files:

  • $Verify
  • $Corrupt

The new “Verification” component confirms the validity of a detected corruption to eliminated unnecessary CHKDSK runs due to a transient hiccup. There’s a service involved here call “Spot Verifier”:

image

The on-line repair capability that was introduced with the “Self-healing” feature in Vista and was limited to Master File Table (MFT) related corruptions has been greatly enhanced and extended. That means it can now handle a broader range of corruptions across multiple metadata files which means nearly all of the most common corruptions can be fixed by an on-line repair

The New CHKDSK Process & Phases

The phases are:

The analysis phase is performed online on a volume snapshot, so there is no down time for the services and users.

IMPORTANT NOTE: You read that right! The analysis phase is performed online on a volume snapshot. Now when you know that the maximum supported size of a Windows volume snapshot is 64TB you also know that except for stress & performance testing of 256TB LUNS there is another limitation in play. The size of the snapshot to make the new chkdsk process work! If you have volumes bigger than 64Tb, this process can and will use a hardware snapshot if there is a hardware VSS Provider that supports snapshots bigger than 64 TB. So the this new chkdsk process in Windows Server 2012 will also work for volumes bigger than 64TB. But within the Microsoft Windows Server 2012 stack, 64TB is the top limit or you lose this new chkdsk functionality. Interesting stuff!

If a corruption is detected, there will be a first attempt at Online Self-Healing via the self-healing API. Now if self-healing cannot repair the error the Online Verification “‘(Spot Verification” kicks in to verify that the error is not a glitch. When it is verified that any detected corruption that cannot be fixed on line is identified and logged to a new NTFS metadata file: $Corrupt. After this the, the administrators are notified so at a time of their choosing the volume can be taken offline to do the repairs

clip_image002

The Offline repair phase (spot fixing) only runs when all else has failed and can’t be avoided. The volume can be taken offline, either manually or scheduled, at a time the administrator’s chooses. Spot fix only repairs logged corruptions to minimize volume unavailability.

Cluster Shared Volumes bring us Continuous Availability in Windows Server 2012 as the process leverages clustering and CSVFS functionality in this case to make sure you don’t have to bring the volume down, IO is just temporarily stalled:

  • Scan runs & adds detected corruptions to $Corrupt
  • Every minute the cluser IsAlive  check runs on a cluster which also ….
  • Enumerates $corrupt system file to identify corrupt files via fsutil, if so action is taken
  • CSV namespace is paused to CSVFS & Underlying NTFS is dismounted
  • Spotfix runs for maximal 15 seconds, if needed this repeats every 3 minutes
  • It corruption repair will take too long it will be marked to run at off line moment and not do the above.

It normally takes no longer than a few seconds, often a lot less, to repair corruptions off line now, which is benign on a modern physical server that runs through its memory configuration, BIOS/EUFI, Controller. Even on laptops and virtual machines this is very short and doesn’t really add much to the boot time as you can see in the picture below, it’s often not even noticeable.

clip_image004

Using this new functionality

The user is notified via the Windows User Interface. The phases of repair are also displayed in the Action Center & Server Manager and the user can take appropriate action.

The chkdsk command line has had options added that leverage this model

clip_image006

The fsutil repair command has also some new options added:

clip_image007

You can also control the action via PowerShell with the storage cmdlet Repair-Volume. Acton can be run as a job and the parameters -scan, -spotfix, -offlinescanandfix are pretty obvious by now. See http://technet.microsoft.com/en-us/library/hh848662.aspx