KB 2636573: Guest Crashes with Win2008R2 RTM/SP1 STOP 0xD1 in storvsc!StorChannelVmbusCallback During Live Migration

The BSOD

I helped hunt down this bug and tested the private fix. Some months ago, during the summer of 2011, I was putting some new Hyper-V clusters under stress tests. You know, letting it work very hard for a longer period of time to see if anything falls off or goes “boink". It all looked pretty robust and and after some tweaking also very fast. Just when you’re about to declare “we’re all set” here you see a BSOD on one of the guests that’s being live migrated happily announcing: “DRIVER_IRQL_NOT_LESS_OR_EQUALSTOP: 0x000000D1 (0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000)”

image

Now that doesn’t make ME very happy however. So I investigate to see if there are any more VMs dropping dead during live migration but we don’t see any. Known issues like out of date versions of the integration tools or the like are not in play nor are any other possible suspects.

We throw the MEMORY.DMP file in the debugger and we come up with the following culprit:

DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)

The driver probably at fault is storvsc.sys

Probably caused by : storvsc.sys ( storvsc!StorChannelVmbusCallback+2b8 )

Hmmmmmmm, we start searching the internet but we don’t find much. We also throw it on to Twitter to see if the community comes up with something. Meanwhile we keep looking and find this little blog post by a Microsoft support engineer Rob Scheepens:

http://blogs.technet.com/b/dip/archive/2011/10/21/win2008r2-rtm-stop-0xd1-in-storvsc-storchannelvmbuscallback-0x2b5.aspx

We pinged Rob and opened a case with MS support. That evening Hans Vredevoort (www.hyper-v.nu), who saw my tweet, mails me with the details of a fellow MVP in the USA having this same issue. We get in contact an via both Microsoft & the Hyper-V community we start hunting the cause of this bug. The progress on this issue can be read at the Microsoft blog above. You’ll notice that the fix is in the works now.

Hunting down the STOP error

What did we establish:

  • It only happens occasionally with a live migration and it rather ad hoc, not every time, not after X amount of live migrations or X amount of up time.
  • It seems sometimes to happens only with guests running dynamically expanding VHDs attached to ISCSI controllers in Hyper-V.  But that’s not really the case as I remember one being with  fixed VHD attached to an SCSI controller. In our case the VMs we could reproduce the issue with in a reasonable time were all SQL Server test and development guests running SQL Server 2008 where the dynamically expanding disks are used as “poor man’s thin provisioning”.
  • I have not heard of this on Windows 2008 hosts, only R2, but I have not tested this.

So it’s reproducible but it takes intensive live migration activity. Meanwhile we received private instrumentation to install on both guests & hosts to collect “enriched” memory.dumps when a guest experiences a BSOD. With PowerShell we have continuous live migration running to reproduce the issue. The fact that can live migrate over 10Gbps does help Smile. Because you can get lucky but in reality needs many hundreds of live migration to reproduce it. On some machines many thousands. Not a joke but I total we did 8000 Live migrations to test the fix and we did about 12000 to reproduce the issue on several VMs with different configurations to send memory dumps to MSFT. So yes, you really need some PowerShell and having a 10Gbps Live Migration network also helps Winking smile.

All the collected MEMORY.DMP files form these live migration exercises were uploaded to Microsoft for analysis. That took a while, also because they had a boatload of live migrations to do and I don’t know if their test lab has 10 Gbps.

On Tuesday the 25th of October Microsoft contacts us with good news. They have root-caused the problem and a hotfix is in the works. You can download that here http://support.microsoft.com/kb/2636573

On Thursday 27th of October we get access to a private fix and after installing this one we’ve been running thousands of live migrations without  seeing the issue.

The public release of this hotfix is currently planned (HTP11-12) under KB2636573.

The details for the curious

Root Cause?

The root cause can be summarized as follows: “StorVSP was modifying guest memory while the VM’s virtual motherboard was being powered off.” Doing this storvsc access a NULL pointer in a memory buffer that is already freed up. The result of this is a BSOD or STOP error in the virtual machine.

Only SCSI attached VHDs

OK but why do we only see this with SCSI attach VHDs? Now the issue happens during power down of the virtual machines’ mother board because there is a disk enumeration during the shut down phase. And this enumeration only happens with SCSI disks.

Right! So the more VHDs we had attached to SCSI controllers in a virtual machine the higher the likelihood of this happening.

Why so much more likely with dynamically expanding VHDs

But still we saw this exponentially more with dynamically expanding disks. Why is that? Well it’s not that dynamically expanding disks trigger disk enumeration more than fixed disks.  However it seems that any disk expansion, which causes write delays, can lead to a timing issue that will cause the disk enumeration to hit the issue described above. So this significantly increases the risk that the STOP error will happen and it explains that the chance this will happen with fixed VHDs attached to SCSI controllers is significantly lower. This is sync with what we saw. The virtual machines with a lot of dynamic disks attach to SCSI controllers that had a lot of activity (and thus potential for expanding) is the ones where we could reproduce this the fasted.

Conclusion

It can take some time to hunt down certain bugs, especially the rare ones that only happen every now and then so occurrences are few and far between. But when you put in some effort Microsoft helps out and works on a fix. And no you don’t have to have the most expensive support contract for that to happen. As a  matter of fact this call was logged under a free support call with the TechNet Plus subscription. And as it was a bug, they return it as unused.

I’m an MVP–What a Great Start Of 2012

Microsoft presented me with the 2012 Microsoft® MVP Award under the Virtual Machine expertise. If you’d like to know a bit more about the MVP Program and the Award you can take a look here http://mvp.support.microsoft.com/gp/aboutmvp

This is special to me, and I’m honored by it. It’s very nice to get such recognition both from your peers in the community and from Microsoft for sharing your experiences and knowledge for the better good. This doesn’t mean I’m an "know all, end all" guru, far from it. No one knows everything or never makes mistakes. To me it does mean my peers think highly enough of me so that they are willing to nominate me and serve as a reference for my skill set and contributions. That by itself is a huge compliment but I’m grateful to have the opportunities to learn a lot and for that I owe some thanks. I learn a lot from participating in a world wide community that shares experiences & knowledge. The amount of skills that these people bring to the table and the wealth of information that is shared by all is enormous. ”The community” is a varied group of experts in their own areas of excellence.

  • Some are (sometimes long time) MVPs like Aidan Finn, Hans Vredevoort, Jaap Wesselius, Jetze Mellema, Kurt Roggen, Mike Resseler, Kristian Nese, Carsten Rachfahl.
  • Naturally there are the Microsoft employees, both locally and abroad, with whom I’ve had the pleasure of working on support & business cases and who’ve probably vouched for me when asked to do so.
  • Then there is the interaction with community members like Ronnie Isherwood, Jeff Wouters, Dave Stork, Peter Noorderijk, Maarten Wijsman, Rick Slager and my blog readers , and a lot of the  people who follow me on twitter (Ronny Pot, J. Wolfgang Goerlich, Kevin Ball, Kenneth, …) and so many other I’m probably forgetting to mention Embarrassed smile. Some of these I’ve had the privilege of meeting in real life and those occasions have always been both educational & fun. Sometimes these meetings turned into an international distributed testing/troubleshooting effort where we all learn something like at TEC 2011.
  • On top of that I have the luck to work with some really nice people both colleagues (Tom, Peter, Karel, Ivan, Sabrina, Jeff – you rock – and thanks for sticking with us through all the sometimes challenging projects). Some are consultants and people I know at other companies that work for or with us.

Together we learn a lot through the need to answer sometimes complex questions and find solutions for the problems at hand. This makes for a great learning school and ongoing education until that day arrives you’re recognized as an expert while you realize more and more how much there is to learn.

Anti Virus & Hyper-V Reloaded

The anti virus industry is both a blessing and a curse.  They protect us from a whole lot of security threats and at the same time they make us pay dearly for their mistakes or failures. Apart from those issues themselves this is aggravated that management does not see the protection it provides on a daily basis. Management only notices anti virus when things go wrong, when they lose productivity and money. And frankly when you consider scenarios like this one …

Hi boss, yes, I know we spent a 1.5 million Euros on our virtualization projects and it’s fully redundant to protect our livelihood. Unfortunately the anti virus product crashed the clusters so we’re out of business for the next 24 hours, at least.

… I can’t blame them for being a bit grumpy about it.

Recently some colleagues & partners in IT got bitten once again by McAfee with one of there patches (8.8 Patch 1 and 8.7 Patch 5). These have caused a lot of BSOD reports and they put the CSVs on Hyper-V clusters into redirected mode (https://kc.mcafee.com/corporate/index?page=content&id=KB73596). Sigh. As you can read here for the redirected mode issue they are telling us Microsoft will have to provide a hotfix. Now all anti virus vendors have their issue but McAfee has had too many issues for to long now.  I had hoped that Intel buying them would have helped with quality assurance but it clearly did not. This only makes me hope that whatever protection against malware is going to built into the hardware will be of a lot better quality as we don’t need our hardware destroying our servers and client devices. We’re also no very happy with the prospect or rolling out firmware & BIOS updates at the rate and with the risk of current anti virus products.

Aidan Finn has written before about the balance between risk & high availability when it comes to putting anti virus on Hyper-V cluster hosts and I concur with his approach:

  • When you do it pay attention to the exclusion & configuration requirements
  • Manage those host very carefully, don’t slap on just any update/patches and this includes anti virus products of cause

I’m have a Masters in biology from they days before I went head over heals into the IT business. From that background I’ve taken my approach to defending against malware. You have to make a judgment call, weighing all the options with their pros and cons. Compare this to vaccines/inoculations to protect the majority of your population. You don’t have to get a 100% complete coverage to be successful in containing an outbreak. Just a sufficiently large enough part including your most vulnerable and most at risk population. Excluding the Hyper-V hosts from mandatory anti virus fits this bill. Will you have 100% success, always? Forget it. There is no such thing.

Experts2Experts Conference London (UK) 2011

I’m at the Experts2Experts Conference in London and I’m having a great time talking shop, tech & business with my fellow IT Pro colleagues from around Europe. Aidan Finn, Jeff Wouters, Carsten Rachfahl, Ronnie Isherwood.

It might be fun for Microsoft to join us for some of these lunch & dinner time dicussions. It would provide them with great feedback, ideas, concerns. Very educational. While we’re discussing Citrix, VMware, Microsoft & ISV solutions (RES, Appsense) this is not a vendor centric conference. Sure we all work with these products but we’re discussing it from our point of view. The challenges, the issues, the successes & failures are discussed and mentioned.

There’s a high density of virtualization, private cloud, desktop virtualization (VDI, Terminal Servers, Application Virtualization, Client hosted virtual desktops etc.) expertise at the conference to make it interesting.

Tomorrow I’ll be sharing some musings on “High Performance & High availability Networks for Hyper-V Clusters” during my session.