DELL XPS Laptops and seemingly random Blue Screens of Death

Introduction

Over the last years, I have endured some blue screens on my personal travel laptop and those of some buddies. The common factors were various types of DELL XPS laptops and seemingly random Blue Screens of Death (BSOD) with irql_not_less_or_equal. The problem and cause are however not limited to these devices or brand. The cause, I found, very often was the Intel RST driver.

Trust me, there is nothing I find more annoying than a BSOD while I’m on the move and working in a train, airplane, attending a conference or in the speaker room of conference finishing my slides. Heaven forbid it happens while I am presenting. It is always annoying, but in my home office I have other options. While I am traveling, I have no other options.

Diagnose the problem

To fix the issue we need to stop guessing and randomly upgrading or downgrading drivers in a trial and error fashion. Diagnose the problem properly. So what does a seasoned IT Pro do? The IT Pro copies the memory dump from the system and feeds it into WinDbg (download it here).

Be sure to work on a copy of the MEMORY.DMP if you run WinDbg on the problematic machine itself to prevent any permissions issues. Open it via the “File” menu and the option “Open Crash Dump”.

Select your copy of the MEMORY.DMP file.

Let it run and have some patience as it does its work. Pretty soon you’ll see output like below.

Probably caused by : iaStorAVC.sys ( iaStorAVC!Wcdl::Allocator::freeContiguous+20 )

When you run analyze -v you’ll get extra details but you already know what you need to know.

The outcome over the past years, in many (most) cases, pointed to iaStorAVC.sys more than anything else. This is the Intel Rapid Storage Technology (intel RST) driver. Your mileage may vary but I have seen the same cause on multiple XPS systems I figured I would blog it and save my future self (and maybe some of you) a ton of headaches. That said, when you get a BSOD you really must investigate the MEMORY.DMP file from that system yourself to see what driver is the culprit. But with an XPS, chances are iaStorAVC.sys is a reasonable candidate suspect. It has been a common issue over the past few years it seems.

Fine now we know what to fix. How to fix it is what we’ll look at next. I just wrote this blog for my own reference.

Fixing iaStorAVC.sys related BSOD

You have two possible approaches to fixing iaStorAVC.sys related BSODs.

  1. Get an updated Intel Rapid Storage Technology driver (this port)
  2. Move from RAID to AHCI without reinstalling
  3. Move from AHCI to RAID without reinstalling(handy when newer RST drivers are available and your want to test your luck).

We will look at all 3 options, staring with option 1 below.

For these operations make sure you have a (local) account with admin rights. These procedures should work with BitLocker enabled (my laptop has) but make sure you have your recovery keys at hand somewhere. Also, when using a PIN this won’t work in safe mode you know your username/password. The Barney Bear essentials for sysadmins who’ve been around the block a few times.

Get an updated Intel Rapid Storage Technology driver

This is pretty straightforward. Making sure all drivers are up to date is the 1st response to driver issues. No need to explain this in detail, just install them. If DELLEMC has one on their support site that’s great. If they don’t head over to the Intel site and look over there. DELLEMC can lag behind a few months which is why SuoortAssist or you won’t find anything. I have used the more recent Intel downloads with success in the past (https://downloadcenter.intel.com/download/28255/Intel-Rapid-Storage-Technology-Intel-RST-User-Interface-and-Driver?product=55005) to keep Intel RST update.

As RAID mode is how the systems ships by default this is the easiest way to resolve the issue with DELL XPS Laptops and seemingly random Blue Screens of Death. Now if this doesn’t fix the issue or there is no newer driver to be found move on to option 2 and eventually maybe even 3. Those are other blog posts.

KB 2636573: Guest Crashes with Win2008R2 RTM/SP1 STOP 0xD1 in storvsc!StorChannelVmbusCallback During Live Migration

The BSOD

I helped hunt down this bug and tested the private fix. Some months ago, during the summer of 2011, I was putting some new Hyper-V clusters under stress tests. You know, letting it work very hard for a longer period of time to see if anything falls off or goes “boink". It all looked pretty robust and and after some tweaking also very fast. Just when you’re about to declare “we’re all set” here you see a BSOD on one of the guests that’s being live migrated happily announcing: “DRIVER_IRQL_NOT_LESS_OR_EQUALSTOP: 0x000000D1 (0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000)”

image

Now that doesn’t make ME very happy however. So I investigate to see if there are any more VMs dropping dead during live migration but we don’t see any. Known issues like out of date versions of the integration tools or the like are not in play nor are any other possible suspects.

We throw the MEMORY.DMP file in the debugger and we come up with the following culprit:

DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)

The driver probably at fault is storvsc.sys

Probably caused by : storvsc.sys ( storvsc!StorChannelVmbusCallback+2b8 )

Hmmmmmmm, we start searching the internet but we don’t find much. We also throw it on to Twitter to see if the community comes up with something. Meanwhile we keep looking and find this little blog post by a Microsoft support engineer Rob Scheepens:

http://blogs.technet.com/b/dip/archive/2011/10/21/win2008r2-rtm-stop-0xd1-in-storvsc-storchannelvmbuscallback-0x2b5.aspx

We pinged Rob and opened a case with MS support. That evening Hans Vredevoort (www.hyper-v.nu), who saw my tweet, mails me with the details of a fellow MVP in the USA having this same issue. We get in contact an via both Microsoft & the Hyper-V community we start hunting the cause of this bug. The progress on this issue can be read at the Microsoft blog above. You’ll notice that the fix is in the works now.

Hunting down the STOP error

What did we establish:

  • It only happens occasionally with a live migration and it rather ad hoc, not every time, not after X amount of live migrations or X amount of up time.
  • It seems sometimes to happens only with guests running dynamically expanding VHDs attached to ISCSI controllers in Hyper-V.  But that’s not really the case as I remember one being with  fixed VHD attached to an SCSI controller. In our case the VMs we could reproduce the issue with in a reasonable time were all SQL Server test and development guests running SQL Server 2008 where the dynamically expanding disks are used as “poor man’s thin provisioning”.
  • I have not heard of this on Windows 2008 hosts, only R2, but I have not tested this.

So it’s reproducible but it takes intensive live migration activity. Meanwhile we received private instrumentation to install on both guests & hosts to collect “enriched” memory.dumps when a guest experiences a BSOD. With PowerShell we have continuous live migration running to reproduce the issue. The fact that can live migrate over 10Gbps does help Smile. Because you can get lucky but in reality needs many hundreds of live migration to reproduce it. On some machines many thousands. Not a joke but I total we did 8000 Live migrations to test the fix and we did about 12000 to reproduce the issue on several VMs with different configurations to send memory dumps to MSFT. So yes, you really need some PowerShell and having a 10Gbps Live Migration network also helps Winking smile.

All the collected MEMORY.DMP files form these live migration exercises were uploaded to Microsoft for analysis. That took a while, also because they had a boatload of live migrations to do and I don’t know if their test lab has 10 Gbps.

On Tuesday the 25th of October Microsoft contacts us with good news. They have root-caused the problem and a hotfix is in the works. You can download that here http://support.microsoft.com/kb/2636573

On Thursday 27th of October we get access to a private fix and after installing this one we’ve been running thousands of live migrations without  seeing the issue.

The public release of this hotfix is currently planned (HTP11-12) under KB2636573.

The details for the curious

Root Cause?

The root cause can be summarized as follows: “StorVSP was modifying guest memory while the VM’s virtual motherboard was being powered off.” Doing this storvsc access a NULL pointer in a memory buffer that is already freed up. The result of this is a BSOD or STOP error in the virtual machine.

Only SCSI attached VHDs

OK but why do we only see this with SCSI attach VHDs? Now the issue happens during power down of the virtual machines’ mother board because there is a disk enumeration during the shut down phase. And this enumeration only happens with SCSI disks.

Right! So the more VHDs we had attached to SCSI controllers in a virtual machine the higher the likelihood of this happening.

Why so much more likely with dynamically expanding VHDs

But still we saw this exponentially more with dynamically expanding disks. Why is that? Well it’s not that dynamically expanding disks trigger disk enumeration more than fixed disks.  However it seems that any disk expansion, which causes write delays, can lead to a timing issue that will cause the disk enumeration to hit the issue described above. So this significantly increases the risk that the STOP error will happen and it explains that the chance this will happen with fixed VHDs attached to SCSI controllers is significantly lower. This is sync with what we saw. The virtual machines with a lot of dynamic disks attach to SCSI controllers that had a lot of activity (and thus potential for expanding) is the ones where we could reproduce this the fasted.

Conclusion

It can take some time to hunt down certain bugs, especially the rare ones that only happen every now and then so occurrences are few and far between. But when you put in some effort Microsoft helps out and works on a fix. And no you don’t have to have the most expensive support contract for that to happen. As a  matter of fact this call was logged under a free support call with the TechNet Plus subscription. And as it was a bug, they return it as unused.