Dell Compellent SCOS 6.7 ODX Bug Heads Up

UPDATE 3: Bad and disappointing news. After update 2 we’ve seen DELL change the CSTA (CoPilot Services Technical Alert)  on the customer website to “’will be fixed” in a future version. No according to the latest comment on this blog post that would be In Q1 2017. Basically this is unacceptable and it’s a shame to see a SAN that was one of the best when in comes to Hyper-V Support in Windows Server 2012 / 2012 R2 decline in this way. If  7.x is required for Windows Server 2016 Support this is pretty bad as it means early adopters are stuck or we’ll have to find an recommend another solution. This is not a good day for Dell storage.

UPDATE 2: As you can read in the comments below people are still having issues. Do NOT just update without checking everything.

UPDATE: This issue has been resolved in Storage Center 6.7.10 and 7.Ximage

If you have 6.7.x below 6.7.10 it’s time to think about moving to 6.7.10!

No vendor is exempt form errors, issues, mistakes and trouble with advances features and unfortunately Dell Compellent has issues with Windows Server 2012 (R2) ODX in the current release of SCOS 6.7. Bar a performance issue in a 6.4 version they had very good track record in regards to ODX, UNMAP, … so far. But no matter how good your are, bad things can happen.

DellCompellentModern

I’ve had to people who were bitten by it contact me. The issue is described below.

In SCOS 6.7 an issue has been determined when the ODX driver in Windows Server 2012 requests an Extended Copy between a source volume which is unknown to the Storage Center and a volume which is presented from the Storage Center. When this occurs the Storage Center does not respond with the correct ODX failure code. This results in the Windows Server 2012 not correctly recognizing that the source volume is unknown to the Storage Center. Without the failure code Windows will continually retry the same request which will fail. Due to the large number of failed requests, MPIO will mark the path as down. Performing ODX operations between Storage Center volumes will work and is not exposed to this issue.

You might think that this is not a problem as you might only use Compellent storage but think again. Local disks on the hosts where data is stored temporarily and external storage you use to transport data in and out of your datacenter, or copy backups to are all use cases we can encounter.  When ODX is enabled, it is by default on Windows 2012(R2), the file system will try to use it and when that fails use normal (non ODX) operations. All of this is transparent to the users. Now MPIO will mark the Compellent path as down. Ouch. I will not risk that. Any IO between an non Compellent LUN and a Compellent LUN might cause this to happen.

The only workaround for now is to disable ODX on all your hosts. To me that’s unacceptable and I will not be upgrading to 6.7 for now. We rely on ODX to gain performance benefits at both the physical and virtual layer. We even have our SMB 3 capable clients in the branch offices leverage ODX to avoid costly data copies to our clustered Transparent Failover File Servers.

When a version arrives that fix the issue I’Il testing even more elaborate than before. We’ve come to pay attention to performance issues or data corruption with many vendors, models and releases but this MPIO issue is a new one for me.

74 thoughts on “Dell Compellent SCOS 6.7 ODX Bug Heads Up

    • I don’t, CoPilot should however be able to tell you very well which version as it’s an alert note on the customer support site.

  1. We have issue since we upgrade from 6.5.5 to 6.7.2., Upgrade was request by CoPilot because on 6.5.5 a controler can go in hibernate mode without failover on the other one loosing the path to the SAN. We have been three months in support with CoPilot that was pointing to our server config or MS mpio problem without success.

    The problem we have with ODX with this version is when backup is runing it put the VM in backup maintenance mode. When the backup finish the maintenance mode is release and OS want to merge the VHDX and AVHDX files and then fail and the complete cluster lost connexion to the SAN.

    • ODX and CSV backups have a long history of bugs in Windows, so make sure you have all relevant update/hotfixes for Failover Clustering and Hyper-V (Windows Server 2012 – Window Server 2012 R2).

      • Yes everything is done and doublecheck for the MS update. CoPilot open twice a case with MS to check our setup and they where pointing to a network problem because everything was corectly configure. Great to know because is our first cluster and was not confident as CoPilot was pointing to our config but as MS tech support tell is perfectly setup following good practices and updates.

        I rebuilt a new 1 server cluster and same problem happend. I was tell by CoPilot support the problem is in version 6.7.2 & the one they test on their side 6.7.3 have the problem. They also test with beta version 7.x and same problem appear while files are merging. Not good, no ETR for the moment.

        I was told to disable ODX and wait for a newer version when the problem will be solved, but I dont want to test a new release firmware with production on it. no way.

        Will do test of new firmware before as this one already mess the things here a lot.

          • SAN was installed by tech field with 6.5.5 and we jump over 6.6.x to 6.7.2. No Issue with the 6.5.5. The weirdest thing I see is my secons SSD drive was flashing really fast and 4 other drive NL-SAS that same way when the ODX mpio error in Windows happend.

            As we are using tiering and default profile the SAN show WRITE 50mb/s on the SSD and a weird 200mb/s on NL-SAS. Supposed to write only on SSD with our setup. No I/O on iSCSI, the controler, the CSV but lot of I/O between the SC4020 (SSD) and SC200 (NL-SAS).

          • I was told by Co-Pilot that this issue affects 6.6.5 as well (that is our current installed version).

          • They didn’t say which versions. We have 6.6.5.19 and they said we are affected. We disabled ODX over the weekend as part of applying OS patches, but we don’t know for sure that the ODX issue is the source of our paths going down problems (it is Co-Pilot’s best theory for the moment). The issue is very sporadic in our environment, but usually it happens on a weekend for some reason. We will just have to wait and see…

          • We’re at 6.6.5.19 with ODX on … haven’t seen paths being lost … we’re FC, iSCSI only used for replications. SC40 controllers by the way.

          • We are iSCSI only and have SC8000s. Co-Pilot never clerified what the exact criteria for problems were, so it is conceivable that our setup is susceptable and your isn’t.

            Lots of people in this discussion thread are saying that they aren’t having the issue, but it isn’t a consistently reproducable issue (assuming we are having the issue). We will go weeks without an incident and then suddenly some/all the paths to one controller will go down (and sometimes stay down until we fix it).

  2. We are trying to start the migration to Compellent since more than 3 month, 3 arrays. My advice is stay away from 6.7 for the moment, we had major bugs (even data corruption on 6.7.3 with thin import). We are on 6.7.5 atm, most of our problems are solved but I need to investigate this ODX issue now.

    From what I understand if I copy a file from the C: (VMware drive) to a D: (Compellent LUN) it should drop the paths right?

    • I read it as any sourece volume used by a host that’s not a Compellent LUN and as such doesn’t have Compellent ODX support – meaning a Windows 2012 (R2) stack (hypervisor). The exact permutations of non Compellent volume to Compellent volume I’m still trying to gather.

  3. Well on Windows 2012R2 SCOS 6.7.5 ODX enabled iSCSI
    Copy from a VMware drive to Compellent LUN > ODX not used, no problems
    Copy from a Compellent LUN to VMware drive > ODX not used, no problems
    Copy from a Compellent LUN to Compellent LUN > ODX used, no problems

    If you want me to test another scenario, tell me.

    • Some physical ones: Host with local storage to and from Compellent LUN on that host , UNC path to volume with non-Compellent LUN to and from Compellent LUN + making sure the data copies are large enough make sure ODX has work to do.

  4. UNC path EqualLogic To Compellent > ODX not used, no problems
    Compellent To UNC path EqualLogic > ODX not used, no problems

    Seriously, i dont have physical windows with san connectivity anymore ><

  5. I was told by our account director at DellDirect no repair for ODX before summer. And because our production can run without ODX to disable it and start putting prod on it and wait for next firmware to enable back ODX. It take them 2 months to find ODX problem that appear 12 hours after firmware update that I was pointing since the begining.

    What do you think of disabling ODX. We have the SAN for 4 months now and It was used only 2-3 weeks without a bug.

    • Here we have escalated the problem, it’s confirmed to be present in the 6.7.5 and we are waiting for the confirmation of the ETA of the resolution. We will begin the Windows migration in April, if it’s not fixed by then we will disable ODX until further notice.

  6. Thank you so! much! for! this! Been banging my head against the wall for a week trying to figure out why I was having terrible throughput and MPIO error spam when copying between two separate Compellent systems (both 6.7.5.5). Disabling ODX fixed it immediately. You are my hero of the month!

  7. We are facing actually the same problem in SCOS 7.0.1.306. We have a case open and investigation/testing at the moment to confirm this at 100%.

    • Hi DennisB,

      Can you elaborate on your post please? Compellent is making 6.7.11 available for us after we discovered this issue in 6.7.5. upgrade last weekend :(. they forgot to mention this minor glitch beforehand… since they now advise us to upgrade to 6.7.11 i would think it was solved, also the release notes state this. if you have other experience, please let me know.

      Rgrds,
      Niels

      • Hi,

        Please read the responses in the comments … lot of sharing going on.

        So 6.7.10 was supposed to reslove this but got pulled as it still has the issue.
        6.7.11 & 7.0.1.306 is reported to still have this issue. Some of the readers have shared their support case numebers that you can use to cross check with CoPilot and find out more.

      • If your experience is anything like ours, ODX is going to be the least of your headaches with 6.7.11. We’re going through major MPIO path failover events that’s bring down one of the clusters to its knees. If you have a working version of the SAN firmware, stay with it as long as you can.

    • Are you serious !!

      I get spammed by one of the arrays for enclosure error 150 times per day and CoPilot is not able to what is flagging alerts (this is a different problem),

      They told me to do BMC drain, remove the controller and battery for 10 minutes and do the same with the other controller. If this not helping us, do the update 6.7.10 release for key customers.

      Yeah! there is a GENERAL release 6.7.11 CoPilot tells me. We were on a key customer for 6.7.2 that fail with MPIO issues.

      I explain to them why I want to do the update, ODX issue and the maybe possibility to stop getting spammed by the enclose problem.

      I call yesterday for the pre-check, call for the update today, and call again for the post-installation check. Nobody tells me the ODX problem is not resolved on the new 6.7.11 version.

      I was supposed to re-enable ODX tonight. By chance I take a look at this site before doing anything as I’m not trusting CoPillot and new firmware update fully.

      Thank to everyone and continue to share.
      I will give a call to CoPilot tomorrow…

  8. I ask CoPilot and they tell me ODX issue should be resolved.

    They tell me the ticket number above have been escalated because it’s look like is a Windows problem…

    • I can confim ODX problem with version 6.7.11.4,
      We get again 35000 MPIO errors in the next hours we reenable ODX.

      DO NOT reenable ODX with this version if you already have the problem.

  9. I can also confirm ODX problem still exists with 6.7.11.4 Today I reluctantly enabled ODX and initiated a manual phone home once MPIO errors started. CoPilot escalation engineer has promised to get back to me tomorrow.

    • Have you get any news ? I have have been assigned to a new resolution manager but no news about ODX.

      I think we have not been understand correctly from Compellent perspective on the resolution of our problem.

      The CoPilot technical service alert tell resolution for ODX 6.10.x tell:

      Windows Server 2012 requests an Extended Copy between a source volume which is unknown to the Storage Center and a volume which is presented from the Storage Center.

      The ODX problem we encounter is only happening for us on the san itself no other party or drive/volume.

      We are using recomment storage hardware profile between tier1 SSD and tier3 NL-SAS. should never write on the NL-SAS directly.

      When this ODX mpio error happend I get write on the SSD tier at 50Mb/sec but get also Write on NL-SAS drive at 200Mb/sec. No IO for iSCSI trafic, server, storage profile but lot of IO trafic on SAS cable between the 2 tiers.

      • Yea, it’s very bizarre, and we have almost exactly the same problem. Our SSDs can no longer keep up with what the SAN is doing in the background, effectively overdriving them. We also don’t have any other volume sources outside of the volumes of the SAN, and I think these are two separate issues altogether. Disabling ODX doesn’t resolve anything for us, and we’re basically forced into engaging Microsoft Support as well.

        • You still have MPIO errors even if the ODX is disable on all servers cluster connected to the SAN?
          What is the Switch you are using? On what version of SCOS this start happend for you?

          • Yeap, it happened to coincide with failed disks, and the array failed to rebuild. Which should really be only making the array slow, but it’s actually causing path issues, making the symptoms almost identical to ODX-related problems.

            It’s all on a pair of Nexus switches, and this is after 6.7.11 “upgrade”…

  10. Hi, as update I receive this from CoPilot support and Escalation Engineer.

    The fix will be released in 6.7.30 which is tentatively scheduled for release mid-January 2017.

    if this firmware finally repair ODX that make us 14 month without ODX functionnality

  11. We too have been bitten by this bug in their SCOS revision. According to official documentation in the Knowledge base it is supposed to be resolved in 7.1.2 Currently in “pre-release” I have no timeframes but have worked with Co-Pilot on this issue. Affected versions are SCOS versions 6.7.5, 6.7.11 6.7.20, 7.0 & 7.1.1 according to the official KB Doc.

      • Appreciate the information. Nothing official but I am finding that 7.1.2 isn’t planned to be released until January – Mid 2017. I will be on the lookout for a fix for 6.7.X. I am still working with Co-Pilot to get better answers.

        • Discussed this with Copilot yesterday. 7.1.2 is available on request, but isn’t still isn’t GA. They can’t give a timeframe on GA as there have been lots of bugs (mostly around dedupe) that they need to make sure are ironed out first.
          So if you want it you can have it… but you might get other issues. Personally I can live without ODX so I’m choosing to wait for GA.

          • Yes. It depends on the situation. But this has been an issue for far to long. Also 7.1.2 is not a solution for people that are still on SC40 controllers. I’m not a happy camper.

  12. Supposedly 6.7.22 is a fix. It’s not GA though. Copilot needs to check your config before they release it to you.

    I’m busy downloading the ISO files and will be doing the update tomorrow on my SC4020.

    I’ll know soon enough if this update has fixed it for me. I had the issue where MPIO would get degraded almost daily after volume snapshots got corrupt while doing backups on a soon to be production Hyper-V cluster.

    • Let’s hope this works out for you and that a GA release of 6.7.x with a fix is out soon as well as a replay manager that supports Windows Server 20156.

  13. CSTA was updated on November 23rd and It says 7.1.2 and 6.7.22 fix the issue. Does anyone have any experience with either of these updates?

    Any chance you will be doing a write up on you the deduplication/compression features released with SCOS 7?

    • Updates with info we already had … AFAIK these are not GA releases those might arrive in January 2017 or if we’re lucky in 2016. I’m not holding my breath. I’m still staying at pre-ODX issues firmware.

      • I updated to 6.7.22 on the recommendation of a Compellent Optimize rep. I don’t plan on re-enabling ODX on my Hyper-V hosts (Server 2012 R2) until I’m confident the issue is fixed this time.

  14. We’ve recently been hit by what appears to be the MPIO issue – Dell ProSupport has released version 7.1.3 to us as of earlier last week confirming it as fixed though looking at the long history of this issue i won’t hold my breath

    • No. A you can read in the CoPilot Customer Technical Advisory 6.7.5 is affected. It’s only fixed in the 6.7.22 release. An earlier attempt to fix it in 6.7.10 wasn’t successful. Do note hat if you do not have the situation that cause the ODX issue to manifest itself you might not have problems in your environment.

  15. According to the Compellent Customer Portal, it appears that 6.7.22 is now generally available for the SC9000, SC8000, SC4020, and Series 40 only (no SC2xxx series). I updated my firmware to this version a few weeks ago, but I have been hesitant to enable ODX. I want to make sure the issue is really fixed this time.

    • Yes it is GA and earlier this week I have deployed it to 2 Compellent systems. We are now to start testing and I will share when I can.

  16. We recently upgraded all our arrays to 6.7.22. and we started seeing lot of issues with Hyper-V, most of the DB vm’s performance reduced to 1/3 before the upgrade. I don’t recommend any one go with 6.7.22. After disabling ODX we did see everything is normal. but I am in a bad situation we can’t disable ODX in prod. Now performances of all the arrays are very poor. I am still working on this, will let you know once i get solution.

    • Sounds bad. Right now we’re only testing with lab clusters on seperate SC40’s with 6.7.22 – the orginal issue seems gone. We’ll need to focus on performance next it seems.

  17. I have not update to 6.7.22, waiting 6.7.30 and to get more positive feedback here and let people share their experience. Murali on Dec 17 tell he have perf problem with 6.7.22. I will contact CoPilot next week.

      • So being bit by this in 6.22 I’ve heard from Dell/EMC over the weekend and they claim internally they feel 6.30 will not fix this troublesome bug. So be careful out there if it does hit GA and let it bake in test for months.

        • Couldn’t agree more…Problem first occurred in 6.7. I am running 6.6.13.5 on my production SAN and wont be changing from this stable version until they have a confirmed fix out and we hear back from the community. My DR San is running on 6.7.11.4 which has the ODX issue but we turned off ODX on the windows operating systems there. I cannot understand what is taking so long to get this right. continuing to follow this issue closely.

Leave a Reply