My favorite deployment for VMs with Discrete Device Assignment for GPU

Introduction

Recently I had an interesting discussion on how to leverage Discrete Device Assignment (DDA) for GPU needs when it’s only needed for a certain number of virtual machines. Someone had read my blogs on leveraging DDA that made here optimistic and enthusiastic. But she noticed in the lab she could not leverage DDA on a VM running on a cluster and she could not use storage QoS policies on a stand-alone Hyper-V host with local storage. So, what could she do?

Well for one, her findings are correct. Microsoft did not enable DDA on clustered virtual machines. It doesn’t make sense as the GPU hardware is tied to the virtual machine and any high availability, both planned (live migration) or unplanned (failover) isn’t possible and available anyway. It just cannot be done. I hear you, when you say “but they pulled it off for SR-IOV for networking”. Sure, but please keep in mind that network cards with Ethernet and TCP/IP allows for different approaches than high end video.

My favorite deployment for VMs with Discrete Device Assignment for GPU

My favorite deployment for VMs with Discrete Device Assignment (DDA) for GPU leveraged SMB3 SOFS shares for the virtual hard disks and stand-alone Hyper-V hosts that are member servers in the domain. Let me explain why.

Based on what we discussed above we have some options. One work around is running the DDA virtual machines not high available on local storage on a cluster node. But that would mean you would have a few VMs on all the nodes and that all those nodes must have a DDA capable GPU. Or if you limit the number of nodes that have such a GPU you’ll have a few odd balls in your cluster. You’ll need to manage some extra complexity and must save guard against assigning a GPU via DDA that is already in use for RemoteFX. That cause all kinds of unpleasantness, nothing too deadly but not something you want to do if on your production VDI clusters for fun. It’s a bit like not running a domain controller on a CSV and not making it highly available. If that’s the only option you have you can do that, and I do when needed as Microsoft has improved a lot of things to make this a better and less risky experience. But I prefer to have either physical one or host it on a separate non-clustered Hyper-V host if that’s an option because not all storage solutions and environments have all capabilities needed to make that fool proof.

Also note that running other storage on a S2D node isn’t supported. You have your OS on the boot disks and the disks used in storage spaces. Odd ones out aren’t supposed to be there as S2D will try to recruit them. You can get do it when using traditional shared storage

What I also don’t like about that is that if the cluster storage is not SMB3 SOFS you don’t get the benefit of storage QoS policies in Windows Server 2016, that only works with CSV. So optionally you could leave the non-clustered VM on a CSV. But that’s perhaps a bit confusing and some people might think some forgot to make the machine high available etc.

My preferred setup to get high available storage for virtual machines with DDA needs that benefits from what storage QoS polies have to offer for VDI is to use standalone Hyper-V hosts that have DDA capable GPUs and leverage SMB3 SOFS shares for the virtual Machines.

clip_image002

The virtual machines cannot be high available anyway so you lose nothing there. The beauty is that in this case, as you leverage a Windows Server 2016 SOFS cluster for Hyper-V storage over SMB3 shares, you do get Storage QoS policies.

#On a SOFS node

Get-StorageQosPolicy -Name DedicatedTier1Policy | Get-StorageQosFlow | ft InitiatorName, *IOPS, Status, PolicyID, filePath -AutoSize

#Query for the VM disks on the Hyper-V node

Get-VM -Name DDAVMSOFSStorage -ComputerName RemoteFXHost | Get-VMHardDiskDrive |fl *

clip_image004

#We generate some IO and get some stats on a SOFS node

get-storageQosFlow

get-storageQoSVolume -Mountpoint C:\ClusterStorage\SOFSDEMO\

get-storageQoSVolume -Mountpoint C:\ClusterStorage\SOFSDEMO\ | fl

clip_image006

You can start out with one Hyper-V node and add more when needed, that scale out. Depending on the needs of the virtual machines and specs of the servers (Memory, CPU cores) and the capability and number of GPU in the video cards you get some scale up as well.

To learn more about DDA go here:  https://blog.workinghardinit.work/?s=DDA&submit=Search

To learn more about storage QoS policies go here:

Some more considerations

By going disaggregated. You can leverage a SOFS share for both virtual machines running on a Hyper-V cluster or on stand-alone (non-clustered) Hyper-V that are domain members. The SOFS cluster can be leveraging S2D, traditional storage spaces with shared SAS (JBODs) or even a FC, iSCSI or shared SAS SANS if that the only option you have. That’s all OK as long as it’s SOFS running on Windows Server 2016 and the Hyper-V hosts (stand alone or clustered) are a running 2016 as well (needed for Storage QoS policies and DDA). There is no need for the Hyper-V host to be part of a cluster to get the best results you need. If I use SOFS for both scenarios I can use the same storage array, but I don’t need to. I could also use separate storage arrays. If the Hyper-V cluster is leveraging CSV instead of SOFS I will need to use a separate one for SOFS as its ill advised to mix Hyper-V workloads with the SOFS role. Keep things easy, clear and supportable. I’ll borrow a picture I got from a Microsoft PM recently, do seek out the bad ideas.

clip_image008

5 thoughts on “My favorite deployment for VMs with Discrete Device Assignment for GPU

  1. “If the Hyper-V cluster is leveraging CSV instead of SOFS I will need to use a separate one for SOFS as its ill advised to mix Hyper-V workloads with the SOFS role.”

    Are you saying here that you simply couldn’t or at the least shouldn’t use a Hyperconverged S2D Cluster, simply put a stand-alone Hyper-V Host for DDA next to it using the S2D Storage ? I do not really see why that would pose a potential problem, unless you would add a huge number of seperate Hosts all drawing IOps from the S2D ? Could you elaborate a bit more on that ? You say it’s ill advised to mix Hyper-V with SOFS, but that’s exactly what Hyperconverged S2D is, isn’t it ?

    • I don’t make the rules. It is still ill advised and Microsoft told me once again last week and 15 minutes go tonight: no, we do not support doing that. You do not run other workloads than Hyper-V on your Hyper-V cluster nodes. It works and if the system is has sufficient resources it won’t cause issues (I do it in the lab actually) but not in production. Now is that a bit of a bummer for some scenarios, yup, but I understand the reasoning behind that statement. What is supported is running a SOFS role in a guest cluster but for Hyper-V workloads that might not be the best option and host level backups of guest SOFS cluster are unavailable, just saying. The discussions will continue at the MVP summit I’m sure. But for now, that’s how it is.

      In regards to your last statement: “You say it’s ill advised to mix Hyper-V with SOFS, but that’s exactly what Hyperconverged S2D is, isn’t it ?” No it is not. There is no SOFS to be found on the hyper-converged S2D model. That’s S2D, Hyper-V running against CSV created on the virtual disks from the S2D storage pool. No SOFS involved. When you put SOFS on S2D you have the above scenario (not supported) if you combine it with Hyper-V. If not, you have a SOFS cluster for converged scenarios, backup targets etc. But it’s not hyper-converged of HCI.

  2. Regarding this discussion. I understand now SOFS is not supported on a S2D. Is that only because of Hyper-V losthost-shares, or just never ? Consider this example: We are running RDS-Servers in a RDSFarm as VMs on S2D. I would want User Profile Disks to be CA-shared. Then I have two options: Built a Guest SOFS on the S2D and share the UPDs from there. Or,host another CSV on S2D and use SOFS to share them that way. Since backing up is not working properly anyway, both options would be viable and viable for using Storage Replica above to get a backup from that in a different way. In short: It’s not supported because of localhost-sharing (as it always was before), or just not at all for different reasons ?
    I mean, in this scenario there is a SOFS on S2D, but it does not get accessed from a localhost, only from VMs running on the S2D. Is there perhaps any information available that you know of, as to why and/or what is and is not supported regarding SOFS/S2D combi on 2016 ?

    • It’s a general statement that you cannot mix roles with Hyper-V, SOFS is such another role. They don’t want competition for resources between the hypervisor and other workloads. Sometime we hear otherwise but a check with the mothership always comes back with the answer: don’t. SOFS on S2D is supported but without Hyper-V, the dis-aggregated model. S2D scenarios in itself are limited in what is supported on them: SOFS, Hyper-V and SQL but no mixing. Other things are not recommended / supported. It might work or not or stop working etc. So I avoid that. In regards to UPD you’re right on the money.

Leave a Reply, get the discussion going, share and learn with your peers.

This site uses Akismet to reduce spam. Learn how your comment data is processed.