My favorite deployment for VMs with Discrete Device Assignment for GPU

Introduction

Recently I had an interesting discussion on how to leverage Discrete Device Assignment (DDA) for GPU needs when it’s only needed for a certain number of virtual machines. Someone had read my blogs on leveraging DDA that made here optimistic and enthusiastic. But she noticed in the lab she could not leverage DDA on a VM running on a cluster and she could not use storage QoS policies on a stand-alone Hyper-V host with local storage. So, what could she do?

Well for one, her findings are correct. Microsoft did not enable DDA on clustered virtual machines. It doesn’t make sense as the GPU hardware is tied to the virtual machine and any high availability, both planned (live migration) or unplanned (failover) isn’t possible and available anyway. It just cannot be done. I hear you, when you say “but they pulled it off for SR-IOV for networking”. Sure, but please keep in mind that network cards with Ethernet and TCP/IP allows for different approaches than high end video.

My favorite deployment for VMs with Discrete Device Assignment for GPU

My favorite deployment for VMs with Discrete Device Assignment (DDA) for GPU leveraged SMB3 SOFS shares for the virtual hard disks and stand-alone Hyper-V hosts that are member servers in the domain. Let me explain why.

Based on what we discussed above we have some options. One work around is running the DDA virtual machines not high available on local storage on a cluster node. But that would mean you would have a few VMs on all the nodes and that all those nodes must have a DDA capable GPU. Or if you limit the number of nodes that have such a GPU you’ll have a few odd balls in your cluster. You’ll need to manage some extra complexity and must save guard against assigning a GPU via DDA that is already in use for RemoteFX. That cause all kinds of unpleasantness, nothing too deadly but not something you want to do if on your production VDI clusters for fun. It’s a bit like not running a domain controller on a CSV and not making it highly available. If that’s the only option you have you can do that, and I do when needed as Microsoft has improved a lot of things to make this a better and less risky experience. But I prefer to have either physical one or host it on a separate non-clustered Hyper-V host if that’s an option because not all storage solutions and environments have all capabilities needed to make that fool proof.

Also note that running other storage on a S2D node isn’t supported. You have your OS on the boot disks and the disks used in storage spaces. Odd ones out aren’t supposed to be there as S2D will try to recruit them. You can get do it when using traditional shared storage

What I also don’t like about that is that if the cluster storage is not SMB3 SOFS you don’t get the benefit of storage QoS policies in Windows Server 2016, that only works with CSV. So optionally you could leave the non-clustered VM on a CSV. But that’s perhaps a bit confusing and some people might think some forgot to make the machine high available etc.

My preferred setup to get high available storage for virtual machines with DDA needs that benefits from what storage QoS polies have to offer for VDI is to use standalone Hyper-V hosts that have DDA capable GPUs and leverage SMB3 SOFS shares for the virtual Machines.

clip_image002

The virtual machines cannot be high available anyway so you lose nothing there. The beauty is that in this case, as you leverage a Windows Server 2016 SOFS cluster for Hyper-V storage over SMB3 shares, you do get Storage QoS policies.

#On a SOFS node

Get-StorageQosPolicy -Name DedicatedTier1Policy | Get-StorageQosFlow | ft InitiatorName, *IOPS, Status, PolicyID, filePath -AutoSize

#Query for the VM disks on the Hyper-V node

Get-VM -Name DDAVMSOFSStorage -ComputerName RemoteFXHost | Get-VMHardDiskDrive |fl *

clip_image004

#We generate some IO and get some stats on a SOFS node

get-storageQosFlow

get-storageQoSVolume -Mountpoint C:\ClusterStorage\SOFSDEMO\

get-storageQoSVolume -Mountpoint C:\ClusterStorage\SOFSDEMO\ | fl

clip_image006

You can start out with one Hyper-V node and add more when needed, that scale out. Depending on the needs of the virtual machines and specs of the servers (Memory, CPU cores) and the capability and number of GPU in the video cards you get some scale up as well.

To learn more about DDA go here:  https://blog.workinghardinit.work/?s=DDA&submit=Search

To learn more about storage QoS policies go here:

Some more considerations

By going disaggregated. You can leverage a SOFS share for both virtual machines running on a Hyper-V cluster or on stand-alone (non-clustered) Hyper-V that are domain members. The SOFS cluster can be leveraging S2D, traditional storage spaces with shared SAS (JBODs) or even a FC, iSCSI or shared SAS SANS if that the only option you have. That’s all OK as long as it’s SOFS running on Windows Server 2016 and the Hyper-V hosts (stand alone or clustered) are a running 2016 as well (needed for Storage QoS policies and DDA). There is no need for the Hyper-V host to be part of a cluster to get the best results you need. If I use SOFS for both scenarios I can use the same storage array, but I don’t need to. I could also use separate storage arrays. If the Hyper-V cluster is leveraging CSV instead of SOFS I will need to use a separate one for SOFS as its ill advised to mix Hyper-V workloads with the SOFS role. Keep things easy, clear and supportable. I’ll borrow a picture I got from a Microsoft PM recently, do seek out the bad ideas.

clip_image008

Software-Defined Data Infrastructure Essentials

The last few months I spent some of my down time and commute time reading a book. A paper one actually. It’s Greg Schulz’s “Software-Defined Data Infrastructure Essentials”. It is as the sub title states about cloud, converged and virtual fundamental server storage I/O tradecraft.

It is not a book you’ll read to learn about a particular technology, product or vendor. It is a more holistic approach to educating people in todays IT landscape. That vast area of expertise in which all the considerations around storage in a modern IT environment come together. Where old and new, established and emerging ways of handling storage IO for a variety of  use cases meet and mix.

image

Reading the book helps to become more well versed in the subject and takes us out of our product or problem specific cocoons. That the main reason I’d recommend anyone to read it. I’m impressed by how well Greg managed to write a book on such a diverse subject that is accessible to all levels of expertise.The depth and the breadth of this subject make this quite a feat. On top of that this book is usable and valuable to both novice and experienced  professionals. I have said it before (on Twitter), but if I was teaching IT classes and needed to bring the student up to date in regards to the software defined cloud data center data considerations this would be the text book. It acknowledges the diversity of solutions and architectures in the real world and doesn’t make bold marketing statements. Instead it focuses on what you need to know and consider when discussing and designing solutions. I wish many IT manager, consultant and analyst would attend my fictional class but I’d settle for them reading this book and learning about a big part of what they need to manage, It would serve them well and help understand concerns other involved parties might want to see addressed.

For me an extra benefit was that I enjoy talking shop with Greg but I only get those opportunities on rare occasions during conferences.  As such, this book gave me some more time to read his views and insights. That’s the best next thing.

Testing Azure File Sync

Introduction

Since Azure File Sync is now out in public preview I can chat a bit about why I am interested in it. My interest is big enough for me to have tested it in private previews even when managers weren’t really interested to do so yet. It was too early for them and it certainly didn’t help that, in many scenarios, it involves “hardware” involved in some way or another. In cloud first world that’s bad by default even if that’s only a hypervisor host and storage.

But part of my value is providing solutions that serve the clear and present needs of the business today and tomorrow. All this without burdening them with a mortgage on their future capabilities to address the needs they’ll have then. In that aspect of my role I can and have to disagree with managers every now and then. Call it a perk or a burden, but I do tell them what they need to hear, not what they want to hear. So yes, I was and I am testing Azure File Sync. You have to remember that what’s too early in technology today, is tomorrows old news. To paraphrase Ferris Bueller, technology moves pretty fast. If you don’t stop and look around once in a while, you could miss it.

If you need to get up to speed on Azure File Sync fast, take a look at https://azure.microsoft.com/nl-nl/resources/videos/azure-friday-hybrid-storage-with-azure-file-sync-langhout/.

clip_image002

I’m not going to discuss all of the current or announced capabilities of Azure File Sync here. There is plenty of info out there and the Program Managers are very responsive and passionate when it comes to answering you detailed questions. There’s plenty of more info in the Microsoft Ignite 2017 Sessions: https://myignite.microsoft.com/sessions/54963 and https://myignite.microsoft.com/sessions/54931.

What drove me to Testing Azure File Sync

I was in general unhappy with the StorSimple offering and other 3rd party offerings. There, I’ve said it. It was too much of a hit or miss, point solution and some of them are really odd ones out in an otherwise streamlined IT environment. I’ll keep it at that. While there are still other players in the market we have another concern. That  is that at the cloud side of things we needed integration and capabilities a 3rd party just wouldn’t have the leverage for to achieve with Microsoft. The reality is that unless you are talking about cloud in combination with huge amounts of money you will not get their attention. It’s business, not personal. Let’s look at the major reasons for me to investigate Azure File Sync.

The need to access data from different locations in different ways

First of all, there is the need to access data from different locations in different ways. That is both on-premises and from Azure. In the cloud we access file share via REST APIs and that works fine for well controlled and secured application environments and services written in and for the modern workplace.

When we start involving human beings that falls apart. The security model with roles and responsibilities doesn’t map well into an access token you paste into a mapped drive of an Azure file share. Then there is the humongous amount of applications that cannot speak REST. The most often can speak NFTS and SMB which is ruled by ACLs. They are children of a client/server world and as such, when designed and written well, they shine on premises in a well-connected low latency world. Yes, that very long tail of classical applications …

clip_image004

Windows Applications and their long, long tail by TeamRGE.com as frequently found in RDS and Citrix VDI articles.

In cloud based RDS solutions and IAAS we also use SMB and NFTS/ACL to consume data, do don’t assume everything in the cloud is pure REST and claims based security.

I won’t even go into the need to have ACL on Azure File share, the options and need for extending AD to Azure over a VPN, ADFS, ADFS Proxies, Azure AD Pass-through, Azure AD Sync, … the whole nine yards to cover all kinds of scenarios. Sure, I hear you, AD will go away, it doesn’t matter anymore. On a long enough time scale such statements are always true. But looking at Office 365 success I’m not howling with the crowd e-mail is totally obsolete yet either. Long tails …

Anyway, when you need to consume the same data from different locations in different ways it often leads to partial or complete copies of data that might change in different places and needs to be protected in different place. That’s a lot of overhead to deal with and consistency can become a serious issue. But with this we have touched on our second driver.

Backups, data protection & recovery

Yup, good old backups. You know the thing too many of the mediocre IT managers out there is nothing to it anymore as they solved that problem over a decade ago. But time doesn’t stand still. Plain and simple. I sometimes deal with largish file server clusters leveraging SMB3 (and ODX when available). The LUNs for those files shares are generally anything between 8 TB and 20 TB totaling 100TB and upwards per cluster. There are dozens of LUNs on any cluster with tens of thousands of folders and hundred million of files. DFS namespace helps us keep a unified namespace for the consumers. Since Windows Server 2012, that’s ages ago, we have the technology to make this a breeze. SMB 3 with continuous availability and leveraging SMB3/ODX wherever we can and makes sense. Thank you, Microsoft!

One of the selling points for Azure File Sync is dealing with storage capacity issues. That’s the least of my worries actually. We have never struggled with a lack of storage on file servers. For crying out loud, we solved that problem well over a decade ago. Yes, even on premises. So that’s not our main motivation here.

What is still not covered well enough for us is backups. That number of files, not just the data volume, is a challenge. Not just the backup target storage for it and the infrastructure. Basically, that can be built performant enough and reasonably cost effective. The big challenge with those numbers is the time it takes to deal with backups.

We have had to revert to application consistent storage snapshots and replicas of those on campus and between cities to make help deal with the fact that “traditional backup” doesn’t really scale well to those volumes. That’s great but when data is valuable the 3-2-1 rule is set in stone. So, you need more. A lot more actually perhaps as today you might want to supplement it with extra rules to include air gapping and encryption. Got ransomware anyone?

Traditional backups with agent on the physical host or in the guest are not cutting it. When you virtualize, you can leverage on host based backups of entire VMs or split it up in host bases backups of the OS, the data virtual disks etc. It’s a lot more effective to backup number of 5/15/25 TB sized VHDX files than it is to so with in guest or in server backups. Even more so when you combine this with modern change block tracking and synthetic full backups and health check/repair capabilities in backup products.

The challenge was that with true virtualized clusters, meaning shared VHDX, the backup story wasn’t a 1st class citizen. Which made people either not do it or use iSCSI, vFC to the guest and deal with the backups as if it were physical machines, losing any benefits associated with host based backups of VMs. Things have gotten better with Windows Server 2016 so we had two paths to investigate to deal with these challenges.

There are 2 approaches to data protection we are investigating

There are 2 approaches to data protection we are investigating and solutions might be a combination of them.

Virtualize it all and then backup the VM. In other words, leverage the power of host level backups with virtual machines. Now things have become a whole lot better for guest cluster backups with Windows Server 2016 but we are not quite there yet in order to call them 1st class citizens. The other concern is that the growth in data will still beat any progress we can make in speed and reliability with backups. Another concern is that people don’t want to deal with storage and are looking for storage as a service. The easiest and smartest way to achieve this is to look a public cloud offerings that delivers what’s needed and not take on the burden of managing 3rd parties to deliver this.

The first problem might be covered by Veeam Backup & Replication v10 that will allow for backing up file shares. As long as shared VHDX or VHD Sets are not full blown equal citizens in Hyper-V when it comes to backups we’ll look at another solution.

That’s fine, but what about pure volume and that 3-2-1 (multiple copies, different technologies, different locations) rule or better? The cost and overhead with a second site can be a challenge for some. Can Azure File Sync help there? It sure can. But the big bonus is that is also a big step forward into helping with the need to access data from different locations in different ways

The problem that we might still have is that we are dependent on the same type of solution if its geo-replicated. Application consistent snapshots, storage array or Windows VSS based, that are geo replicated are a very important part of our protection against data loss. That have become even more obvious with the ever-growing threat of ransomware. In this threat model you try to separate access to data and backups both logically and physically as well as technology and operations wise. Full “air gapping” is the Walhalla here. Lacking that you need at least serious delaying factors to prevent an attacker to gain access to both source data, snapshot on Windows or on storage arrays, backup targets and replicas of all the above.

We could actually use Azure File Sync as a primary of secondary backup target and combine the best of both worlds. Interesting huh!

The fact that we can have snapshots of Azure File Shares is important to us for data recovery. It’s also important to have those replicated in case the Azure Datacenter West Europe goes south so to speak . The next question is whether our needs and wants are covered today in the preview?

Are all our needs and wants covered today in the preview?

The question we need to answer does it cover the needs you have well enough or better. Today Azure File Sync isn’t generally available and when it is it will be v1. So, things can and will improve.

Let’s forget of LUN size limitations (5TB) for the moment or some of our other future capability and functionality requests on the wish list. I don’t need 64TB or bigger per se but I’d like to see at least 20 to 30TB LUNs supported. Not sure if Azure File Sync snapshots will be limited by VSS to 64TB or that they break through that barrier like we can with certain hardware VSS providers.

Today we limit the size of LUNs for repair as well as data protection and recovery speeds or to minimize impact of a downed LUN. When the data is in the cloud and “only” a cache on premises these concerns might subside more and more. At Ignite 2017 Microsoft mentioned +/- 100TB sizes. So, there is an indication of that 64TB barrier being broken somehow.

The one thing we are not getting yet with Azure File Sync in preview is the multiple copies on two different technologies in two different locations (remember that 3-2-1 rule). You are and remain dependent on the Microsoft solution and maybe adherence to that rule is built into their technological design, maybe it isn’t. I don’t know right now. We have LRS for now, but not GRS. Do note that things change rather fast in the loud and what isn’t here today might be there tomorrow. GRS or ZRS is on the roadmap to be available when Azure File Sync goes into production.

clip_image006

Also, Microsoft realized that having snapshots in the same storage account was a risk. So, the backups will go to the recovery services vault in the future. It will also provide for long term data retention. This is important for archiving and authentic data that cannot be reproduced and as such has to be secured against altering.

Maybe you don’t trust MSFT or the cloud for 100%. If that’s the case and it’s a hard policy requirement of the business to have a protected data copy that is not dependent on the cloud provider we could propose a solution at a certain cost. You can use a file share cluster or, cheaper, a single file server in the datacenter (one location only) where everything is pinned locally (no cloud tiering) and which is used as a backup source to a backup target on different storage. The challenge there again is that you still have to deal with that massive volume of files yourself. If that’s worth the price/benefit versus the risks is not for me to decide. Yes, no, maybe so, that’s not. I deal in technological solutions and I buy and build results technological results for the needs of the organization. I do this in the context and in the landscape in which that business operates. I do not buy or build services, let alone journey’s. That’s the part I do and I do it very well . In my opinion Azure File Synch will play a major part in all this.

Other things on the wish list are:

  • Global file lock mechanism and orchestration so with an error handling model (Try, Catch, Finally) so automation in different sites can deal with locked files in the application logic helping with the challenges of distributed access (read/write/modify) to data. I’m not asking to reinvent SharePoint here, don’t get me wrong but this area remains a challenge. Both for the user of data as well as for those trying to solve those challenges in code. There are just so many applications working in different ways that perfection in this arena will not be of this world and any solution will require some tweaking and flexibility to cater to differ needs. But I think that an effort here could make a real value proposition.
  • Bringing ACL to Azure File Sync might enable new scenarios as well. It’s been on the wish list for a while and while MSFT is looking in to it we don’t have it yet.
  • Support for data deduplication. Kind of speaks for itself and both customer & Microsoft would benefit.
  • I’d love to see support for ReFSv3 as well in time because I’m still hoping Microsoft will bring the benefits of ReFSv3 too many more use cases than today. I think they got my feedback on that one loud and clear already.

So far for some of our musing that lead us to testing Azure File Sync. We’ll share more of our journey and become a bit more technical as we progress in our journey.

Missing Hyper-V Service Connection Point caused failed off-host backup proxy jobs

The issue

We have a largish Windows Server 2016 Hyper-V cluster (9 nodes) that is running a smooth as can be but for one issue. The off-host backups with Veeam Backup & Replication v9.5 (based on transportable hardware snapshots) are failing. They only fail for the LUNs that are currently residing on a few of the nodes on that cluster. So when a CSV is owned by node 1 it will work, when it owned by node 6 it will fail. In this case we had 3 node that had issues.

As said, everything else on these nodes, cluster wise or Hyper-V wise was working 100% perfectly. As a matter of fact, they were the perfect Hyper-V clusters we’d all sign for. Bar that one very annoying issue.

Finding the cause

When looking at the application log on the off-host backup proxy it’s quite clear that there is an issue with the hardware VVS provider snapshots.

We get event id 0 stating the snapshot is already mounted to different server.

clip_image002

Followed by event id 12293 stating the import of the snapshot has failed

clip_image004

When we check the SAN, and monitor a problematic host in the cluster we see that the snapshot was taken just fine. what was failing was the transport to the backup repository server. It also seemed like an attempt was made to mount the snapshot on the Hyper-V host itself, which also failed.

What was causing this? We dove into the Hyper-V and cluster logs and found nothing that could help us explain the above. We did find the old very cryptic and almost undocumented error:

Event ID 12660 — Storage Initialization

Updated: April 7, 2009

Applies To: Windows Server 2008

This is preliminary documentation and subject to change.

clip_image005

This aspect refers events relevant to the storage of the virtual machine that are caused by storage configuration.

Event Details

Product:

Windows Operating System

ID:

12660

Source:

Microsoft-Windows-Hyper-V-VMMS

Version:

6.0

Symbolic Name:

MSVM_VDEV_OPEN_STOR_VSP_FAILED

Message:

Cannot open handle to Hyper-V storage provider.

Resolve

Reinstall Hyper-V

A possible security compromise has been created. Completely reimage the server (sometimes called a bare metal restoration), install a new operating system, and enable the Hyper-V role.

Verify

The virtual machine with the storage attached is able to launch successfully.

This doesn’t sound good, does it? Now you can web search this one and find very little information or people having serious issues with normal Hyper-V functions like starting a VM etc. Really bad stuff. But we could start, stop, restart, live migrate, storage live migrate, create checkpoints etc. at will without any issues or even so much as a hint of issues in the logs.

On top of this event id Event ID 12660 did not occur during the backups. It happens when you opened up Hyper-V manager and looked at the setting of Hyper-V or a virtual machine. Everything else on these nodes, cluster wise or Hyper-V wise was working 100% perfectly Again, this is the perfectly behaving Hyper-V cluster we’d all sign for. If it didn’t have that very annoying issue with a transportable snapshot on some of the nodes.

We extended our search outside if of the Hyper-V cluster nodes and then we hit clue. On the nodes that owns the LUN that was being backup and that did show the problematic transportable backup behavior noticed that the Hyper-V Service Connection Point (SCP) was missing.

clip_image006

We immediately checked the other nodes in the cluster having a backup issue. BINGO! That was the one and only common factor. The missing Hyper-V SCP.

Fixing the issue

Now you can create one manually but that leaves you with missing security settings and you can’t set those manually. The Hyper-V SCP is created and attributes populates on the fly when the server boots. So, it’s normal not to see one when a server is shut down.

The fastest way to solve the issue was to evacuate the problematic hosts, evict them from the cluster and remove them from the domain. For good measure, we reset the computer account in AD for those hosts and if you want you can even remove the Hyper-V role. We then rejoined those node to the domain. If you removed the Hyper-V role, you now reinstall it. That already showed the SCP issue to be fixed in AD. We then added the hosts back to the cluster and they have been running smoothly ever since. The Event ID 12660 entries are gone as are the VSS errors. It’s a perfect Hyper-V cluster now.

Root Cause?

We’re think that somewhere during the life cycle of the hosts the servers have been renamed while still joined to the domain and with the Hyper-V role installed. This might have caused the issue. During a Cluster Operating System Rolling Upgrade, with an in-place upgrade, we also sometime see the need to remove and re-add the Hyper-V role. That might also have caused the issue. We are not 100% certain, but that’s the working theory and a point of attention for future operations.