Dell Storage Replay Manager 7.6.0.47 for Compellent 6.5

Recently as a DELL Compellent customer version 7.6.0.47 became available to us. I download it and found some welcome new capabilities in the release notes.

  • Support for vSphere 6
  • 2024 bit public key support for SSL/TLS
  • The ability to retry failed jobs (Microsoft Extensions Only)
  • The ability to modify a backup set (Microsoft Extensions Only)

The ability to retry failed jobs is handy. There might be a conflicting backup running via a 3rd party tool leveraging the hardware VSS provider. So the ability to retry can mitigate this. As we do multiple replays per day and have them scheduled recurrently we already mitigated the negative effects of this, but this only gibes us more options to deal with such situations. It’s good.

image

The ability to modify a backup set is one I love. It was just so annoying not to be able to do this before. A change in the environment meant having to create a new backup set. That also meant keeping around the old job for as long as you wanted to retain the replays associated with that job. Not the most optimal way of handling change I’d say, so this made me happy when I saw it.

image

Now I’d like DELL to invest a bit more in make restore of volume based replays of virtual machines easier. I actually like the volume based ones with Hyper-V as it’s one snapshot per CSV for all VMs and it doesn’t require all the VMs to reside on the host where we originally defined the backup set. Optimally you do run all the VMs on the node that own the CSV but otherwise it has less restrictions. I my humble opinion anything that restricts VM mobility is bad and goes against the grain of virtualization and dynamic optimization. I wonder if this has more to do with older CVS/Hyper-V versions, current limitations in Windows Server Hyper-V or CVS or a combination. This makes for a nice discussion, so if anyone from MSFT & the DELL Storage team responsible for Repay Manager wants to have one, just let me know Smile 

Last but not least I’d love DELL to communicate in Q4 of 2015 on how they will integrate their data protection offering in Compellent/Replay manager with Windows Server 2016 Backup changes and enhancements. That’s quite a change that’s happing for Hyper-V and it would be good for all to know what’s being done to leverage that. Another thing that is high on my priority for success is to enable leveraging replays with Live Volumes. For me that’s the biggest drawback to Live Volumes: having to chose between high/continuous availability and application consistent replays for data protection and other use cases).

I have some more things on my wish list but these are out of scope in regards to the subject of this blog post.

Fixing Two Small DELL Compellent Hardware Hiccups

Here’s two little tips to solve some small hardware issues you might run into with a Compellent SAN. But first, you’re never on your own with CoPilot support. They are just one phone call away so I suggest if you see these to minor issues you give them a call. I speak from experience that CoPilot rocks. They are really good and go the extra mile. Best storage support I have ever experienced.

Notes

  • Always notify CoPilot as they will see the alerts come in and will contact you for sure Smile. Afterwards they’ll almost certainly will do a quick health check for you. But even better during the entire process they keep an eye on things to make sure you SAN is doing just fine. And if you feel you’d like them to tackle this, they will send out an engineer I’m sure.
  • Note that we’re talking about the SC40 controllers & disk bays here. The newer genuine DELL hardware is better than the super micro ones.

The audible alert without any issues what so ever

We kept getting an audible alert after we had long solved any issues on one of the SANs. The system had been checked a couple of times and everything was in perfect working order. Except for that audible alarm that just didn’t want to quit. A low priority issue I know but every time we walk into the data center we were going “oh oh” for a false alert. That’s not the kind of conditioning you want. Alerts are only to be made when needed and than they do need to be acted upon!

Working on this with CoPilot support we got rid of it by reseating the upper I/O module. You can do this on the fly – without pulling SAS-cables out or so, they are redundant, as long as you do it one by one and the cabling is done right (they can verify that remotely for you if needed).

image

But we got lucky after the first one. After the “Swap Clear” was requested  every warning condition was cleared and we got rid of the audible alert beep!  Copilot was on the line with us and made sure all paths are up and running so no bad things could happen. That’s what you have a copilot for.

Front panel display dimming out on a Compellent Disk Bay

We have multiple Compellent SANs and on one of those we had a disk bay with a info panel that didn’t light up anymore. A silly issue but an annoying one as this one also show you the disk bay ID.

image

Do we really replace the disk bay to solve this one? As that light had come on and of a couple of time it could just be a bad contact so my colleague decided to take a look. First  he removed the protective cover and then, using some short & curved screw drivers, he took of the body part. The red arrow indicates the little latch that holds the small ribbon cable in place.

image

That was standing right open. After locking that down the info appeared again on the panel. The covers was screwed on again and voila. Solved.

FREE WHITE PAPER: Configuring a VEEAM Off Host Backup Proxy Server for backing up a Windows Server 2012 R2 Hyper-V cluster with a DELL Compellent SAN (Fiber Channel)

Whilst I’m attending TechEd North America 2014, being able to learn and network again with the community at large I think this is a good moment to share. So here’s a little contribution to that community: it’s a white paper on How to configure a VEEAM Off Host Backup Proxy server for backing up a Windows Server 2012 R2 Hyper-V cluster with a DELL Compellent SAN (Fiber Channel).

VEEAM Back & Replication is currently under and extensive test before we make the decision. So far it is going (very) well. And no, VEEAM or DELL did not sponsor this. It’s sharing with the community. A prosperous, successful community makes my professional live better to!

I have to applaud VEEAM for allowing such easy access to their software for trials, to their engineers for assistance and to their support forum and resources even without yet being a paying customer. This is how it should be: vendors having faith in their products both in quality and ease of use. It’s a refreshing experience as some vendors don’t want you to get your hands on new versions of their products even as a existing paying customer “because due to its complexity we might get the wrong impression”. It’s even near impossible with some to get a test license for the lab of the version you currently use with some of them. Not so with VEEAM and that’s great.

I hope you enjoy it. As you might realize I don’t have this kind of infrastructure in my home lab so some of the screenshots have been edited / blurred. I’m sure you can live with that. Otherwise feel free to provide me with the gear in a paid for data center.

Use Cases For Fluid Cache For SAN With DELL Compellent In High Performance Virtualization With Windows 2012 R2

Fluid Cache For SAN

At Dell World 2013 in Austin Texas I spent some time talking to engineers & managers about Fluid Cache For SAN. The demo in the keynote was enough to grab my distinct attention, especially as a Compellent customer.

What is it?

Dell already has Fluid Cache for DAS available in its PowerEdge servers. Now it’s time to bring this to their best SAN offering, the Compellent, and make Fluid Cache shared storage suitable for shared storage clustering. The way to do that cost effective and high performance is to build on the success of on board (local to the server) high performance storage and make that shared through software in a physical shared nothing replication/sync model. To make this happen they use a 10/40Gbps Ethernet solution leveraging RoCE (RDMA over Converged Ethernet). Yes that very technology I have been investing time & effort in for SMB Direct and which we leverage for CSV & Live Migration traffic and with SOFS in Windows 2012 R2.

Basically the super low latency an high throughput enable the memory to be synced across all nodes in a cluster and as such each node sees all the cluster memory. For redundancy you will need at least 3 nodes in a cluster. Dell will scale Fluid Cache For SAN to 128 nodes. Windows Server 2012 R2 can handle 64 nodes, which some think is ridiculously high, but then again, Dell aims even higher so it’s not as weird as you think. Some people have really huge computing needs. Just remember that 10 years ago you probably found that 16GB of RAM was extravagant.

Why this architecture?

Dell uses server based “shared nothing flash storage” & high speed low, latency synchronization to create a logical cluster wide shared pool of flash memory. This means the achieve stellar low latency as the flash storage is inside of the servers, close to the processors and as such delivers excellent performance for the workloads. Way better that “just” flash only SAN can. For data integrity they commit the data only when it written to the Express Flash drive(s) of one server and then also to another and verified. This needs to happen very fast and that where the RoCE network come sin to play. Later, at less speed critical times the data is pushed out to the Compellent SAN for storage. If that SAN is a flash based setup think about the capabilities this gives you in performance. Likewise data reads of the SAN that are highly active are pushed from the Compellent SAN and cached (also in multiple copies) on the Express Flash modules. While two servers with each a copy of the data on Express Flash modules would suffice DELL requires at least three. This is just a plain common sense N+1 redundancy design to have high availability even when a node fails. A cool think to note is that you can build larger clusters with 3 nodes each having one or more Express Flash modules and additional nodes don’t need it as long as they can read the cache of those 3. So the cost of this can be managed. The drawback is that you don’t read & write to a local Express Flash module on those extra node. If you want that you’ll need to put more $ on the table.

clip_image001

The thing to note here is that the Servers/SAN are connected over RoCE/RDMA. Well this look familiar. What technology can also leverages RDMA? SMB Direct in Windows Server 2012 R2! And where do we use this amongst other things? Storage IO in Scale Out File Server, CSV traffic, Live Migration …

The big benefit of this design it just it takes your SAN to the next level but also, if DELL does this right, they won’t break any of the good stuff like VSS aware snapshot with Replay Manager, Automatic Data Tiering, Live Volumes, Live Migration etc. A lot of the high IOPS/low latency solutions out their based on fast local flash break a lot of the good stuff and reduces centralized storage management. What if you can have your cookie and eat it to?

Demo Time at Dell World

Dell demonstrated an Oracle database load on an eight node cluster of PowerEdge R720 servers with Intel Xeon E5 processors, with Linux (no Windows Server 2012 R2 support yet Sad smile) These servers each used 350GB PCI-Express flash cards (“only” PCI-Express 2.0 capable by the way). This cluster, using a Compellent SAN, managed to get a result of more than 5 million IOPS at 6 millisecond response times, delivering 12,000 tps for 14,000 client connections. This was read only. If they dropped the Fluid Cache for SAN they  could “only” achieve 2,000 clients (6 times less clients due to 4 time less transactions and 99% slower responses). See this movie for more info: http://www.youtube.com/watch?v=uw7UHWWAtig and watch the keynote from Dell World 2013 here

clip_image003

Where would I use this?

Cost will determine use cases and this is unknown for now. We can only look at what Fluid Cache for DAS cost right now and speculate. I for one hope/bet on the fact that DELL won’t price itself out of the marked (they have a lot of competition from big & smalls players in a “good enough is good enough” world with a cloud mindset all around). So make it too expensive and we might be happy with “just” 500.000 IOPS at much less cost. It’s a fine line. Price it right, support it well and you might win the bulk of sales in the storage wars. Based on the DAS solution we’re looking at least 8000 $ per server (license is 3500 for DAS => see http://www.theregister.co.uk/2013/03/05/dell_fluid_cache_server_acceleration/  + cost of PCI-Express flash module (> 5000$ => see http://en.community.dell.com/techcenter/extras/chats/w/wiki/4480.3-5-2013-techchat-fluid-cache-for-das.aspx) &  yearly maintenance fee. Then we need to factor in the cost of the RDMA/RoCE capable NICs & the (dedicated) Force10 switches – 2 for redundancy  that are at least 10Gbps (S4810?) or probably 40Gbps (S6000?) & cabling. So this is not a cheap solution and you won’t just “throw it in” on a quiet afternoon to see what it does for you. Not that there will be a DIY “throw it on kit” I think, it’s a step above plug and play. If they keep it affordable and do some other things for Windows Server 201 R2 / Hyper-V they can be the absolute number one SAN vendor for any Microsoft customer. But that’s another blog topic.

Cost is indeed something that might make it a show stopper for us. I just can’t tell yet. One of the key factors is that if affordable it could give the point solutions we now see pop up more and more in storage. a run for their money. While cheap and workable in good enough is good enough scenarios it takes some of the centrally shared storage advantages away. But if we ever do a state full VDI project in an environment with high end physical desktops (500GB or more local storage, SSD disks, 8 core CPU, 8-32GB DDR3, dual or more screens) that run ArcGis, AutoCad, Visual Studio, SQL Server, Outlook with 5GB mailboxes, large documents & huge files (images) this might be the enabler we need to make VDI happen & works as desired with current all-purpose Compellent SANs. IIf the price is right it could enable VDI in now “NO GO” scenarios.  And those are plentiful, … Another use case I see is a virtualized SQL Server environment on Hyper-V with general purpose shared storage. We’re doing very well but the day might arrive that we need those IOPS in order to take it even further. Don’t laugh but realize how much IOPS an SSD delivers to a workstation today and that’s what your users expect & demand. Want to fail at VDI? Have it outperformed by a 4 year old physical PC where you slapped an SSD into.

Could it help in keeping excessive IOPS away from the SAN, making that capable of doing more over a longer life time? In other words can it play a part in the Storage QoS issue across server/cluster/storage system issue for non workload aware storage solutions?

So I might have some homework to do. For our next SQL Server cluster we’ll look at the next generation of servers & start counting our PCI Express slots. We now already consume 4 PCI-Express slots for 2*FC & 2*Dual Port 10Gbps) in our Hyper-V design. That’s another discussion, but they are built purposely for performance under any condition & to be highly redundant. A health check / improvement track by Microsoft for our SQL Server environment has proven this to be an outstanding setup (nice e-mail to see your bosses get by the way). I digress, free PCI-Slots should not be an issue, as we also don’t need the FC cards in the Fluid Cache Nodes. The storage IO uses the RoCE network, to which the Compellent SAN attaches.

Cost is very important in determining if we’ll ever get to deploy it. The cloud is here, and while that is far from cheap either, it’s a lot easier to sell than internal IT for various reasons. That’s just how the powers that be roll right now & how things are.

What we’ll get in our hands

There was a lot of love between Dell & Samsung at Dell World. Talking to Dell at the server/storage/networking boots I understood that Samsung is going to produce flash modules for this that support PCI-Express 3.0 and the industry backed NVM Express host interface for solid state drives which will reduce latency with 1/3 compared to now. As it seems they will produce higher capacity cards than what was used in the demos (800 GB and 1.6 TB). So capacity will increase & latency will drop even more. They leverage the Force10 10Gpbs or 40Gbps switches for the RoCE network. As Dell & Mellanox are cooperating heavily (Mellanox Collaborates with Dell to Deliver 10/40GbE Solution for Mainstream Servers and Networking Solutions) my bet is on Mellanox for the cards. Broadcom is not there yet for it to happen in time and Intel has no RoCE cards afaik. They seem to be playing the waiting game before they jump in.

Magic Ball Time, Speculation & Questions.

I’m not a DELL Server / Storage designer or architect, and those that are don’t tell me to plaster it all over the internet, so this really is magic ball time …

image

I’ll show my ignorance on what Samsung does under the hood when I hear that the next generation of DELL servers can have 6TB of RAM I can only speculate that with the advent of DDR4 in servers & ever dropping cost the path is open to leverage NV-RAM disk for the read/write cache in Fluid Cache for SAN as well a bit like what IDT did http://us.generation-nt.com/idt-announces-world-first-pci-express-gen-3-nvme-nv-dram-press-3732872.html. The persistence comes from writing the DRAM content to NAND at shutdown, can we do that fast enough at 1.6 TB sized caches? Can we fit enough of  those modules on a card? What would that do for IOPS & latency? Does that even make sense at this moment in time?

What if we could leverage the DDR4 dims in the server itself? This would perhaps cut costs and also save us some valuable PCI-Express 3.0 slots for our 10Gbps or better addiction Smile. Sure there is no persistence than but the content is distributed redundantly over the cluster anyway? Is that safe enough to make it feasible? What if we need to shut down the cluster? I guess it’s not that easy and perhaps we just need to make sure future motherboards have 8 or more PC-Express 3.0 slots & not worry about that. Or move to 40/100Gbps & have less need for NICs. Yeah that’s what was said of 10Gbps in the early days …

Support for Windows?

While it’s not there yet I have absolutely no doubt that they will bring it to Windows Server 2012 R2 and higher. Well Windows is a huge on premise market for native workloads like SQL Server, VDI and Hyper-V. The number of sales opportunities in the Microsoft ecosystem is growing (despite cloud) while others are stagnant or dropping. On top of that the low cost of Hyper-V leaves money to be spent on Fluid Cache for SAN. As Dell is in business to make money, they will not leave that big chunk of cash on the table.

When can we get our hands on this technology?

Timing wise that will be early to late Q2 in 2014, which is my best guestimate. Interesting times people, interesting times