A reality Check On Disaster Recovery & Business Continuity

Introduction

Another blog post in “The Dilbert Life Series®” for those who are not taking everything personal. Every time business types start talking about business continuity, for some reason, call it experience or cynicism, my bull shit & assumption sensors go into high alert mode. They tend to spend a certain (sometimes considerable) amount of money on connectivity, storage, CPUs at a remote site, 2000 pages of documentation and think that covers just about anything they’ll need. They’ll then ask you when the automatic or 5 minute failover to the secondary site will be up and running. That’s when the time has come to subdue all those inflated expectations and reduce the expectation gap between business and IT as much as possible. It should never have come to that in the first place. But in this matter business people & analysts alike, often read (or are fed) some marchitecture docs with a bunch of sales brochures which make it al sound very easy and quickly accomplished. They sometimes think that the good old IT department is saying “no” again just because they are negative people who aren’t team players and lack the necessary “can do attitude” in world where their technology castle is falling down. Well, sorry to bust the bubble, but that’s not it. The world isn’t quite that black and white. You see the techies have to make it work and they’re the ones who have to deal with the real. Combine the above with a weak and rather incompetent IT manager bending over to the business (i.e. promising them heaven on earth) to stay in there good grace and it becomes a certainty they’re going to get a rude awakening. Not that the realities are all that bad. Far from it, but the expectations can be so high and unrealistic that disappointment is unavoidable.

The typical flow of things

The business is under pressure from peers, top management, government & regulators to pay attention to disaster recovery. This, inevitably leads to an interest in business continuity. Why, well we’re in a 24/7 economy and your consumer right to buy a new coffee table on line at 03:00 AM on a Sunday night is worth some effort.  So if we can do it for furniture we should certainly have it for more critical services. The business will hear about possible (technology) solutions and would like to see them implemented. Why wouldn’t they? It all sounds effective and logical. So why aren’t we all running of and doing it? Is it because IT is a bunch of lazy geeks playing FPS games online rather than working for their mythically high salaries? How hard can it be? It’s all over the press that IT is a commodity, easy, fast, dynamic and consumer driven so “we” the consumers want our business continuity now! But hey it costs money, time, a considerable and sustained effort and we have to deal with the less than optimal legacy applications (90% of what you’re running right now).

Realities & 24/7 standby personnel

The acronyms & buzz words the business comes up with after attending some tech briefing by Vendors Y & Z (those are a bit like infomercials but without the limited value those might have Sarcastic smile) can be quite entertaining. You could say these people at least pay attention to the consumerized business types. Well actually they don’t, but they do smell money and lots of it. Technically they are not lying. In a perfect world things might work like that … sort of, some times and maybe even when you need it. But it will really work well and reliable. Sure that’s not the vendors fault. He can’t help  that the cool “jump of a cliff” boots he sold you got you killed. Yes they are designed to jump of a cliff but anything above 1 meter without other precautions and technologies might cause bodily harm or even death. But gravity and its effects in combination with the complexity of your businesses are beyond the scope of their product solutions and are entirely your responsibility. Will you be able to cover all those aspects?

Also don’t forget the people factor. Do you have the right people & skill sets at your disposal 24/7 for that time when disaster strikes? Remember that could be on a hot summer night in a weekend when they are enjoying a few glasses of wine at a BBQ party and not at 10:15 AM on a Tuesday morning.

So what terminology flies around?

They hear about asynchronous or even synchronous replication of storage of applications. Sure it can work within a data center, depending on how well it is designed and setup. It can even work between data centers, especially for applications like Exchange 2010. But let’s face it, the technical limitations and the lack of support for this in many of the legacy applications will hinder this considerably.

They hear of things like stretched clusters and synchronous storage replication. Sure they’ll sell you all kinds of licensed features to make this works at the storage level with a lot of small print. Sometimes even at the cost of losing functionality that makes the storage interesting in the first place. At the network level anything below layer 3 probably suffers from too much optimism. Sure stretched subnets seem nice but … how reliable are these solutions in real live?

Consider the latency and less reliable connectivity.You can and will lose the link once in a while. With active-active or active-passive data centers that depend on each other both become single points of failure. And then there are all the scenarios where only one part of the entire technology stack that makes everything work fails. What if the application clustering survives but not the network, the storage or the database? You’re toast any way. Even worse, what if you get into a split brain scenario and have two sides writing data. Recover from that one my friend, there’s no merge process for that, only data recovery. What about live migration or live motion (state, storage, shared nothing) across data centers to avoid an impending disaster? That’s a pipe dream at the moment people. How long can you afford for this to take even if your link is 99.999% reliable? Chances are that in a crisis things need to happen vast to avoid disaster and guess what even in the same data center, during normal routine operations, we’re leveraging <1ms latency 10Gbps pipes for this. Are we going to get solutions that are affordable and robust? Yes, and I think the hypervisor vendors will help push the entire industry forward when I see what is happening in that space but we’re not in Walhalla yet.

Our client server application has high availability capabilities

There are those “robust and highly available application architectures” (ahum) that only hold true if nothing ever goes wrong or happens to the rest of the universe. “Disasters” such as the server hosting the license dongle that is rebooted for patching. Or, heaven forbid, your TCP/IP connection dropped some packages due to high volume traffic. No we can’t do QoS on the individual application level and even if we could it wouldn’t help. If your line of business software can’t handle a WAN link without serious performance impact or errors due to a dropped packet, it was probably written and tested on  <1ms latency networks against a database with only one active connection. It wasn’t designed, it was merely written. It’s not because software runs on an OS that can be made highly available and uses a database that can be clustered that this application has any high availability, let alone business continuity capabilities. Why would that application be happy switching over to another link. A link that is possibly further away and running on less resources and quite possibly against less capable storage? For your apps to works acceptably in such scenarios you would already have to redesign them.

You must also realize that a lot of acquired and home written software has IP addresses in configuration files instead of DNS names. Some even have IP addresses in code.  Some abuse local host files to deal with hard coded DNS names … There are tons of very bad practices out there running in production. And you want business continuity for that? Not just disaster recovery  to be clear but business continuity, preferably without dropping one beat. Done any real software and infrastructure engineering in your life time have you? Keeping a business running often looks like a a MacGyver series. Lots creativity, ingenuity, super glue, wire, duct tape and Swiss army knife or multi tool. This is still true today, it doesn’t sound cool to admit to it, but it needs to be said.

We can make this work with the right methodologies and strict processes

Next time you think that, go to the top floor and jump of, adhering to the flight methodologies and strict processes that rule aerodynamics. After the loud thud due to you hitting the deck, you’ll be nothing more than a pool of human waste. You cannot fly. On top of unrealistic scenarios things change so fast that documentation and procedures are very often out of date as soon as they are written.

Next time some “consultants” drop in selling you products & processes with fancy acronyms proclaiming rigorous adherence to these will safe the day consider the following. They make a bold assumption given the fact they don’t know even 10% of the apps and processes in your company. Even bolder because they ignore the fact that what they discover in interviews often barely scratches the surface. People can only tell you what they actually know or dare tell you. On top of that any discovery they do with tools is rather incomplete. If the job consist of merely pushing processes and methodologies around without reality checks you could be in for a big surprise. You need the holistic approach here, otherwise it’s make believe. It’s a bit like paratrooper training for night drops over enemy strong holds, to attack those and bring ‘m down. Only the training is done in a heated class room during meetings and on a computer. They do not ever put on all their gear, let alone jump out of an aircraft in the dead of night, regroup, hump all that gear to the rally points and engage the enemy in a training exercise. Well people, you’ll never be able to pull of business continuity in real life either if you don’t design and test properly and keep doing that. It’s fantasy land. Even in the best of circumstances no plan survives it first contact with the enemy and basically you would be doing the equivalent of a trooper firing his rifle for the very first time at night during a real engagement. That’s assuming you didn’t break your neck during the drop, got lost and managed to load the darn thing in the first place.

You’re a pain in the proverbial ass to work with

Am I being to negative? No, I’m being realistic. I know reality is a very unwelcome guest in fantasy land as it tends to disturb the feel good factor. Those pesky details are not just silly technological “manual labor” issues people. They’ll kill your shiny plans, waste tremendous amounts of money and time.

We can have mission critical applications protected and provide both disaster recovery and business continuity. For that the entire solution stack need to be designed for this. While possible, this makes things expensive and often only a dream for custom written and a lot of the shelf software. If you need business continuity, the applications need to be designed and written for it. If not, all the money and creativity in the world cannot guarantee you anything. In fact they are even at best ugly and very expensive hacks to cheap and not highly available software that poses as “mission critical”.

Conclusion

Seriously people, business continuity can be a very costly and complex subject. You’ll need to think this through. When making assumptions realize that you cannot go forward without confirming them. We operate by the mantra “assumptions are the mother of al fuckups” which is nothing more than the age old “Trust but verify” in action. There are many things you can do for disaster recovery and business continuity. Do them with insight, know what you are getting into and maybe forget about doing it without one second of interruption for your entire business.

Let’s say disaster strikes and the primary data center is destroyed. If you can restart and get running again with only a limited amount of work and productivity lost, you’re doing very well. Being down for only a couple of hours or days or even a week, will make you one of the top performers. Really! Try to get there first before thinking about continuous availability via disaster avoidance and automatic autonomous failovers.

One approach to achieve this is what I call “Pandora’s Box”. If a company wants to have business continuity for its entire stack of operations you’ll have to leave that box closed and replicate it entirely to another site. When you’re hit with a major long lasting disaster you eat the down time and loss of a certain delta, fire up the entire box in another location. That way you can avoid trying to micro manage it’s content. You’ll fail at that anyway. For short term disasters you have to eat the downtime. Deciding when to fail over is a hard decision. Also don’t forget about the process in reverse order. That’s another part of the ball game.

It’s sad to see that more money is spend consulting & advisers daydreaming than on realistic planning and mitigation. If you want to know why this is allowed to happen there’s always my series on The do’s and don’ts when engaging consultants Part I and Part II. FYI, the last guru I saw brought into a shop was “convinced” he could open Pandora’s Box and remain in control. He has left the building by now and it wasn’t a pretty sight, but that’s another story.

The Right Stuff

You all probably know that to get a difficult job done well and fast, you need the right people in the right place at the right moment in time. Those people also need the right tools. This requires people who can think on their feet, people who are resourceful and who will always seek and find opportunities under adverse conditions.

The placement and timing of these resources and assets is more than just management of some table matching names to roles. It’s not enough to have the right resources and skill sets. You need to know who and what is available and what these or they can contribute. Management is often not very good at this, so in a crisis they need to let go and rely on their people. Free Tip: you can’t start building a team when the crisis arrives 😉

It’s the boots on the ground will have to deal with the issues at hand and take the decisions. In a crisis time is of the essence. There is no place for too many layers of management, let alone micro management, only the ones with the right responsibilities insight and knowledge are needed and helpful. The decisions become tactical and operational within the context of the situation at hand and it its relation to the entire environment. So they have to be made by people who preferably know the environment well and have a very good skillset, drive and motivation. Basically this is what I refer to when I talk about the right stuff.

If you have ever worked or work in that sort of environment you know what I’m talking about. The knowledge that no matter where you are going for whatever reason, you’re doing so with a team of very skilled people who are the very best in the business, at the top of their game and ready to roll with any situation thrown at them. They are capable to react in a moment’s notice and focus entirely on the job at hand. If you’re interested in building such a team I suggest you select your team members very carefully. Head count doesn’t mean jack shit if they are the wrong people for the job and the team. Don’t ever lower the bar, it’s there for a very good reason.

This year I had misfortune of having to respond to two major HVAC disasters at night in a weekend. I had the good fortune of having the right stuff at my disposal. There is no “On Call”, there is no monetary compensation. This team is my crew and they are all volunteers who will do what is needed when it is needed. Why because they have professional pride and know that at these moments the very survival of the business they work for depends on them acting fast and correctly. To them it’s not about “somebody should do this” or ”that’s not my job”. It’s not about “this should be taken care of” or “I never had a template telling me what to do”. No, they step forward and get it done. This weekend, from the very first alert, 4 people were mobilized in 30 minutes and acted at the speed of light. This led to the emergency shutdown of a data center in a city 60 kilometers way to prevent a catastrophic meltdown of millions of euros in hardware (not even trying to put a value on the data loss). Two people were acting remotely and 2 (including me) were heading over there to have boots on the ground. The reason for this is that “the away team” could deal with anything that couldn’t be handled remotely and coordinate with facility management. Having people on site is important to all involved (two is preferable for safety reasons) for assessing the situation and for the sake of speed. More people often becomes less efficient as numbers are not the same as capability.

So to my team, I’m proud of you. I quote Beckwith “I’d rather go down the river with 7 studs than with a 100 shitheads”. You all know you’ve got the right stuff. Be proud of that!

Data Protection & Disaster Recovery in Windows 8 Server Hyper-V 3.0

The news coming in from the Build Windows conference is awesome. The speculation of the last months is being validated by what is being told and on top of that more goodness is thrown at us Hyper-V techies.

On the data protection and disaster recovery front some great new weapons are at our disposal. Let’s take a look at some of them.

Live Migration & Storage Live Migration.

Among the goodies are the improvements in Live Migration and the introduction of Storage Live Migration.  Hyper-V 3.0 supports multiple concurrent Live Migrations now, which combined with adequate bandwidth will provide for fast evacuation of problematic hosts. Storage Live Migration means you can move a VM (configuration, VHD & snapshots) to different storage while the guest remains on line so the users are not hindered by this. I’m trying to find out if they will support multiple networks / NICs  with this.

Now to make this shine even more MSFT has another ace up it’s sleeve. You can do Live Migration and Storage Live Migration without the requirement of shared storage on the backend. This combination is a big one. This is means “shared nothing” high availability. Even now when prices for entry level shard storage has plummeted we see SMB being weary of SAN technology. It’s foreign to them and the fact they haven’t yet gained any confidence with the technology makes them hesitant. Also the real or perceived complexity might hold ‘m back. For that segment of the market it is now possible to have high availability anyway with the combo Live Migration / Storage Migration.  Add to this that Hyper-V now supports running virtual machines on a file share and you can see the possibilities of NAS appliances in this space of the market for achieving some very nice solutions.

Replication to complete the picture

To top this of you have replication built in, meaning we have the possibility to provide reasonably fast disaster recovery. It might not be real time data center fail over but a lot of clients don’t need that. However, they do need easy recoverability and here it is. To give you even more options, especially  if you only have one location, you can replicate to the cloud.

So now I start dreaming Smile We have shared nothing Live & Storage Live  Migration, we have replication. What could achieve with this? Do synchronous replication locally over a 10Gbps for example and use that to build something like continuous availability. There we go, we already have requirements for “Windows 8 Server R2”!

NIC Teaming in the OS

No more worries about third party NIC teaming woes. It has arrived in the OS (finally!) and it will support load balancing & failover. I welcome this, again it makes this a lot more feasible for the SMB shops.

IP Virtualization / Address Mobility

Another thing that will aid with any kind of of site  disaster recovery / high availability is IP address Mobility. You have an IP for the hosting of the VM and one for internal use by VM. That means you can migrate to other environments (cloud, remote site) with other addresses as the VM can change the hosted IP address, while the internal IP address remains the same.  Just imagine the flexibility this gives us during maintenance, recovery, trouble shooting network infrastructure issues and all this without impacting the users who depend on the VM to get their job done.

Conclusion

Everything we described is out of the box with Windows 8 Server Hyper-V. To a lot of business this can  mean a  huge improvement in their current  availability and disaster recovery situation. More than ever there is now no more reason for any company to go down or even out of business due to catastrophic data loss as all this technology is available on site, in hybrid scenarios and in the cloud with the providers.

BriForum 2011 Europe Here I Come

As you might have read a previous blog and noticed in the sidebar, I’m off to London (UK) to attend  BriForum 2011 Europe. It’s time to get away from the wide screens overlooking my ICT empire toys and broaden my horizons  For those who think the cloud is going to take away your job … think again, I’m getting busier than ever. The reality is that we just can’t push a button and have everything up and running in the cloud. Greenfield projects and startups might beat existing infrastructure & application architecture over the head with cloud and make those businesses run harder for their money but they will run and compete. That race will produce a huge workload.

So I’m of to dive into some sessions on Cloud, Server & Application Virtualization, VDI … should make for some interesting days. I hope to be able to talk to lots of people with a variety of experiences to help find out new or alternate ways to address some issues (or challenges) we need to tackle in the years ahead. Subjects like Disaster Recovery, Business Continuity, application-aware storage in a virtualized environment, Geo Clustering, Site Recovery, … should give us ample to discuss. Give us a shout if you’re there. It’s also a nice opportunity to meet up with some fellow bloggers and twitter. acquaintances.

A colleague of mine is heading to the USA, Atlanta to attend TechEd 2011 USA. So he’s crossing the big pond to get some brand new info on the latest and the greatest in Microsoft technologies on the IT Pro side of the business.

So of to London, I go, onwards & always going forward in IT as there is no turning back I’ll keep you posted when I find the time to do so.