Developer Exchange Blog
Failure. On Purpose? Great idea!
Have you ever Experienced a situation where you needed to recover from a failure but found that despite backup or continuity plans, the restoration expected to be possible was not possible? I've seen this many times over the years. How could this be avoided?
You may or may not be aware of this, but Netflix has a Chaos Monkey! Its job is to randomly wreak havoc on their operation by killing instances of services in their architecture. And, yes, we're talking about production services.
Imagine that in your enterprise. Every day you know one thing for sure: something's going to break today. Can you imagine the tension you'd feel heading into work, knowing that today some random thing is going to break, causing us problems, possibly at the worst possible time? In a business critical environment this would surely cause many to dread coming to work.
How would you deal with this? What would you do?
Thinking about it systematically, the first thing I would do from an operations standpoint is to build some great monitoring and diagnostic capabilities that would keep me aware, at all times, of the health of my systems. I'd want to know immediately if something were to stop or reach critical conditions (e.g., running out of storage, processors maxed out, etc.). Then, I'd develop a troubleshooting roadmap that would divide and conquer enterprise services so that I could isolate areas where problems were and weren't. Finally, I d have very clear methods of restarting machines and services when issues are identified. A common sense approach, right?
Now think about this from a developers perspective. Whether you have a service oriented architecture or a more traditional architecture, or like most, a nasty blend, as an architect or developer you'd be worried about what would happen to your processes and transactions if, frequently, systems and services you depend on would not be available. You'd be wise to build independence in everywhere you could and deal with the unavailability of critical components and services in as friendly a way as possible, trying to keep the customer/user from being burdened by transient failures that you know and expect to happen. Systems and services would become more and more autonomous.
The truth of the matter is that these are the things we should be doing all the time because, whether you acknowledge it or not, the Chaos Monkey can strike anyone, whether he's intentionally allowed to run amok or whether he just sneaks in without permission.
This is not just true for software services. It's equally true for restoring data from back-ups, recovering from hardware failures like disk drives, network devices and routes, power supplies, entire machines and more.
Here's a quote from a blog posting by Jeff Atwood about this counterintuitive approach to high service availability:
"When you work with the Chaos Monkey, you quickly learn that everything happens for a reason. Except for those things which happen randomly. And that's why, even though it sounds crazy, the best way to avoid failure is to fail constantly."
Netflix came to this realization as they began to rely on Amazon Web Services (AWS), which meant that they had dependencies on environments that they couldn't necessarily control. At the end of the day, their approach to robustness has made their systems better for it. So, apparently failure on purpose really is a good idea.