How to Create Software that is More Resilient, Durable, and Dependable

A Deployment Tale

You’ve just spent the last two months on a marathon of a project. You haven’t gotten good sleep in weeks. Your team finally made it to production 2 days ago! You had your doubts! The test environment had experienced a few occasional hiccups. A couple of servers had experienced some unexplained and “un-reproducible” errors. The test environment had experienced a few occasional moments of lagging performance.

Stability Concerns

You’d expressed some concerns about stability! Those concerned were brushed off. “It’s just a configuration issue. The production servers are built for performance. It’s environmental. Servers rarely go down in production.” You’d heard it all before. The business needed the features. Delaying the release was not an option (management had performance incentives based on the release date). Everything will be just fine.

Production Outage

For two day, everyone held their breath. You’re still trying to catch up from the last two months of sleep-deprived nights. It’s 3 a.m., your phone is blowing up! Servers are down in production, and customer website is down completely. Within 30 minutes of rebooting the troubled servers, the CPU spikes to 100% and they become completely unresponsive. This causes cascading effect on numerous other services that depend on the unresponsive servers. You wipe the sleep from your eyes. You put on your favorite fuzzy gray slippers, and head to your office (your home office – which is nothing more than a small desk, a phone, and an internet connection in darkest corner of your storage closet).

You dial in. The call is abuzz with frenzied chatter. Nobody has any clue on the root cause. An emergency is declared! The entire production release is rolled back. There are numerous dependent systems that need to be rolled back also. It’s an incredible effort requiring many people.

Management Shock and Outrage

Management is shocked! How could this happen again? This had just happened three months ago. But it had been fixed. A resiliency checklist had been created. Development teams were required to fill it out every release. This would ensure they would incorporate resiliency into their design. A new process created, and another problem solved. End of story! Or so we thought.

Resiliency Lessons in Software Development

Getting software to production is no easy task. Apparently, keeping software in production isn’t always so easy either! While the reasons for failure are endless and varied, I’d like to discuss a few of them in the sections below.

Resiliency Lesson 1 : Hope is not a Strategy

While you’d think this should be apparent, I’m continually amazed by the surprisingly large number of developers and software leaders that subscribe to this rather unscientific theory and practice.

Resiliency Lesson 2 : Process Should not be Confused with Progress

The act of adopting a new process doesn’t mean that progress has actually been made. For a process to result in progress, the process needs make people want to change the way they fundamentally behave. Creating yet another checklist doesn’t solve a problem. It creates more overhead.

Resiliency Lesson 3 : Use Appropriate Incentives to Drive Outcomes

So often, well-intentioned folks try to fix problems, but instead of attacking the problem, they attack the symptoms. To many folks, it seemed like the problem was that developers weren’t thinking enough about resiliency. However, the real problem was that the development processes weren’t designed in a way that caused the software engineers to think about resiliency.

How Netflix Got It Right

I recently attended a wonderful talk by Casey Rosenthal, a software engineering manager at Netflix. Special thanks to him for the insightful talk, and to David Hussman of DevJam for providing the meeting space and getting the event organized.

Casey was talking about Chaos Monkey. Chaos Monkey is a software tool developed by Netflix engineers to test the resiliency and recoverability of their Amazon Web Services. This software simulates service failures by shutting down virtual machines running within Auto Scaling Groups.

At Netflix, the development teams that own a service are also on-call for production failures of that service. This has a couple of beneficial side effects.

Benefit #1 - those that support the service in production are the ones who are intimately familiar with the service. (Often at organizations I’ve been a part of, the production support teams only call in the developers as a last resort).

Benefit #2 - This creates a strong incentive for the development team to build resiliency into their service, because they’ll be the ones to take the 3 a.m. phone call when the service goes down and is unable to recover on its own.

Casey mentioned that part of his team’s mission involves randomly killing off services in production during business hours to verify that the service knows how to recover from the failure. There are a few benefits to this strategy.

Benefit #1 - Since the production failures are being caused during business hours, developers are already in the office instead of taking the call at 3 a.m.

Benefit # 2 - Hoping that the your service won’t go down is no longer realistic. Developers are assured of the failure of any and every given service.

Benefit # 3 - Developers are conditioned to think about resiliency proactively instead of reactively. They know for a fact that their services will go down at some point because of the continuous service "failures" that take place randomly in the production environment.

Netflix has created an environment where resiliency is built into the process from the beginning instead of bolted on as an afterthought when the development is finished. For companies looking to improve their resiliency processes, examine the Netflix example and incorporate these lessons into your own resiliency strategy.

Are you interested in increasing the efficiency of your software deployment and development processes? Learn the processes used by large software companies to produce reliable software systems with dev ops and site reliability engineering in this article .

Site Reliability Engineering

Fostering Innovation on Software Teams Thru Deliberate Practice

Creating an Environment-Specific Data Loader in Spring Boot