Youíve just spent the last two months on a marathon of a project. You havenít gotten good sleep in weeks. Your team finally made it to production 2 days ago! You had your doubts! The test environment had experienced a few occasional hiccups. A couple of servers had experienced some unexplained and ďun-reproducibleĒ errors. The test environment had experienced a few occasional moments of lagging performance.
Youíd expressed some concerns about stability! Those concerned were brushed off. ďItís just a configuration issue. The production servers are built for performance. Itís environmental. Servers rarely go down in production.Ē Youíd heard it all before. The business needed the features. Delaying the release was not an option (management had performance incentives based on the release date). Everything will be just fine.
For two day, everyone held their breath. Youíre still trying to catch up from the last two months of sleep-deprived nights. Itís 3 a.m., your phone is blowing up! Servers are down in production, and customer website is down completely. Within 30 minutes of rebooting the troubled servers, the CPU spikes to 100% and they become completely unresponsive. This causes cascading effect on numerous other services that depend on the unresponsive servers. You wipe the sleep from your eyes. You put on your favorite fuzzy gray slippers, and head to your office (your home office Ė which is nothing more than a small desk, a phone, and an internet connection in darkest corner of your storage closet).
You dial in. The call is abuzz with frenzied chatter. Nobody has any clue on the root cause. An emergency is declared! The entire production release is rolled back. There are numerous dependent systems that need to be rolled back also. Itís an incredible effort requiring many people.
Management is shocked! How could this happen again? This had just happened three months ago. But it had been fixed. A resiliency checklist had been created. Development teams were required to fill it out every release. This would ensure they would incorporate resiliency into their design. A new process created, and another problem solved. End of story! Or so we thought.
Getting software to production is no easy task. Apparently, keeping software in production isnít always so easy either! While the reasons for failure are endless and varied, Iíd like to discuss a few of them in the sections below.
While youíd think this should be apparent, Iím continually amazed by the surprisingly large number of developers and software leaders that subscribe to this rather unscientific theory and practice.
The act of adopting a new process doesnít mean that progress has actually been made. For a process to result in progress, the process needs make people want to change the way they fundamentally behave. Creating yet another checklist doesnít solve a problem. It creates more overhead.
So often, well-intentioned folks try to fix problems, but instead of attacking the problem, they attack the symptoms. To many folks, it seemed like the problem was that developers werenít thinking enough about resiliency. However, the real problem was that the development processes werenít designed in a way that caused the software engineers to think about resiliency.
I recently attended a wonderful talk by Casey Rosenthal, a software engineering manager at Netflix. Special thanks to him for the insightful talk, and to David Hussman of DevJam for providing the meeting space and getting the event organized.
Casey was talking about Chaos Monkey. Chaos Monkey is a software tool developed by Netflix engineers to test the resiliency and recoverability of their Amazon Web Services. This software simulates service failures by shutting down virtual machines running within Auto Scaling Groups.
At Netflix, the development teams that own a service are also on-call for production failures of that service. This has a couple of beneficial side effects.
Benefit #1 - those that support the service in production are the ones who are intimately familiar with the service. (Often at organizations Iíve been a part of, the production support teams only call in the developers as a last resort).
Benefit #2 - This creates a strong incentive for the development team to build resiliency into their service, because theyíll be the ones to take the 3 a.m. phone call when the service goes down and is unable to recover on its own.
Casey mentioned that part of his teamís mission involves randomly killing off services in production during business hours to verify that the service knows how to recover from the failure. There are a few benefits to this strategy.
Benefit #1 - Since the production failures are being caused during business hours, developers are already in the office instead of taking the call at 3 a.m.
Benefit # 2 - Hoping that the your service wonít go down is no longer realistic. Developers are assured of the failure of any and every given service.
Benefit # 3 - Developers are conditioned to think about resiliency proactively instead of reactively. They know for a fact that their services will go down at some point because of the continuous service "failures" that take place randomly in the production environment.
Netflix has created an environment where resiliency is built into the process from the beginning instead of bolted on as an afterthought when the development is finished. For companies looking to improve their resiliency processes, examine the Netflix example and incorporate these lessons into your own resiliency strategy.
Are you interested in increasing the efficiency of your software deployment and development processes? Learn the processes used by large software companies to produce reliable software systems with dev ops and site reliability engineering in this article .