Site Reliability Engineering and Dev Ops

Site Reliability Engineering for Software Applications

The evolution of technology frameworks has made it continually easier to create new business functionality. As software developers are able to develop functionality at an ever-increasing rate, it becomes practical to deploy at a much more frequent pace. Ten years ago, the typical software organization might only have a couple releases annually. In many cases, one or two systems engineers could be reliably counted on to execute the internal deployments for the various development and QA deployments, as well as the occasional deployments required for the production environment.

As new technology frameworks made it practical to produce significant business functionality in a matter days and weeks, instead of months and years, the need for deployments increased significantly.

Technology Dictates Development Methodology (not the other way around)

The conventional wisdom says that the Agile development methodology is what drove the adoption of shorter development cycles and more frequent deployments. However, Agile was merely the reaction to an ever-changing technology environment that created a need for shorter development cycles.

Don't be Married to a Particular Development Methodology

When Agile was first introduced, Agile Scrum was the preferred development methodology. While there was an awareness of Kanban, Scrum was clearly the favorite in many organizations. Once again, it was technology that was driving this decision. While technology frameworks were vastly improved, they hadn't improved to the point where it made logical sense to be conducting daily (or multiple daily) deployment to production. However, that is no longer true! The technology frameworks available today, make it reasonable, logical, and possible to have continuous daily deployments to production.

Once again, this technology change has also facilitated a shift in development methodologies. Methodologies like Kanban are better suited to continuously cranking out new functionality, This functionality is continuously deployed to production environments. There are no sprint boundaries that can arbitrarily delay the development and deployment of a particular feature. Teams simply work the highest priority tasks.

Deployment / Monitoring Architecture Cost Increases in Significance

The cost to develop new features continues to decrease significantly. The number of deployments has increased significantly. Thus, the cost of deployments increased, in total and as a percentage of the overall cost (deployment vs. development). This cost increase has driven the need for significant improvements in the deployment and monitoring techologies and architecture.

The Market Always Responds To a Need

As is the case, when a need becomes significant, the market will develop solutions. Along came technologies like Chef, Puppet, Monit, ELK, Jenkins, Docker, Ansible, Splunk, AWS, etc. These tools and technologies helped facilitate the ever increasingly-complex deployment architectures. With the explosion of public APIs, there arose a need to scale up for potentially global traffic volumes, sometimes thousands of servers serving requests, each potentially running different versions of software. This has created a need for an entirely new job description for those in charge of deployments.

Site Reliability Engineer / Dev Op Engineer is Born

The practice of systems engineering used to require a basic knowledge of some scripting language and the ability to navigate a particular deployment tool of choice. The evolution to dev ops and site reliability engineering has created a need for a whole new category of technology worker. It is no longer sufficient to have rudimentary scripting skills and the ability to navigate deployment tools. Dev ops / site reliability engineers are not second-class citizens in the software engineering space. An elite site reliability engineer is every bit as valuable as an elite software engineer. You may even argue that an SRE vs. software engineer is really just a distinction without a difference, that they are one and the same.

Site Reliability Engineering as a Practice

The rest of the concepts in this article are summarized ideas from the book Site Reliability Engineering: How Google Runs Production Systems.

Site Reliability Engineering Principles

The main principles of site reliability engineering are outlined as follows:

Eliminate toil - toil is any task that is manual, repetitive, automatable, tactical, no enduring value, and grows linearly (O(n)) with service growth. Any time you encounter a task that fits the description above, you need to automate it or figure out how to eliminate it. These are time-wasters and efficiency-killers.
Define Service Level Objectives (SLO) - when defining SLOs, you need to figure out what is acceptable. Creating SLOs that are too aggressive can be over-kill and extremely cost ineffective. SLOs that are not aggressive enough will cause consumer dissatisfaction and lost business. In summary, the SLO should be defined at the point where additional improvement is unlikely to be noticed by the consumer.
Embrace Risk - Define a risk tolerance and an error budget. Plan to spend it all, but no more. If you've define an SLO of 99.99% uptime, that means that your service can be down for 56.56 minutes a year, or 4.71 minutes a month. Your error budget would be 4.71 minutes a month. Using an error budget, once you've exceeded 4.71 minutes of down-time a month, you'd no longer be able to conduct deployments or other things that might cause downtime. The main benefit of an error budget is that it provides a common incentive for product development and site reliability engineering to find the right balance between innovation and reliability. Additionally, while you don't want to go over 4.71 minutes of downtime in a month, you want to use up that full-amount. If you had a month with zero downtime, that probably indicates that you weren't being innovative enough, and weren't taking enough risk.
Monitoring Distributed Systems - One of the key aspects of site reliability engineering is the monitoring of production systems. While there are all sorts of tools that can be uses, there are four critical metrics that must be collected. See the section on Four Most Important Monitoring Metrics below.
Automation - There are a number of benefits to automated processes. They are more consistent and eliminate the element of human-error. They provide a platform and a system that can be extended to automate other tasks and systems. It allows systems to be changed or repaired much faster. It saves time (the one resource that you can't make more of).

Four Most Important Monitoring Metric

While there are endless metrics that can be monitored in production systems, the four described below are some of the most important for indicating the health state of your production system.

Latency - the amount of time required to process a particular request.
Traffic - the number of requests in a given time period.
Errors - the number of requests that are not returning the proper data.
Saturation - the utilization level of constrained resources. (CPU, IO, bandwidth, memory)

Site Reliability Engineer's Work Allocation

A dev ops / site reliability engineer should spend 50% of their time working on engineering tasks. Or said another way, they should spend half of their time trying to work themselves out of a job. If they are getting request from other teams that exceed 50% of their time, they need to push back. If they stop taking requests from development teams once they've used up their daily or weekly allowance of time, the development teams will be incentivized to automate what they can, and it will encourage them not to saddle dev ops engineers with toil (see definition above).

SRE Practices

Some of the necessary practices of site reliability engineering are outlined below. The following are tactical measures that will help keep your environments up and running consistently.

Monitoring - if you don't monitor, you won't have the ability to detect error situations prior to them becoming catastrophic.
Incident Response - You need an incident response strategy and plan. This will likely involve on-call personnel. Keep a list of outages and incidents.
Postmortem analysis - You need to conduct postmortem analysis on all incidents. Analyze root causes and look for trends.
Testing - When you see error/incident trends, you need to develop a testing strategy (preferably automated) to prevent these types of incidents.

Summary

While this is not an exhaustive list of tools and practices associated with site reliability engineering, hopefully it spurred some thoughts and ideas that you can leverage to improve the health of your deployment environments and infrastructure. If you'd like a much more in-depth look into these practices, they can be found in Site Reliability Engineering: How Google Runs Production Systems.

Are you interested in learning the Netflix approach to creating resilient software. Check out this article .