The evolution of technology frameworks has made it continually easier to create new business functionality. As software developers are able to develop functionality at an ever-increasing rate, it becomes practical to deploy at a much more frequent pace. Ten years ago, the typical software organization might only have a couple releases annually. In many cases, one or two systems engineers could be reliably counted on to execute the internal deployments for the various development and QA deployments, as well as the occasional deployments required for the production environment.
As new technology frameworks made it practical to produce significant business functionality in a matter days and weeks, instead of months and years, the need for deployments increased significantly.
The conventional wisdom says that the Agile development methodology is what drove the adoption of shorter development cycles and more frequent deployments. However, Agile was merely the reaction to an ever-changing technology environment that created a need for shorter development cycles.
When Agile was first introduced, Agile Scrum was the preferred development methodology. While there was an awareness of Kanban, Scrum was clearly the favorite in many organizations. Once again, it was technology that was driving this decision. While technology frameworks were vastly improved, they hadn't improved to the point where it made logical sense to be conducting daily (or multiple daily) deployment to production. However, that is no longer true! The technology frameworks available today, make it reasonable, logical, and possible to have continuous daily deployments to production.
Once again, this technology change has also facilitated a shift in development methodologies. Methodologies like Kanban are better suited to continuously cranking out new functionality, This functionality is continuously deployed to production environments. There are no sprint boundaries that can arbitrarily delay the development and deployment of a particular feature. Teams simply work the highest priority tasks.
The cost to develop new features continues to decrease significantly. The number of deployments has increased significantly. Thus, the cost of deployments increased, in total and as a percentage of the overall cost (deployment vs. development). This cost increase has driven the need for significant improvements in the deployment and monitoring techologies and architecture.
As is the case, when a need becomes significant, the market will develop solutions. Along came technologies like Chef, Puppet, Monit, ELK, Jenkins, Docker, Ansible, Splunk, AWS, etc. These tools and technologies helped facilitate the ever increasingly-complex deployment architectures. With the explosion of public APIs, there arose a need to scale up for potentially global traffic volumes, sometimes thousands of servers serving requests, each potentially running different versions of software. This has created a need for an entirely new job description for those in charge of deployments.
The practice of systems engineering used to require a basic knowledge of some scripting language and the ability to navigate a particular deployment tool of choice. The evolution to dev ops and site reliability engineering has created a need for a whole new category of technology worker. It is no longer sufficient to have rudimentary scripting skills and the ability to navigate deployment tools. Dev ops / site reliability engineers are not second-class citizens in the software engineering space. An elite site reliability engineer is every bit as valuable as an elite software engineer. You may even argue that an SRE vs. software engineer is really just a distinction without a difference, that they are one and the same.
The rest of the concepts in this article are summarized ideas from the book Site Reliability Engineering: How Google Runs Production Systems.
The main principles of site reliability engineering are outlined as follows:
While there are endless metrics that can be monitored in production systems, the four described below are some of the most important for indicating the health state of your production system.
A dev ops / site reliability engineer should spend 50% of their time working on engineering tasks. Or said another way, they should spend half of their time trying to work themselves out of a job. If they are getting request from other teams that exceed 50% of their time, they need to push back. If they stop taking requests from development teams once they've used up their daily or weekly allowance of time, the development teams will be incentivized to automate what they can, and it will encourage them not to saddle dev ops engineers with toil (see definition above).
Some of the necessary practices of site reliability engineering are outlined below. The following are tactical measures that will help keep your environments up and running consistently.
While this is not an exhaustive list of tools and practices associated with site reliability engineering, hopefully it spurred some thoughts and ideas that you can leverage to improve the health of your deployment environments and infrastructure. If you'd like a much more in-depth look into these practices, they can be found in Site Reliability Engineering: How Google Runs Production Systems.
Are you interested in learning the Netflix approach to creating resilient software. Check out this article .