The History of Site reliability engineering (SRE) – enov8
JAN, 2020
by Sylvia Fronczak.
Author Sylvia Fronczak
This post was written by Sylvia Fronczak. Sylvia is a software developer that has worked in various industries with various software methodologies. She’s currently focused on design practices that the whole team can own, understand, and evolve over time.
Site reliability engineering (SRE) uses techniques and approaches from software engineering to tackle reliability problems with a team’s operations and a site’s infrastructure.
Knowing the history of SRE and understanding which problems it solves ensures that you can make it work for your organization. And as is the case with the spread and adoption of agile and DevOps, SRE guides you, so you know you’re making choices for the right reasons and with the right goals in mind.
Enov8 IT & Test Environment Manager
*Innovate with Enov8
Streamlining delivery through effective transparency & control of your IT & Test Environments.
In this post, we’ll look at the history of SRE in the industry. We’ll explore what problems caused people to combine engineering principles with operations and system administration.
Let’s kick it off with a look at how all this started.
In the Beginning
To start, let’s talk about software reliability engineering over the last 50 years. In this time span, people put much analysis and thought into software reliability. The literature covers many of the terms and topics that present-day site reliability engineering uses, such as mean time to recovery, using metrics to analyze and predict failures, and creating fault tolerance through application redundancy.
During this period, people learned that software can and will fail. It also became clear that people can take steps to reduce the occurrence of failure, or at least reduce the impact of failure.
But much of the literature still looked at software as one black box, irrespective of its ecosystem. Experts divided hardware and software into two different worlds. Keeping an application running wasn’t the same as building it in the first place.
For example, back when I began writing software, the organizations I worked with kept application developers and operations separate. The developers worked on products and features, while the sysadmins worked on hardware, configuration, and monitoring, much of which was manual.
Now, don’t think that this was unusual. Many companies took on this model. And no one questioned it for a long time.
However, companies like Google saw that there could be a better way. But what problem did they notice that drove them to try something different?
The Problem Worth Solving
In an application life cycle, people first wrote the application, putting forth a lot of engineering discipline and thought. And then they pushed it into production, whether all at once or iteratively. Then, once it was live and in production, they put it in maintenance mode. They didn’t actively work on it unless users wanted more features or unless there was a problem. In fact, they put very little engineering effort into running the application for the years that it provided value.
Though people looked at design and engineering principles during the development of the software, they didn’t take the same amount of care in keeping the software up and running. Instead, they hired operations teams to restart the server occasionally, fiddle with the load balancing, deploy a patch or an upgrade, or manually intervene when a problem occurred.
But they didn’t look for engineering solutions to these menial maintenance tasks. They didn’t even see this as a problem. And they repeatedly completed these same manual tasks on every one of their applications and systems.
When new applications deployed to production, they’d have a new application maintenance and operations team ready to take on the additional responsibility.
So why was this a bad thing? Mainly it had to do with costs.
Costs of the Old Way: Devs vs. Sysadmins
First, because of the way manual processes ruled in this paradigm, the sysadmin team grew over time to maintain additional systems and additional traffic. Therefore, the size of the sysadmin team increased with the number of applications, the complexity of the systems, and the success of the application due to the increased traffic.
So for a company like Google, which continually increases the number of applications and traffic over time, it required a larger and larger operations staff to keep it all running. In fact, Google isn’t alone. You’ve probably worked in places where additional teams developed for each application that went live.
Second, because the development team and the sysadmin team were separate, a divide began to form between them. The devs wanted to ship new features, while the sysadmins wanted to reduce the amount of change happening to the system. The sysadmins and the rest of the operations team felt good when few changes to the application took place. So they wanted fewer new features, fewer config changes, and less complexity overall.
Why did the sysadmins resist new features? Because changes like these introduce the potential for something to break. And the sysadmins didn’t want to be up all night dealing with it.
On the other hand, developers wanted to push new features for the end user. Change is what they lived for!
So they were two sides of a coin. Both wanted to improve the life of the customer. The operations team wanted to do that through improved reliability. The development team wanted to do that through new features and applications.
As you might expect, the two paradigms conflicted and slowed each other down. The operations teams worked to slow change, while the development team worked to speed up change.
How do we get out of this cycle? How do we get features out quickly while also keeping reliability and stability at acceptable levels?
The Path Forward
The journey that Google and other companies took to adopt and create SRE wasn’t always smooth. It also wasn’t a large, coordinated, heavily structured plan that management laid out. It grew from engineers solving engineering problems. And all of this reportedly began back in 2003, when Ben Traynor coined the term SRE at Google.
So how did Google make it work?
To begin with, people there decided to look at operations and system administration in a different way. They looked at it through the eyes of an engineer.
In addition to hiring software engineers for the operations team, Google looked for sysadmins who understood the internal workings of operations while still having a solid level of engineering skills. Combined, these people were well equipped to improve operations through engineering practices that reduced toil and improved reliability.
Why did the folks at Google do this?
When you look at teams and operations through the eyes of an engineer, you see opportunities to automate and streamline processes. You begin to notice common problems and find elegant solutions that you can use in multiple places. And you believe that engineering skills can solve complex software problems.
And then Google iterated.
The team members at Google didn’t look at failures as being a drawback of SRE. Instead, they were opportunities to improve and learn. And over time, Google team members came up with a wealth of knowledge that we can now find in their SRE handbook and workbook.
What does all this mean for you?
The next step involves your path forward. How will you look at costs and benefits with your existing teams and infrastructure? And how will you begin your journey toward SRE?
Key Takeaways
The history of reliability engineering spans many decades. People have spent a lot of time making systems more reliable. However, using manual processes around maintaining reliability resulted in rising costs for operating the systems in production. In an effort to find a better way, Google team members took it upon themselves to combine engineering skills with system administration. They did all this to reduce application operation costs—both direct and indirect.
Even if your path to SRE takes a different route from the ones that Google staffers took, you can be sure that you’ll still end up in the right place. Why? You’ll be looking at problems from a software engineer’s perspective and from a system administrator’s perspective. Combining the two will provide maintainable and reliable systems for your customers. You’ll increase automation to save time, money, and hassles. And you’ll find that problems that occur on your journey will provide chances to learn and grow.
Next Steps – Enhance your IT Environment Resilience
Want to see how you can uplift your IT & Test Environment Resilience.
Why not ask us about our IT & Test Environment Management solution.
Helping you manage your Production & Non-Production Environments through System Modelling, Planning & Coordination, Booking & Contention Management, Service Support, Runsheeting, DevOps Automation and centralized Status Accounting & Reporting.
Innovate with Enov8, the IT Environment & Data Company.
Specializing in the Governance, Operation & Orchestration of your IT systems and data.
Delivering outcomes like
- Improved visibility of your IT Fabric,
- Streamlined Delivery of IT Projects,
- Operational Standardization,
- Security & Availability,
- DevOps / DataOps Automation,
- Real-Time insights supporting decision making & continuous optimization.
Enov8‘s key solutions include
- Environment Manager for IT & Test Environment Management.
- Release Manager for Enterprise Release Management & Implementation Planning.
- Data Compliance Suite (DCS) for Test Data Management, including Data/Risk Profiling /Discovery, Automated Remediation & Compliance Validation.
- VirtualizeMe (vME) for Database Cloning & Rapid provisioning.
Other SRE Reading
Interested in reading more about SRE. Why not start here:
Enov8 Blog: Observability a Foundation for SRE
Enov8 Blog: SRE top 10 Best Practices
Enov8 Blog: DevOps versus SRE – Friends or Foe?
Relevant Articles
What makes a Good Deployment Manager?
Deployment management is a critical aspect of the software development process. It involves the planning, coordination, and execution of the deployment of software applications to various environments, such as production, testing, and development. The deployment...
DevOps vs SRE: How Do They Differ?
Nowadays, there’s a lack of clarity about the difference between site reliability engineering (SRE) and development and operations (DevOps). There’s definitely an overlap between the roles, even though there are clear distinctions. Where DevOps focuses on automation...
Self-Healing Data: The Power of Enov8 VME
Introduction In the interconnected world of applications and data, maintaining system resilience and operational efficiency is no small feat. As businesses increasingly rely on complex IT environments, disruptions caused by data issues or application failures can lead...
What is Data Lineage? An Explanation and Example
In today’s data-driven world, understanding the origins and transformations of data is critical for effective management, analysis, and decision-making. Data lineage plays a vital role in this process, providing insights into data’s lifecycle and ensuring data...
What is Data Fabrication? A Testing-Focused Explanation
In today’s post, we’ll answer what looks like a simple question: what is data fabrication? That’s such an unimposing question, but it contains a lot for us to unpack. Isn’t data fabrication a bad thing? The answer is actually no, not in this context. And...
Technology Roadmapping
In today's rapidly evolving digital landscape, businesses must plan carefully to stay ahead of technological shifts. A Technology Roadmap is a critical tool for organizations looking to make informed decisions about their technological investments and align their IT...