SRE vs Disaster Recovery – (Friends or Foe)

08
JUNE, 2020by Eric Boersma

Site reliability engineering versus disaster recovery: Every company needs a disaster recovery plan. This is just a simple fact of life. Your company needs to know how to recover when something breaks or you can’t get access to something you need.

In larger, more advanced tech companies, disaster recovery sometimes bumps against environment management teams. This is the sort of thing that can lead to conflict between the teams. For instance, when the disaster recovery team bumps against the site reliability engineering team, which team takes responsibility for what?

In moments like these, it can feel like disaster recovery and site reliability engineering are foes. In some companies, they stay that way. But in my experience, the healthiest companies don’t think of disaster recovery or site reliability engineering as being foes at all. They find that they work best together, each working toward a goal of making the business better.

In this post, we’ll talk about how that happens and how you can make it happen at your company.

 

What Is Disaster Recovery?

Before we get started, it’s good to understand what we’re talking about. Disaster recovery (DR) is the process of planning and executing in scenarios where there’s been a “disaster.”

Originally, this term meant something significant, like a flood or a fire or an earthquake. Businesses would make plans about how to handle these major events. Then, when they did happen, they’d put those plans into action and follow them to get the business back to normal.

In today’s climates, “disaster” can mean something a little less significant. A disaster in today’s tech world can be as small as a critical server suffering a hardware failure. However, it’s still up to the business to have plans in place for events like that and put them into action when they happen.

What Is Site Reliability Engineering?

Site reliability engineering (SRE) can feel a little difficult to define. This is because, as a concept, it’s still pretty young. People are still figuring out the best ways to do it.

Basically, it’s a concept where companies build software to automate building and scaling their technical infrastructure. Instead of systems administrators building and managing individual servers by hand, site reliability engineering expects automation. Effective SRE teams define their environments in code, then use tools to automate their deployment and manage their state. Really effective teams build that automation directly into their release management process. They also use SRE principles to build out test environments, too.

How Can SRE and DR Work Together?

At first glance, it can seem like DR and SRE will always be at odds. When you really break it down, DR is concerned with what happens when things fall down. SRE is concerned with making sure that things keep standing up. Those don’t feel like two groups that will ever overlap, and at some companies, they don’t.

But if you reframe how you think of DR and SRE, it makes a little more sense. At a higher level, both teams are concerned with how to keep the company in an operational state. When you think about it that way, SRE and DR are actually close friends. They’re both working for the same goal, and they have a lot to offer one another.

Configuration Management: SRE Helping DR

At their core, SRE and DR are both focused on defining a valid state for the business, then taking steps until the business achieves that state. In SRE, this is known as configuration management.

An effective configuration management system goes a long way toward simplifying DR plans. What happens when that server crashes and can’t come back up? In a world dominated by system administrators, someone’s pager goes off. They have to drive to the data center and begin the arduous process of bringing a new server back up. That’s disaster recovery.

On a team with quality SRE practices, what happens when that server crashes? The automated server management tool provisions a new server on the cloud server provider. Because its entire configuration is well-managed, it already knows where all the other services on the network run. Nobody’s pager goes off. The entire system is down less than ten minutes.

This kind of process scales out, too. The best configuration management tools monitor the status of entire data centers. If Amazon or Microsoft or Google have a data center that goes offline, your configuration management software will automatically switch to the failover data center. There’s no “disaster” anymore. Things are down for a little while, but the business moves back to a working state as quickly as possible.

Outage Rehearsal: DR Helping SRE

One of the major flaws of most disaster recovery plans is that they’re never put into practice. Obviously, for the business, this seems like good news. Nobody wants to deal with a disaster! But it’s only good news as long as there’s never a disaster.

A DR plan that hasn’t been put into practice isn’t an effective plan. If you haven’t actually tested bringing the system back up, you can’t know it’ll work.

Obviously, that kind of outage is difficult to test. Companies are scared of bringing down critical services, even on purpose. That’s understandable!

However, at companies with quality SRE practices, bringing down critical systems carries far less risk. In the most mature environments, it’s possible to set up entire copies of your production environment. You don’t have to bring down servers that are serving customers.

The goal of bringing these services down is to allow your teams, both SRE and DR, to practice identifying and restoring those services. People don’t learn how to remediate outages by reading slide decks and watching videos. They learn by doing.

An effective SRE implementation means that it’s possible to turn off critical parts of the infrastructure easily. Then, it’s up to the team to identify the outage and figure out how to remediate the problem. This kind of real-world practice is invaluable in an actual emergency.

Savvy readers will notice that this sounds a lot like chaos engineering. They drive at similar aims. The goal is to build resilient, robust systems that work even when things are falling apart. Both work by breaking things on purpose, then figuring out how to fix them so you know better next time. The bonus here is that an effective DR team will make things a lot easier for the SRE team. They’ll be able to guide the effort to fix things through their pre-existing plans.

SRE and DR Don’t Have to Be at Odds

Like we’ve mentioned, SRE and DR teams wind up at odds in a lot of organizations. This can happen for many reasons, and sometimes those reasons can’t be overcome. That’s the unfortunate reality of working in business; no organization is perfect.

However, if you’re striving to make your organization as effective as possible, you should aim to help SRE and DR teams work together. As we’ve shown here, it’s entirely possible for these teams to serve the same purpose. In the best environments, they help each other work more effectively at their ultimate goals.

The key to the effort is to make sure that each team focuses on their ultimate goal: a healthy, effective business that’s operating efficiently. When that’s the case, everyone wins, and SRE and DR teams are friends to the end.

Eric Boersma

This post was written by Eric Boersma. Eric is a software developer and development manager who’s done everything from IT security in pharmaceuticals to writing intelligence software for the US government to building international development teams for non-profits. He loves to talk about the things he’s learned along the way, and he enjoys listening to and learning from others as well.

Relevant Articles

What makes a Good Deployment Manager?

What makes a Good Deployment Manager?

Deployment management is a critical aspect of the software development process. It involves the planning, coordination, and execution of the deployment of software applications to various environments, such as production, testing, and development. The deployment...

DevOps vs SRE: How Do They Differ?

DevOps vs SRE: How Do They Differ?

Nowadays, there’s a lack of clarity about the difference between site reliability engineering (SRE) and development and operations (DevOps). There’s definitely an overlap between the roles, even though there are clear distinctions. Where DevOps focuses on automation...

Self-Healing Data: The Power of Enov8 VME

Self-Healing Data: The Power of Enov8 VME

Introduction In the interconnected world of applications and data, maintaining system resilience and operational efficiency is no small feat. As businesses increasingly rely on complex IT environments, disruptions caused by data issues or application failures can lead...

What is Data Lineage? An Explanation and Example

What is Data Lineage? An Explanation and Example

In today’s data-driven world, understanding the origins and transformations of data is critical for effective management, analysis, and decision-making. Data lineage plays a vital role in this process, providing insights into data’s lifecycle and ensuring data...

What is Data Fabrication? A Testing-Focused Explanation

What is Data Fabrication? A Testing-Focused Explanation

In today’s post, we’ll answer what looks like a simple question: what is data fabrication? That’s such an unimposing question, but it contains a lot for us to unpack. Isn’t data fabrication a bad thing? The answer is actually no, not in this context. And...

Technology Roadmapping

Technology Roadmapping

In today's rapidly evolving digital landscape, businesses must plan carefully to stay ahead of technological shifts. A Technology Roadmap is a critical tool for organizations looking to make informed decisions about their technological investments and align their IT...