5 Red Flags Deployment Management Is Failing

20
MAY, 2019

by Mark Henke

It’s a great step when teams deliberately manage their deployments instead of treating them as second-class citizens to writing code. But there are many pitfalls to managing deployments effectively. Many things lurk, waiting to trip us up. We want to start with a guiding light: to automate our deployments completely from a push of code up to production. Every step we take toward automation makes deployment management easier. We’ll reduce human error and save loads of time. We’ll allow our customers to trust us. In the spirit of complete and utter automation, I give you five red flags that we can use to find and root out obstacles to our deployment management.

 

1. Manual Steps and Approvals

Our first red flag is a fairly obvious one to those familiar with continuous delivery: manual steps and approvals. Every one of these steps is a spot of darkness against the guiding light of automation.

Manual Steps

Manual steps are clear obstacles to automated deployments. They continue to be a source of human error and slowdowns. It’s common when practicing deployment management to automate some steps. But it’s easy for a team to lose steam and give up on automating all of them. How hard it is to automate some of these steps can depend on the maturity of your organization. It can be especially tricky to deal with them when the steps require tooling or infrastructure that isn’t yet in place.

External Approvals

Many manual steps come in the form of supervisor or compliance approvals. The most insidious of these is when someone outside of the software team must approve deployments. Oftentimes the person approving has no grasp on what’s going to production. This makes such approvals only illusions for safety. These can be tricky to root out because they’re outside the direct influence of the team.

Dealing With Manual and External Steps

With perseverance and data to make our case, we can drive out manual steps and external approvals. Management loves talking about money, and you can show how these approvals cost your organization. First off, they increase lead time significantly. The cost of an approval handoff is one of the largest costs across an enterprise. Additionally, human error is still in effect. If you show how defects can still escape into production with approvals, you weaken the reason for their existence. For manual steps, forecast configuration error reductions. Show how much time you can save per deployment by automating the manual. You can even strengthen your case by eliminating the next red flag we’ll discuss.

2. High Error Rates per Deployment

A high error rate in a team’s application is another red flag that many may consider obvious. This can include actual error responses in the application or defects that break an application’s service level agreements or objectives. A team that’s getting high error rates indicates that it needs to build more quality into its deployment pipeline. This could mean automating certain manual steps, as we discussed above. It often means adding more tests into the deployment pipeline.

One of the more counterintuitive ways to deal with this is to “slow down” per user story and bake in more testing. It could also mean putting in more resiliency-focused code or focusing more on edge cases. Practices like behavior-driven development really help bake this quality in to keep your error rates per deployment low.

Ensure you diligently measure this metric when managing deployments.

3. Not Deploying on Certain Days

Now we come to a more subtle indication that you can improve your deployment management: not deploying on certain days. This red flag sends a signal that your deployments are unreliable. One common example of this is “Please don’t deploy anything on Fridays.” This is common because people assume a few things. First, they assume your deployment has a high enough chance that something will go wrong. Second, they believe that your deployment pipeline doesn’t automate recovering itself. Third, the team members themselves may say this because they’ll have to support it over the weekend. This last one may be an indication that the team doesn’t have the tooling or training to quickly diagnose, roll back, or fix defects.

There are a few ways to deal with this, above reducing the other red flags listed here.

Separating Deployment From Release

When teams are just starting to manage deployments, it can be easy to think that deploying software to production is the same thing as releasing it to customers. However, these concepts are separate. Releasing software exposes it to your customers. Deploying gets a new version of your application into production. Deployment tests things like “Am I connecting to the right databases or web services?” and “Do I have enough memory to run this service?” Releasing allows us to answer questions like “Will this feature make more money?” or “Do the customers like the new layout?” You can see that deployment answers “Am I building things right?” whereas release answers “Am I building the right thing?”

Separating these two will reduce the risk of causing problems for your customers on certain days. We can shift control over releasing changes to our business stakeholders while we continue to just manage deployment. The main way to do this is to build in release toggles that let us turn on deployed code that’s inert. We can also evolve this to canary release features to subsets of customers. This lets us change how our system operates in a low-risk, controlled way.

Zero-Downtime Deployments

Another key way to deal with the resistance to deploying on certain days is to ensure your deployments cause no downtime for your customers. This is common practice in large commercial software, such as Amazon and Google. By having no downtime your customers will only see changes when a team chooses to release features or when something goes wrong during deployment. We can achieve zero downtime by practicing blue/green deployments. This pattern lets us deploy our new software alongside our old, then switch traffic to the new software once we validate that it’s ready.

4. A High Mean Time to Recovery

Despite our best efforts, things go wrong sometimes. It’s prudent for us to measure how quickly we can get back to a working state when that happens. If we don’t, then people start distrusting our deployments and will create pressure for us to deploy less frequently. A deployment’s mean time to recovery (MTTR) is a popular measurement between when an incident starts causing problems and how long it takes for that incident to go away. Measuring this with error rate per deployment will allow us to constantly review and improve our deployment management practices.

One way to deal with a high MTTR is to automate away any manual steps to rolling back a deployment. It’s common for teams to home in on the manual steps that exist for successful deployments. But then they easily ignore how to automate the steps the system needs to recover from those steps. I encourage every team to think about both the success and failure of every step and how to script away human interaction.

But in order to automate failures, we need a way to automatically know that something has failed. This brings us to our final red flag.

5. Unverified Deployments

A team can get so caught up in automating its deployments that it doesn’t bother to learn whether or not a deployment has actually succeeded. For many teams, the limit of checking a deployment is to hit the home page of their web application. We need to automate not only our deployment steps but also how we verify a deployment was successful. Every time a team promotes software to a new environment, it should check that the promotion succeeded. This will also help to dealign the red flag of approvals. Below are some strategies we can apply to verify deployments.

Health Checks

The simple health check can go a long way to verifying deployment success. Just like a person checking the home page, we can check a specific heartbeat URL or do a simple GET request on our service. This tells us at least the application started up and is running.

Smoke Tests

Smoke tests are more comprehensive than health checks but more complex. They run through some of the system’s scenarios. This ensures that the system is not only running but also has the correct high-level functionality. Be careful as you want to ensure you don’t pollute your database with unwanted information. It’s good to have a way to set up and clean up these smoke tests.

Contract Tests

Contract tests are like smoke tests for your downstream systems. We want to ensure that we’re connecting to our dependencies correctly. We also want to ensure that those dependencies are upholding their end of our contract. Running some simple tests against their systems in each environment allows us to verify the contract is intact and that we have configured our properties correctly.

There are many more ways to verify deployments, but these are some of the most frequently used.

Relentless Automation

The goal is clear to us: fully automated, self-recovering deployment pipelines that can deploy on every push. But with our day-in and day-out workload, it’s easy to become desensitized to the numerous obstacles to that goal. I recommend every few retrospectives to review this list and see if there are any of these red flags remaining in your deployment pipeline. If you find them, root them out. Be relentless in your pursuit of automation. The investment will definitely pay off, possibly faster than you think.

Mark Henke

This post was written by Mark Henke. Mark has spent over 10 years architecting systems that talk to other systems, doing DevOps before it was cool, and matching software to its business function. Every developer is a leader of something on their team, and he wants to help them see that.

Relevant Articles

What makes a Good Deployment Manager?

What makes a Good Deployment Manager?

Deployment management is a critical aspect of the software development process. It involves the planning, coordination, and execution of the deployment of software applications to various environments, such as production, testing, and development. The deployment...

DevOps vs SRE: How Do They Differ?

DevOps vs SRE: How Do They Differ?

Nowadays, there’s a lack of clarity about the difference between site reliability engineering (SRE) and development and operations (DevOps). There’s definitely an overlap between the roles, even though there are clear distinctions. Where DevOps focuses on automation...

Self-Healing Data: The Power of Enov8 VME

Self-Healing Data: The Power of Enov8 VME

Introduction In the interconnected world of applications and data, maintaining system resilience and operational efficiency is no small feat. As businesses increasingly rely on complex IT environments, disruptions caused by data issues or application failures can lead...

What is Data Lineage? An Explanation and Example

What is Data Lineage? An Explanation and Example

In today’s data-driven world, understanding the origins and transformations of data is critical for effective management, analysis, and decision-making. Data lineage plays a vital role in this process, providing insights into data’s lifecycle and ensuring data...

What is Data Fabrication? A Testing-Focused Explanation

What is Data Fabrication? A Testing-Focused Explanation

In today’s post, we’ll answer what looks like a simple question: what is data fabrication? That’s such an unimposing question, but it contains a lot for us to unpack. Isn’t data fabrication a bad thing? The answer is actually no, not in this context. And...

Technology Roadmapping

Technology Roadmapping

In today's rapidly evolving digital landscape, businesses must plan carefully to stay ahead of technological shifts. A Technology Roadmap is a critical tool for organizations looking to make informed decisions about their technological investments and align their IT...