IT Environments – Top 5 Deployment Metrics

February, 2019

By Christian Meléndez.

Preamble: Following on from our previous blog on Enterprise Release Management bench-marking, Christian talks about the “sister disciple” of deployment management and more specifically the top 5 metrics one could use to better understand and optimize their operations and streamline application lead times and project delivery.

IntroductionIt’s time to talk about the top five deployment metrics that you could use to make wise decisions when improving the lead time of an application. Mapping your value stream is an excellent way to identify where you’re having problems shipping application changes on time. Deployment metrics will tell you when and how much time you need to spend improving the deployment pipeline’s quality and make it more efficient.When talking about deployments, the key metrics are related to frequency and how much time deployments are taking. But quality is also an important metric. How many deployments fail? Ideally, none. How much time does it take to recover when there’s a failure? How many deployments are on schedule?There are going to be a lot of reasons why these metrics could be wrong, like spending too much time doing manual tasks. But without all this information it’s hard to know where to focus your efforts.

The Number of Deployments

Just because the current trend is to make several deployments a day doesn’t mean that you must focus on deploying changes even if you don’t need to. The number of deployments you execute says a lot about how frequently you’re changing the application. The more frequently you deploy, the smaller the changes and the risk they carry are. It’s really different to do one big deployment to change the database engine as opposed to changing the database engine little by little—for example by using feature flags.The best approach to how frequently you should deploy resonates with the saying if it hurts, do it more often. Martin Fowler wrote about this saying, explaining that frequency reduces the difficulty of activities. And we all know that deployments can be both difficult and risky. Most of the time when there’s a downtime it’s because a deployment just occurred. Fearing such downtimes, we decide to accumulate changes or set tight schedules to deploy.Increasing the number of deployments you do helps you to improve deployments until they become dull and deterministic.

The Time for Each Deployment

If you want to deploy frequently, deployments shouldn’t take too much time to finish. The amount of time a deployment takes is a sign of how complex or simple the steps to deploy a change for the application are. Maybe you need to update a lot of servers, and you’re updating them one by one. Or maybe there are many dependencies that you need to take care of to finish the deployment. For example, you need to update the database or update the HTTP endpoint for the services that depend on the application you deploy.Time is also a sign of how many manual processes still exist in your deployment pipeline. The more manual labor exist, the riskier a deployment becomes. And that risk only increases when there’s pressure to ship an application change. Everything that can go wrong will go wrong. This is especially true when someone is following a manual or trusting their ability to remember every step. Just one mistake in a deployment, and chances are high that you’ll need to start over again.So is the answer to improving deployment time getting rid of manual labor with automation? Well, sometimes the effort is worth it. Other times, the problem might be in a tight architecture where every fix could generate a new issue. An application should be able to be deployed independently without having to coordinate too many things.A deployment that takes a few minutes or less will also allow you to increase the frequency of deployments.

The Number of Failed Deployments

Now let’s say that you’ve been able to (or at least are able to) deploy more frequently and deployments are fast. How many of these deployments do you fail? Hopefully not many. But if you know that the failure rate has increased, you can decide to slow down a little bit and start focusing all your energy on improving the deployment’s quality.Correlate the number of errors in the application with deployments. The next time a downtime occurs or the availability of the application is not good, you could start by asking if this starting to happen because a deployment just happened.When the deployment failure rate increases, maybe you need to increase the code coverage, add more integration tests, add performance tests, or implement canary releases. But what I’ve seen happening most of the time is that the quality of deployments decreases because deployments are treated differently in production than other environments. Having differences in the way you deploy your application to all different environments will cause you problems. Deploy the same way everywhere, like using the same scripts, recipes, or playbooks.Having a low failure rate will increase the confidence in the team and will let you know that you’re on the right path to improve the lead time of your deployments.

The Mean Time to Recover After a Downtime

Every time a deployment happens, there’s always the possibility of causing downtime in the application. Reasons behind a downtime may vary, like missing test cases or a type in the number of servers that have to be updated with the new code. Having 100 percent of availability for a service is expensive, and it isn’t worth it. Instead of trying to enforce how to avoid any downtime, try to focus on recovering as fast as you can when something goes wrong.Start by setting the baseline to define how the system looks when it’s healthy. Then take action every time something gets outside the parameters of that “normal” state. For example, you might define a service level objective (SLO) to have a latency under 500ms. If, after a deployment, the latency of the system turns out to be above the SLO, you might want to roll back immediately.Ideally, if you’ve been able to improve the deployment frequency and it doesn’t take too much time for the deployment to finish, the mean time to recover should be small. I mean, it’s just about doing another deployment, ordinary business by this point. It could be about changing the artifact version to a previous one. Remember, the time it takes to deploy a change will indicate how hard or easy it is to make a change in the application.The best way to keep the mean time to recover small is by practicing the rollback mechanism. When everything is under version control, it gets easier—it’s just about going back to the previous version and deploying again.

The Number of Deployments on Schedule

It doesn’t matter if you’re able to ship application changes rapidly if they’re not on time, live in production. The number of on-schedule deployments is a good metric to motivate the team to improve the deployment pipeline. Also, anyone that’s interested in knowing more about the reasons why the need to introduce automation or infrastructure as code will find this metric helpful.I’ve worked on projects where we needed to change many things in every deployment. A few of the changes were done manually. There were even times where unexpected problems arose in a testing environment. These problems had a significant impact on the testing phase of the application. We lacked heterogeneal environments. It was possible that something could happen when pushing the application changes to the next environment.Having deployments that are being done on schedule is a sign that your deployment pipeline is in good condition. You frequently deploy, it takes just minutes, and more importantly, you have a steady quality.

How Are Your Deployment Metrics Doing?

Before you start using these metrics to identify waste and efficiently deployments, you need to know your current state. What are the current numbers for these type of metrics in your workloads? If you don’t know what the current state is, you’ll end up improving processes that don’t add value.Using these deployment metrics will help you identify where there’s waste in your value stream mapping. You’ll have a clear view of where your efforts should be oriented to improve speed and quality when shipping new features.

Author Christian Meléndez. Christian is a technologist that started as a software developer and has more recently become a cloud architect focused on implementing continuous delivery pipelines with applications in several flavors, including .NET, Node.js, and Java, often using Docker containers.

Relevant Articles

RAG Status: What It Is and Using It for Project Management

0 Comments

Effective Leadership requires effective tooling to drive successful outcomes. One tool they can use to monitor and measure progress is RAG status. RAG stands for Red, Amber, Green, and is a simple traffic light system used to communicate the current status of a...

Enterprise Architecture Tools: 11 to Be Aware Of in 2025

0 Comments

Enterprise architecture (EA) is an essential discipline for organizations aiming to align their IT strategy with business goals. As companies become more complex and technology-driven, having the right set of EA tools is crucial to streamline operations, improve...

What is a Staging Server? An Essential Guide

0 Comments

Release issues happen. Maybe it’s a new regression you didn’t catch in QA. Sometimes it’s a failed deploy. Or, it might even be an unexpected hardware conflict. How do you catch them in advance? One popular strategy is a staging server....

What is Deployment Planning? A Detailed Guide

0 Comments

Deployment planning, sometimes referred to as "implementation planning," is the process of creating a plan for the successful deployment of a new software or system. It involves identifying the resources, tasks, and timeline needed to ensure that the deployment is...

The Definitive Guide to Test Data Generation

0 Comments

Test data generation is a critical part of the software testing lifecycle, ensuring that applications are tested against realistic scenarios before going live. If you’re not testing against production-like data, you’re arguably not truly testing your application. In...

What is a Test Data Manager? A Detailed Introduction

0 Comments

Testing is a critical aspect of software development, and it requires the use of appropriate test data to ensure that the software performs optimally. Test data management (TDM) is the process of creating, storing, and managing test data to ensure its...