Top five Cloud Metrics for Value-Based Practices – Enov8
Preamble
These days, deploying our services to the cloud just makes sense. Deploying to the cloud means you’re letting someone else handle low-level infrastructure costs, which gives us incredible flexibility. And with such flexibility comes a potentially overwhelming set of options on the best ways we can monitor our systems and take advantage of how easily we can scale our services.But luckily, we can limit our options based on two key factors: what decisions we’ll make to better our system and how customer-focused they are. With that in mind, I’ll share what I think are the five most important cloud metrics.Metrics Are for Decisions
Here, I want to revisit some of my points from my “Top 5 DevOps Metrics” post on the Enov8 blog. And I’ll start by reemphasizing what I said there: metrics are useless on their own. If someone comes to you asking for a dashboard or to track some data, it’s fair to ask, “How will you use that data to make decisions?” If data doesn’t help you decide what actions to take, it’s not data worth collecting—the information will simply be clutter. And we eliminate clutter from our minds so we can concentrate on the data that will guide our decisions.I want to be clear about what I mean when I say decision here. I don’t mean you need to know exactly what specific code you will hit or what report you will send out. What I mean is that you should have a decision-making system in place that you trigger when a metric reaches some threshold.For example, let’s say you want to be able to notify a customer when your service is unstable. So you ask your team to create a process where they send an email to all customers. Then once the situation settles, they’ll send another email saying everything is A-OK. If the instability is extensive, you may decide to have phone calls with your preferred customers as well. Now you can build a metric that monitors when availability dips below 99 percent, and at that point, you’d send the email.Customers First, Then Everything Follows
Yes, we need to know what decisions our metrics will support. But we need more than that, too. For any metric we use, we should be able to point back to how it helps our customers. After all, our customers are our reason for existing!Top Five Metrics
The top five customer-focused, decision-encouraging metrics for cloud systems are as follows:- Service availability
- Reliability
- Incident rate
- Throughput
- Service response time
Service Availability
Also known as uptime, service availability is how often the service is able to receive requests from users and/or consuming applications. This is usually measured in “9s.” For example, a service that is available 99 percent of the time in a year has two 9s of availability. The gold standard for this is five 9s, or 99.999-percent uptime per year. Achieving this, however, is usually very expensive.What Decisions It Supports
The threshold at which you should trigger some decisions depends on the cost of downtime for your organization. If your system goes down, roughly how much money will the organization lose? Knowing that will help determine what we are willing to invest into availability in order to avoid those costs. It may help to have two thresholds: one that acts as a warning that something may be going on with a service, and a second one that signals critical instability.With these two thresholds, you can set up some decisions. If you hit the warning threshold you may want the team to put a card at the top of their backlog to investigate the problem during this iteration or the next one. If you hit the critical threshold, you may want the team to stop what they are doing and swarm on investigating and resolving the problem.How It Relates to the Customer
The service availability metric has some clear connections to customer satisfaction. If they cannot use your application, they will go elsewhere. If your application is flaky, they will not trust you enough to use you consistently. So having a highly available system builds warm customer rapport.Reliability
While availability tells you that a service is up, it may still have problems. When those problems occur, reliability measures how quickly we get the system back to a usable place. Reliability measures both short-term and long-term problem elimination.Short-term reliability is measured by mean time to recovery. This is how long it takes for the team or system to overcome some problem. For example, if the cache fills up too much and causes product searches to take twice as long as normal, the mean time to recovery would be how long it takes to flush the cache.Long-term reliability is measured by mean time to repair. This is how long it takes to permanently root out a recurring issue. So going back to our cache example, this would be how long it takes for the developers to implement a feature where the cache expires before it fills up.What Decisions It Supports
I strongly recommend mapping a value stream for handling production incidents. It’s good to look at how an incident moves from being reported by a user or the system to being resolved and finally being permanently repaired. Doing so will help you track and investigate waste to these metrics, just like you do with your development value stream.How It Relates to the Customer
Like availability, customers trust a system that responds the way they expect it to. The fewer nasty surprises in your systems, the more customers will trust you. They’ll also be more likely to stick around, using you in the future. Additionally, an unreliable system will cause bugs that may lose customer transactions, which loses you money. This is especially true if you have to recompense the customer something for the inconvenience. This applies to internal customers too, since ultimately the consuming application is serving real customers somewhere.Even for purely internal apps, such as a timesheet application, high reliability means higher employee morale and employees wasting less time trying to contact the help desk and figure out what’s going on.Incident Rate
Reliability only gives a portion of the picture of customer trust. The other side of this is incident rates. This metric shows how frequently an incident occurs. You can measure this via your error tracker or even through a customer support tool. The incident rate plus reliability will give you a good picture of how often your system does what is expected.What Decisions It Supports
What incidents pop up in your system vary widely, but you probably want a system in place similar to what we discussed with availability. With a good error tracking tool, you can monitor the severity of different errors and warnings. You can also measure severity based on how frequently an error or incident occurs. Medium-severity incidents may trigger an investigation for the next iteration. High-severity errors can trigger an immediate triage from one or more of your developers.How It Relates to the Customer
The factors here are a lot like the factors for reliability and service availability. There’s not much more to add on this front. Use all three of these metrics to get a sense of how much a customer may trust your application when it is running in the cloud.Throughput
We shift gears a bit with this next metric, moving away from avoiding problems toward providing maximum service to our customers. Throughput lets us look at how many customer requests we can handle at a time. This is often measured in transactions per second. Slower applications can adjust the time unit as necessary.What Decisions It Supports
Set up your throughput thresholds based upon current and anticipated customer demand. When your throughput goes below that demand it makes sense to trigger some sort of investigation with the development team. A strong development team will dedicate a portion of their work per iteration to technical debt. It can make sense to dump these investigations into the technical debt backlog.How It Relates to the Customer
The throughput you need is directly connected to the number and speed at which you service customers. The more customers likely to hit your service at a time, the more throughput you need to handle it. If you don’t handle and anticipate the rate of customer transactions, you run the risk of increasing incident rates as your system buckles under pressure.Service Response Time
Our final metric is how responsive our system or service is in the cloud. How fast do customers or consumers receive responses to their requests? This can be measured in latency in milliseconds per request. For web apps, this is very simple. For apps with asynchronous processing, it may take a bit more elbow grease to instrument the requests.What Decisions It Supports
Based on your application you can establish certain response service-level agreements (SLAs) for your customers. When latency hits that threshold, just like with throughput, you can toss an investigation or “fix it” card into the team’s technical debt backlog. I recommend ensuring these SLAs are actually made visible to the customer and known in advance. Otherwise, you may find yourself scrambling to “false alarms”—slower requests that don’t actually break any SLAs.A more advanced decision-making system would proactively stop SLA breakages by having an early warning threshold. One way you can achieve this is by making visible the moments when your response times are 80 percent of the SLA time.How It Relates to the Customer
Anything that takes more than a second on the web is noticeable by people. When customers notice slowness, they get frustrated. This affects the trust and loyalty you retain with them, just like reliability. You want to present a responsive, accommodating experience to your customers so they will keep coming back.An Umbrella for a Rainy Day
When we run in the cloud, sometimes we get a downpour of scaling and production issues. Similar to the metrics in my “Top 5 DevOps Metrics” post, knowing the decisions we will make will give us an umbrella against this rain. Our umbrella will be strongest when we focus on our customers. Drawing from this, we can derive a set of strong metrics that ensure our services stay dry on rainy days.Relevant Articles
Technology Roadmapping
In today's rapidly evolving digital landscape, businesses must plan carefully to stay ahead of technological shifts. A Technology Roadmap is a critical tool for organizations looking to make informed decisions about their technological investments and align their IT...
What is Test Data Management? An In-Depth Explanation
Test data is one of the most important components of software development. That’s because without accurate test data, it’s not possible to build applications that align with today’s customers’ exact needs and expectations. Test data ensures greater software security,...
PreProd Environment Done Right: The Definitive Guide
Before you deploy your code to production, it has to undergo several steps. We often refer to these steps as preproduction. Although you might expect these additional steps to slow down your development process, they help speed up the time to production. When you set...
Introduction to Application Dependency Mapping
In today's complex IT environments, understanding how applications interact with each other and the underlying infrastructure is crucial. Application Dependency Mapping (ADM) provides this insight, making it an essential tool for IT professionals. This guide explores...
What is Smoke Testing? A Detailed Explanation
In the realm of software development, ensuring the reliability and functionality of applications is of paramount importance. Central to this process is software testing, which helps identify bugs, glitches, and other issues that could mar the user experience. A...
What is a QA Environment? A Beginners Guide
Software development is a complex process that involves multiple stages and teams working together to create high-quality software products. One critical aspect of software development is testing, which helps ensure that the software functions correctly and meets the...