The amount of time it takes engineering teams to get back to work after an incident is getting worse every year, even though spending on observability tools has reached record highs. This should worry everyone in this field.
The Logz.io Observability Pulse followed teams with a mean time to resolution (MTTR) of more than one hour: 47% in 2021, 64% in 2022, 74% in 2023 and 82% in 2024. Four years in a row of going backwards. During the same time, the average number of tools used by a team rose to eight or nine different platforms.
The answer from the industry has always been the same: More — more tools, more dashboards, more signals. The working assumption is that the problem is visibility; that if engineers could see more of what was going on, they would be able to fix problems faster.
This article asserts that the contrary is accurate: Beyond a specific limit, excessive observability data results in cognitive overload that hinders root cause analysis (RCA) and increases MTTR. More signals can, surprisingly, mean less clarity.
The data suggests this assumption is wrong, and I’ve watched it play out firsthand.
While I was on-call for a big cloud provider, at one of my previous jobs, I had to deal with a problem where a customer was seeing packet drops during a performance test. The war room got going quickly. We looked at all our dashboards, which showed things such as average CPU, per-node CPU, soft IRQ and memory use across the fleet. Everything seemed fine. There was no problem anywhere. We spent a long time in that space, methodically going through the stack, sure that the answer was in there somewhere if we just looked harder.
However, it wasn’t. We finally did a packet capture, which is a simple, old-school way to diagnose a problem, and we found the real problem right away. Bursty traffic had pushed the use of the SNAT port past its limit. There were too many connections happening at the same time. The solution was simple: Add nodes.
There was a metric for the SNAT port. It just wasn’t on any of our dashboards. We had set up a complicated observability system that covered almost everything, but it confidently kept us from focusing on the one thing that really mattered. Not only did the complexity not help, but it also wasted our time by making us look for a complicated answer instead of the simple one.
We Built More; Things Got Slower
The Grafana Labs Observability Survey 2025 found that companies now use an average of eight different observability tools, down from nine the year before. Teams are starting to come together, but it’s not because they are thinking about what observability is really for; it’s because of the cost. Nearly 74% of businesses say that cost is the most important factor when choosing tools.
The CNCF’s own research backs this up: 72% of teams use up to nine different observability tools, and tool sprawl is the most common operational problem mentioned by half of all respondents. This is a neutral finding from a foundation that has no vendor interest in the answer.
On the other hand, 39% of engineering teams say that complexity and operational overhead are their biggest problems. Not a lack of information. Not enough tools. Difficult. The thing that was supposed to fix the problem has become the problem.
One engineering team kept track of this very well: They checked their setup and found 47 dashboards, each of which made sense on its own. During incidents, engineers opened them at random and looked at panels that told different stories. They deleted 28 without a plan for archiving. Within two weeks, it was easier to see.
We all got mixed up between data coverage and operational clarity. They feel the same when they buy things. However, at 2 a.m., they feel very different.
There is a Cognitive Wall
Gary Klein’s research on how experts make decisions under pressure found that experienced professionals don’t solve problems by looking at all the information they have. They match patterns. They look for a shape they know, do a quick mental simulation and then do something. The skill is knowing what to let go of.
An engineer can’t put together eight dashboards across four platforms when an incident happens at 2 a.m. They look for a signal that is similar to one they have seen before. Every extra chart and every alert that isn’t directly related to the main issue is friction. In an environment with a lot of unique cards, there’s a lot of friction.
Studies on MTTR reduction showed that three things consistently speed up recovery time: Fast and accurate detection, low-cardinality instrumentation and easy-to-follow diagnostic paths. Not always does more data help. There is a sweet spot in the number of metrics, beyond which adding more signal starts to slow down resolution instead of speeding it up. This is what the Google SRE Book has said since 2016. Most people in the industry have ignored it.
AI is Repeating the Same Mistake
AI-assisted observability is the current industry answer to cognitive overload. If engineers cannot handle the amount of data, teach a model to find the important signals. It’s a good guess.
However, the DORA 2024 report found something that made people uncomfortable. A 25% rise in the use of AI was linked to a 1.5% drop in delivery throughput and a 7.2% drop in stability. The mechanism: AI speeds up code production, which raises the risk of deployment, leading to more incidents, which uses up the time that was saved.
This is the same way that tool proliferation fails, but one level higher. More capability was added to solve a human problem, but this made the problem worse by adding more complexity.
Measure Where the Engineer’s Attention Actually Goes
The solution is to measure something more honest if the industry has been measuring the wrong thing, like tool adoption instead of incident effectiveness.
Take the on-call engineer’s attention as a sign. What did they actually open during the event? What were they looking at for more than 30 seconds? What did they do? What did they completely miss?
It’s not hard: Incident timeline data, tooling interaction logs and one question after the incident — What actually helped you diagnose this? — give you a surprisingly clear picture over time. You could call it incident attention mapping. The data is like an audit of your observability stack done by the people who need it to work.
Most teams would find that a small number of signals do most of the work on most incidents. A few dashboards, two or three types of alerts and one log query pattern that happens a lot. Additionally, a long list of tools that were made to be complete, checked from time-to-time out of habit, and have never once helped solve a real problem.
That tail isn’t in the middle. It costs money to build the infrastructure and time for an engineer under pressure to rule it out and find what they really need. If an engineer doesn’t open a tool during a real incident, it’s not an observability asset. It’s debt for the organization that comes with a bill every month.
What Teams Should Do Instead
The goal isn’t to use fewer instruments; it’s to use them on purpose. Three changes that always make response time better:
- First, make dashboards that focus on incidents, not coverage. Every panel should answer a specific question that an engineer would ask during a real incident. Get rid of the panel if you can’t name the question.
- Second, put the signal ahead of volume. One alert that goes off correctly and points directly to the problem is worth more than 50 alerts that need to be interpreted. Check your alert-to-action ratio often. If engineers are often ignoring or silencing alerts, that’s the system telling you something.
- Third, do attention audits after big events. Talk to your engineers about what they opened, what worked and what didn’t. If you do this every day for three months, you’ll have a better idea of how valuable your observability stack really is than any vendor dashboard can show you.
The Reframe
A system with good observability doesn’t let you see everything. It’s a system that lets an engineer who gets a call at 2 a.m. understand what’s wrong in the first three minutes and take the right first step. Anything that helps you get over that bar is worth keeping. No matter how complete it makes your coverage look on a vendor dashboard, you should cut out anything that adds noise.
Don’t ask your vendors to tell you which is which. It’s to keep an eye on your engineers. Let the people who are fixing things show you what is actually useful by how they act.
A good observability system doesn’t depend on how much it can show you. It is how quickly it helps someone make a confident choice when they are under pressure. Start there and work your way back.
The tool count going from nine to eight is a start. The next step is finding out which eight — and being honest about what the answer reveals.
