When it comes to discussing how AI will impact the future of software development and IT management, most vendors hold on to the objective that it’s important to keep the human in the loop. They are afraid to publicly acknowledge what’s always been true — that machines are better than humans at many things. Meanwhile, that list of things continues to grow.
When it comes to observability and IT operations, our goal should be to get humans out of the loop as much as possible. In a world run by software, where SLAs guarantee 99.999% uptime, even one incident in a month — which requires human intervention — is enough downtime to violate your commitments to your customers.
With the rise of distributed, complex applications exacerbated by more and more AI-generated code, it is difficult for traditional IT operations to keep up across troubleshooting, provisioning infrastructure, performance monitoring, security, etc. Developers now deploy code 70% faster using AI, according to the latest DORA report — you can’t have enough SREs and security personnel to keep up with this kind of activity. The companies that succeed will leverage technology to get humans out of the way as quickly as possible.
Failed Promises of AIOps
It is worth stating the obvious that there’s a difference between where things are today and where they are heading in the future. As someone who has been in IT management for decades, I have seen all the buzzwords as well as both marketing nonsense and genuine progress.
The term ‘AIOps’ was coined by Gartner in 2016 to address the increasing complexity and data volume in IT environments, aiming to automate processes such as event correlation, anomaly detection and causality determination. But as it turned out, many of the vendors who claimed to offer AIOps were nothing more than empty shells when you looked under the hood.
It was essentially the same AI/ML that had been used for a decade beforehand, branded in a new way and making outsized claims that didn’t map to reality. We saw tech giants making acquisitions of point solutions, and then bundling them under the ‘AIOps’ category because it was trendy to do so and they had nowhere else better to put them.
But I would argue that many of the companies that continued to trumpet these claims without actually delivering on the promises eventually suffered a hit to their reputation. We still aren’t at that stage where machines are acting autonomously to solve complex problems across performance, reliability and security. But they are doing more than ever before, and there’s every reason to believe we will reach that goal in the future.
The Future of Autonomous Reliability
Autonomous reliability platforms of the future will not only surface actionable insights, but they will also be competent enough to make autonomous decisions without human intervention. And why should this be impossible? If planes can mostly fly themselves, why can’t IT management become autonomous?
The decades-long trend of collecting more and more data in the name of observability isn’t rendering autonomous service reliability. Nor is it feeding that data into a machine and magically hoping for answers. Machines trained on yesterday’s patterns might explain what went wrong in the past, but they can’t make the real-time decisions required to keep systems running, especially in dynamic, cloud-native environments.
What we need is a paradigm shift from data collection to causal understanding. By capturing causal knowledge as part of an ontology, we can reason about cause-and-effect in complex, ever-changing systems — the key is to move beyond reactive alerting into autonomous reliability. Unlike the growing industry trend of offloading alerts to LLMs, causal reasoning gives us the context and clarity needed to take real control.