You built the agent. It works in testing. Then it hits production and starts giving wrong answers, timing out or burning through your token budget, and you have no idea why. This is when developers discover that print statements and log files weren’t designed for this. LLM applications fail in ways that traditional tooling can’t see. A hallucination doesn’t throw an exception. A slow retrieval step doesn’t show up in CPU metrics. A prompt that worked yesterday silently degrades today. The fix is observability, and the standard for doing it right is OpenTelemetry (OTel). What OpenTelemetry Actually Is OTel isn’t a monitoring product; it’s a vendor-neutral specification under the CNCF that defines a standard way to collect…
Author: drweb
Most enterprise AI projects start with retrieval.You connect Jira, Confluence, SharePoint, and Slack. Maybe a few internal databases nobody has touched in five years. You tune embeddings, optimize chunking, wire up a vector database, and convince yourself you’ve built an AI-powered knowledge system.Then the model server crashes. And suddenly, you discover the uncomfortable truth about enterprise AI: The hard part was never retrieval. It was infrastructure.For the past two years, the industry has treated LLM deployment like a feature integration problem. In reality, it is rapidly becoming a platform engineering problem, one involving GPU orchestration, scaling economics, governance boundaries, workload…
Over the past year, AI has fundamentally changed how software is written. Infrastructure code is no exception. Tasks that once required deep familiarity with tools, syntax, and workflows can now be handled through natural language. Engineers are no longer starting from a blank file. In many cases, reviewing and modifying code generated for them has become the norm.At a high level, this looks like progress, and in many ways, it is. Teams can move faster, the barrier to entry is lower, and experimentation is easier. But there is a growing gap that many organizations are only beginning to recognize: AI…
Several years ago, the observability community reached what felt like a consensus: The three pillars — logs, metrics and traces. Instrument everything, ship it all to a central platform and you will finally understand what your system is doing. It’s a tidy framework. Yet it turns out to be incomplete in ways that only become obvious once you’re actually trying to debug a production incident with it. This article isn’t an argument against logs, metrics and traces; you need all three. However, there’s a growing set of failure modes in modern distributed systems that the three-pillar model struggles to explain — and understanding why is the first step toward building observability that…
Installing and configuring cloud-init on Ubuntu 26.04 makes it much easier to automate server setup, especially when working with cloud VPS systems, virtual machines, and home lab deployments. If you’ve ever installed a fresh Ubuntu server and spent the next 20 minutes creating users, installing packages, configuring SSH keys, and adjusting networking manually, cloud-init can save yourself a lot of repetitive work. Cloud-init is the initialization service used by most cloud platforms and virtual machine images. It reads configuration data during the first boot of a Linux system and automatically applies settings like hostname changes, user creation, SSH configuration, package…
The first time an enterprise AI system gives a wrong answer, people usually blame the model. Let us talk Enterprise AI’s Hidden Problem Is Organizational Amnesia in detail.AI does not fail only when it forgets. It also fails when the company remembers badly.Why the smartest model in the room still fails when the company cannot remember what it knowsThe first time an enterprise AI system gives a wrong answer, people usually blame the model.They say the model hallucinated. They say the prompt was weak. They say the vendor overpromised. They say the technology is not mature enough yet. Sometimes that…
Harness today added two tools to track and analyze the impact code generated by artificial intelligence (AI) coding tools is having on application development.An AI Development Lifecycle (DLC) tool installs agent software on a developer’s machine to track adoption, sessions, and the code created across every coding agent, while a Cloud & AI Cost Management tool has been extended to track spending on AI infrastructure.Trevor Stuart, general manager and senior vice president for Harness, said the goal is to provide DevOps teams with greater insights into how much real productivity is being gained following the adoption of AI coding tools.For…
AI is making it easier for SaaS companies to build integrations. Give a coding agent decent API docs, some context about the systems involved, and a clear prompt, and it can get surprisingly far. It can write the logic faster than most teams could a year or two ago, saving time, reducing repetitive work, and making it easier to respond to customer requests.Faster creation can make the overall problem look simpler than it really is. The code may come together more quickly, but the operational burden underneath it doesn’t go away.Building Isn’t the Hard PartBuilding the integration itself isn’t the…
The Emerging Popularity of Python in Game Tech Python is the hottest programming language right now, and for good reason. It has user-friendly syntax, can manage complex prompts with less code, and Python can manage huge datasets in real-time. The practicalities of Python extend to virtually all sectors, especially in modern gaming technology. For gaming ecosystems are expanding at a tremendous rate globally, from online casino Philippines to the blockbuster US video game developer firms. Python is flexible, scalable, and can be seamlessly integrated into cloud-based systems. One of it’s greatest perks is the ability to manage huge pools of…
The amount of time it takes engineering teams to get back to work after an incident is getting worse every year, even though spending on observability tools has reached record highs. This should worry everyone in this field. The Logz.io Observability Pulse followed teams with a mean time to resolution (MTTR) of more than one hour: 47% in 2021, 64% in 2022, 74% in 2023 and 82% in 2024. Four years in a row of going backwards. During the same time, the average number of tools used by a team rose to eight or nine different platforms. The answer from the industry has always been the…
