The free-wheeling days of cloud spend are over. Scarred by sky-high monthly bills and embarrassing outages, many companies are now closely monitoring their infrastructure costs to prevent waste, avoid performance issues and siphon out every possible bit of savings. When it comes to new capacity, finance leaders are increasingly asking: Do we really need to scale up, or did we just become less efficient?  Metrics, logs, traces and profiles hold information required to run more cost-effective and resilient infrastructure. Historically, telemetry data was analyzed in silos over fixed periods. Generating collective insights requires a lot of manual work from specialists; and until recently, profiles remained an outlier, preventing organizations from ‘last mile’ optimization.  

But with the continued uptick in adoption of eBPF along with other innovations, companies are now able to combine profiles and traces to track real-time compute, CPU and memory usage across their cloud footprint. Instead of static insights, companies are granted a dynamic window into application behavior to troubleshoot issues more quickly. It is the difference between knowing a span took 400ms and understanding the specific line of code that caused it.  

AI is poised to extend these capabilities and help companies start to proactively address problems before they impact the business. These capabilities have never been more important. Without the ability to investigate Kubernetes clusters quickly and painlessly at the code-level, as well as get broad overviews of activity across their entire environment, enterprises risk losing control of their infrastructure and becoming the next cloud cautionary tale.  

The Power of Profiles   

Integrating profiles and traces gives organizations the ability to pinpoint the specific code causing downstream performance issues.  

For example, many of us have likely been in the scenario where upon facing a long delay in getting connected to a rideshare, we try out a competing service. With the ability to deeply mine profile data, the provider can pinpoint the code running during that specific period and identify the exact reason for the delay. Another example is a retailer preparing for Cyber Monday can simulate the expected web traffic to ensure the system will be prepared for the spike. 

When combined with AI, productivity enhancements are more impactful, including: 

  • Alerts: AI can detect anomalies in the information, as well as explain the reason for the issue, helping internal experts more quickly resolve problems.  
  • Intelligence: Infrastructure teams are inundated with signals from metrics, log, traces, profiles, load tests and more. AI can quickly explain what they are looking at and what the next steps will be.  
  • Automation: While a long-term goal for many businesses, AI will eventually be able to detect issues before they occur and flag them to internal experts who can quickly implement a fix.  

It is this type of continual optimization that helps satisfy both customers and finance managers.  

Profiles: The Secret to Preparedness and Agility  

When it comes to the cloud, even the most mature companies struggle to optimize performance during the most critical times. For example, Netflix drew intense backlash when streaming issues impacted the marquee fight between Jake Paul and Mike Tyson.  

When problems occur, it is often a scramble to figure out why. And the information they need to really fix it is trapped in profiles and traces. When this data is unified and combined with AI, infrastructure teams have a digital agent at their disposal that helps guide them to specific issues and provides actionable advice on the next steps.  

But most businesses want to avoid this situation entirely and use the same data to prepare in advance and eliminate issues before they arise. For example, before deploying a CI/CD pipeline, internal teams may analyze profiles and traces under certain conditions to determine how it will increase certain metrics. This helps them prepare in advance for any disruption.  

Meanwhile, leadership teams increasingly want high-level metrics from across their whole cloud footprint. By merging tracing and profiling data together, companies can identify these larger areas of potential inefficiencies, such as whether a specific API request utilizes too many resources. 

For example, at Grafana, integrating these data streams helped us reduce API calls to object storage by 3x. 

eBPF: The New Application Video Camera  

In the past, infrastructure teams would manually log on to servers, grab profile data and save it to the desktop before analyzing the assets and sharing the results within the organization. This task is time-consuming, made harder by the many competing demands put on cloud specialists today.  

Simultaneously, companies wish to prevent developers from having to build these capabilities into the code itself. It is why many are turning to eBPF. The technology functions like a video camera within each application, continually reporting what is happening inside to users.  

With AI, this data is automatically analyzed, thus key information is quickly surfaced to experts. With eBPF, organizations can start profiling data more ubiquitously without burdening developer teams to collect and analyze it.  

In today’s economic climate, cloud resources are tracked more closely than ever. As budgets grow tighter, IT teams are increasingly told to find savings in existing usage to fund new workloads. Unifying profiling data with the rest of the telemetry information can lead to quick and significant savings.  

This helps the organization offset the investment in new observability tools, which in turn helps them continually optimize resources. It is a rare win-win for both IT and finance professionals.  


Share.
Leave A Reply