What’s observability? And why should I care if I’ve got AI?

By Sammy Zoghlami, SVP EMEA at Nutanix.

Technology was supposed to make everything easier. Faster decisions, smarter systems, leaner operations. But for many leaders, the reality looks very different: rising costs, swelling cyber risk, and a tangle of legacy and multi-cloud complexity that’s only getting harder to manage. AI is now promising to help solve this — but in truth, AI has its work cut out.

Even the most advanced projects are feeling the strain. Gartner predicts that over 40% of agentic AI projects will be cancelled by the end of 2027, derailed by spiralling costs, poor ROI, or inadequate risk controls. It’s not that AI cannot deliver value, it’s that the foundations beneath it were not built for what is coming.

The truth is, much of the frustration around AI isn’t really about AI at all. It’s about those underlying systems and an organisation’s ability to actually see what is happening across those systems. For all the hype around intelligent agents and autonomous workflows, success still depends on something far less glamorous; the performance, visibility, and resilience of the platforms those models run on. When infrastructure cannot keep pace, costs rise, performance dips, and complexity multiplies. That is where the cracks begin to show.

Three recurring pain points keep surfacing.

AI workloads are starving for data

Modern models devour data. Training runs and retrieval-augmented generation (RAG) pipelines depend on high-throughput access to files, objects, and vector data spread across hybrid environments. Yet traditional storage systems were not built for this pace. I/O bottlenecks throttle performance, GPUs sit idle waiting for data, and every wasted second becomes wasted compute. For many enterprises, storage has become a tax on AI progress.

Observability isn’t observability if it stops at infrastructure metrics

Most organisations can see CPU load, disk IOPS, and network latency but that’s only half the picture. True observability means correlating those infrastructure signals with model behaviour - accuracy, drift, throughput, error rates, even cost per inference. When data, compute, and models are scattered across clouds, this end-to-end view disappears. Teams end up reacting to symptoms, such as slower queries and rising bills, without understanding root causes. Observability, in simple terms, should answer one question. What’s happening, why, and what should we do about it?

Fragility is a hidden threat

AI workloads are notoriously unforgiving. A single node failure, power fluctuation, or regional outage can derail production workflows, interrupt inference pipelines, and erode business confidence. Many enterprises still rely on manual failovers or untested disaster-recovery plans. True resilience means cross-region redundancy, automated recovery, and continuous validation because in AI, uptime equals trust.

These three issues are driving the cancellations, overruns, and disappointments Gartner warns about. And they’re why performance and resilience, the two least glamorous parts of the stack, have suddenly become the most strategic.

So, what does good look like?

It starts with recognising that performance is a by-product of smarter architecture (and not necessarily better hardware). The best AI systems are fed by storage that can keep up. That means fast, scalable, and intelligent enough to balance cost with speed. When training workloads or retrieval-augmented generation pipelines hit the accelerator, data needs to move just as quickly. AI-optimised, tiered storage architectures do exactly that, feeding GPUs at line speed while still providing the durability and auditability needed for compliance.

But performance on its own isn’t enough. Without visibility, even the best-engineered systems are flying blind. Observability needs to go beyond dashboards and alerts. It has to connect the dots between infrastructure health and model behaviour. It is the ability to see how a GPU spike in one region affects inference latency elsewhere, or how network congestion is degrading model accuracy. When you can see everything, data, compute, and model performance, you can tune it, fix it, and ultimately trust it.

And then there’s resilience, the quiet hero of AI scale. The more distributed AI becomes, the more fragile it gets. Models are trained across regions, data flows across clouds, and a single outage can ripple through everything. The answer isn’t redundancy for redundancy’s sake, it is resilience by design. Dynamic workload migration, self-healing infrastructure, and continuous validation of failover processes. That’s what turns AI from an experimental tool into an operational asset.

In truth, performance, observability, and resilience are inseparable. Without one, the other falters. Together, they define how prepared an organisation really is for AI at scale, not just for the pilot phase, but for the everyday reality of running critical workloads in production.

AI success depends on seeing your infrastructure as part of the intelligence. Leaders should start by asking tough questions about visibility and control. Can your teams trace data flows across every cloud? Do you know, in real time, how infrastructure decisions affect model performance? And are your recovery processes tested for when (not if) something fails?

The answers shape competitive advantage. The organisations that treat infrastructure as a living system, continuously tuned, instrumented and stress-tested, will be the ones that turn AI into a reliable engine of productivity. Because the future of AI isn’t just about creating smarter models, it’s about value. Otherwise, what’s the point? And without smarter systems, there really is no point.

By Manvinder Singh, VP of Product Management for AI at Redis.
By Nico Gaviola, Vice President, Emerging Enterprises and Digital Natives, Databricks.
By Derek Thompson, Senior VP & GM, EMEA, Workato.
By Chris Ackerson, SVP of Product at AlphaSense.
By Rishi Kapoor, Head of WW Partner Sales Engineering & Solutions: Technology & Innovation Partners...