The conversation around AI has finally matured. After three years of chasing model performance, enterprises are asking a harder question: are people actually using this?
The answer is forcing a reckoning.
According to McKinsey's 2025 Global Survey on AI, companies that measure AI's real value, not just its technical performance, are pulling ahead. Yet the gap between deployment and adoption remains stark. Gartner reports that 40% of enterprise applications will feature AI agents by 2026, but adoption rates tell a different story. Most organizations remain trapped between the promise of AI and the reality of low adoption.
The culprit isn't the models. It's measurement.
This essay explores why 2026 is the year enterprises stop measuring whether their AI works and start measuring whether their users trust it.
The Observability Trap
Enterprise leaders believed the problem was predictable. Models were not accurate enough. Systems were not fast enough. Infrastructure was not robust enough.
So they invested heavily in observability.
Observability tools flood dashboards with data: API latency, token throughput, error rates, model uptime. Datadog, New Relic, Dynatrace, and purpose-built LLM monitoring platforms like LangSmith and Arize now dominate the AI infrastructure stack. By these measures, most enterprise AI systems are performing well.
But here is what observability cannot tell you. It cannot answer whether a user actually accomplished their goal. It cannot reveal whether an employee will use the AI tomorrow. It cannot detect the moment a user gives up. It cannot capture the difference between a fast wrong answer and a slow right one.
Observability measures the system. It does not measure adoption.
Bessemer Venture Partners' 2025 State of AI report identifies a critical bottleneck: enterprise evaluation. The report notes that most companies still lack frameworks to assess whether their AI performs in their real-world contexts, not just in public benchmarks. Companies chase leaderboard scores when they should be asking, "Does this work for our users in our domain?"
This distinction matters. Wharton's 2025 AI Adoption Report shows that 82% of enterprises now use AI weekly, and 46% use it daily. Yet 72% formally measuring ROI are those who focus on behavioral outcomes—not infrastructure health. Three out of four see positive returns. Those who rely on technical metrics alone do not.
The pattern is clear. Observability is table stakes. It is not a differentiator.
Why the MIT Study Reveals a Learning Gap
In late 2025, MIT researchers released a finding that should reshape how enterprises think about AI. Despite $30–40 billion poured into GenAI, 95% of organizations report zero measurable ROI. Yet 80% have deployed AI somewhere.
The paradox reveals the core issue: adoption has stalled not because models lack capability, but because systems lack feedback mechanisms.
The research identifies a "learning gap" as the defining barrier. Many enterprise AI systems today do not retain feedback, adapt to context, or improve over time. An employee uses an internal copilot, finds it unhelpful, and abandons it. The system does not learn. The company does not know why adoption failed. The AI remains unchanged.
This is not an AI problem. This is an architecture problem.
Compare this to how web analytics transformed digital products. Before Google Analytics, website teams measured server logs and infrastructure metrics. They had no visibility into what visitors actually did, where they dropped off, or why they did not return. Analytics changed that. It moved measurement from infrastructure to behavior.
Enterprise AI is at that inflection point now.
The organizations crossing into the 5% of AI initiatives delivering real value share a pattern: they treat AI like a product, not an experiment. They close feedback loops. They measure user behavior, not just system performance. They iterate based on what real users tell them, explicitly and implicitly.
The Three Measurement Gaps Blocking Adoption
Most enterprises today measure AI success across three dimensions. All three are insufficient alone.
Gap 1: Technical Performance vs. User Success
A support chatbot responds in 200 milliseconds with grammatically perfect language. Technical metrics show success.
The customer reads the response, finds it irrelevant to their question, and requests a human agent. User success: failed.
Observability tools will not flag this failure. They will show that latency was acceptable, that the model inference completed, that the API response was clean. A behavioral analytics platform would show that task completion failed, that the user expressed frustration through follow-up clarifications, and that they abandoned the conversation.
The difference is material. A fast-but-wrong system is worse than a slow-but-helpful one. Yet technical metrics incentivize speed over accuracy to user intent.
Gap 2: Adoption Rates vs. Task Completion Rates
An enterprise deploys an internal AI assistant. Ninety percent of employees log in. Adoption looks strong.
Sixty days later, only 12% remain active users. Adoption metrics said success. Retention data reveals failure.
Observability platforms measure logins. Behavioral analytics platforms measure whether employees return because the tool delivers value. The second metric predicts long-term ROI. The first does not.
According to Wharton's 2025 study, enterprises measuring adoption by behavioral retention and task completion rates see measurable returns. Those measuring by activation and login rates see stalled pilots.
Gap 3: Feature Output vs. Intent Resolution
An AI system generates 10,000 support ticket suggestions per month. Output metrics suggest productivity.
Investigation reveals that only 3,000 of those suggestions are actually used by support teams. The rest are ignored or overridden. Output is high. Adoption is low.
This distinction matters because it separates what the AI produces from what the organization values. Bessemer's 2025 State of AI report emphasizes that companies building private evaluation frameworks (frameworks grounded in business outcomes, not benchmarks), are seeing 10X better deployment success than those chasing model performance alone.
The real metric is whether the AI's output changes how work gets done.
What Intent Resolution Looks Like in Practice
In 2026, the question guiding AI measurement is not "Did the model respond?" but "Did the user accomplish what they came for?"
This shift reframes the entire analytics layer.
Intent resolution asks: of all the conversations a user has with an AI system, what percentage lead to a successful outcome ? By successful, we mean the user either completed their task, got the information they needed, or resolved their problem, all without escalating to human support or abandoning the system.
This metric cuts across product categories. A customer-facing chatbot, an internal copilot, a content generation tool, all can measure intent resolution. And all can use this metric to iterate.
One manufacturing company illustrates the point. Their support team assumed that 70% of tickets were technical system failures. Investigation revealed something different: 70% of "error" tickets were users providing unclear instructions. The AI was working correctly. The users simply did not know how to ask effectively.
By instrumenting conversations with behavioral analytics, the company identified this pattern and deployed real-time prompt suggestions. Support tickets dropped 32%. Intent resolution improved.
This is what happens when measurement moves from infrastructure to user behavior.
The Feedback Loop as Competitive Advantage
Bessemer's State of AI report identifies private evaluation frameworks as the next critical infrastructure layer. But evaluation is not a one-time practice. It is a feedback loop.
The companies outpacing competitors by 10X in how quickly they improve their AI capabilities share a pattern: they close the loop between user feedback and product iteration. They measure continuously. They act rapidly.
Here's what that loop looks like:
Week 1: Deploy an AI assistant to a department. Instrument it to capture user behavior, frustration signals, and conversation outcomes.
Week 2: Analyze the data. Identify where users drop off. Notice that questions about compliance are being answered poorly. Sentiment analysis reveals frustration clusters around specific topics.
Week 3: Adjust the system. Improve training data for compliance queries. Deploy targeted guidance based on the patterns discovered.
Week 4: Measure again. Task completion for compliance questions rises from 31% to 58%. User sentiment improves. Adoption accelerates.
Organizations operating on this cadence gain visibility that observability-only teams lack. They do not wait nine months for a pilot to finish. They iterate in weeks. They treat user feedback as product intelligence.
The MIT research highlighted a clear divide: enterprises spending 9+ months moving from pilot to production conversion see lower success rates than mid-market firms that move to scale in 90 days. The difference is not model quality. It is feedback velocity.
Trust Cannot Be Measured by Uptime
A related insight from 2025 research reshapes how we think about AI reliability. According to the BIT (Behavioural Insights Team) 2025 study, organizations that provided real-time feedback on AI performance saw significantly higher trust in the AI system, even when the AI was not perfect.
In other words, users trust systems they can see improving in response to their input.
Gartner's prediction that 40% of enterprise applications will feature AI agents by 2026 comes with a caveat: agents that lack user trust will not scale. And trust is built through transparency, not uptime.
An agent that runs 24/7 but never shows it learned from user feedback erodes trust. An agent that occasionally pauses, incorporates user input visibly, and demonstrates improvement builds trust.
This has profound implications for measurement. In 2026, CISOs and innovation leaders should measure not just whether AI systems are reliable in technical terms, but whether they demonstrate reliability in behavioral terms. That means showing users that their feedback changes outcomes.
The measurement shift is from "Is the system up?" to "Is the system demonstrating that it improves based on how we use it?"
Why Measurement Matters for Agentic AI
Gartner's forecast that 40% of enterprise applications will feature AI agents by 2026 matters because agents are not assistants. Assistants advise. Agents act.
An agent that makes a bad decision costs more than an assistant that gives bad advice. Users need to trust agents with higher stakes.
And as Bessemer's report notes, that trust is built through clear evaluation frameworks, continuous feedback loops, and visible improvement. Without these, agentic AI adoption will hit a ceiling.
According to the MIT research, the companies scaling agentic AI successfully today are those treating agents like products: instrumenting them with behavioral analytics, measuring user outcomes, and iterating rapidly based on feedback.
The measurement gap widens with agent autonomy. The higher the stakes, the more measurement matters.
The Measurement Shift Is Underway
The change is already happening in leading organizations. In Wharton's 2025 study, companies formally measuring AI ROI cite behavioral analytics and user outcome metrics as their primary tools. In Bessemer's analysis, companies building private evaluation frameworks integrated into production systems are 10X faster at moving from pilots to scale.
McKinsey's 2025 Global Survey on AI confirms that organizations measuring real value, not just adoption vanity metrics, are pulling ahead.
But the majority of enterprises remain stuck in observability-only measurement. They have dashboards full of infrastructure data and no visibility into whether their users are succeeding.
In 2026, this gap will become untenable. As AI moves from pilot to production, as autonomous agents proliferate, as adoption becomes mission-critical,measurement will determine winners and laggards.
The shift is from "Is the AI working?" to "Are our users succeeding?" That simple reframing changes which metrics matter, which tools organizations need, and ultimately which companies will turn AI from promise into practice.
Conclusion
Observability was the foundation. It solved a real problem: ensuring AI infrastructure could be trusted to run reliably at enterprise scale.
But observability was never enough. A system can be fast, accurate by benchmark standards, and fully instrumented, and still fail to drive adoption.
The 95% of organizations seeing zero ROI from AI are not failing because their models are bad or their infrastructure is weak. They are failing because they cannot see whether their users are winning.
In 2026, the competitive advantage will go to those who measure user intent, close feedback loops, and iterate continuously. Those who show users that their feedback matters. Those who treat AI like a product.
The 5% of organizations successfully delivering AI value today are already here. The rest have until 2026 to catch up.
Seeing this in practice with Nebuly
Many enterprises now use Nebuly as their user analytics layer for GenAI. Nebuly analyzes every interaction with internal copilots and customer-facing assistants so teams can see where intent is resolved, where users get stuck, and where frustration silently kills adoption. Product, AI, and CX leaders use these signals to prioritize improvements based on real behavior instead of assumptions.
If your AI roadmap still runs on latency charts and token counts, adding a user analytics layer is often the fastest way to understand what is actually happening in your copilots and agents.
If you want to see how this looks on top of your own GenAI or agentic AI products, you can explore Nebuly in action. Book a demo with us to understand what your users are really trying to do and where adoption gets blocked.



