Measuring what matters in 2026: user intent as the new GenAI KPI

TL;DR

→ 95% of organizations see zero ROI from AI investments despite deploying widely. The MIT-led State of AI in Business 2025 Report shows this is not a model problem, it's a measurement problem.

→ Technical observability (latency, uptime, token efficiency) cannot tell you if users succeed, if they trust the system, or if they'll return. This gap between infrastructure metrics and adoption metrics defines the 2026 divide.

→ The winners in 2026 will be those who measure user intent, conversation success, and behavioral signals—and close feedback loops fast enough to improve continuously.

The conversation around AI has finally matured. After three years of chasing model performance, enterprises are asking a harder question: are people actually using this?
The answer is forcing a reckoning.

According to McKinsey's 2025 Global Survey on AI, companies that measure AI's real value, not just its technical performance, are pulling ahead. Yet the gap between deployment and adoption remains stark. Gartner reports that 40% of enterprise applications will feature AI agents by 2026, but adoption rates tell a different story. Most organizations remain trapped between the promise of AI and the reality of low adoption.
The culprit isn't the models. It's measurement.
This essay explores why 2026 is the year enterprises stop measuring whether their AI works and start measuring whether their users trust it.

The Observability Trap

Enterprise leaders believed the problem was predictable. Models were not accurate enough. Systems were not fast enough. Infrastructure was not robust enough.
So they invested heavily in observability.

Observability tools flood dashboards with data: API latency, token throughput, error rates, model uptime. Datadog, New Relic, Dynatrace, and purpose-built LLM monitoring platforms like LangSmith and Arize now dominate the AI infrastructure stack. By these measures, most enterprise AI systems are performing well.
But here is what observability cannot tell you. It cannot answer whether a user actually accomplished their goal. It cannot reveal whether an employee will use the AI tomorrow. It cannot detect the moment a user gives up. It cannot capture the difference between a fast wrong answer and a slow right one.

Observability measures the system. It does not measure adoption.
Bessemer Venture Partners' 2025 State of AI report identifies a critical bottleneck: enterprise evaluation. The report notes that most companies still lack frameworks to assess whether their AI performs in their real-world contexts, not just in public benchmarks. Companies chase leaderboard scores when they should be asking, "Does this work for our users in our domain?"

This distinction matters. Wharton's 2025 AI Adoption Report shows that 82% of enterprises now use AI weekly, and 46% use it daily. Yet 72% formally measuring ROI are those who focus on behavioral outcomes—not infrastructure health. Three out of four see positive returns. Those who rely on technical metrics alone do not.
The pattern is clear. Observability is table stakes. It is not a differentiator.

Why the MIT Study Reveals a Learning Gap

In late 2025, MIT researchers released a finding that should reshape how enterprises think about AI. Despite $30–40 billion poured into GenAI, 95% of organizations report zero measurable ROI. Yet 80% have deployed AI somewhere.
The paradox reveals the core issue: adoption has stalled not because models lack capability, but because systems lack feedback mechanisms.

The research identifies a "learning gap" as the defining barrier. Many enterprise AI systems today do not retain feedback, adapt to context, or improve over time. An employee uses an internal copilot, finds it unhelpful, and abandons it. The system does not learn. The company does not know why adoption failed. The AI remains unchanged.
This is not an AI problem. This is an architecture problem.

Compare this to how web analytics transformed digital products. Before Google Analytics, website teams measured server logs and infrastructure metrics. They had no visibility into what visitors actually did, where they dropped off, or why they did not return. Analytics changed that. It moved measurement from infrastructure to behavior.
Enterprise AI is at that inflection point now.

The organizations crossing into the 5% of AI initiatives delivering real value share a pattern: they treat AI like a product, not an experiment. They close feedback loops. They measure user behavior, not just system performance. They iterate based on what real users tell them, explicitly and implicitly.

The Three Measurement Gaps Blocking Adoption

Most enterprises today measure AI success across three dimensions. All three are insufficient alone.

Gap 1: Technical Performance vs. User Success

A support chatbot responds in 200 milliseconds with grammatically perfect language. Technical metrics show success.
The customer reads the response, finds it irrelevant to their question, and requests a human agent. User success: failed.

Observability tools will not flag this failure. They will show that latency was acceptable, that the model inference completed, that the API response was clean. A behavioral analytics platform would show that task completion failed, that the user expressed frustration through follow-up clarifications, and that they abandoned the conversation.
The difference is material. A fast-but-wrong system is worse than a slow-but-helpful one. Yet technical metrics incentivize speed over accuracy to user intent.

Gap 2: Adoption Rates vs. Task Completion Rates

An enterprise deploys an internal AI assistant. Ninety percent of employees log in. Adoption looks strong.
Sixty days later, only 12% remain active users. Adoption metrics said success. Retention data reveals failure.

Observability platforms measure logins. Behavioral analytics platforms measure whether employees return because the tool delivers value. The second metric predicts long-term ROI. The first does not.
According to Wharton's 2025 study, enterprises measuring adoption by behavioral retention and task completion rates see measurable returns. Those measuring by activation and login rates see stalled pilots.

Gap 3: Feature Output vs. Intent Resolution

An AI system generates 10,000 support ticket suggestions per month. Output metrics suggest productivity.
Investigation reveals that only 3,000 of those suggestions are actually used by support teams. The rest are ignored or overridden. Output is high. Adoption is low.

This distinction matters because it separates what the AI produces from what the organization values. Bessemer's 2025 State of AI report emphasizes that companies building private evaluation frameworks (frameworks grounded in business outcomes, not benchmarks), are seeing 10X better deployment success than those chasing model performance alone.
The real metric is whether the AI's output changes how work gets done.

What Intent Resolution Looks Like in Practice

In 2026, the question guiding AI measurement is not "Did the model respond?" but "Did the user accomplish what they came for?"
This shift reframes the entire analytics layer.

Intent resolution asks: of all the conversations a user has with an AI system, what percentage lead to a successful outcome ? By successful, we mean the user either completed their task, got the information they needed, or resolved their problem, all without escalating to human support or abandoning the system.
This metric cuts across product categories. A customer-facing chatbot, an internal copilot, a content generation tool, all can measure intent resolution. And all can use this metric to iterate.

One manufacturing company illustrates the point. Their support team assumed that 70% of tickets were technical system failures. Investigation revealed something different: 70% of "error" tickets were users providing unclear instructions. The AI was working correctly. The users simply did not know how to ask effectively.
By instrumenting conversations with behavioral analytics, the company identified this pattern and deployed real-time prompt suggestions. Support tickets dropped 32%. Intent resolution improved.

This is what happens when measurement moves from infrastructure to user behavior.

The Feedback Loop as Competitive Advantage

Bessemer's State of AI report identifies private evaluation frameworks as the next critical infrastructure layer. But evaluation is not a one-time practice. It is a feedback loop.
The companies outpacing competitors by 10X in how quickly they improve their AI capabilities share a pattern: they close the loop between user feedback and product iteration. They measure continuously. They act rapidly.

Here's what that loop looks like:
Week 1: Deploy an AI assistant to a department. Instrument it to capture user behavior, frustration signals, and conversation outcomes.
Week 2: Analyze the data. Identify where users drop off. Notice that questions about compliance are being answered poorly. Sentiment analysis reveals frustration clusters around specific topics.
Week 3: Adjust the system. Improve training data for compliance queries. Deploy targeted guidance based on the patterns discovered.
Week 4: Measure again. Task completion for compliance questions rises from 31% to 58%. User sentiment improves. Adoption accelerates.

Organizations operating on this cadence gain visibility that observability-only teams lack. They do not wait nine months for a pilot to finish. They iterate in weeks. They treat user feedback as product intelligence.
The MIT research highlighted a clear divide: enterprises spending 9+ months moving from pilot to production conversion see lower success rates than mid-market firms that move to scale in 90 days. The difference is not model quality. It is feedback velocity.

Trust Cannot Be Measured by Uptime

A related insight from 2025 research reshapes how we think about AI reliability. According to the BIT (Behavioural Insights Team) 2025 study, organizations that provided real-time feedback on AI performance saw significantly higher trust in the AI system, even when the AI was not perfect.
In other words, users trust systems they can see improving in response to their input.

Gartner's prediction that 40% of enterprise applications will feature AI agents by 2026 comes with a caveat: agents that lack user trust will not scale. And trust is built through transparency, not uptime.
An agent that runs 24/7 but never shows it learned from user feedback erodes trust. An agent that occasionally pauses, incorporates user input visibly, and demonstrates improvement builds trust.

This has profound implications for measurement. In 2026, CISOs and innovation leaders should measure not just whether AI systems are reliable in technical terms, but whether they demonstrate reliability in behavioral terms. That means showing users that their feedback changes outcomes.
The measurement shift is from "Is the system up?" to "Is the system demonstrating that it improves based on how we use it?"

Why Measurement Matters for Agentic AI

Gartner's forecast that 40% of enterprise applications will feature AI agents by 2026 matters because agents are not assistants. Assistants advise. Agents act.
An agent that makes a bad decision costs more than an assistant that gives bad advice. Users need to trust agents with higher stakes.

And as Bessemer's report notes, that trust is built through clear evaluation frameworks, continuous feedback loops, and visible improvement. Without these, agentic AI adoption will hit a ceiling.
According to the MIT research, the companies scaling agentic AI successfully today are those treating agents like products: instrumenting them with behavioral analytics, measuring user outcomes, and iterating rapidly based on feedback.
The measurement gap widens with agent autonomy. The higher the stakes, the more measurement matters.

Layer	What It Measures	Why It Matters	Example Metric
Technical Performance	System health, latency, uptime, token efficiency	Table stakes. You cannot ignore infrastructure.	API response time under 200ms
Conversation Quality	Intent resolution, first-turn accuracy, conversation completion rate	Shows whether the AI answers are relevant to what users actually need.	65% of conversations result in task completion
User Experience	Frustration signals, sentiment, drop-off points, session depth	Reveals why users abandon AI or return to it.	Frustration detection declining 15% week-over-week
Business Outcomes	Adoption rate, return user percentage, business metric correlation	Proves ROI and justifies investment.	40% of users active after 60 days; support tickets down 25%

The Measurement Shift Is Underway

The change is already happening in leading organizations. In Wharton's 2025 study, companies formally measuring AI ROI cite behavioral analytics and user outcome metrics as their primary tools. In Bessemer's analysis, companies building private evaluation frameworks integrated into production systems are 10X faster at moving from pilots to scale.
McKinsey's 2025 Global Survey on AI confirms that organizations measuring real value, not just adoption vanity metrics, are pulling ahead.

But the majority of enterprises remain stuck in observability-only measurement. They have dashboards full of infrastructure data and no visibility into whether their users are succeeding.
In 2026, this gap will become untenable. As AI moves from pilot to production, as autonomous agents proliferate, as adoption becomes mission-critical,measurement will determine winners and laggards.

The shift is from "Is the AI working?" to "Are our users succeeding?" That simple reframing changes which metrics matter, which tools organizations need, and ultimately which companies will turn AI from promise into practice.

Conclusion

Observability was the foundation. It solved a real problem: ensuring AI infrastructure could be trusted to run reliably at enterprise scale.
But observability was never enough. A system can be fast, accurate by benchmark standards, and fully instrumented, and still fail to drive adoption.

The 95% of organizations seeing zero ROI from AI are not failing because their models are bad or their infrastructure is weak. They are failing because they cannot see whether their users are winning.
In 2026, the competitive advantage will go to those who measure user intent, close feedback loops, and iterate continuously. Those who show users that their feedback matters. Those who treat AI like a product.
The 5% of organizations successfully delivering AI value today are already here. The rest have until 2026 to catch up.

Seeing this in practice with Nebuly

Many enterprises now use Nebuly as their user analytics layer for GenAI. Nebuly analyzes every interaction with internal copilots and customer-facing assistants so teams can see where intent is resolved, where users get stuck, and where frustration silently kills adoption. Product, AI, and CX leaders use these signals to prioritize improvements based on real behavior instead of assumptions.

If your AI roadmap still runs on latency charts and token counts, adding a user analytics layer is often the fastest way to understand what is actually happening in your copilots and agents.

If you want to see how this looks on top of your own GenAI or agentic AI products, you can explore Nebuly in action. Book a demo with us to understand what your users are really trying to do and where adoption gets blocked.

Frequently Asked Questions (FAQs)

What is the difference between observability and user analytics?

Observability measures whether your system is running properly (latency, uptime, error rates, token efficiency). It tells you if the AI works from a technical perspective. User analytics measure whether your users are succeeding (task completion, intent resolution, conversation outcomes, frustration signals). They tell you if the AI delivers value to people. Both are necessary. Most enterprises have observability. Few have adoption analytics. That gap is costing them ROI.

Why do 95% of AI organizations see zero ROI?

According to MIT's State of AI in Business 2025 Report, the core reason is a learning gap. Most enterprise AI systems do not retain feedback, adapt to context, or improve over time. Users find the AI unhelpful, abandon it, and move on. The company never learns why. The system never improves. Without feedback loops connecting user behavior back to product iteration, adoption stalls. The 5% succeeding are those closing this loop rapidly.

What does intent resolution actually measure?

Intent resolution measures the percentage of conversations where the user accomplishes their actual goal. Did they get the information they needed? Did they complete their task? Did they solve their problem? Or did they abandon the conversation, escalate to a human, or try again with a rephrased prompt? Intent resolution cuts through vanity metrics (like "conversations completed" without knowing if they succeeded) and gets at the core question: is the AI useful?

How quickly can AI teams improve if they have the right feedback loop?

Organizations with structured feedback loops can outpace competitors by 10X in how quickly they improve their AI capabilities. Instead of waiting nine months to see whether a pilot works, leading teams iterate weekly. They deploy. They measure user behavior. They adjust. They measure again. This velocity is now a competitive moat. Companies trapped in slow pilot cycles lose market position to faster learners.

What role do private evaluation frameworks play in 2026 AI success?

Bessemer's 2025 State of AI report identifies private evaluation frameworks as critical infrastructure. Public benchmarks like MMLU do not reflect real-world workflows, compliance constraints, or domain-specific success. Leading enterprises are building evaluation frameworks tailored to their use cases and data. They measure accuracy, hallucination risk, compliance, and customer satisfaction together. This grounded approach enables smarter procurement decisions, faster deployment, and measurable ROI. Companies building this infrastructure first will scale AI faster than those chasing benchmark scores.

How does user trust relate to AI measurement?

Users trust systems they can see improving. When organizations provide real-time feedback showing how AI performance changed in response to user input, trust increases significantly. This challenges the assumption that trust requires perfect reliability. Instead, visible improvement builds trust faster than perfect consistency. This has major implications for how enterprises should design measurement systems: show users that their feedback matters and that the AI is getting better because of it.