Why "Tokens Per Second" is a vanity metric for Agentic AI

TL;DR

→ The Speed Trap: Many teams measure AI agent performance by speed (latency, tokens per second), but a fast wrong answer or the wrong tool usage is worse than a slightly slower correct solution.

→ Autonomy Over Speed: New agentic AI metrics like Autonomy Rate and Goal Completion Rate focus on outcomes, not just output speed. These metrics emphasize success and self-sufficiency over vanity throughput.

→ Experience and Value: Outcome-based metrics better capture user experience. An agent that completes tasks satisfies users and drives ROI, whereas raw speed often correlates poorly with success.

→ Learning from Failures: Observability tools (like Nebuly) surface where agents stall, loop, or hallucinate. By measuring failures, teams can shift from system metrics to user outcome metrics that truly matter.

Speed ≠ Success in Autonomous AI Agents

Enterprise adoption of large language model (LLM) agents has exploded. Over 70% of organizations were using generative AI by mid-2024, and Deloitte forecasts a quarter of GenAI-driven companies will have deploied intelligent agents in 2025.

With this surge, it’s tempting to measure success by technical performance, how fast the model responds or how many tokens it generates per second. After all, lower latency feels like progress. Vendors often tout figures like "X tokens per second" as a bragging right. For instance, OpenAI’s optimized GPT-4o model outputs ~110 tokens per second, about 3× faster than the original GPT-4 Turbo. Some state-of-the-art models even exceed 150+ tokens/sec in benchmarks. Latency keeps improving, with 2025 tests showing new models delivering the first token in just 0.345 seconds.

But here’s the rub: speed is a vanity metric for agentic AI.

A lightning-fast agent that spits out an answer in half a second is useless if that answer is wrong or the agent fails the task. Conversely, a slightly slower agent that actually solves the user’s problem is far more valuable. In other words, volume and speed mean nothing without value. As one analysis put it, the dotcom era chased eyeballs (volume) and crashed; AI chasing tokens is headed for the same cliff because "volume metrics mean nothing without value metrics."

Consider a simple example: Two AI agents are given a complex task that involves using tools (e.g., searching information or running code) to get an answer.

Agent A responds in 2 seconds. It picked the wrong tool and returned incorrect information.
Agent B takes 5 seconds. It uses the correct reasoning chain and produces a correct, actionable answer.

From a pure latency perspective, Agent A "wins." Yet from any meaningful perspective, Agent B is the winner. Speed without accuracy is a false victory.

Real-world data bears this out. Early experiments with autonomous AI frameworks showed that being fast didn't make them successful. Carnegie Mellon researchers tested leading AI agents on real work tasks—the best agent completed only 24% of tasks. Shaving 300ms off response time does nothing to fix a scenario where an agent’s plan collapses 7 times out of 10.

Yet many organizations are stuck in the "Old Way" of measurement, focusing on system-centric metrics. For agentic AI, these metrics only tell part of the story. An agent might achieve high throughput by blazing through steps quickly, but if it picks the wrong actions, the speed means nothing. Teams often see high usage numbers and assume success, while missing that many of those sessions ended in user frustration.

The business impact of this metrics mistake is significant. Enterprises are pouring money into generative AI—global spending is forecast to jump 76% from 2024 to 2025—but many aren't seeing returns. Gartner analysts found that up to 85% of AI projects fail to scale beyond pilot phases. McKinsey likewise reported that 80% of organizations using AI have seen zero increase in bottom-line impact.

The lesson is clear: Speed is not success. We need to shift our definition of "good performance" from system metrics to outcome metrics.

Old Way vs New Way: Measuring AI Agent Success

It’s time to replace vanity metrics with actionable metrics that reflect what we truly care about: successful autonomous outcomes.

Traditional Metric (System-Focused)	New Metric (Outcome-Focused)
Tokens per second (Throughput) How fast the model generates text. Useful for infrastructure, useless for user success.	Goal Completion Rate The percentage of user goals the AI successfully completes from start to finish.
Latency How quickly the agent produces the first token.	Autonomy Rate The percentage of sessions the agent completes without human intervention.
Interaction Count Number of turns exchanged. Often misread as "engagement."	User Success Rate Share of sessions that end in success from the user's perspective (e.g., correct answer).
Static Model Accuracy Performance on benchmarks (MMLU, HumanEval).	Loop & Error Rate How often the agent gets stuck, loops on the same tool, or hallucinates an action.

In the "Old Way," a team might proudly report "Our AI responds in under 1 second." In the "New Way," what matters is "Our AI successfully completed the request 90% of the time without human help."

Enterprises are beginning to recognize that shift. Gartner analysts predict that by 2026, success will be measured more by user outcome and trust metrics than by technical benchmarks. Forward-thinking teams like Microsoft’s Copilot are already adopting metrics like resolution rate and fallback rate.

Solution 1: Autonomy Rate – Measure True Self-Sufficiency

One of the most telling metrics for an agentic AI system is Autonomy Rate. This metric answers the question: "How often can the AI agent handle tasks fully on its own?"

A high autonomy rate means the agent isn't just fast—it’s capable. It handles the inputs, does the reasoning, and produces the correct outcome. A low autonomy rate means that most sessions eventually need a person to intervene.

Autonomy Rate is crucial for business value. If your AI sales assistant only handles 30% of inquiries without escalation, that limits its ROI. If it can handle 80%, that is a massive efficiency gain. However, autonomy isn't just a cost metric—it’s a user experience metric. When users know the AI can handle their request end-to-end, it builds trust.

To improve autonomy rate, teams need to identify why an agent needed help. Did it encounter a query it didn’t understand? Did it choose an incorrect tool? These are actionable insights.

Solution 2: Goal Completion Rate: Focus on Outcomes, Not Outputs

If autonomy rate measures the AI’s independence, Goal Completion Rate (GCR) measures the AI’s effectiveness. This metric asks: "Did the agent ultimately achieve the user’s goal?"

GCR is essentially the success rate of the agent. If an AI agent in a marketing tool is asked to draft a social media post, goal completion means the post was generated to the user's satisfaction. High GCR means users consistently get what they came for.

Tracking goal completion forces teams to define what success means for their specific use case. It might be easy to count a response as "completed" if the AI responded at all, but that’s not enough. Was the answer correct? By explicitly measuring goal completion, you align the team on the true north: user outcome.

There is evidence that organizations adopting GCR see more meaningful progress. Reports note that teams using goal completion iteratively raise their chatbot's success on intents, rather than just making it talk more fluently.

Solution 3: Analyzing Failures: Learning from Stalls, Loops, and Hallucinations

No AI agent is perfect. What separates successful deployments from failed ones is how well the team understands the failure modes.

This is where conversational AI analytics tools like Nebuly come into play. By instrumenting the agent’s decision process, we can uncover exactly where it faltered:

Stalling: The agent repeats the same prompt or action without moving forward. This often indicates it is stuck or confused.
Tool misuse (Looping): The agent might call the wrong external tool, get an error, and then try the same thing over and over. Analytics can flag these loops by showing a high number of repeated tool calls.
Hallucinated tool results: In some cases, agents claim to have done something they didn’t. One dramatic example noted in an AI forum showed an agent hallucinating an entire API response to make it look like a step succeeded.

By surfacing these failure patterns, teams can prioritize fixes. Is the agent often looping when using Tool X? Maybe the integration is faulty. Does the agent hallucinate facts in a certain domain? Maybe it needs better Retrieval Augmented Generation (RAG) context.

Nebuly’s analytics platform helps teams map out where and why agents fail. It can show that 15% of user sessions end without goal completion, and then drill down to reveal that 40% of those failures involved the agent using an outdated knowledge source. These insights are golden for improving the system.

You cannot manage what you cannot measure. If you are ready to see what your users are actually telling you, we are here to help. Book a demo to see your data clearly.

Frequently asked questions (FAQs)

Can an AI agent be slow and still successful?

Yes. If an agent takes a bit longer but completes the task correctly, users often don’t mind the wait. Enterprise users value accuracy and reliability over sheer speed. A slightly slower agent that always finds the right answer will beat a fast agent that is hit-or-miss.

Why is “tokens per second” so popular if it’s misleading?

It is an easy number to measure. Hardware vendors use it to benchmark infrastructure speed. It is not useless—it matters for costs and load balancing—but it becomes a vanity metric when organizations treat it as the primary definition of product success.

How do I calculate Autonomy Rate?

Autonomy Rate is typically calculated as: (# of sessions completed by AI alone) / (# of total sessions). "Completed by AI alone" means no human stepped in, the user didn’t abandon the session, and no human overrides were required.

What is an example of implicit user feedback?

Users rarely click "thumbs down," but their behavior reveals failure. If a user rephrases their question three times, that is implicit negative feedback. If they ask "Are you sure?", that signals a lack of trust. Nebuly captures these behavioral signals automatically.

Is measuring "user satisfaction" too subjective?

It can be harder to measure than latency, but it is not less important. Latency is easy to measure with precision, but a precise measure of the wrong thing (speed) is less valuable than an approximate measure of the right thing (did the user get value?).