From “internet teams” to AI everywhere: Why companies will use multiple models (and how to measure them)

TL;DR

→ In the 1990s, companies had dedicated “internet teams.” Today, many have centralized AI teams. But just as the internet became part of every department, AI is on track to be embedded everywhere.

→ Enterprises won’t standardize on one AI model. A typical large company already uses ~11 different generative AI models across functions, a number expected to grow ~50% by 2027. Each department will choose the AI that fits its needs.

→ Broad AI adoption is driving this multi-model approach. 78% of organizations now use AI in at least one business unit (up from 55% a year earlier). Generative AI costs have plummeted 280× since 2022, making it feasible for teams to spin up custom models and apps.

→ The challenge: fragmentation. When every team runs its own AI assistant or model, companies lose the unified view of what’s working. Critical questions like “Which AI tools deliver value?” or “Where do users get frustrated?” become hard to answer without cross-model analytics.

→ Past technology waves taught us the solution: unified analytics. Just as Google Analytics provides a single source of truth for website behavior, enterprises now need a “conversation intelligence” layer to track AI usage and user outcomes across all models.

When the internet first entered enterprises, many companies set up dedicated “internet teams” to build and run the corporate website. This specialized group picked the tech stack, managed content, and decided what went online. Over time, that model disappeared as internet technology became integral to every department’s work: marketing ran online campaigns, sales managed web leads, HR recruited via digital platforms. The responsibility for “internet” dissolved into the fabric of each function. Generative AI is following the same trajectory.

Today, in most organizations AI is still concentrated in a central team (often under IT or engineering). This central AI group chooses models, manages integrations, and tracks technical performance. Centralizing made sense in AI’s early stages for a few reasons: early deployments required niche machine learning skills, companies needed tight risk controls around sensitive data, and initial projects were experimental pilots easier to manage in one place. In other words, just as early web projects lived with the “internet team,” early AI efforts have been siloed with an “AI team.”

But we’re already seeing the shift. In the near future, AI will be embedded in every department’s tools and processes. Signs of this are emerging today: marketing teams spinning up AI-driven content generators, HR teams deploying onboarding copilots for new hires, operations managers using conversational assistants for routine processes, and sales reps relying on AI for research and lead qualification. Instead of a distant AI expert team handling all use cases, each function will integrate generative AI into its daily workflows. The central AI team won’t disappear – it will provide governance, best practices, and technical oversight – but ownership of AI’s day-to-day use and improvement will shift to the teams that use AI every day. This mirrors how IT departments now set web infrastructure standards while marketing, sales, and others run their own web initiatives. The chatbot or AI assistant may “sit” with an AI team today, but its future is shared across the organization.

One company, many AIs (the multi-model enterprise)

As AI decentralizes, enterprises will not standardize on a single model or vendor for all use cases. Just as individual consumers mix and match AI tools for different tasks (one might use OpenAI’s GPT-4 for creative writing, Anthropic’s Claude for brainstorming legal language, Google’s Bard/Gemini for multimodal research, etc.), companies will do the same across their departments. Each team will gravitate to the model or AI service that best fits its domain and requirements.

For example, an HR department might adopt Google’s Gemini assistant for its strong multimedia and translation capabilities in training modules, the legal team might prefer Anthropic’s Claude for contract analysis due to its focus on compliance and careful reasoning, and the marketing group might lean on OpenAI’s GPT-4 for creative campaign content generation. These choices depend on each team’s specific use cases, data sensitivity, and success metrics. The result is an organization running a mosaic of AI assistants – each tuned to its function’s needs.

Several forces are driving this multi-model landscape:

Broad AI adoption across functions

Companies are embracing AI at an unprecedented rate, making it natural for multiple solutions to sprout. According to McKinsey’s latest global survey, 78% of organizations report using AI in at least one business function, up from just 55% a year earlier. Generative AI in particular saw usage nearly double within 10 months to 71% of businesses. As marketing, customer support, finance, and other teams all build AI into their workflows, they demand tools tailored to their context. Each team, being closest to its domain challenges, wants more control and customization over “its” AI.

Falling costs and more model options

The cost of running advanced AI models has plummeted. Since late 2022, the price of using a GPT-3.5 level model dropped from about $20 to just $0.07 per million tokens – a 280× reduction in under two years. At the same time, open-source and “open-weight” models have rapidly improved, in some cases approaching the performance of closed APIs. In fact, the performance gap between the top AI models is shrinking year over year. Smaller models (with far fewer parameters) can now achieve tasks that two years ago required giant 500B+ parameter models. This means teams no longer need a big budget or big infrastructure to leverage AI; many can fine-tune modest models on their own data or build custom AI apps cheaply. According to an AWS generative AI adoption study, 58% of businesses plan to customize existing models and 55% plan to build AI applications using fine-tuned models trained on proprietary data. In practice, one department might spin up a fine-tuned open-source model for its needs while another buys access to a premium API – whatever gets the best results. Once teams start experimenting independently, the only way to know which AI solutions work best is to measure how people actually use them. Adoption and outcome data become the benchmarks for comparing value across these different approaches.

Organizational changes supporting distributed AI

‍Companies are formally reorganizing to support AI everywhere. The same AWS study found that 60% of companies have already appointed a Chief AI Officer (CAIO) and another 26% plan to do so within a year. These roles, along with hiring of generative AI specialists in various departments, create a structure for more distributed AI ownership. Cross-functional teams (product managers, engineers, and domain experts together) are emerging to build AI features within departments. These groups iterate quickly and tailor AI to on-the-ground needs. For example, a product team in customer support might develop a custom AI agent to help resolve tickets faster, while a marketing content team fine-tunes an AI writer on brand-specific data. This distributed experimentation is healthy for innovation, but without coordination it can also lead to duplicated efforts and siloed learnings. Each team might be solving similar problems in parallel without sharing insights. A common measurement system – focusing on user adoption and outcomes – is needed so teams can learn from each other’s successes and failures. In essence, if each team is “doing its own AI thing,” a unified way to track and compare their AI’s performance is critical for the organization to maximize value and avoid reinventing wheels.

The net result of these trends will be AI embedded in every corner of the organization, with the central AI group acting more as a platform steward or center of excellence than a sole owner. This diffusion of AI brings huge opportunities for innovation and productivity – but also a new challenge: fragmentation of data and insight. When every department runs its own assistant or model, how can the company as a whole answer basic questions like “Which AI tools are actually working for us?” or “Where are users getting frustrated or dropping off?” or “What’s the ROI of our various generative AI initiatives?” Without a unified view, those answers remain elusive.

The fragmentation problem: Why we need cross-model analytics

When AI ownership spreads out, it becomes harder to see the full picture. In the early centralized model, the AI team could track usage and outcomes for the one or two chatbots or models it deployed. In a decentralized scenario, you might have five, ten, or more different AI systems in play across the business. Each likely comes with its own usage logs or vendor dashboard, but none of them alone shows how AI is performing enterprise-wide. Important patterns only emerge when you look across all AI tools in aggregate. This is why tracking adoption and user behavior across departments is the only way to truly spot where AI is thriving and where it’s underused (or causing issues).

Imagine trying to improve customer experience when your support team’s virtual agent uses one model, your website’s sales chatbot uses another, and internal teams use a mix of other AI assistants. You might see that one assistant is answering thousands of queries a week while another in a different department barely gets any use – a sign that one team’s tool provides more value or is easier to use than another’s. Or you might find that in a certain workflow, users consistently abandon an AI assistant halfway through the task, indicating friction or dissatisfaction. These insights are only visible if you have cross-model, cross-department analytics tying together all usage data.

Right now, most organizations lack this unified view. Each AI vendor might provide some metrics in its own silo, but there’s no “common dashboard” for all AI interactions happening in the business. The risk is flying blind: improving or troubleshooting one AI tool in isolation while missing bigger wins or failures elsewhere. A unified analytics layer for AI usage would serve as the shared source of truth that keeps everyone aligned. It would let a company answer questions like:

- What are users trying to do with our AI assistants? (e.g. Are employees mostly asking the HR bot policy questions? Are customers using the support bot for troubleshooting, account info, or something else?)

- Where do they succeed or fail? (Which requests are fulfilled well by the AI versus where does it often fall short, causing users to give up or escalate to a human?)

- Which features or use cases create real value? (Maybe the marketing content generator saves dozens of hours on blog drafts – indicated by high adoption – whereas a legal document summarizer is rarely used, indicating low value or poor UX.)

- Where do drop-offs and frustrations occur? (Do users tend to rephrase questions repeatedly, suggesting the AI didn’t get it right? Are there common points in a conversation or process where they get frustrated and abandon the AI?)

With multiple AI systems in play, having this bird’s-eye view is critical. In one real example, a Nebuly client in the automotive industry found that usage of their generative AI assistant varied widely by region. Initially, leadership assumed that some regions just weren’t as interested in the tool. But by digging into the conversational analytics, they discovered a specific problem: the assistant’s performance for non-English queries was poor, leading to high failure rates for those users.

In other words, employees in Latin American offices were asking questions, getting bad answers, and understandably using the tool less. The fix was to improve the assistant’s multilingual support – which removed a major blocker to adoption. System-level technical metrics alone would not have uncovered this issue; only by pairing technical data with user-centric analytics did the team get the full picture of why one deployment succeeded while another struggled. This kind of insight – understanding why one team’s AI is delivering value while another’s isn’t – is exactly what cross-model user analytics provides.

Perhaps most importantly, a unified adoption dataset allows the business to compare the effectiveness and ROI of each AI tool. For example, if Marketing’s content AI is used 10× more often than Sales’ lead-qualifying AI, that tells you where value is being realized (and maybe which team might need help improving their solution). Or if one department’s custom fine-tuned model drives measurable productivity gains, while another team’s off-the-shelf AI sees low engagement, you can make informed decisions about where to invest further, which approaches to replicate or scale back, or whether to consolidate solutions. Once each department starts shaping its own AI, adoption data becomes the benchmark for judging which approaches actually deliver value. Without that data, every team is just guessing or relying on anecdotal feedback.

In short, a cross-model analytics layer turns a potential fragmentation nightmare into an opportunity. It gives you a way to oversee and optimize AI at the portfolio level across all the different models and assistants your company uses. Instead of siloed views and guesswork, you get a holistic understanding of AI’s impact on your business. Without it, you’re essentially in the dark – you might improve some AI tool in isolation while missing a chance to replicate its success elsewhere (or failing to notice a high-risk failure in another corner). With unified analytics, you gain the visibility to govern AI use company-wide: guiding best practices, reallocating resources to high-impact projects, and ensuring every AI initiative aligns with business outcomes.

Lessons from web, BI, and CRM analytics

If this need for a unified view sounds conceptually familiar, that’s because we’ve seen this movie before. Every time businesses adopted a transformative technology at scale, they eventually needed a unified analytics or management layer to make sense of it across the organization. Consider a few analogies:

Web Analytics (e.g. Google Analytics)

In the early days of websites, companies often had very basic metrics – perhaps a hit counter on each page or server logs analyzed in silos. Marketing had its web stats, product teams had theirs, and there was no easy way to get a cohesive view of user behavior online. The introduction of platforms like Google Analytics changed the game by providing one unified view of user behavior across an entire website (and across marketing channels driving traffic to that site). Suddenly, everyone could agree on a single source of truth for web metrics – which campaigns bring in traffic, how users navigate pages, where they drop off in a funnel. This unified approach was crucial for the web to mature into a core business tool. Today, using a web analytics solution is just part of doing business online; it’s hard to imagine running a major website without tracking user traffic and conversions (indeed, over 80% of websites use Google Analytics or similar tools for tracking). We need the same for AI assistants: a single lens on how users interact with all our different AI interfaces, not just siloed stats for each.

Business Intelligence (BI tools like Tableau, Power BI)

Large enterprises have dozens of databases and software systems – finance, supply chain, HR, sales, marketing, support, etc. In the past, each might produce separate reports, making it hard for leadership to get the full picture. Modern BI platforms aggregate data from across these silos, enabling cross-functional dashboards and analysis. This centralization of reporting means a company can correlate metrics that would be impossible to see in isolation (for example, seeing how customer service response time impacts customer satisfaction scores or revenue retention). In the same way, GenAI usage insights must be unified so companies can correlate AI usage with business metrics. For instance, does increased use of an internal AI knowledge assistant correspond to faster project delivery or fewer support tickets? Does the customer support chatbot deflect a meaningful percentage of inquiries from call centers (and thereby save costs)? Only a unified data approach lets you tie those threads together.

Customer Relationship Management (CRM systems like Salesforce)

Before CRMs, customer interactions were scattered. Sales might track leads in spreadsheets, support had a separate ticketing system, marketing had an email list, etc. Salesforce and its peers created a single system of record for all customer interactions across departments. This not only improved internal efficiency but provided management with a holistic view of the customer journey. Similarly, as every department starts interacting with users (whether customers or employees or partners) via AI, those AI interactions become part of the overall user experience and journey. We’ll need a central record or analytics for those AI-driven interactions. For example, how many times did a customer attempt self-service with the chatbot before contacting a human support agent? What kinds of questions are employees asking an internal HR AI assistant, and are they getting answers or escalating to managers? These are new interaction points that should feed into our analytics and CRM thinking, just as web clicks or support calls do.

In all these cases – web analytics, BI, CRM – the pattern was the same: early fragmentation and siloed efforts eventually gave way to unified platforms and “single source of truth” approaches as usage scaled. Generative AI inside the enterprise is reaching that inflection point now. Companies that pioneered some AI pilots in one team are now scaling AI across the org, and they’re running into the limits of siloed monitoring and ad-hoc measurement. It’s time to bring AI usage data together. In fact, Nebuly’s vision is that a unified analytics layer for AI will become as essential as web analytics is today. You wouldn’t run a mission-critical website without Google Analytics or an equivalent in place; likewise, in a few years it will be unthinkable to deploy dozens of enterprise AI and assistant tools without proper user analytics and feedback loops to understand how they’re performing.

Beyond observability: Tracking user behavior in conversations

Up to now, many AI teams have relied on technical observability tools or the models’ own logs to monitor their systems. These are valuable for what they do – ensuring the technical system is functioning correctly. For example, observability dashboards will track metrics like response latency, error rates, throughput, and infrastructure usage. This kind of monitoring is essential, especially early on, to debug and scale the system. If your chatbot’s response times spike or an API call fails, you need to know immediately.

However, once the AI assistant is live with real users, those system metrics cannot tell you whether the AI is actually helping people or driving business outcomes. A model could be fast, stable, and error-free and still fail to deliver value if, say, users don’t find its answers helpful or stop using it after a few tries. Technical metrics alone miss the human side of the equation.

This is where tracking user behavior in conversations becomes critical. It captures the interaction from the user’s perspective: Are users engaging with the AI or abandoning it? What are they asking for? Do they have to rephrase or repeat questions (a sign they didn’t get what they needed)? How often do they give up on the AI and switch to another channel (like calling support or asking a colleague)? Traditional product analytics tools (which excel at tracking clicks, page views, and form submissions) provide very little insight here. And traditional web/app analytics were not built for the nuance of natural language dialogues. A conversation isn’t a series of discrete events like page loads – it’s a back-and-forth with context, intent, and sometimes ambiguity.

Crucially, conversational AI introduces a whole new layer of behavioral data that standard analytics tools don’t capture. Measuring where in a dialogue a user gets frustrated, for example, requires understanding the content and flow of the conversation, not just an isolated UI event. If a user asks a question, gets an answer, then asks the same question slightly differently two more times, a human can infer “the first answer didn’t satisfy them.” But a generic event logger might just count three queries. If you optimize only for system metrics, you might end up making the AI respond a few milliseconds faster while missing the fact that users are unhappy with the answers. Conversely, if you only look at high-level business outcomes (e.g., support ticket volume), you might not realize an AI tool is underperforming until it’s reflected in those outcomes – by which time you may have lost user trust or wasted effort.

To truly measure success in GenAI deployments, organizations need to track a broader set of metrics that link to user experience and business value. One helpful framework is to think of metrics in three layers:

1. Behavioral signals:

How users interact with the AI day-to-day. This includes adoption rates (how many people are using it, how often), engagement patterns (e.g. average conversation length, number of turns per session), retry or rephrase rates (do users need to ask multiple times to get a good answer?), and drop-off points (where in a conversation or task users give up). You can also gather implicit satisfaction cues – for instance, if a user abruptly ends the session or continually rephrases a query, that implies frustration. These metrics reveal whether people are actually embracing the tool and where they encounter friction or confusion.

2. Operational signals:

How AI is affecting core processes. For example, is the AI actually resolving issues or just handing them off? Metrics here could be things like self-service success rate (what percent of inquiries the chatbot resolves without human handoff), average handling time (if AI is involved in a workflow, does it speed it up?), or internal metrics like faster project completion when using an AI copilot. These connect AI usage to efficiency improvements in the business.

3. Financial or business outcomes:

The highest-level results influenced by AI. This includes direct outcomes like revenue generated or costs saved due to AI, as well as broader KPIs like customer satisfaction scores or employee productivity measures. Ultimately, these tell you if the AI investment is paying off in tangible terms.

Most observability tools focus on technical and perhaps some operational metrics. Traditional product analytics focus mainly on the very top of the funnel (behavioral events in a UI, but not the content or success of those events).

Conversational AI spans all three layers: you need to understand the interaction, link it to process outcomes, and ultimately see the business impact. Without all three layers, it’s easy to optimize for the wrong thing – like boosting usage numbers without improving outcomes, or improving model accuracy without improving user satisfaction.

So how can teams get these missing AI user insights? This is an emerging area, and it requires purpose-built solutions. Companies need to instrument their AI assistants in a way that captures conversational interactions similar to how we instrument websites and mobile apps for user actions. This might involve logging each turn of a conversation (user question and AI answer), noting user reactions (did the user ask a follow-up? rephrase? click a thumbs-down button if provided?), and tagging outcomes (was the issue resolved or did it escalate? what was the user’s sentiment or feedback?). Doing this at scale and making sense of the data isn’t trivial – but that’s exactly the challenge that new platforms like Nebuly are tackling.

A unified analytics platform for GenAI (how Nebuly helps)

Nebuly is building what we like to call the “user intelligence layer” for generative AI products. In essence, it’s a user analytics platform for LLMs – a centralized solution to capture and analyze how users interact with any AI assistant or agent, regardless of which model or vendor is behind it. Think of it as analogous to Google Analytics, but for AI-driven conversations. Just as GA tracks events on your website or app, Nebuly tracks events in a conversation: user queries, AI responses, follow-up actions, and implicit feedback signals. The goal is to turn all that unstructured conversational data into actionable insights that product owners and business stakeholders can use.

What does this look like in practice? Nebuly connects to your AI applications (whether it’s a customer support chatbot, an internal copilot integrated in your software, a voice assistant, etc.) and automatically logs the interactions. It then provides analytics and dashboards on top of this data, so you can see things like:

- Usage and adoption: e.g. how many users engage with each assistant daily/weekly, how many total queries, what the user retention looks like (do people come back and continue using it?).

- Conversation patterns and outcomes: e.g. the most common intents or questions users have, the average dialogue length, where users tend to drop off in a conversation, and what percentage of conversations are successful (however success is defined for that use case).

- Implicit feedback and friction: e.g. how often users rephrase their queries (which might indicate the AI didn’t satisfy them initially), how often they switch to a human channel or escalate, sentiment analysis of user messages, or detection of frustration signals like repeated “help” requests.

- Comparative performance: if you have multiple bots or multiple versions of a bot, Nebuly can compare them side by side. For instance, if one team is A/B testing two different models or prompt strategies, the platform can show which variant yields higher user satisfaction or task completion rates.

- Trend and ROI analysis: tying usage metrics to business outcomes. For example, correlating a rise in chatbot self-service resolution with a drop in live agent support tickets, or measuring how much time an internal AI tool is saving employees on certain tasks (time that can be quantified in cost savings).

Crucially, Nebuly is model-agnostic and cross-functional. Whether your HR bot is built on OpenAI, your finance analysis tool uses an open-source LLM on-prem, and your customer chatbot runs on Google’s API – Nebuly aggregates their analytics into one place. This provides the common language of adoption data we discussed earlier: a way for different teams and stakeholders to align on what’s working and what’s not. The platform is designed to deliver insights that are useful not just for engineers, but for product managers, UX designers, and business leaders as well. In fact, Nebuly emphasizes cross-team value: it’s one platform delivering relevant insights for product, marketing, customer experience, and compliance teams, not only for AI developers.

For example:

- Product teams can get faster feedback on new AI features by seeing how users actually interact with them and where they get stuck, enabling a more rapid iteration cycle based on real behavior.

- Marketing teams (in a customer-facing context) can gain a deep understanding of user intent and pain points from analyzing chatbot conversations, informing content strategy or identifying gaps in self-service resources.

- Customer experience or success teams can identify where users get frustrated with AI help and prioritize those areas for improvement or additional training content.

- Compliance and risk officers can even use the analytics to spot potential issues – e.g. if users are frequently asking an AI about something that might lead to inappropriate advice, or if employees are entering sensitive data despite policies, those patterns can be caught.

By contrast, traditional observability tools are built mainly for engineering metrics and don’t offer this kind of rich user-centric insight. And traditional product analytics (web/app analytics) can’t parse an AI conversation meaningfully – a chat isn’t a series of button clicks or page loads. Even product analytics vendors have recognized this gap: for instance, Pendo (a product analytics company) recently introduced a beta feature called Agent Analytics to measure AI agent interactions, an acknowledgment that new approaches are needed. However, these retrofits are often limited in depth because they’re bolted onto tools not originally designed for conversational data. Nebuly, by contrast, is purpose-built from the ground up for conversational AI analytics.

The table below highlights how a GenAI user analytics platform like Nebuly differs from traditional analytics tools:

Tool	Primary focus	What it tracks	Limitation for GenAI
Google Analytics	Web traffic	Page views, sessions, traffic sources, conversions	Cannot analyze conversation content or user intent
Amplitude	Product analytics (apps)	Events, funnels, retention, user cohorts	Event-based model misses unstructured conversation data
Mixpanel	Behavioral analytics	User actions, conversion funnels, A/B tests	Tracks clicks and events, not natural language interactions
Hotjar	Qualitative insights	Heatmaps, session recordings, on-page surveys	Visual-based analysis doesn’t apply to chat interfaces
Pendo	Product experience	Feature adoption, user paths, in-app guides	Recently added “agent analytics” but not conversation-native
Nebuly	GenAI user analytics	Conversation intent, user satisfaction signals, topics, friction points, outcomes	Purpose-built for conversational AI data

By deploying a platform like Nebuly, companies essentially gain Google-Analytics-like visibility into their AI usage. You can pinpoint which assistant is delivering high ROI and which ones are underperforming. You can measure employee engagement with a new internal AI tool and identify if additional training or UX improvements are needed – for instance, in one global manufacturer’s rollout of an internal coding assistant, Nebuly’s analytics revealed that many engineers weren’t phrasing queries in a way the model could handle, leading to low success rates. The fix was to provide a short prompt training and examples to those teams, after which usage and success metrics climbed significantly (resulting in measurable productivity gains). Without user analytics, the company might have assumed the model itself was insufficient, when in fact the issue was a solvable user education gap.

Nebuly also helps enforce governance and compliance in this multi-model environment. A central AI governance team can see, for example, if a certain department’s AI usage spikes unexpectedly (maybe indicating an unmanaged “shadow AI” tool being used), or if certain types of queries (like requests involving sensitive data or policy-related questions) are causing problems across different bots. They can then step in to investigate or set organization-wide guidelines. Essentially, Nebuly closes the feedback loop for AI deployments: it provides the data to answer “Is our AI actually helping users? Where and how should we improve it?” Instead of guessing or relying only on anecdotal reports, teams get concrete evidence from user behavior.

In effect, Nebuly is positioning itself as the category-defining GenAI user analytics platform – purpose-built to handle the complexity of conversational data at scale. It gives enterprises a way to turn raw conversational logs into structured insights, revealing user behavior patterns that purely technical metrics would never show. As one Nebuly message puts it: If you want to deliver business value in the age of AI, you need to understand your users – and that now means understanding how users interact with AI systems, not just with traditional software.

Looking ahead: AI as the new everyday tool

The trend is clear: Just as the internet moved from a specialist project to a pervasive utility in business, AI is moving from a siloed capability to everyday infrastructure. In a few years, every department will take for granted that some form of AI assistant or generative tool helps power its work – analogous to how every department today relies on software and connectivity. We’re nearing a future where saying “we have an AI team that handles all our AI” will sound as outdated as having an “internet team” does now.

To make this transition successful (and not chaotic), companies will need a combination of robust technical monitoring, clear adoption metrics, and a business value framework. Technical monitoring ensures the models run reliably, securely, and within guardrails. But adoption metrics and user analytics ensure the AI is actually delivering measurable impact, not just being available. Governance will also evolve: early on, centralized teams set guardrails; over time, those become organization-wide policies with each department accountable for using AI responsibly and effectively. A unified analytics layer becomes the compass that guides that responsibility – showing where to course-correct and where to double down.

In summary, as generative AI gets embedded across every department, organizations will inevitably juggle multiple models and AI tools. It’s a positive development – allowing each team to leverage the AI that fits best – but it comes with the challenge of fragmentation. The solution is to zoom out and treat AI usage and behavior data as a first-class domain to be measured and optimized, just like web traffic or sales pipelines. By implementing cross-model user analytics and sharing those insights widely, companies can ensure all their AI efforts are rowing in the same direction. They’ll be able to compare apples-to-apples across tools, learn quickly what works (and what doesn’t), and make informed decisions to drive maximum value from AI.

The businesses that master this – that turn AI user data into actionable insight – will have a massive advantage. They’ll improve their AI systems faster (because they actually know what’s happening in those systems), deliver better user experiences, and ultimately achieve stronger ROI on AI investments. In the end, successful AI adoption isn’t just about model performance or technical feats; it’s about user acceptance, effective usage, and real outcomes. As we enter a world of AI-in-every-department, understanding your users’ behavior with those AI systems isn’t a “nice-to-have” – it’s the key to delivering business value in the age of AI.

Frequently asked questions (FAQs)

Do I need a purpose-built GenAI analytics tool to track user behavior?

Yes. Traditional product analytics tools like Google Analytics, Amplitude, and Mixpanel were designed for click-based interfaces, not conversational AI. They cannot analyze the content of conversations, detect user intent, or measure satisfaction from dialogue patterns. A purpose-built GenAI analytics platform is necessary to understand what users actually want in a chat and whether they succeed.

Can I use Google Analytics to track my AI chatbot?

Google Analytics can track basic metrics like how many users access your chatbot page and maybe referral sources, but it cannot analyze conversation content or user intent. GA (even GA4) isn’t built to interpret chat interactions – it will treat a chat like a black box. It also struggles to attribute conversational flows to outcomes, since AI chats don’t have the traditional page view funnel that GA expects.

What is the difference between product analytics and GenAI user analytics?

Product analytics tracks discrete user actions in a graphical interface – clicks, page views, button presses, etc. GenAI user analytics tracks conversational interactions – it looks at the dialogue between user and AI. Product analytics can tell you what users clicked and in what sequence. GenAI analytics tells you what users asked, whether the AI’s answer was helpful, if the user had to rephrase, and if their problem was solved. In short, product analytics answers “What did the user do in the app?” GenAI analytics answers “What did the user try to achieve with the AI, and did they succeed?”

Can Mixpanel or Amplitude track chatbot conversations?

Not in a meaningful way. You could jury-rig Mixpanel or Amplitude to log an event when a chat starts or when a user clicks a suggested prompt, but these tools won’t capture the content of messages or the flow of a conversation. They treat everything as generic events. They can’t tell you if a user’s question was answered satisfactorily or if they got frustrated. So you’d get surface-level metrics (like “User opened chat window”), but miss the deeper insight (like “User asked about billing, got a confusing answer, and gave up”).

What can Hotjar tell me about my chatbot users?

Hotjar and similar tools provide heatmaps and session recordings, which are great for visual interfaces (like where users scroll or click on a webpage). But for a chatbot, there isn’t a visual navigation to heatmap – the interaction is through messages. Hotjar might help you see if users click on the chat widget or how they scroll on the page around it, but it won’t show the content of the conversation or user sentiment. At best, you could use Hotjar surveys to ask users how the chatbot experience was, but you’re relying on users to explicitly provide feedback, which many won’t.

Why can’t I just add chatbot events to my existing analytics?

You can track a few high-level events (like “Chat session started” or “User clicked suggested answer”) in existing tools, but this approach misses the most important insights. Conversational AI requires understanding the *content* and *context* of interactions. Generic analytics events won’t capture if a user said “I need help resetting my password” and the bot misunderstood them. To get value, you need to analyze the language, the intent behind it, and the outcome (did the password get reset or was the user frustrated?). Traditional analytics lacks the natural language processing component needed to do that analysis.

What metrics should I track for my GenAI chatbot?

Key metrics for a GenAI chatbot fall into a few categories: (1) Adoption – number of active users, usage frequency, repeat usage rate; (2) Engagement – average session length (in turns or time), number of messages per session, drop-off rate (% of sessions where user quits early); (3) Success/Outcome – task completion rate (did the user achieve what they wanted, e.g. got an answer or completed a form via the bot), self-service rate (% of conversations not needing human escalation), and resolution time; (4) User Satisfaction – which can be measured via explicit ratings (thumbs up/down, CSAT after chat) and implicit signals (rehit rate, frustration signals like repeated questions or negative sentiment in user messages). These metrics together tell you if your chatbot is being used, how it’s being used, and how well it’s meeting user needs.

How does Nebuly detect user satisfaction without explicit feedback?

Nebuly analyzes implicit signals in the conversation. For example, if a user asks a question and then immediately rephrases it two or three times, that suggests the earlier answers weren’t helpful – a sign of dissatisfaction. If a user spends a while interacting and then suddenly types “agent” or “help” to get a human, that indicates the AI failed them on that task. On the positive side, if a user’s follow-up questions become more detailed or they say “Thanks, that helps,” those are signs of satisfaction. Nebuly’s platform looks at patterns like these (including conversational sentiment analysis and linguistics cues) to infer a satisfaction score or outcome without needing every user to explicitly rate the chat. Essentially, it’s like reading the room in a conversation – detecting if it’s going well or poorly from the dialogue itself.

Is Pendo suitable for tracking AI agent behavior?

Pendo recently launched an “Agent Analytics” feature to start measuring AI agent performance, which is a step in the right direction. If you’re already a Pendo user, it can capture some basic metrics about AI interactions (like number of prompts, maybe outcome tags). However, Pendo’s core strength is still traditional product analytics, and its AI analytics capabilities are in early stages. It may not automatically interpret conversation flows or provide rich intent and satisfaction analysis. In short, it’s a partial solution – better than nothing, but not as deep as a specialized GenAI analytics tool. Teams highly invested in Pendo might experiment with it, but those who need robust conversation insights often turn to purpose-built platforms like Nebuly that were designed specifically for AI usage data.

How do I prove ROI from my GenAI investment?

Proving ROI from GenAI requires connecting usage to tangible outcomes. First, define what success looks like for your AI use case (e.g. faster support resolution, reduced content creation costs, increased sales conversions via better recommendations). Then, instrument your AI solution to track those outcomes – both baseline (before AI) and after AI deployment. With a user analytics platform, you can quantify things like: AI handled 500 customer queries this week that would have otherwise gone to support reps (saving X hours of support time), or the marketing AI produced content that helped drive Y% more traffic (which is worth Z in lead value). You should also track avoided costs – for instance, if AI automation allowed you to handle a surge in inquiries without hiring temp staff, that cost saving counts toward ROI. Essentially, use the data to tell a story: we invested $A in this AI, and we’re seeing $B in either increased revenue or decreased costs (or both) as a result. Over time, the trend of those metrics (hopefully improvement) will bolster the ROI case. A platform like Nebuly can help by attributing outcomes to AI usage – e.g., showing that power users of an internal AI tool complete tasks 30% faster than those not using it, which you can then translate into productivity gain dollars.

What is the difference between LLM observability and GenAI user analytics?

LLM observability tools (like Helicone, LangSmith, Langfuse, etc.) are focused on the system side of an AI application. They track things like API calls, latency, token usage, error rates, model performance stats, and provide traces of how prompts flow through your system (especially if you have complex chains or agents). They’re great for developers to debug and monitor the health and cost of the AI service. GenAI user analytics, on the other hand, is focused on the human side: understanding the user’s experience and behavior. It tracks what users are asking, whether their needs are met, how they interact with the assistant, and where they struggle. In an ideal setup, you use both: observability ensures your AI service is technically sound and cost-effective, while user analytics ensures it’s useful, usable, and delivering value to people.

Can Helicone track user behavior in my chatbot?

Helicone is an LLM observability tool that provides some user-level metrics like number of active users, session length, and it can log prompts/responses with user IDs. It’s useful for seeing usage volume and capturing conversation data primarily for debugging and performance. However, Helicone by itself won’t analyze that conversation data for you – it won’t categorize user intents or tell you if the user was frustrated. It’s more about capturing data than interpreting it. You might get explicit feedback if you code that in (like thumbs up/down captured in metadata), but you won’t get the richer implicit analytics. Helicone is excellent for engineering visibility (it even lets you store and query chat logs), but you’d complement it with an analytics layer (like Nebuly) to derive insights about user experience and satisfaction from those logs.

Is LangSmith suitable for understanding chatbot user behavior?

LangSmith (from LangChain) is geared towards developers building LLM applications. It gives detailed traces of how prompts are processed, how different chains or tools are invoked, and allows logging of results and feedback for each run. It’s great for debugging complex agent behaviors or fine-tuning your prompt chains. However, LangSmith doesn’t come with out-of-the-box analytics for user experience. It’s not going to automatically tell you “Users seem unhappy with answers about topic X” or “Conversation length correlates with success in this use case.” You could use LangSmith’s logged data to manually analyze some patterns, but it’s not a plug-and-play solution for product insights. It shines more in the development and QA phase rather than ongoing user behavior monitoring.

What can Langfuse tell me about my chatbot users?

Langfuse is another observability tool for LLM apps, focusing on capturing conversation sessions and providing an interface to review them. It can show you the structure of conversations (the sequence of messages) and metadata like which model was used, latency, etc. This is useful for debugging and ensuring the conversation flow is as expected. But similar to others in this category, Langfuse doesn’t inherently analyze user sentiment or success – it’s more like a recording and debugging system. Think of it as a flight recorder; it records everything that happened in a conversation. To get insights like “where do users struggle?” or “what intents are most common?”, you would have to manually sift through those records or export them for analysis in another tool. It’s aimed at engineers maintaining the system, rather than product managers looking for usage trends.

Do I need both observability and user analytics for my GenAI application?

In most cases, yes – they serve different but complementary purposes. Observability (tools like Helicone, LangSmith, Langfuse, etc.) is crucial for your engineering team to ensure the AI system is functional, efficient, and within cost and compliance limits. It answers “Is the system working correctly?” User analytics (tools like Nebuly) is crucial for your product and business teams to ensure the AI system is useful, usable, and delivering value. It answers “Are people using it effectively and getting what they need?” If you only do observability, you might have a technically sound system that no one likes or uses. If you only do user analytics, you might identify a problem but not know where in the system or model it’s originating. Using both means you can monitor the health of the AI (uptime, errors, performance) and the health of the user experience (adoption, satisfaction, outcomes) together. Many successful AI product teams integrate both types of tools into their workflows.