Opinion: Why Manual AEO Tracking Falls Apart

Manual AEO tracking breaks down because small-sample spreadsheets measure model variance instead of real visibility, while reliable tracking requires scaled sampling, Share of Voice, platform separation, fan-out visibility, and citation-level mapping.

Mina

AEO enthusiast

Update on

May 14, 2026

Product Mechanics

A common question in marketing circles right now: can you actually track brand mentions in AI answers manually, or is it a waste of time?

The typical setup looks reasonable. Twenty prompts, run daily across ChatGPT, Claude, and Perplexity, results logged in a spreadsheet. Two months in, the data is unusable. Because different runs give different answers, there is no way to tell whether a week-over-week change reflects real movement or random variance.

The reasons for this failure are worth walking through, because they map almost exactly onto what Operyn was built to fix.

The variance is the platform, not the method

The noise in that spreadsheet is not a methodology problem, it's a property of the systems being measured. LLM outputs are non-deterministic, which means they produce different outputs on different runs by design, so a handful of samples per prompt will never settle into a stable read. To see past the variance, each prompt needs to be run dozens or hundreds of times across multiple models, with the results aggregated statistically into a single rate. Perplexity makes this worse by pulling fresh citations every query, so part of what gets measured is the platform's own drift rather than the brand's position.

A binary "did my brand get mentioned today" log is therefore measuring weather, not climate. Getting to signal requires infrastructure to average out the noise. And that is not a spreadsheet job.

Take a B2B SaaS brand running 25 queries across topics like "best CRM for startups" and "alternatives to HubSpot." Operyn runs each query at scale across ChatGPT and Gemini and rolls the results into mention rate, citation rate, and share of voice over a configurable window. After a month, the dashboard might show, say, 540 responses with a stable mention rate of 47%. That number means something because the sample size is large enough to dampen variance. A spreadsheet logging one run per day across the same 25 queries would have produced a noisy ribbon between 20% and 70% with no way to know which week was real movement.

Share of voice beats raw mentions

The strongest single argument against raw mention counting is mathematical. Mentions of any single brand in any single query is a small number, and small numbers are dominated by variance. Share of voice within a query, meaning how often a brand is one of the brands the answer surfaces, is far more stable because it's normalized against the answer set instead of against zero.

This is how the Operyn Competition view works. The dashboard ranks a brand against its competitors on the same set of queries, expresses the result as a share of voice percentage, and shows the mention and citation counts behind it. A 30% share against a leader at 50% and a third player at 20% is a workable competitive picture. A standalone mention count for one brand, in isolation, tells you almost nothing without a denominator to compare it to.

The Topic Battlegrounds view takes the same idea deeper: which competitor wins which topic. The same brand might be losing on one topic, holding even on another, and winning on a third, all within the same week. That is where decisions get made about what to defend and where to attack. A spreadsheet of raw mentions cannot produce this view.

The platforms are different experiments

Aggregating ChatGPT, Gemini, Claude, and Perplexity into one number hides the actual signal because the platforms are not measuring the same thing. From our own observation, and with the caveat that this behavior shifts with each model update, ChatGPT tends to lean on sources like Wikipedia, G2, and Forbes. Perplexity weights Reddit heavily. Different citation behavior produces different brand positions on different platforms.

Operyn separates them. The Brand Visibility view breaks the score out by platform, so a strong position on one model and a weak position on another show up as two distinct numbers rather than getting averaged into a blended figure that hides where the actual work needs to happen.

The same split shows up at the citation source level. Operyn lets you drill into a single URL and see which queries it ranks for, which models cite it, and at what rate. A high-citation-rate page that only ranks for two queries is a different signal than a low-citation-rate page that ranks for fifty, and that kind of detail is not reconstructable by hand.

What manual gets you, and what it doesn't

The honest version is that manual tracking is not worthless. It's useful for spot checks: did the model say something strange about us this week, or did a new competitor start showing up. Useful, but limited.

What manual cannot do is resolve week-over-week brand share movement at meaningful resolution, or track 200+ queries across multiple platforms with enough samples per query to be statistically meaningful.

It also cannot surface fan-out queries, the sub-questions models generate behind the scenes that never appear in a prompt list. A single user query about a category might trigger the model to spin up ten variant searches on its own, refined by budget, use case, or comparison set. Those queries were not written by anyone on the marketing team; the model generated them, and they only become visible if something is logging at the model layer. Operyn surfaces these directly: the Fan-out view sits inside the query detail page and lists every sub-query a model generated in the background, along with which platform produced it, so the long tail of unseen searches becomes part of the dataset rather than a blind spot.

The last thing manual cannot build is a citation map across topics. Operyn's narrative summary view shows where citations cluster, which sources dominate which topic, and how flow shifts week over week. That informs which third-party domains to pitch, which to displace, and where existing strength is enough to ignore.

The actual case

A useful framing here is that a tool buys scale and a cleaner UI, not better data quality, unless the manual work was being done wrong in the first place.

That is fair, but most manual work is being done wrong: binary tracking, single runs, blended platforms, no share of voice, no fan-out, no citation source breakdown. The result is two months of effort and a spreadsheet nobody can trust.

The point of Operyn is that the actual fixes, sampling at depth, normalizing to share of voice, separating platforms, and tracking citations to specific URLs, are not optional features but the floor. Without them, what gets collected is not AEO tracking but anecdotes.

For anyone investing in a spreadsheet with messy data: the methodology can be improved by hand, but the ceiling on careful manual work is a directional read across six to eight weeks, whereas the floor on Operyn is daily resolution at statistical scale across every brand and topic worth caring about.

That is the trade, and it's worth knowing before the third month of the spreadsheet.

Share on social media

AI Search Optimization Is Source-Influence Management

May 13, 2026

Managing Up: Translating AI Visibility into Revenue Risk for Your CMO

May 5, 2026

Extracting AEO Content Briefs from Semantic Sentiment Maps

May 4, 2026