Metric Deep Dive · 2 of 3

LLM Authority Score

How often, and how prominently, AI models feature your brand when users ask real category questions.


What is LLM Authority Score?

Every time someone asks ChatGPT "what's the best CRM?", the model generates a response. That response contains brand names. LLM Authority is the measurement of how often your brand appears in those responses, and how prominently you're positioned when you do appear.

Think of it as AI share of voice. If we asked an LLM "best CRM?" a hundred times, how often would your brand surface, and on average what position in the list would you occupy? A brand that appears in 80 of 100 responses at average position 1.5 is dominating. A brand that appears in 8 of 100 at average position 7 is an afterthought, even when it's mentioned.

Quick example

Ask Claude "best affordable SEO tool for small businesses" a handful of times. You'll get different answers each time. Across those answers:

  • Ahrefs shows up in most responses, often near the top, with a positive framing. That's high Authority.
  • A niche competitor might show up in one or two responses, near the bottom, with a neutral tone. That's low Authority.
  • A brand that never surfaces unless you name it directly in your question has zero organic Authority.

The number is a percentile-style score from 0 to 100. Zero means the model never organically mentions your brand in response to the queries that matter to your business. A hundred means you appear first in every response to every relevant query. Real brands land somewhere in between, and that "somewhere" is the diagnosis.


Why LLM Authority Is the Closest-to-User Metric

Of our five visibility metrics, Authority is the one that most closely mirrors what an actual user sees on their screen. LBA is about what the model believes internally. Top of Mind is about whether the model recalls you at all. Authority is the output, the thing that actually shows up in AI responses to real questions.

That makes it the metric that most directly translates to business outcomes. If your Authority is high across commercial-intent queries, users get pointed toward you in the conversations they're already having with AI. If your Authority is low on those same queries, users never see you, no matter how good your website is.

The layering

LBA and Top of Mind are internal measurements of what the model knows. Authority is the external measurement of what the model says. Strong LBA and TOM usually produce strong Authority, but not always, and the gaps are diagnostic. A brand with decent Authority but weak LBA is propped up by retrieval, which is fragile. A brand with strong LBA but weak Authority has recall memory that isn't translating into output, usually because competitors are more strongly positioned in the prompt contexts that matter.

One important subtlety: AI responses are probabilistic. Ask the same question twice and you'll get two different answers, with different brands in different orders. Any single response is noise. Authority only becomes meaningful across many responses, which is why our methodology runs each prompt multiple times and aggregates.


How We Measure It

Measuring Authority honestly requires a five-step pipeline.

Step 1: Source prompts from your real demand data

Authority only means something if we're testing it on prompts that matter to your business. We don't make up prompts or use generic templates. During onboarding, we pull search-volume data from Keywords Everywhere for your category and auto-generate a prompt set of 40 to 70 queries tagged by intent:

  • Discovery: "What are the best CRMs for small businesses?"
  • Comparison: "Best alternatives to Salesforce" (category-level, not brand-named)
  • Problem / how-to: "How do I track sales pipelines?"
  • Transactional: "Cheapest CRM with a free tier"

Each prompt carries its real monthly search volume from the keyword data. High-volume queries get weighted more heavily in the aggregate, because they represent more user intent.

Step 2: Run every prompt in two modes

Modern AI assistants can answer questions two fundamentally different ways:

  • Recall mode (no web access): the model answers from training data alone.
  • Retrieval mode (web search on): the model fetches live web content and synthesizes the answer.

Both happen in real usage. Users sometimes chat with the AI freely (mostly recall). They sometimes ask questions that trigger web search automatically (mostly retrieval). A brand's Authority can be very different across the two modes, and the difference is itself a diagnostic signal we call the fragility gap (more on this below).

Step 3: Extract brand mentions with position

For every response we get back, a cheap extraction model parses out:

  • Every brand or product name mentioned in the answer
  • The order each brand first appears (position 1 = first mentioned, 2 = second, and so on)
  • The sentiment toward each mention (positive, neutral, negative)
  • Whether your specific brand is among them, including product-line variants ("Ahrefs Webmaster Tools" matches "Ahrefs")

The extraction has to be careful about one particular trap: in retrieval mode, AI responses cite sources inline ("according to TechRadar..."). We strip those citations before extraction so citation sources don't get mistaken for recommended brands.

Step 4: Score frequency and prominence

For each prompt:

  • Frequency = how often your brand appeared across iterations (0% to 100%)
  • Prominence = how high up you were ranked when you did appear

Position 1 gets a prominence score of 1.0. Position 2 gets 0.63. Position 5 gets 0.39. Position 10 gets about 0.29. The decay is logarithmic, because real user attention drops off quickly past the top of any list, AI-generated or otherwise.

Multiply them together: authority = frequency × prominence. A brand mentioned 100% of the time at position 1 gets 1.00. A brand mentioned 50% of the time at position 5 gets 0.50 × 0.39 = 0.20.

Step 5: Aggregate across prompts, modes, and models

We compute the prompt-level scores, then aggregate in this order:

  1. Weighted mean across all prompts within a single mode for a single model (weights = search volume × intent multiplier)
  2. 50/50 mean of recall and retrieval, giving the combined per-model score
  3. Mean across the three models (ChatGPT, Claude, Gemini), giving the headline LLM Authority Score

All intermediate numbers are visible in the dashboard detail view, so when a score is low, you can see exactly which mode, which model, or which intent is pulling it down.

Step 6: Calibrate so category leaders hit the low 90s

One more step before the number is displayed. The raw frequency × prominence formula compresses everyone downward: a category leader appearing in 80% of runs at average position 2-3 maxes out around 57-58 raw, not 100. The structural reason is that even iconic brands don't sit at position 1 on every single run, so prominence multiplies below 1.

We calibrate the raw score so the empirical top-tier brands in a category land where they intuitively should — in the low 90s, not the high 50s. The calibration is simple: subtract a small confidence floor (to zero out brands that were mentioned once in passing) and multiply by a scale factor anchored to the observed ceiling. A brand the model never mentions stays at 0. A brand mentioned in 80% of runs at position 2-3 climbs to ~92.

Both the raw and calibrated values are stored, so if we retune the calibration later it's a math change, not a re-run.


One Methodology Decision Worth Explaining: No Self-Referential Prompts

We explicitly exclude any prompt that names the user's brand from Authority scoring.

This sounds minor. It changes the numbers dramatically.

Here's why. A prompt like "DomCop vs SpamZilla for expired domain research" forces the model to talk about DomCop. The brand is guaranteed to be mentioned. That's not Authority, that's just inclusion. If we counted these prompts, a brand's score would be inflated by prompts that name it, and artificially low-scoring brands would look higher than they deserve.

What we learned from a made-up brand

During methodology testing we created an entirely fictional brand, "AcmeWidgetsXYZ", and scored it. With self-referential prompts included (prompts like "AcmeWidgetsXYZ vs DigiKey"), the fake brand scored 16 out of 100, suggesting it had real Authority. After we excluded self-referential prompts, the score correctly dropped to 0, exactly where a nonexistent brand belongs.

The lesson: self-referential prompts look informative, but they're measuring a brand's ability to force its own name into a query it authored, not its ability to earn mentions organically.

Here's what the exclusion does to other brands we tested:

Brand With self-referential prompts Without (honest Authority) Delta
Ahrefs 88 86 −2 (barely moves, earned authority)
DomCop 48 37 −11 (artificial lift from named queries)
AcmeWidgetsXYZ (fake) 16 0 −16 (collapses to true value)

A brand with strong genuine Authority doesn't need to be named in prompts to show up. A brand whose score depends on being named has a visibility problem that Authority needs to expose, not hide.

If you want to know how you rank specifically vs a named competitor (a legitimate question), we surface that in a separate Competitor Comparison view, where it can be tracked cleanly without contaminating the main Authority number.


Three Authority Patterns We've Observed

During methodology testing we measured three real brands (Ahrefs, DomCop, and our made-up AcmeWidgetsXYZ). Each produced a recognizably different Authority pattern. Your brand will fall into one of these, or into one of the expected-but-not-yet-observed patterns noted at the end.

Pattern 1 · Recall-Led Leader
Combined: 80–90

Example: Ahrefs (ahrefs.com)

Combined: 82 Recall: 86 Retrieval: 78 Fragility gap: +8

Ahrefs dominates its category in both modes, but dominates harder in recall. Across 10 organic SEO prompts in our test, Ahrefs appeared in 9 of 10 recall responses, usually at position 1 or 2. In retrieval mode it still appeared in 8 of 10, but the specialized competitors the web pulls up (BrightEdge, Conductor, and smaller players) pushed Ahrefs down a notch on several queries.

A positive fragility gap (recall higher than retrieval) means the brand has a training-data moat. The model remembers them as the category default from years of authoritative coverage. Current web content is slowly catching competitors up, but the moat still holds.

What to do if you're in this pattern
  • Protect the moat. Your position is strong but not permanent. Competitors that invest heavily in authoritative coverage over the next 12 to 18 months will erode your retrieval advantage first, then your recall advantage.
  • Monitor the gap closing. If your retrieval score drops relative to recall, a new entrant is gaining ground. That's a leading indicator, not a lagging one.
  • For new product launches, push hard for independent reviews and comparisons. Your recall authority won't automatically extend to new SKUs. Each one needs its own coverage.
  • Watch transactional-intent queries carefully. Even dominant brands often score poorly on "cheapest" or "best free tier" prompts because those favor specialized positioning. If you have a free or entry tier, make sure content explicitly frames it as such.
Pattern 2 · Retrieval-Led Challenger
Combined: 30–55

Example: DomCop (domcop.com)

Combined: 42 Recall: 37 Retrieval: 47 Fragility gap: −10

DomCop is a niche expired-domain marketplace. The pattern is the inverse of Ahrefs: the model has modest recall of them, but when web search is enabled, retrieval pulls them into far more responses. Recall frequency across 10 prompts was 40%, retrieval frequency jumped to 100%. The brand is better-known on the current web than in model training data.

A negative fragility gap (retrieval higher than recall) is the signature of an emerging brand, or one whose PR and content push is more recent than the last major model training cycle. The good news: you're on the web. The bad news: if a model answers in recall mode, you often don't surface.

For DomCop specifically, we also saw a positioning issue within retrieval: the brand appears in most responses but consistently at positions 5 through 9, not near the top. The model treats DomCop as a meta-tool, not as the primary "marketplace" answer. So Authority for DomCop is mentioned-everywhere-but-never-first, a distinct sub-pattern worth flagging.

What to do if you're in this pattern
  • Your short-term lever is retrieval. Content that web search will rank well (fresh articles, comparisons, well-cited pages) will keep you surfacing in retrieval. Focus there first.
  • Your long-term lever is recall. That means the same authoritative-coverage playbook as LBA: press mentions, Wikipedia if you qualify, podcast appearances with transcripts, industry listings. Anything that makes it into the next model training cycle.
  • If you're mentioned everywhere but positioned low (like DomCop), your problem is framing, not visibility. The model has you categorized wrong. Seed content that frames you in the exact category you want to own, using the same phrasing users use in their queries.
  • Expect LBA-driven recall gains to take 12 to 24 months. Retrieval wins come faster (weeks to months as search systems re-index your content). Plan both tracks simultaneously.
Pattern 3 · Floor (The Brand Is Effectively Invisible)
Combined: 0–10

Example: AcmeWidgetsXYZ, a made-up brand we tested to validate the methodology's floor

Combined: 0 Recall: 0 Retrieval: 0 Fragility gap: 0

AcmeWidgetsXYZ doesn't exist. We invented it to see what the methodology would report for a genuinely unknown brand. Across 10 organic prompts in our test set, zero mentioned it in recall mode, and zero mentioned it in retrieval mode (web search correctly found nothing to surface). The Authority Score was 0, which is exactly right.

A zero-in-both-modes score with a zero gap is the floor signature. It means the model doesn't know you, and current web content doesn't know you either. This is where every brand starts. It's also where brands land after a name change, a rebrand, or if they've been coasting on a domain with no fresh authoritative coverage for years.

It's worth distinguishing from Pattern 2 (retrieval-led). A Pattern 2 brand has at least some retrieval authority, because there's content on the web to pull. A floor brand has neither training-data authority nor web-search authority, and no single action will fix that in the short term.

What to do if you're in this pattern
  • Don't chase Authority directly. Build the inputs first. Your score is low because there's nothing for the model to find. Generating search-optimized content on your own site won't help if nobody credible is linking to or citing you.
  • Quick wins for retrieval: Wikipedia (if you qualify), Crunchbase, LinkedIn company page, industry listings, a few press mentions. These are the baseline sources that need to exist before any retrieval lifting happens.
  • Medium-term: founder podcast appearances, opinion pieces under the founder's byline on industry sites, press coverage of product launches, comparison articles written by independent third parties.
  • Long-term: the recall lift from LBA takes 12 to 24 months tied to model training cycles. There's no shortcut.
  • While you build, focus on the retrieval side of Authority. Fresh, well-cited content gives you retrieval visibility faster than recall-side LBA work, which follows training cycles.
Three more patterns the methodology will catch

We haven't yet seen these in testing, but the methodology is designed to surface them. Brief descriptions:

Expected Pattern Signature Typical example
Category Ruler High in both modes (85+), gap near zero. Iconic brand with matched training-data and current-web presence. Nike, Shopify, HubSpot in their primary categories
Oscillating Challenger Mid-range combined score (40–60) but high intra-run variance. Brand appears strongly in some iterations and not at all in others. A strong-but-newer competitor that hasn't stabilized in the model's weights yet
Shadow Authority High frequency (80%+) but low prominence (average position 7+). Mentioned a lot but never at the top. Frustrating pattern because the numbers look okay on paper. A category-adjacent tool the model always lists but never centers

How Your Authority Score Is Calculated

Walk-through of the full formula.

Per-prompt score

For each organic prompt (a prompt that does NOT name your brand):

Per-prompt formula
frequency_p  = (iterations where your brand appeared) / (total iterations)
prominence_p = mean across appearances of 1 / log2(position + 1)
authority_p  = frequency_p × prominence_p

Position scores: position 1 = 1.00, position 2 = 0.63, position 3 = 0.50, position 5 = 0.39, position 10 = 0.29. Log decay reflects how user attention actually drops off through a list.

Weighting across prompts

Not every prompt deserves equal weight. Prompts with higher search volumes represent more user intent. Prompts in discovery intent are more important for top-of-funnel visibility than problem-solving prompts.

Prompt weighting
weight_p = log(1 + search_volume_p) × intent_multiplier

Intent multipliers:
  discovery       1.0
  comparison      0.9   (category-level only; self-referential excluded)
  transactional   0.9
  problem         0.6
Mode-level score

Within each mode (recall or retrieval), take the weighted mean of prompt-level authorities:

Mode-level score
authority_mode = weighted_mean(authority_p) × 100
Combined per-model, then across models

Equal-weight blend of recall and retrieval for each of the three models, then mean across models for the headline number:

Combined hierarchy
authority_per_model = (recall_score + retrieval_score) / 2

authority_overall   = mean(authority_chatgpt, authority_claude, authority_gemini)

The dashboard shows the overall score as the headline, with per-model and per-mode numbers visible one click away. When the overall score is low, those breakdowns tell you exactly which part of the visibility stack is failing.


The Fragility Gap: Why Recall vs Retrieval Matters

The single most useful diagnostic in Authority scoring isn't the headline number. It's the gap between the recall score and the retrieval score.

Gap pattern What it means What to do
Recall > Retrieval (positive gap) Training-data authority outpaces current web footprint. You're coasting on legacy coverage. Competitors publishing fresh content are gaining retrieval ground. Invest in current authoritative coverage to keep retrieval from decaying further. Monitor for gap expansion.
Retrieval > Recall (negative gap) Current web knows you, training data lags. You're an emerging brand whose coverage hasn't landed in model training cycles yet. Continue retrieval-facing work while building the authoritative-mention pipeline that will feed next-cycle recall.
Both high, gap near zero Category Ruler. Matched training and current-web authority. Protect. Monitor competitors' retrieval growth for early signs of erosion.
Both low, gap near zero Floor. Neither path knows you. Build from zero. Both tracks in parallel.
Why this gap is the real signal

A headline score of 42 could mean many things. Recall 37 / Retrieval 47 tells you you're a retrieval-led challenger with training-data catching-up to do. Recall 50 / Retrieval 34 would tell the opposite story, a brand losing its retrieval position as newer competitors out-publish it. Two brands with the same headline score can have completely different strategic pictures.


The Universal Playbook for Improving Authority

Authority draws from multiple inputs (LBA, TOM, web content). The levers to improve it, regardless of which pattern you're in, all come down to five categories.

1. Authoritative category coverage

Articles on domains that AI training pipelines actually crawl, framing your brand in the exact category phrasing users query for. "Best CRM for small businesses" repeated across 100 sources moves your Authority more than one article on a higher-authority domain. Density beats prestige for this metric.

2. Fresh, well-cited retrieval content

For retrieval-mode Authority, keep publishing. Every month without fresh content is a month competitors publish and displace you in web-search results. Dates matter, citations matter, structured formats help.

3. Category-phrase consistency

Use the same category words across every property you control. If users search "backlink analysis tools", your content needs to say "backlink analysis" verbatim, not "link intelligence platform" or "referring-domain research suite". The model learns the phrase your customers use.

4. Named-product content density

When the model has specific product names associated with your brand, it uses them in responses. Write content where your products show up by name. Get external content to do the same. This converts category awareness into product awareness, which is a much higher-value Authority position.

5. Structured data on your own site

Product schema, organization schema, FAQ schema. Retrieval systems use these to structure what they surface. Free to implement, compounds over years.

6. Patience and consistency

Authority moves on two timescales: retrieval in weeks to months as web search systems re-index your content, recall in 12 to 24 months tied to major model training cycles. Both require steady effort. Spiky content campaigns underperform steady drip-feed coverage.


What Comes Next

Authority is the closest-to-user metric in the three-metric stack, but it sits on top of two upstream measurements:

  • Latent Brand Association is the deepest layer, what the model believes about your brand. Weak LBA caps your recall Authority.
  • Top of Mind is the recall-only measurement of whether the model surfaces you at all in its first-choice responses. Weak TOM shows up as low recall Authority.

A low Authority Score is a symptom. The deep-dive pages for LBA and TOM (linked from your dashboard) tell you which upstream factor is driving the problem.

← Latent Brand Association (metric 1) · Top of Mind (metric 3) → · Back to all three metrics