phidea
Reference · page 1 / 6

1. Instrumentation — measure your citation share

Part 1 of 6 (technical) · Index · Next → Structured data

This page is the technical companion to the plain "what the data says" page. It explains how to actually instrument citation-share measurement on a weekly cadence — the same probe pattern Phidea uses to populate the observation tool.

The core idea

Run the same buyer-shaped query against multiple LLMs, multiple times, weekly, and capture three things per response:

  1. First-named carrier — the most-cited carrier (or carriers, if multiple are named at the top of the answer).
  2. Citation hosts — the unique hostnames the LLM cites as sources.
  3. Optional: the full answer text — for offline review and entity extraction.

Aggregate across runs to compute modal-carrier-with-count (e.g., "USAA 4/5 on Perplexity") and unique-hostname citations. Deltas week-over-week tell you whether your carrier is gaining or losing ground.

Anatomy of one probe run

Pseudocode for the inner loop:

\\\ for query in queries: for llm in llms: for run in 1..N: response = llm.query(prompt) first_carrier = extract_first_named_carrier(response.text, carrier_pattern_list) citations = response.grounding_urls or response.citations record(query, llm, run, first_carrier, citations) modal = compute_modal(records[query][llm]) write_observation(query, llm, modal, unique(citations)) \\\

The Phidea production version is in \scripts/observation-probe.ts\ in the public repository. It calls the LLM provider APIs directly (not via aggregation services), captures grounding metadata, and writes both raw JSON and a structured \Observation[]\ ready to commit.

Why it's not just "ask ChatGPT once"

Three reasons the Phidea probe pattern beats ad-hoc spot-checking:

1. LLM responses are non-deterministic. Even with temperature 0, the same prompt to the same model in the same hour produces different first-named-carriers ~30% of the time. Five runs per LLM is the empirical floor for stability. We've tested 3-run and 5-run cadences; 5-run produces meaningfully tighter modal counts.

2. Cross-LLM is non-redundant. Perplexity (sonar-pro with web search) and Gemini (2.5-pro with google_search grounding) read different citation surfaces. A carrier that wins on Perplexity but not Gemini is a carrier with strong comparison-site presence and weak Google-indexed presence — a different position than winning on both. The Phidea observation tool labels findings "clear" only when ≥3/5 modal holds on both LLMs.

3. Time stability is its own measurement. A single-day result can be a single-day editorial-state accident. A finding that holds across multiple weeks is meaningfully different from one that flips. Phidea's time-stability retests are the dataset that revealed which round-2 findings hold and which drifted.

The provider APIs

For the two LLMs Phidea probes against today (Perplexity + Gemini, both with web grounding), here's what to call:

Perplexity Sonar Pro

\\\`http POST https://api.perplexity.ai/chat/completions Authorization: Bearer <PERPLEXITY_API_KEY> Content-Type: application/json

{ "model": "sonar-pro", "messages": [{ "role": "user", "content": "<your buyer query>" }], "max_tokens": 1200 } \\\`

Returns: - \choices[0].message.content\ — the answer text - \citations\ — array of cited URLs

Gemini 2.5 Pro with grounding

\\\`http POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-pro:generateContent?key=<GEMINI_API_KEY> Content-Type: application/json

{ "contents": [{ "role": "user", "parts": [{ "text": "<your buyer query>" }] }], "tools": [{ "google_search": {} }] } \\\`

Returns: - \candidates[0].content.parts[].text\ — the answer text - \candidates[0].groundingMetadata.groundingChunks[].web.uri\ — cited URLs

Both APIs work for production probes today (verified 2026-05-04 in the Phidea probe runs). Rate limits are generous; backoff on 429/503.

First-named-carrier extraction

A regex pattern over a maintained carrier list works surprisingly well. The Phidea pattern library lives at \scripts/configs/_carriers.ts\ and tracks ~40 home carriers, ~40 auto carriers, ~25 commercial-cyber carriers, ~25 commercial-lines carriers, ~30 life carriers.

Pseudocode:

\\\ function firstCarrier(text: string, patterns: string[]) { let earliest = null let earliestIdx = Infinity for (const c of patterns) { const idx = text.indexOf(c) if (idx !== -1 && idx < earliestIdx) { earliest = c earliestIdx = idx } } return earliest } \\\

Edge cases:

  • The carrier name appears in the response but as part of a larger phrase. "Compared to Allstate, Acme Insurance offers..." — Allstate gets returned even though Acme is the recommendation. Mitigation: skip carriers that appear after "compared to," "instead of," "unlike," or in similar comparison constructs.
  • Holding-company vs subsidiary names. "Liberty Mutual" wins a query; the response actually says "Safeco" (a Liberty Mutual subsidiary). Mitigation: maintain a holding→subsidiary alias table; merge or distinguish based on what your question is actually testing.
  • Comparison sites named first instead of carriers. "NerdWallet recommends..." — NerdWallet is not a carrier. Mitigation: don't include comparison-site names in the carrier pattern list. Phidea's library is carriers-only.

Build a regression-test suite over real probe responses as you tune the extractor. Each new false positive becomes a test case.

Modal computation

Trivial when you have it right:

\\\ function modal(carriers: Array<string | null>) { const valid = carriers.filter(Boolean) if (valid.length === 0) return { value: null, count: 0 } const counts = new Map() for (const c of valid) counts.set(c, (counts.get(c) || 0) + 1) const sorted = [...counts.entries()].sort((a, b) => b[1] - a[1]) return { value: sorted[0][0], count: sorted[0][1] } } \\\

Tie-breaking on equal counts: pick the alphabetically-first or the one that appeared first chronologically. Doesn't matter operationally; document your choice and stick with it.

Citation-host extraction

Reduce raw URLs to unique hostnames so the data is human-readable:

\\\ function citationHosts(allCitations: string[]): string[] { const set = new Set<string>() for (const url of allCitations) { try { const u = new URL(url) set.add(u.hostname.replace(/^www\\./, "")) } catch { /* skip malformed */ } } return [...set].sort() } \\\

Strip \www.\ consistently; keep the rest of the host (subdomain matters — e.g., \thezebra.com\ and \auto.thezebra.com\ are different surfaces). For per-vertical analysis you might also want to bucket by TLD or by domain category (comparison-site vs carrier-direct vs trade-press), but the raw host list is the right base layer.

Cadence

Phidea&rsquo;s cadence as of mid-2026:

  • Weekly probes for active levers (e.g., the multi-query observation tool's price-anchor / bundled / cyber probes).
  • Monthly probes for validated-study confirmations (the home + auto ablation suites).
  • Ad-hoc probes when an essay or pitch needs fresh data.

Weekly is the right floor. Daily produces noise (you'd run too many calls without learning much). Monthly misses real shifts (the Coalition-cyber drift happened in 8 days; a monthly cadence would have missed the early signal).

What to do with the output

Three downstream uses, in order of operational impact:

1. Drift alerts. A flipped modal carrier on a watched query should trigger a Slack message or PagerDuty alert. If your competitor is suddenly winning a query you used to win, you want to know within 24 hours, not 30 days.

2. Citation-host trend graphs. Track the unique-host count per query week over week. A query whose citation-host set is shifting (new sites entering, old ones falling out) is a query whose retrieval is unstable; treat it as a "watch closer" candidate.

3. Per-host carrier-mention monitoring. Once you know which 5-15 hosts an LLM cites for your top queries, set up monitoring on those hosts directly: when the host updates, when carriers are added/removed from its lists. The carriers a comparison site lists this week predict the carriers your LLMs will name next month.

What to NOT spend instrumentation budget on

  • Custom LLM hosting. Use the provider APIs. Self-hosted LLMs don't have grounding (no citations), which makes them useless for citation-share measurement.
  • Aggregation-service wrappers (Helicone, Langfuse, etc.). Useful for production AI features; overhead for the probe pattern. Direct API calls + a small TS script is the right shape.
  • Long-form transcript storage. Keep the answer text for 30 days for spot-checking; archive to JSON; don't build a search-able transcript database. The first-named-carrier + citation-host record is the durable signal.

The shortest path to instrumentation. Clone the Phidea observation-probe, point it at your top 10 buyer queries, drop your carrier names into the pattern library, run it weekly, and watch the modal column. Most of the value of this whole guide is downstream of running this probe consistently.