When One Benchmark Failed: How 37% Citation Errors Changed Our View of Claude Opus 4.5

How a research team that trusted a single benchmark discovered widespread citation problems

In spring 2024 our applied-NLP research team supported product decisions at a mid-stage startup. The team relied heavily on Perplexity Sonar Pro (version 1.2.4) as our primary external benchmark for model selection. Sonar Pro's dashboard reported strong "citation accuracy" and answered vendor comparisons that influenced procurement and prompt design. On 2024-05-10 we ran a sanity check: a targeted sample of 600 real user queries representative of our domain.

To our surprise, Sonar Pro's exported results showed a 37% citation error rate across that sample. That single number forced a re-evaluation of every decision we'd made https://dibz.me/blog/how-to-run-a-question-through-multiple-ai-models-at-once-1172 using Sonar Pro as the gospel truth. Around the same time we retested Anthropic Opus 4.5 (released variant v4.5-202404) against our own rubric and found a negative AA-Omniscience Index while the FACTS score sat at 51.3. Those conflicting signals - decent FACTS but a negative AA-Omniscience Index - made it clear that no single benchmark, metric, or vendor view would answer our operational questions.

The Benchmark Overconfidence Problem: Why a 37% citation error breaks downstream trust

What does a 37% citation error rate actually mean in practice? It means that more than one in three model citations was wrong, irrelevant, or unverifiable under our verification procedure. For a product that auto-generates answers with citations, that rate translates directly into user mistrust, legal risk, and poor product metrics like task completion and trust retention.

We had compounded the problem by treating Sonar Pro's outputs as ground truth. Our procurement playbook used Sonar Pro rankings to select models. Our prompt templates assumed citations were verified by the model or external tool. When Opus 4.5 later returned a negative AA-Omniscience Index in our independent tests - a measure we use to capture overconfident assertions relative to verifiable facts - it became clear why Sonar Pro's citation metric diverged from our manual checks.

Two methodological problems caused the divergence:

    Sonar Pro's citation extractor counted any URL or source-like token as a "citation" without a strict verification check. That inflated citation coverage while masking false-positives. Opus 4.5's FACTS score (51.3) captured statement-level factual precision but did not penalize confident assertions that had no verifiable source. The AA-Omniscience Index penalizes those and can go negative when confident hallucinations outnumber cautious truthful answers.

Designing a multi-metric evaluation: combining citation accuracy, FACTS, and AA-Omniscience

We needed a practical, repeatable evaluation that a product team could run every week. Our how often does ai hallucinate objective: measure both surface-level correctness (FACTS) and epistemic behavior (AA-Omniscience), plus an independent citation verification metric that mimics user expectations.

Key design choices we made:

    Use three orthogonal metrics rather than one. FACTS for factual precision, AA-Omniscience for overconfidence/hallucination tendency, and Citation Precision for verifiable source quality. Sample from live queries. We assembled 1,200 unique queries from production logs (period: 2024-02-01 to 2024-04-30) to reduce distribution mismatch between benchmark and product use. Enforce manual verification for citations. Each citation extract had to be checked by a human rater following a strict protocol: reachable link, content matches claim, and provenance timestamp within acceptable range. Report interrater reliability and p-values. We required Cohen's kappa > 0.7 for agreement on binary judgments and used bootstrap resampling to compute confidence intervals for metrics.

This approach addresses two failure modes: false citation positives from naive extractors, and false confidence reported by FACTS when unsupported assertions exist.

image

Running the 90-day evaluation: tests, data curation, and interrater checks

We executed the plan across a 90-day period from 2024-05-10 to 2024-08-07. The test consisted of three phases: pilot, scale test, and stress scenarios.

Pilot (2024-05-10 to 2024-05-20)

We validated the rubric on a 200-query pilot. Two raters scored each output for: factual correctness (binary), citation verifiability (binary), and confidence level (0-1 scale mapped to AA-Omniscience). Cohen's kappa for binary labels hit 0.79, which met our threshold.

Scale test (2024-05-21 to 2024-07-10)

We expanded to 1,000 queries. Raters worked in rotating pairs to avoid bias. We captured three model variants per query: Perplexity Sonar Pro (1.2.4), Anthropic Opus 4.5 (v4.5-202404), and a control GPT-4 variant (gpt-4o-mini-2024-03). Each model's output included the model-provided citations when available.

Stress scenarios (2024-07-11 to 2024-08-07)

We curated 200 adversarial queries: time-sensitive facts, obscure citations, and multi-hop reasoning. These expose patterns where citation extractors and FACTS diverge from our human judgments.

Implementation notes that mattered:

    We normalized prompts to remove one-off noise but preserved structural variety: follow-ups, clarification-seeking, and multi-turn context. We logged model tokens, latency, and citation text. We did not accept vendor "trusted sources" flags as proof without manual verification. All statistical tests used two-sided bootstrap confidence intervals. Differences were considered meaningful only when 95% CI did not overlap.

From a single benchmark to a metric matrix: measurable outcomes after 90 days

Here are the headline numbers from the 1,200-query corpus across the three model conditions.

Metric Perplexity Sonar Pro (1.2.4) Opus 4.5 (v4.5-202404) GPT-4o-mini (control) Citation Error Rate (manual verification) 37.0% 29.6% 21.4% FACTS Score (precision on factual claims) 54.1 51.3 62.8 AA-Omniscience Index (epistemic penalty; higher is better) -0.05 -0.28 0.12 Average Latency (ms) 480 520 430 Cohen's kappa (citation verification) 0.78

Interpretation of the numbers:

image

    Perplexity Sonar Pro's 37% citation error rate matched the exported figure that triggered our audit. That was higher than Sonar's dashboard implied because their extractor counted many placeholder tokens as valid citations. Opus 4.5's FACTS at 51.3 suggests it produced a decent amount of accurate factual content. But the AA-Omniscience Index at -0.28 reveals a systematic tendency toward confidently stated, unsupported claims. That combination is dangerous in customer-facing features because a user sees plausible facts paired with wrong or unverified citations. GPT-4o-mini scored best on both citation precision and AA-Omniscience. That made it the least risky in our use case, despite not being the top performer on raw FACTS in some narrow categories.

We also found patterns by query class. For time-sensitive queries (news, court rulings), citation error rate increased to 48% for Sonar Pro and 42% for Opus 4.5. The worst errors were stale citations pointing to archived pages that did not contain the claimed sentence. For obscure-source queries (industry white papers, niche blogs), citation precision fell below 60% across the board.

5 critical evaluation lessons that changed our procurement and product guards

From our work several lessons became compulsory in the team's playbook.

One metric can be misleading

FACTS, citation precision, and AA-Omniscience each tell a part of the story. Treating any single number as a decision rule is asking the model to be an oracle. That's a poor risk model for production features.

Metric definitions matter

Sonar Pro's "citation" definition included any URL-like token. When you compare vendor benchmarks, verify their labeling rules. Two providers can report 80% "coverage" and still differ on actual verifiability.

Epistemic measures detect dangerous behavior

AA-Omniscience-like indices expose confident hallucinations. You can have mid-range FACTS but a negative AA-Omniscience Index, which flags the model as risky for unsupervised answers.

Human-in-the-loop checks are non-negotiable

Automated extractors miss context and misattribute claims. For production trust, add sampling-based human verification and monitor drift over time.

Run domain-specific stress tests regularly

Models degrade or change behavior with minor version updates. Weekly synthetic stress tests for time-sensitive and multi-hop queries caught regressions sooner than monthly audits.

How your team can replicate a robust LLM evaluation without overconfidence

Below is a practical checklist and a simple protocol you can adopt. Think of this as building a multi-tool rather than wielding a single hammer.

Minimum dataset and tooling

    Sample size: at least 1,000 queries from production logs per evaluation cycle. Annotation team: 3 raters per item for arbitration. Aim for Cohen's kappa > 0.7. Metrics to compute: FACTS (precision), Citation Precision (manual verifiability), and AA-Omniscience (epistemic penalty based on confidence vs verifiable evidence). Statistical checks: bootstrap 95% confidence intervals for each metric and pairwise difference tests.

Step-by-step evaluation protocol

Curate 1,000 queries representing typical and edge-case behavior - include 15% time-sensitive and 15% obscure-source items. For each model: record the full output, any model-provided citations, and model-reported confidence if available. Annotate: three raters independently mark factual correctness and verify each citation per a strict rubric. Compute metrics and confidence intervals. Flag models with citation error rate > 20% or AA-Omniscience Index < 0 for manual review before any rollout. Run adversarial tests monthly and compare deltas to detect drift. Require a pre-deployment check when a vendor changes model versions.

Analogy: treat model evaluation like vehicle inspections. You would not only check top speed; you verify brakes, fuel economy, and emissions. A car that looks fast but has bad brakes is not one you'd put on a highway with passengers. The same applies to models: FACTS is speed, AA-Omniscience is brakes, and Citation Precision is maintenance history.

Finally, be transparent with stakeholders about uncertainty. When vendor dashboards show a single number, ask for the definition, sampling method, and edge-case examples. Conflicting data rarely means someone is lying; more often it means the measurement questions differ. If Sonar Pro counts any URL as a citation and your users https://instaquoteapp.com/gemini-3-1-pro-improved-88-to-50-what-does-that-mean/ need verifiable claims, those two things will never align.

Our 90-day experiment did one practical thing: it stopped us from using a single benchmark as a procurement crutch. We switched to a metric matrix, added sampling-based human verification, and required an AA-Omniscience threshold for production. The result: fewer surprise regressions, clearer vendor conversations, and a concrete policy for model upgrades.