200+ sources poll on schedule — HF leaderboards, vendor system cards, audited harnesses, prose papers.
Sources register on first detection; no manual onboarding.
The pipeline that produces the frontier — five ingestion + scoring stages, a five-tier verification ladder, and a per-axis manifesto that defines what counts.
200+ sources poll on schedule — HF leaderboards, vendor system cards, audited harnesses, prose papers.
Sources register on first detection; no manual onboarding.
Raw rows fold into model_benchmarks. Model aliases collapse onto a single canonical_entity; benchmark aliases collapse onto a canonical benchmark.
ON CONFLICT DO UPDATE: most-recent-wins at the DB layer.
Per-axis manifestos define the composite basket. axis_score is a reliability-weighted percentile across the basket — each benchmark scored by within-benchmark rank, shrunk toward the field median for thin samples. Two bases, toggleable everywhere: Authority (provenance-ranked pick, default) and Max (highest reported).
Eligibility: min benchmarks measured, recency window, harness filter.
Each score carries a verification_level (independently verified › aggregator › vendor first-party › vendor cross-reference › unverified). A vendor re-citing a number it didn't produce ranks below first-party; agreement_stddev surfaces within-model disagreement.
Below-threshold confidence → review queue, not publication.
Frontier band = top of the per-axis distribution. Tier counts, sibling axes, and the front-page hero render directly from the materialized view — no editorial overlay.
Methodology label + reference URL ride on every visible row.
Third-party harness ran the model. Replicable from the leaderboard's published prompts + scoring code.
e.g. HELM · LiveBench · SWE-Bench Verified · LMArena evals
Aggregator surfaced the score — model card, leaderboard table, Papers with Code entry. Re-attests a primary source.
e.g. HuggingFace · Papers with Code · model registries
Vendor's own system card / blog / model card. Authoritative on the model but not independently replicated.
e.g. OpenAI system cards · Anthropic model cards · vendor blogs
A vendor re-states a benchmark number it did not produce — re-citing a rival or an aggregator. DEMOTED below first-party attribution when a headline score is selected (mig 156).
e.g. launch-post comparison tables · a vendor blog quoting a competitor's eval
Unverified — surfaced through prose extraction (research paper, conference talk, technical brief) without re-run by an aggregator. The lowest tier.
e.g. arXiv abstracts · conference papers · community evals