Our Methodology

How Climatefacts.ai analyses and scores information, in full. Every prompt, formula, indicator, and quality signal the platform uses is published below — and the numbers come live from the API, so what you see is what the platform is doing right now. This page exists so our methodology can be objectively reviewed and challenged.

Self-audit: where we are vs. where we claim

In May 2026 we commissioned an external analytical audit of every component, route, and scoring pipeline. The audit re-graded the platform against the same rubric our own engineering team uses. The gap between self-claimed and audited scores is published here because trust infrastructure must model the honesty it asks of others.

Live composite (backend-driven)

Loading…

Live composite from calibration, source tiers, embeddings, coverage, and provenance data

Last audited (End2End, 2026-06-14)

3.55/5

Original E2E audit benchmark. The live composite (left) now drives from backend data and updates on every page load — it replaces the previously hardcoded 4.78 self-claim and the stale 2026-05-27 audit date.

What the audit found

Reliability Tiering

Claimed: 4.4-4.8→Audited: 3.5 → 4.6 (2026-05-26)

✅ DB-backed source credibility tiers + 3-axis (editorial/factcheck/transparency) wired into compute_weighted_score. Mig 045 fence guarantees no NULL axes.

Calibration Math

Claimed: 4.6-4.7→Audited: 2.8 → 3.4 (2026-05-27)

✅ Calibration fence — min_labels=50, sub-threshold fits stamped 'preview' (commit 5dc7b12). Awaiting label volume.

Hallucination Detection

Claimed: 4.8→Audited: 3.2 → 4.3 (2026-05-27)

✅ spaCy NER model now downloaded in API Dockerfile so PERSON/ORG/GPE/LOC entity grounding runs in prod (was regex-fallback).

Sustainability Composite

Claimed: 4.8→Audited: 3.3

Integrate ND-GAIN; widen confidence band on mixed-year inputs. Pending.

Claim Density Honesty

Claimed: 4.6→Audited: 4.6 → 4.8 (2026-05-25)

✅ Claim-density factor + Limited Evidence badge — '90% credibility with 1 claim' no longer possible.

Deep Search Relevance

Claimed: —→Audited: 3.8 → 4.5 (2026-05-25)

✅ Min semantic similarity 0.55 + tightened overlap guardrail (0.25). Fixed Slovenian-noise-for-India-query trust bug.

Multi-claim Extraction Yield

Claimed: —→Audited: 2.2 claims/article → target 3-8 (2026-05-27)

✅ Prompt v1.1 explicitly targets 3-8 claims (was 'up to N'). Primary DeepSeek extractor now uses the registered prompt template — previously diverged from secondary Anthropic.

External Citation Credibility

Claimed: —→Audited: n/a → 4.5 (2026-05-27)

✅ Perplexity deep-search citations now annotated with tier + 0-100 credibility score via source_credibility_tiers lookup.

Source Stamping at Ingest

Claimed: —→Audited: 0% → 100% (2026-05-27)

✅ New article ingest stamps articles.source_credibility_score via source_tier_service (was hardcoded 50 / NULL across the corpus).

Bias Auditor

Claimed: —→Audited: missing → 4.5 (2026-05-27)

✅ Chi-squared bias auditor live at /api/methodology/bias-audit — Cramér's V + critical-value gate at α=0.05 (commit 5dc7b12).

Provenance Ledger

Claimed: —→Audited: empty → backfilled (2026-05-27)

✅ Article-enrichment path now writes claim_provenance for every LLM call (commit 5dc7b12).

Premium Gating

Claimed: —→Audited: ungated → Standard+ (2026-05-27)

✅ /companies/{ticker}/analyze-report + /research/upload now require Standard+ subscription (document_ingestion premium feature).

Last audited composite (2026-05-27): ~3.55/5 by the End2End audit, up from 3.05 the day before. That wave closed the three biggest residual gaps: multi-claim extraction yield (prompt v1.1 targets 3-8 instead of "up to N"), spaCy NER entity grounding (now downloaded in the API Dockerfile rather than silently degrading to regex), and external citation credibility (Perplexity URLs annotated with tier). Trust work has continued since — see the "Since the audit" card below — but those gains are not yet reflected in a re-graded score, so the audited figure stands until the next audit. Calibration label volume + ND-GAIN integration + full transition-risk scoring remain the highest-leverage open items — see End2End Audit Benchmark 2026-05-27 for the file-level evidence per fix.

Publishing this gap is our strongest trust signal. It inverts every greenwashing pattern: we show the gap, we name the fixes, and we give a date by which we commit to closing it. This page will be updated as each axis improves.

Recent platform updates (May – June 2026)

The 2026-05-27 audit wave closed the Honest-Gap-Audit v2 plus the End2End audit's Section I priority list — multi-claim yield, entity grounding NER, external citation credibility, source stamping at ingest, and premium gating on heavy LLM endpoints. Work has continued since: the Since the audit card below lists what shipped through 2026-06-08 (semantic-search embeddings, source-health monitoring, the credibility-tier completion, drift honesty) — ahead of the next re-score. Every change has file-level evidence in git + corresponding tests.

Since the audit (May 28 – Jun 8, 2026)

Semantic search resurrected — GX10 bge-m3 embedding write path (the corpus was ~0/666 embedded)
Source-health canary — daily feed probe; a feed auto-disables after 5 consecutive failures
Credibility tiers completed — a migration version-prefix collision had silently dropped ~55 climate-journalism tier seeds; fixed (mig 066). source_credibility_tiers now 164 rows in prod
Drift detector honesty — thin windows report 'insufficient_data' (neutral) instead of a fake-green 'stable'
SBTi validated-target detection fixed — 9 → ~3,900 companies
Company head-to-head compare — size-independent ambition leader; comparisons are now saveable
LLM cost telemetry — cloud-vs-GX10 spend is now visible
Billing / subscription routes aligned to the DB schema (paid paths were 500-ing)

Truth-engine scoring

Claim-density factor (Slice 4a) — 1/1 verified no longer = 8/8 verified
Limited Evidence badge below 3 claims
3-axis source scoring (editorial / factcheck / transparency) wired into credibility math (Polish wave 2)
Multi-claim extraction prompt v1.1 — targets 3-8 claims explicitly (was 'up to N'; lifted from 2.2 avg)
DeepSeek primary extractor now uses the registered prompt template (parity with Anthropic secondary)
spaCy NER model downloaded in API Dockerfile — entity grounding runs at semantic level, not regex fallback
Ingest stamps articles.source_credibility_score via tier service (was hardcoded 50 across whole corpus)
External Perplexity citations annotated with tier + credibility score chip

Retrieval honesty

Deep-search min semantic similarity 0.55 + min FTS rank 0.01
Relevance guardrail tightened — overlap ≥ 0.25 OR rel ≥ 0.5
Full-text fetch pre-pass before claim extraction (Slice 4b)
Link-rot detection — nightly HEAD probe (Slice 5a + Mig 046)

Save & explore

Polymorphic /api/user/saved — 8 item types (Slice 3)
My Saves page surfaces everything
Scenario explorer — IPCC AR6 interpolation with 'not simulation' disclaimer

Document analysis

/api/research/upload — PDF / DOCX / TXT up to 25 MiB (Deferred #11)
/api/companies/{ticker}/analyze-report — full sustainability report → claims (Deferred #12)
Research feed — subscribe-to-topic + CrossRef poller (Deferred #13)

Agentic chat

15 single-sourced agentic skills (was 11) — backend ↔ frontend pin tests guarantee parity
save_item / subscribe_research_topic / explore_scenario / analyze_corporate_report added
Deep-search inline follow-up chat (Slice 6)

Infrastructure

Local GX10 LLM routing — flip CLILENS_ENRICHMENT_PROVIDER=local-gx10 once hardware serves
Auto-fallback to DeepSeek if GX10 unreachable
Cloud Scheduler crons: cn-link-check + cn-research-poll + cn-aoi-poll provisioned (mig 046 + 047)
Migration runner @notolerate directive — broken migrations now fail loud
claim_provenance ledger now written from article enrichment path (was empty in prod)
Chi-squared bias auditor live at /api/methodology/bias-audit — Cramér's V + critical-value gate at α=0.05
Calibration refit default min_labels bumped 5→50 — production-grade Platt fits only
Admin endpoints (link-check, research-poll) accept SCHEDULER_SECRET as fallback header for Cloud Scheduler
Premium gating on /companies/{ticker}/analyze-report + /research/upload — Standard+ document_ingestion feature

Persona surfaces

Dashboard — Persona Lens (6 personas) + Analytics & Exports tile
Map country panel — 3-axis chips per source (editorial / factcheck / transparency)
SourceProfileCard — numeric 0-100 scores per axis below qualitative labels
Export tiles wired to logged-in saves (articles / companies / countries / searches)

Full commit ledger in docs/improvementplans/ . Each item also has corresponding pytest coverage in tests/backend/ and tests/scripts/.

How verification works

Every URL the platform analyses runs through five stages. Each stage emits a versioned audit record so a displayed score can be traced back to the exact prompt, model, retrieval strategy, and source articles that produced it.

1
Article ingestion & extraction
Title, author, publish date, source, language, and full body are extracted. The fetcher validates URLs against SSRF blocklists and re-validates after every redirect hop.
2
Claim extraction
A versioned LLM prompt identifies factual claims. The prompt name + version + content-fingerprint are recorded on every output (see Models & Prompts below).
3
Evidence retrieval
Hybrid retrieval combines internal corpus (FTS + HNSW vector search + knowledge graph), external web search via Perplexity (when configured), and weather context. Retrieval strategy is recorded per call.
4
Multi-LLM verification
The primary model's claims are cross-checked against a secondary LLM. Token-level Jaccard similarity yields an agreement score; large disagreements downgrade confidence.
5
Hallucination grounding
A separate hallucination check compares the synthesised answer against the retrieved articles. Entity overlap + statistic verification + LLM grounding feed into the final risk score, which is calibrated against ground-truth labels.

What we surface

Versioned prompt + fingerprint per LLM call
Source articles that fed each output
Reliability + agreement + hallucination scores
Calibration metrics tied to reviewer labels
Drift verdicts on source mix and prompts

Known limits

LLMs occasionally misread subtle scientific nuance
Calibration requires labelled reviews to accumulate
Paywalled sources are not retrievable
Predictive claims are flagged but not adjudicated

Models & versioned prompts

Every LLM call goes through a registered, fingerprinted prompt. The fingerprint is a SHA-256 prefix of template + system content; two prompts with the same fingerprint are byte-identical. Drift detection (below) watches the distribution of these fingerprints over time.

Loading…

Sustainability score formula

Country sustainability scores are a weighted combination of Bayesian-normalised indicators. Weights of missing components redistribute across the available subset; the confidence band widens when fewer indicators contribute.

Loading…

Indicator catalogue

Every climate indicator the platform stores, with its authoritative source. Indicators flow into country_indicators from per-source adapters (Climate TRACE, Our World in Data, Climate Action Tracker) and feed the sustainability formula above.

Loading…

Calibration

Brier score, Expected Calibration Error, and Platt scaling for each calibratable signal. A well-calibrated system has Brier ≈ 0 and ECE close to 0. When labels are sparse, the metrics show 'awaiting reviews' — calibration data accumulates as reviewers grade analyses.

Loading…

Hallucination rates

Per-extraction-method, per-model, and per-source hallucination scores over the last 30 days. Each LLM output is checked against its retrieved sources; the resulting risk score is recorded in claim_provenance and aggregated here.

Loading…

Drift detection

KL-divergence between the recent 7-day window and the prior 30-day baseline, computed independently for the article source mix and the prompt-fingerprint distribution. A 'significant' verdict signals a meaningful shift — operators investigate.

Loading…

Verdict labels

How the platform classifies an analysed claim once evidence is gathered.

Verified

Multiple credible sources confirm with high confidence.

Partially true

Some evidence supports, some contradicts, or context is needed.

Disputed / false

Scientific consensus contradicts the claim.

Unverified

Insufficient evidence to make a determination.

Feedback & corrections

We want this methodology to be challenged. If you spot a weakness in a prompt, disagree with an indicator weight, find a calibration label we got wrong, or have a primary source you think we should add — tell us, and we'll respond.

Methodology suggestions

methodology@climatefacts.ai

Corrections

corrections@climatefacts.ai

New data sources

research@climatefacts.ai

Reviewer-graded calibration labels can also be submitted by authorised partners via POST /api/methodology/calibration/labels — contact us for credentials.

Privacy, terms & GDPR

Required reading for EU users and enterprise customers. The full documents are version-controlled in the repository; older versions remain reachable by git SHA.

Corporate climate claim verification

The /companies surface verifies corporate climate claims against the public disclosure ledger (CDP / SBTi / Net Zero Tracker). Verdicts are deterministic and unit-tested — no LLM is in the verdict path.

Each claim is routed through a rule set pinned by tests/api/test_company_routes.py. The taxonomy is fixed:

flagged — offset-based "climate neutral" phrasings (ECGT Article 4 prohibition, effective 27 Sept 2026)
verified — net-zero claims supported by SBTi validation in the company's disclosure context
disputed — net-zero claims without SBTi evidence (fail-safe default: absent confirmation → disputed)
partially_true — emissions-reduction claims that require cross-referencing the Scope 1/2/3 rows on the company's profile
unverified — claims that don't match a routing rule (fallback bucket)

Seed data covers ~17 well-known public companies across tech, consumer goods, industrials, and oil & gas — illustrative of both SBTi-validated and unvalidated cohorts. Once the CDP / SBTi / NZT adapters run, fresher data idempotently overwrites the seed rows.

Methodology snapshot generated live from GET /api/methodology and related endpoints. To pin a snapshot for audit, request the bundle directly and attach the response to your record.

Self-audit: where we are vs. where we claim

What the audit found

Recent platform updates (May – June 2026)

Since the audit (May 28 – Jun 8, 2026)

Truth-engine scoring

Retrieval honesty

Save & explore

Document analysis

Agentic chat

Infrastructure

Persona surfaces

How verification works

Article ingestion & extraction

Claim extraction

Evidence retrieval

Multi-LLM verification

Hallucination grounding

Verdict labels

Feedback & corrections