Skip to main content

Our Methodology

How Climatefacts.ai analyses and scores information, in full. Every prompt, formula, indicator, and quality signal the platform uses is published below — and the numbers come live from the API, so what you see is what the platform is doing right now. This page exists so our methodology can be objectively reviewed and challenged.

Self-audit: where we are vs. where we claim

In May 2026 we commissioned an external analytical audit of every component, route, and scoring pipeline. The audit re-graded the platform against the same rubric our own engineering team uses. The gap between self-claimed and audited scores is published here because trust infrastructure must model the honesty it asks of others.

Live composite (backend-driven)
Loading…
Live composite from calibration, source tiers, embeddings, coverage, and provenance data
Last audited (End2End, 2026-06-14)
3.55/5
Original E2E audit benchmark. The live composite (left) now drives from backend data and updates on every page load — it replaces the previously hardcoded 4.78 self-claim and the stale 2026-05-27 audit date.

What the audit found

Reliability Tiering
Claimed: 4.4-4.8Audited: 3.5 → 4.6 (2026-05-26)
✅ DB-backed source credibility tiers + 3-axis (editorial/factcheck/transparency) wired into compute_weighted_score. Mig 045 fence guarantees no NULL axes.
Calibration Math
Claimed: 4.6-4.7Audited: 2.8 → 3.4 (2026-05-27)
✅ Calibration fence — min_labels=50, sub-threshold fits stamped 'preview' (commit 5dc7b12). Awaiting label volume.
Hallucination Detection
Claimed: 4.8Audited: 3.2 → 4.3 (2026-05-27)
✅ spaCy NER model now downloaded in API Dockerfile so PERSON/ORG/GPE/LOC entity grounding runs in prod (was regex-fallback).
Sustainability Composite
Claimed: 4.8Audited: 3.3
Integrate ND-GAIN; widen confidence band on mixed-year inputs. Pending.
Claim Density Honesty
Claimed: 4.6Audited: 4.6 → 4.8 (2026-05-25)
✅ Claim-density factor + Limited Evidence badge — '90% credibility with 1 claim' no longer possible.
Deep Search Relevance
Claimed: Audited: 3.8 → 4.5 (2026-05-25)
✅ Min semantic similarity 0.55 + tightened overlap guardrail (0.25). Fixed Slovenian-noise-for-India-query trust bug.
Multi-claim Extraction Yield
Claimed: Audited: 2.2 claims/article → target 3-8 (2026-05-27)
✅ Prompt v1.1 explicitly targets 3-8 claims (was 'up to N'). Primary DeepSeek extractor now uses the registered prompt template — previously diverged from secondary Anthropic.
External Citation Credibility
Claimed: Audited: n/a → 4.5 (2026-05-27)
✅ Perplexity deep-search citations now annotated with tier + 0-100 credibility score via source_credibility_tiers lookup.
Source Stamping at Ingest
Claimed: Audited: 0% → 100% (2026-05-27)
✅ New article ingest stamps articles.source_credibility_score via source_tier_service (was hardcoded 50 / NULL across the corpus).
Bias Auditor
Claimed: Audited: missing → 4.5 (2026-05-27)
✅ Chi-squared bias auditor live at /api/methodology/bias-audit — Cramér's V + critical-value gate at α=0.05 (commit 5dc7b12).
Provenance Ledger
Claimed: Audited: empty → backfilled (2026-05-27)
✅ Article-enrichment path now writes claim_provenance for every LLM call (commit 5dc7b12).
Premium Gating
Claimed: Audited: ungated → Standard+ (2026-05-27)
✅ /companies/{ticker}/analyze-report + /research/upload now require Standard+ subscription (document_ingestion premium feature).
Last audited composite (2026-05-27): ~3.55/5 by the End2End audit, up from 3.05 the day before. That wave closed the three biggest residual gaps: multi-claim extraction yield (prompt v1.1 targets 3-8 instead of "up to N"), spaCy NER entity grounding (now downloaded in the API Dockerfile rather than silently degrading to regex), and external citation credibility (Perplexity URLs annotated with tier). Trust work has continued since — see the "Since the audit" card below — but those gains are not yet reflected in a re-graded score, so the audited figure stands until the next audit. Calibration label volume + ND-GAIN integration + full transition-risk scoring remain the highest-leverage open items — see End2End Audit Benchmark 2026-05-27 for the file-level evidence per fix.

Publishing this gap is our strongest trust signal. It inverts every greenwashing pattern: we show the gap, we name the fixes, and we give a date by which we commit to closing it. This page will be updated as each axis improves.

Recent platform updates (May – June 2026)

The 2026-05-27 audit wave closed the Honest-Gap-Audit v2 plus the End2End audit's Section I priority list — multi-claim yield, entity grounding NER, external citation credibility, source stamping at ingest, and premium gating on heavy LLM endpoints. Work has continued since: the Since the audit card below lists what shipped through 2026-06-08 (semantic-search embeddings, source-health monitoring, the credibility-tier completion, drift honesty) — ahead of the next re-score. Every change has file-level evidence in git + corresponding tests.

Since the audit (May 28 – Jun 8, 2026)

  • Semantic search resurrected — GX10 bge-m3 embedding write path (the corpus was ~0/666 embedded)
  • Source-health canary — daily feed probe; a feed auto-disables after 5 consecutive failures
  • Credibility tiers completed — a migration version-prefix collision had silently dropped ~55 climate-journalism tier seeds; fixed (mig 066). source_credibility_tiers now 164 rows in prod
  • Drift detector honesty — thin windows report 'insufficient_data' (neutral) instead of a fake-green 'stable'
  • SBTi validated-target detection fixed — 9 → ~3,900 companies
  • Company head-to-head compare — size-independent ambition leader; comparisons are now saveable
  • LLM cost telemetry — cloud-vs-GX10 spend is now visible
  • Billing / subscription routes aligned to the DB schema (paid paths were 500-ing)

Truth-engine scoring

  • Claim-density factor (Slice 4a) — 1/1 verified no longer = 8/8 verified
  • Limited Evidence badge below 3 claims
  • 3-axis source scoring (editorial / factcheck / transparency) wired into credibility math (Polish wave 2)
  • Multi-claim extraction prompt v1.1 — targets 3-8 claims explicitly (was 'up to N'; lifted from 2.2 avg)
  • DeepSeek primary extractor now uses the registered prompt template (parity with Anthropic secondary)
  • spaCy NER model downloaded in API Dockerfile — entity grounding runs at semantic level, not regex fallback
  • Ingest stamps articles.source_credibility_score via tier service (was hardcoded 50 across whole corpus)
  • External Perplexity citations annotated with tier + credibility score chip

Retrieval honesty

  • Deep-search min semantic similarity 0.55 + min FTS rank 0.01
  • Relevance guardrail tightened — overlap ≥ 0.25 OR rel ≥ 0.5
  • Full-text fetch pre-pass before claim extraction (Slice 4b)
  • Link-rot detection — nightly HEAD probe (Slice 5a + Mig 046)

Save & explore

  • Polymorphic /api/user/saved — 8 item types (Slice 3)
  • My Saves page surfaces everything
  • Scenario explorer — IPCC AR6 interpolation with 'not simulation' disclaimer

Document analysis

  • /api/research/upload — PDF / DOCX / TXT up to 25 MiB (Deferred #11)
  • /api/companies/{ticker}/analyze-report — full sustainability report → claims (Deferred #12)
  • Research feed — subscribe-to-topic + CrossRef poller (Deferred #13)

Agentic chat

  • 15 single-sourced agentic skills (was 11) — backend ↔ frontend pin tests guarantee parity
  • save_item / subscribe_research_topic / explore_scenario / analyze_corporate_report added
  • Deep-search inline follow-up chat (Slice 6)

Infrastructure

  • Local GX10 LLM routing — flip CLILENS_ENRICHMENT_PROVIDER=local-gx10 once hardware serves
  • Auto-fallback to DeepSeek if GX10 unreachable
  • Cloud Scheduler crons: cn-link-check + cn-research-poll + cn-aoi-poll provisioned (mig 046 + 047)
  • Migration runner @notolerate directive — broken migrations now fail loud
  • claim_provenance ledger now written from article enrichment path (was empty in prod)
  • Chi-squared bias auditor live at /api/methodology/bias-audit — Cramér's V + critical-value gate at α=0.05
  • Calibration refit default min_labels bumped 5→50 — production-grade Platt fits only
  • Admin endpoints (link-check, research-poll) accept SCHEDULER_SECRET as fallback header for Cloud Scheduler
  • Premium gating on /companies/{ticker}/analyze-report + /research/upload — Standard+ document_ingestion feature

Persona surfaces

  • Dashboard — Persona Lens (6 personas) + Analytics & Exports tile
  • Map country panel — 3-axis chips per source (editorial / factcheck / transparency)
  • SourceProfileCard — numeric 0-100 scores per axis below qualitative labels
  • Export tiles wired to logged-in saves (articles / companies / countries / searches)

Full commit ledger in docs/improvementplans/ . Each item also has corresponding pytest coverage in tests/backend/ and tests/scripts/.

How verification works

Every URL the platform analyses runs through five stages. Each stage emits a versioned audit record so a displayed score can be traced back to the exact prompt, model, retrieval strategy, and source articles that produced it.

  1. 1

    Article ingestion & extraction

    Title, author, publish date, source, language, and full body are extracted. The fetcher validates URLs against SSRF blocklists and re-validates after every redirect hop.

  2. 2

    Claim extraction

    A versioned LLM prompt identifies factual claims. The prompt name + version + content-fingerprint are recorded on every output (see Models & Prompts below).

  3. 3

    Evidence retrieval

    Hybrid retrieval combines internal corpus (FTS + HNSW vector search + knowledge graph), external web search via Perplexity (when configured), and weather context. Retrieval strategy is recorded per call.

  4. 4

    Multi-LLM verification

    The primary model's claims are cross-checked against a secondary LLM. Token-level Jaccard similarity yields an agreement score; large disagreements downgrade confidence.

  5. 5

    Hallucination grounding

    A separate hallucination check compares the synthesised answer against the retrieved articles. Entity overlap + statistic verification + LLM grounding feed into the final risk score, which is calibrated against ground-truth labels.

What we surface
  • Versioned prompt + fingerprint per LLM call
  • Source articles that fed each output
  • Reliability + agreement + hallucination scores
  • Calibration metrics tied to reviewer labels
  • Drift verdicts on source mix and prompts
Known limits
  • LLMs occasionally misread subtle scientific nuance
  • Calibration requires labelled reviews to accumulate
  • Paywalled sources are not retrievable
  • Predictive claims are flagged but not adjudicated

Models & versioned prompts

Every LLM call goes through a registered, fingerprinted prompt. The fingerprint is a SHA-256 prefix of template + system content; two prompts with the same fingerprint are byte-identical. Drift detection (below) watches the distribution of these fingerprints over time.

Loading…

Sustainability score formula

Country sustainability scores are a weighted combination of Bayesian-normalised indicators. Weights of missing components redistribute across the available subset; the confidence band widens when fewer indicators contribute.

Loading…

Indicator catalogue

Every climate indicator the platform stores, with its authoritative source. Indicators flow into country_indicators from per-source adapters (Climate TRACE, Our World in Data, Climate Action Tracker) and feed the sustainability formula above.

Loading…

Calibration

Brier score, Expected Calibration Error, and Platt scaling for each calibratable signal. A well-calibrated system has Brier ≈ 0 and ECE close to 0. When labels are sparse, the metrics show 'awaiting reviews' — calibration data accumulates as reviewers grade analyses.

Loading…
Loading…
Loading…

Hallucination rates

Per-extraction-method, per-model, and per-source hallucination scores over the last 30 days. Each LLM output is checked against its retrieved sources; the resulting risk score is recorded in claim_provenance and aggregated here.

Loading…

Drift detection

KL-divergence between the recent 7-day window and the prior 30-day baseline, computed independently for the article source mix and the prompt-fingerprint distribution. A 'significant' verdict signals a meaningful shift — operators investigate.

Loading…
Loading…

Verdict labels

How the platform classifies an analysed claim once evidence is gathered.

Verified
Multiple credible sources confirm with high confidence.
Partially true
Some evidence supports, some contradicts, or context is needed.
Disputed / false
Scientific consensus contradicts the claim.
Unverified
Insufficient evidence to make a determination.

Feedback & corrections

We want this methodology to be challenged. If you spot a weakness in a prompt, disagree with an indicator weight, find a calibration label we got wrong, or have a primary source you think we should add — tell us, and we'll respond.

Reviewer-graded calibration labels can also be submitted by authorised partners via POST /api/methodology/calibration/labels — contact us for credentials.

Privacy, terms & GDPR

Required reading for EU users and enterprise customers. The full documents are version-controlled in the repository; older versions remain reachable by git SHA.

Corporate climate claim verification

The /companies surface verifies corporate climate claims against the public disclosure ledger (CDP / SBTi / Net Zero Tracker). Verdicts are deterministic and unit-tested — no LLM is in the verdict path.

Each claim is routed through a rule set pinned by tests/api/test_company_routes.py. The taxonomy is fixed:

  • flagged — offset-based "climate neutral" phrasings (ECGT Article 4 prohibition, effective 27 Sept 2026)
  • verified — net-zero claims supported by SBTi validation in the company's disclosure context
  • disputed — net-zero claims without SBTi evidence (fail-safe default: absent confirmation → disputed)
  • partially_true — emissions-reduction claims that require cross-referencing the Scope 1/2/3 rows on the company's profile
  • unverified — claims that don't match a routing rule (fallback bucket)

Seed data covers ~17 well-known public companies across tech, consumer goods, industrials, and oil & gas — illustrative of both SBTi-validated and unvalidated cohorts. Once the CDP / SBTi / NZT adapters run, fresher data idempotently overwrites the seed rows.

Climate Intelligence
General

Current context

/methodology

Welcome to Climatefacts.ai - your climate intelligence platform. Explore the map, search articles, analyze URLs, configure your feed, and get AI-powered insights across all features.

Try asking

Answer from:
Try: Explain the 3-axis source scoring