v0.4 · May 2026

Project Armour

A deterministic verification layer for AI-generated financial analysis. No generative AI in the audit loop.

7,372 audited claims · 30 sector-diversified earnings releases · 4 frontier LLMs · zero LLMs in the audit loop

The problem

Generative models are now routine producers of investment summaries, earnings-call recaps, and research notes. Their outputs are also routinely wrong in ways that are difficult to detect through conventional review: invented numbers, inverted directional claims, misattributed quotes, and plausible-sounding inferences with no source backing.

The standard mitigations fail in characteristic ways:

The missing primitive is a non-generative verifier: a system that answers, for each claim in an LLM output, what specific source content backs this assertion and how confident the match is, without itself being capable of hallucination.

What Project Armour does

Zero generative AI in the audit loop. All verification is deterministic: regex extraction, arithmetic recompute, period alignment, scope fingerprinting, and token/embedding anchoring. The audit cannot fabricate an explanation because it cannot generate text.

Given a source document (earnings press release, transcript, 8-K) and an LLM-generated output, Armour:

  1. Parses the source into a structured fact store: sentences, numbers with units and periods, named entities, table values.
  2. Extracts every claim from the LLM output and classifies it by type and by origin (restatement, derived, inference, fabrication).
  3. Anchors each claim to the best-matching source sentence via token-set overlap and optional sentence embeddings.
  4. Runs per-claim checks: numeric tolerance, arithmetic recompute, period alignment, direction sign-check, scope fingerprint, entity presence.
  5. Assigns a graduated verdict and produces a claim-by-claim ledger with source quotes.

Verdict ladder

VerdictMeaning
passRestates a fact present in the source.
derived_passDerived from the source via verifiable arithmetic.
minor_driftGrounded but contains small wording or rounding drift.
inference_unanchoredReasonable inference; no specific source sentence anchors it.
inference_unverifiedInference whose factual basis could not be deterministically verified.
failConflicts with or misstates source material.
fabricationAsserts entities, numbers, or events absent from the source.

The ledger is not a summary score. It is a citable, per-claim record that tells a compliance reviewer exactly which source sentence was checked, what the verdict was, and why. Identical input produces an identical ledger, every time.

v0.4 headline results

203 (model, ticker, task) pairs across 30 sector-diversified US large-cap earnings press releases, audited under the compliance profile. Accuracy is defined as (pass + derived_pass + minor_drift) / total claims. Fabrication is claims with no traceable anchor in the source document.

Model Claims Accuracy Accuracy 95% CI Fabrication Fabrication 95% CI
GPT-5.5 Pro 742 74.3% 69.4–78.6% 5.1% 3.5–6.9%
OpenAI o3 2,038 71.5% 68.1–74.7% 5.3% 3.9–6.9%
Claude Sonnet 4.6 2,372 59.6% 56.9–62.3% 8.5% 6.7–10.4%
Claude Opus 4.7 (extended thinking) 2,220 60.0% 57.0–62.9% 10.6% 8.7–12.7%

CIs are 95% cluster bootstrap (resampling cells with replacement, 5,000 draws, seed 20260506). The OpenAI-vs-Anthropic accuracy and fabrication gaps are robust: OpenAI lower bounds (68-69% accuracy, 3.5-3.9% fabrication) do not overlap Anthropic upper bounds (62-63% accuracy; fabrication upper bounds 10.4-12.7%). The Opus-vs-Sonnet fabrication difference is directional (10.6% vs 8.5%) but the CIs overlap at 95% confidence. Full methodology and matched-subset robustness check in WHITEPAPER.md.

The model marketed as the premium reasoning option produced the highest fabrication rate in the panel. The "use a better model" advice does not survive contact with this dataset.

Try it without API keys

The demo command audits a bundled fictional earnings release against a bundled LLM-style output, producing all five verdict types the system can generate. No OpenAI or Anthropic key required.

pip install -e .
armour demo

This writes demo_output/audit.json and demo_output/audit.html to your working directory. Open the HTML file for the per-claim source-trace ledger.

The five demo claims cover one faithful restatement (pass), one verified arithmetic derivation (derived_pass), one unanchored inference (inference_unanchored), one contradiction of the source (fail), and one assertion absent from the source entirely (fabrication).

Reproducibility

The v0.4 headline numbers are fully reproducible from the committed evaluation artefacts in batch_results_v2/. No API keys, no network access, no LLM inference required.

pip install -e ".[dev]"

# Reproduce headline table from committed artefacts
python evaluation/reproduce_v04.py

# Reproduce 95% confidence intervals (5,000 draws, seed 20260506)
python evaluation/bootstrap_ci.py

# Full test suite (44 tests, all offline)
pytest

Both scripts exit printing PASS when the computed numbers match the stored artefacts exactly. See evaluation/README.md for full methodology and metric definitions.

GitHub Actions CI (Python 3.10 and 3.11) runs all four steps on every push. The CI badge is the live verification status of the committed artefacts.

What Armour does not guarantee

Deterministic does not mean infallible. Armour checks consistency with the supplied source document. It does not verify truth in the world.

Further reading

Source code, committed evaluation artefacts, and the full test suite are available at github.com/C-Murdoch/projectarmour.