Project Armour — Deterministic Verification for AI-Generated Financial Analysis

The problem

Generative models are now routine producers of investment summaries, earnings-call recaps, and research notes. Their outputs are also routinely wrong in ways that are difficult to detect through conventional review: invented numbers, inverted directional claims, misattributed quotes, and plausible-sounding inferences with no source backing.

The standard mitigations fail in characteristic ways:

Use a better model. Reasoning models editorialise more, not less. The v0.4 evaluation shows the marketed premium model produces the highest fabrication rate in the panel.
Use an LLM as a judge. A judge model shares the same training distribution and failure modes as the model under test. Errors are correlated. The judge agrees with the wrong answer.
Use RAG. Retrieval grounds the input. It does not constrain the output. The model remains free to interpolate, smooth, and embellish.

The missing primitive is a non-generative verifier: a system that answers, for each claim in an LLM output, what specific source content backs this assertion and how confident the match is, without itself being capable of hallucination.

What Project Armour does

Zero generative AI in the audit loop. All verification is deterministic: regex extraction, arithmetic recompute, period alignment, scope fingerprinting, and token/embedding anchoring. The audit cannot fabricate an explanation because it cannot generate text.

Given a source document (earnings press release, transcript, 8-K) and an LLM-generated output, Armour:

Parses the source into a structured fact store: sentences, numbers with units and periods, named entities, table values.
Extracts every claim from the LLM output and classifies it by type and by origin (restatement, derived, inference, fabrication).
Anchors each claim to the best-matching source sentence via token-set overlap and optional sentence embeddings.
Runs per-claim checks: numeric tolerance, arithmetic recompute, period alignment, direction sign-check, scope fingerprint, entity presence.
Assigns a graduated verdict and produces a claim-by-claim ledger with source quotes.

Verdict ladder

Verdict	Meaning
pass	Restates a fact present in the source.
derived_pass	Derived from the source via verifiable arithmetic.
minor_drift	Grounded but contains small wording or rounding drift.
inference_unanchored	Reasonable inference; no specific source sentence anchors it.
inference_unverified	Inference whose factual basis could not be deterministically verified.
fail	Conflicts with or misstates source material.
fabrication	Asserts entities, numbers, or events absent from the source.

The ledger is not a summary score. It is a citable, per-claim record that tells a compliance reviewer exactly which source sentence was checked, what the verdict was, and why. Identical input produces an identical ledger, every time.

v0.4 headline results

203 (model, ticker, task) pairs across 30 sector-diversified US large-cap earnings press releases, audited under the compliance profile. Accuracy is defined as (pass + derived_pass + minor_drift) / total claims. Fabrication is claims with no traceable anchor in the source document.

Model	Claims	Accuracy	Accuracy 95% CI	Fabrication	Fabrication 95% CI
GPT-5.5 Pro	742	74.3%	69.4–78.6%	5.1%	3.5–6.9%
OpenAI o3	2,038	71.5%	68.1–74.7%	5.3%	3.9–6.9%
Claude Sonnet 4.6	2,372	59.6%	56.9–62.3%	8.5%	6.7–10.4%
Claude Opus 4.7 (extended thinking)	2,220	60.0%	57.0–62.9%	10.6%	8.7–12.7%

CIs are 95% cluster bootstrap (resampling cells with replacement, 5,000 draws, seed 20260506). The OpenAI-vs-Anthropic accuracy and fabrication gaps are robust: OpenAI lower bounds (68-69% accuracy, 3.5-3.9% fabrication) do not overlap Anthropic upper bounds (62-63% accuracy; fabrication upper bounds 10.4-12.7%). The Opus-vs-Sonnet fabrication difference is directional (10.6% vs 8.5%) but the CIs overlap at 95% confidence. Full methodology and matched-subset robustness check in WHITEPAPER.md.

The model marketed as the premium reasoning option produced the highest fabrication rate in the panel. The "use a better model" advice does not survive contact with this dataset.

Try it without API keys

The demo command audits a bundled fictional earnings release against a bundled LLM-style output, producing all five verdict types the system can generate. No OpenAI or Anthropic key required.

pip install -e .
armour demo

This writes demo_output/audit.json and demo_output/audit.html to your working directory. Open the HTML file for the per-claim source-trace ledger.

The five demo claims cover one faithful restatement (pass), one verified arithmetic derivation (derived_pass), one unanchored inference (inference_unanchored), one contradiction of the source (fail), and one assertion absent from the source entirely (fabrication).

Reproducibility

The v0.4 headline numbers are fully reproducible from the committed evaluation artefacts in batch_results_v2/. No API keys, no network access, no LLM inference required.

pip install -e ".[dev]"

# Reproduce headline table from committed artefacts
python evaluation/reproduce_v04.py

# Reproduce 95% confidence intervals (5,000 draws, seed 20260506)
python evaluation/bootstrap_ci.py

# Full test suite (44 tests, all offline)
pytest

Both scripts exit printing PASS when the computed numbers match the stored artefacts exactly. See evaluation/README.md for full methodology and metric definitions.

GitHub Actions CI (Python 3.10 and 3.11) runs all four steps on every push. The CI badge is the live verification status of the committed artefacts.

What Armour does not guarantee

Deterministic does not mean infallible. Armour checks consistency with the supplied source document. It does not verify truth in the world.

Source consistency, not world truth. If the source document contains an error, Armour will pass a claim that correctly restates that error. An Armour pass means "consistent with the source", not "factually correct about the company".
Anchoring errors are deterministic. Where the anchoring or classification logic makes an error, that error is deterministic and inspectable in the claim ledger. It is not random and it cannot be hidden.
Sentence-level granularity. Claims that synthesise information across multiple non-adjacent source sentences may anchor weakly. Arithmetic recompute partially mitigates this for numeric claims.
Inference requires human review. Unanchored inferences are flagged, not condemned. A qualified reviewer adjudicates whether the inference is sound.
Finance domain scope. The v0.4 evaluation covers US large-cap earnings press releases. Performance on 10-K filings, broker notes, transcripts, or non-English sources has not been characterised at this scale.
Human sign-off is required. Armour produces a ledger. A qualified human makes the compliance determination.

The problem

What Project Armour does

Verdict ladder

v0.4 headline results

Try it without API keys

Reproducibility

What Armour does not guarantee

Further reading