The problem
Generative models are now routine producers of investment summaries, earnings-call recaps, and research notes. Their outputs are also routinely wrong in ways that are difficult to detect through conventional review: invented numbers, inverted directional claims, misattributed quotes, and plausible-sounding inferences with no source backing.
The standard mitigations fail in characteristic ways:
- Use a better model. Reasoning models editorialise more, not less. The v0.4 evaluation shows the marketed premium model produces the highest fabrication rate in the panel.
- Use an LLM as a judge. A judge model shares the same training distribution and failure modes as the model under test. Errors are correlated. The judge agrees with the wrong answer.
- Use RAG. Retrieval grounds the input. It does not constrain the output. The model remains free to interpolate, smooth, and embellish.
The missing primitive is a non-generative verifier: a system that answers, for each claim in an LLM output, what specific source content backs this assertion and how confident the match is, without itself being capable of hallucination.
What Project Armour does
Zero generative AI in the audit loop. All verification is deterministic: regex extraction, arithmetic recompute, period alignment, scope fingerprinting, and token/embedding anchoring. The audit cannot fabricate an explanation because it cannot generate text.
Given a source document (earnings press release, transcript, 8-K) and an LLM-generated output, Armour:
- Parses the source into a structured fact store: sentences, numbers with units and periods, named entities, table values.
- Extracts every claim from the LLM output and classifies it by type and by origin (restatement, derived, inference, fabrication).
- Anchors each claim to the best-matching source sentence via token-set overlap and optional sentence embeddings.
- Runs per-claim checks: numeric tolerance, arithmetic recompute, period alignment, direction sign-check, scope fingerprint, entity presence.
- Assigns a graduated verdict and produces a claim-by-claim ledger with source quotes.
Verdict ladder
| Verdict | Meaning |
|---|---|
| pass | Restates a fact present in the source. |
| derived_pass | Derived from the source via verifiable arithmetic. |
| minor_drift | Grounded but contains small wording or rounding drift. |
| inference_unanchored | Reasonable inference; no specific source sentence anchors it. |
| inference_unverified | Inference whose factual basis could not be deterministically verified. |
| fail | Conflicts with or misstates source material. |
| fabrication | Asserts entities, numbers, or events absent from the source. |
The ledger is not a summary score. It is a citable, per-claim record that tells a compliance reviewer exactly which source sentence was checked, what the verdict was, and why. Identical input produces an identical ledger, every time.
v0.4 headline results
203 (model, ticker, task) pairs across 30 sector-diversified US large-cap earnings press releases, audited under the compliance profile. Accuracy is defined as (pass + derived_pass + minor_drift) / total claims. Fabrication is claims with no traceable anchor in the source document.
| Model | Claims | Accuracy | Accuracy 95% CI | Fabrication | Fabrication 95% CI |
|---|---|---|---|---|---|
| GPT-5.5 Pro | 742 | 74.3% | 69.4–78.6% | 5.1% | 3.5–6.9% |
| OpenAI o3 | 2,038 | 71.5% | 68.1–74.7% | 5.3% | 3.9–6.9% |
| Claude Sonnet 4.6 | 2,372 | 59.6% | 56.9–62.3% | 8.5% | 6.7–10.4% |
| Claude Opus 4.7 (extended thinking) | 2,220 | 60.0% | 57.0–62.9% | 10.6% | 8.7–12.7% |
CIs are 95% cluster bootstrap (resampling cells with replacement, 5,000 draws, seed 20260506). The OpenAI-vs-Anthropic accuracy and fabrication gaps are robust: OpenAI lower bounds (68-69% accuracy, 3.5-3.9% fabrication) do not overlap Anthropic upper bounds (62-63% accuracy; fabrication upper bounds 10.4-12.7%). The Opus-vs-Sonnet fabrication difference is directional (10.6% vs 8.5%) but the CIs overlap at 95% confidence. Full methodology and matched-subset robustness check in WHITEPAPER.md.
The model marketed as the premium reasoning option produced the highest fabrication rate in the panel. The "use a better model" advice does not survive contact with this dataset.
Try it without API keys
The demo command audits a bundled fictional earnings release against a bundled LLM-style output, producing all five verdict types the system can generate. No OpenAI or Anthropic key required.
pip install -e .
armour demo
This writes demo_output/audit.json and demo_output/audit.html to your working directory. Open the HTML file for the per-claim source-trace ledger.
The five demo claims cover one faithful restatement (pass), one verified arithmetic derivation (derived_pass), one unanchored inference (inference_unanchored), one contradiction of the source (fail), and one assertion absent from the source entirely (fabrication).
Reproducibility
The v0.4 headline numbers are fully reproducible from the committed evaluation artefacts in batch_results_v2/. No API keys, no network access, no LLM inference required.
pip install -e ".[dev]"
# Reproduce headline table from committed artefacts
python evaluation/reproduce_v04.py
# Reproduce 95% confidence intervals (5,000 draws, seed 20260506)
python evaluation/bootstrap_ci.py
# Full test suite (44 tests, all offline)
pytest
Both scripts exit printing PASS when the computed numbers match the stored artefacts exactly. See evaluation/README.md for full methodology and metric definitions.
GitHub Actions CI (Python 3.10 and 3.11) runs all four steps on every push. The CI badge is the live verification status of the committed artefacts.
What Armour does not guarantee
Deterministic does not mean infallible. Armour checks consistency with the supplied source document. It does not verify truth in the world.
- Source consistency, not world truth. If the source document contains an error, Armour will pass a claim that correctly restates that error. An Armour pass means "consistent with the source", not "factually correct about the company".
- Anchoring errors are deterministic. Where the anchoring or classification logic makes an error, that error is deterministic and inspectable in the claim ledger. It is not random and it cannot be hidden.
- Sentence-level granularity. Claims that synthesise information across multiple non-adjacent source sentences may anchor weakly. Arithmetic recompute partially mitigates this for numeric claims.
- Inference requires human review. Unanchored inferences are flagged, not condemned. A qualified reviewer adjudicates whether the inference is sound.
- Finance domain scope. The v0.4 evaluation covers US large-cap earnings press releases. Performance on 10-K filings, broker notes, transcripts, or non-English sources has not been characterised at this scale.
- Human sign-off is required. Armour produces a ledger. A qualified human makes the compliance determination.
Further reading
- WHITEPAPER.md — full methodology, per-task breakdown, failure-mode signatures per model, limitations, and Appendix B reproducibility commands.
- evaluation/README.md — metric definitions, script usage, and verification instructions.
- README.md — install, quick-start CLI usage, verdict ladder, weight profiles, and repository layout.
Source code, committed evaluation artefacts, and the full test suite are available at github.com/C-Murdoch/projectarmour.