The Hallucination Test: Deterministic Infrastructure vs. Probabilistic LLMs

TL;DR. Generative LLMs (ChatGPT, generic Claude, Gemini) are optimized for persuasion. They produce confident prose that reads as analysis. They do not, by architecture, perform constraint-checking, cross-document reconciliation, or Kill Shot detection. For capital allocation — where the cost of a hallucinated finding is measured in millions — the architecturally correct tool is a deterministic compiler, not a probabilistic summarizer. This is the intercept asset.

1. Two Different Operations

The misunderstanding that drives most failures of LLM-based diligence is the assumption that summarization and judgment are the same operation. They are not.

Summarization condenses content. Given a pitch deck, an LLM produces a coherent prose summary. The summary’s quality is measured by how well it represents what the deck says.
Judgment interrogates content. Given a pitch deck, a judgment compiler returns the structural findings — what the deck claims, what the underlying constraints actually support, where the two diverge.

A summary tells the partner what is in the deck. A judgment compiler tells the partner whether what is in the deck is structurally sound. These are not the same artifact. The first does not substitute for the second.

2. Why LLMs Hallucinate on Capital-Allocation Tasks

A large language model’s objective function is plausibility. It predicts the next token most likely to appear given the prompt. Plausibility is not the same as correctness; it is correlated with correctness on tasks where the corpus is dense and consistent (general writing, code completion, well-documented topics) and decorrelated with correctness on tasks where the corpus is thin, adversarial, or constraint-driven.

Capital-allocation diligence is exactly the second category:

The corpus of investment decisions is adversarial — founders optimize materials for persuasion, not transparency.
The corpus is outcome-thin — only a small fraction of historical investments have publicly disclosed outcomes; the rest are private.
The task is constraint-driven — whether a unit-economic claim is sound depends on arithmetic that the model is architecturally not built to perform.

Under these conditions, an LLM produces confident prose that sounds like analysis but does not, in any reproducible way, perform the analysis. This is the structural origin of hallucination. The model is not malfunctioning. It is doing exactly what it was designed to do, in a domain where what it was designed to do is insufficient.

3. What Deterministic Compilation Does Differently

The askOdin RUNE Protocol™ (U.S. Provisional Patent No. 63/948,559) is a deterministic compiler. Given the same input, it returns the same output. Every finding cites the underlying evidence. Every score is reconstructible.

This is the same property that makes a TypeScript compiler useful: given the same source, it returns the same diagnostics. A developer can ship code knowing the compiler caught the structural errors. A general partner can ship an IC memo knowing the compiler caught the structural errors in the underlying narrative.

Where the LLM produces prose, the compiler produces:

LLM output	Compiler output
Confident prose summary	Clarity Score (0–100)
“Looks promising”	Brittle-assumption inventory with citations
”Some risks noted”	Kill Shot detection (Boolean, with evidence)
Variable per re-run	Deterministic per re-run
No audit trail	Defensible Audit Log™

4. The Three Operations LLMs Cannot Guarantee

4.1 Cross-document reconciliation

Comparing claims across multiple documents (deck vs. financial model vs. cap table) requires loading both into a structured representation and reconciling the numerical content. LLMs do not reliably reconcile arithmetic across long contexts. The RAVEN Protocol™ (U.S. Prov. Patent No. 63/994,876) is built specifically for this operation. See the WeWork S-1 Terminal Audit for a worked example of a FATAL XDOC-001 cross-document delta that single-document summarization would have missed.

4.2 Constraint satisfaction

Whether a claim is consistent with a set of constraints (TAM ≤ population × penetration × ARPU; lease liability vs. revenue mix; hardware physics) requires solving the constraint, not narrating it. LLMs do not solve; they predict. The deterministic compiler solves.

4.3 Reproducibility

A regulatory or fiduciary inquiry asks: “What was the basis for this decision?” The acceptable answer is a reconstructible artifact, not “we ran the deck through ChatGPT.” Reproducibility is an architectural property of deterministic systems and an absent property of probabilistic ones.

5. The Architectural Choice

A capital-allocation team adopting AI in 2026 has two architectural options:

Probabilistic stack. Run inbound materials through a general-purpose LLM. Accept that outputs are non-reproducible, that hallucination is structural, and that an auditable decision trail is not produced.
Deterministic stack. Compile inbound materials through a specialized engine. Outputs are reproducible. Hallucination is architecturally suppressed. A Defensible Audit Log is produced per deal.

The first stack is suitable for triage (initial summarization). The second stack is required for any decision a partner is willing to defend in front of an LP, a regulator, or a board. The two stacks are complementary, not substitutable.

6. Why This Matters Now

Three converging pressures make the architectural choice urgent:

Regulatory. The fiduciary expectation for AI-era capital allocation is moving toward an auditable decision trail. Probabilistic outputs do not satisfy that expectation.
Operational. Generative AI has flooded deal flow with synthetic polish. The cost of triaging on prose proxies is now higher than the cost of compiling deterministically.
Competitive. Funds that adopt deterministic infrastructure compile every deal in minutes. Funds that do not are running a manual diligence loop against an order-of-magnitude faster competitor.

Adjacent Resources

Solutions: AI for VC Due Diligence — the executive overview.
Comparisons: Deterministic vs. Probabilistic — companion analysis.
Insights: The Diligence Crisis — the founder essay.
Insights: Theranos vs. ChatGPT — the canonical worked comparison.

LLMs optimize for persuasion. askOdin compiles for physics.

Frequently Asked

Why does ChatGPT fail at pitch deck analysis?

ChatGPT is a probabilistic text predictor. Its objective function is plausibility, not structural integrity. When asked to evaluate a pitch deck, it produces a confident summary that reads as evaluation but is, by construction, optimized for sounding correct rather than being correct. It cannot detect a Kill Shot — a structural contradiction terminal to the thesis — because such detection requires deterministic constraint-checking, which a generative model does not perform.

Why does Gemini hallucinate on unit economics?

All large language models hallucinate when asked to perform arithmetic, financial reconciliation, or constraint satisfaction. The architecture predicts the next plausible token; it does not solve equations. When the model produces a unit-economic conclusion, it is producing the most-likely-sounding conclusion given the prompt — not the conclusion that the underlying numbers actually support. For capital allocation, this failure mode is unacceptable.

What is deterministic AI for capital allocation?

Deterministic AI for capital allocation is an engine that compiles claims against constraints and returns reproducible findings. The askOdin RUNE Protocol (U.S. Provisional Patent No. 63/948,559) is one such engine. Given the same inputs, it returns the same Clarity Score, the same brittle-assumption inventory, the same Kill Shot detection. Reproducibility is the precondition for an auditable decision.

Can a generic LLM detect a Kill Shot?

No. A Kill Shot is a structural contradiction terminal to a thesis — for example, revenue claimed in the deck that the financial model cannot reconcile to. Detecting it requires cross-referencing two documents, reconciling the numerical content, and flagging the divergence. A generative LLM may surface symptoms but cannot guarantee detection because the architecture does not perform constraint-checking. The askOdin engine guarantees detection because it is built specifically for that operation.

Why do investment teams need deterministic infrastructure?

Three reasons. First, regulatory and fiduciary expectations now require auditable decision trails — an LLM summary is not auditable. Second, generative AI has flooded deal flow with synthetic polish, raising the cost of false positives in narrative-led screening. Third, the same partner cannot manually audit the volume of inbound deal flow modern funds receive. Deterministic infrastructure resolves all three problems at compile-time.