Why Chained Evaluation?

Legal reasoning is not a multiple-choice test. It is a sequence of dependencies where errors propagate.

LegalChain uses chained evaluation, where each step's output becomes the next step's input. This differs from typical benchmarks that score independent questions in isolation. Chaining reflects how legal reasoning actually works: errors in one part of an analysis propagate through the rest.

The Problem with Independent Evaluation

Most AI benchmarks present isolated tasks: "What is the capital of France?" or "Summarize this document." Each question is scored independently. Getting Q1 wrong has no effect on Q2.

This works for factual retrieval. It fails completely for legal reasoning.

Example: The IRAC Cascade

Consider a Fourth Amendment analysis:
1. Identified Issue: Was it a "search"?
2. Rule: Katz reasonable expectation of privacy.
3. Analysis: Application of Katz to facts.

If Step 1 is wrong (e.g., identifying it as a Fifth Amendment issue), Steps 2 and 3 are fundamentally irrelevant, even if "correctly" generated.

"Error propagation is not a bug—it is the core feature."

Cascading Failures

In LegalChain, we explicitly model this dependency. If S3 (Authority Check) fails to validate a case, S6 (Synthesis) must proceed without it—or hallucinate.

Figure 1. The Propagation of Error
S1
Bad Issue ID
S2
Irrelevant Rule
S6
Hallucinated Analysis
S8
Integrity Failure
S1 → S2 Wrong context
S2 → S6 Flawed premise
S6 → S8 Fabrication detected at integrity gate

What Chaining Reveals

Robustness: Does the model recover from upstream errors? If S2 provides a slightly wrong rule, can S6 still produce a reasonable analysis?

Consistency: Does the model maintain consistent positions throughout the chain?