Why Chained Evaluation?
Legal reasoning is not a multiple-choice test. It is a sequence of dependencies where errors propagate.
LegalChain uses chained evaluation, where each step's output becomes the next step's input. This differs from typical benchmarks that score independent questions in isolation. Chaining reflects how legal reasoning actually works: errors in one part of an analysis propagate through the rest.
The Problem with Independent Evaluation
Most AI benchmarks present isolated tasks: "What is the capital of France?" or "Summarize this document." Each question is scored independently. Getting Q1 wrong has no effect on Q2.
This works for factual retrieval. It fails completely for legal reasoning.
Example: The IRAC Cascade
Consider a Fourth Amendment analysis:
1. Identified Issue: Was it a "search"?
2. Rule: Katz reasonable expectation of privacy.
3. Analysis: Application of Katz to facts.
If Step 1 is wrong (e.g., identifying it as a Fifth Amendment issue), Steps 2 and 3 are fundamentally irrelevant, even if "correctly" generated.
Cascading Failures
In LegalChain, we explicitly model this dependency. If S3 (Authority Check) fails to validate a case, S6 (Synthesis) must proceed without it—or hallucinate.
What Chaining Reveals
Robustness: Does the model recover from upstream errors? If S2 provides a slightly wrong rule, can S6 still produce a reasonable analysis?
Consistency: Does the model maintain consistent positions throughout the chain?