Chained Evaluation

Measuring why a model with 90% accuracy can fail 65% of professional workflows.

Multiplicative Degradation
S1
90%
S3
73%
S10
35%

The Success Probability

P(Success) = P(S1) × P(S2) × ... × P(S10)

The core insight of chained evaluation is multiplicative. A model that achieves impressive scores on independent tasks quickly collapses when those tasks are dependent.

The Multiplicative Reality

Legal research is not a series of independent tasks; it is a dependent workflow. LegalChain operationalizes this by evaluating models through a 10-step chained dependency.

Chain Length Completion Rate (@ 90% accuracy)
1 Step 90.0%
3 Steps 72.9%
5 Steps 59.0%
10 Steps 34.9%

Cascade Failures

In a parallel benchmark (like LegalBench or CaseHOLD), an error at Step 1 is just one point lost. In LegalChain, an error at Step 1 corrupts the entire chain.

01. Propagation

If the model identifies the wrong case in S1, all downstream facts, distinctions, and synthesis are grounded in false authority.

02. The "Fabrication Gate"

If the model hallucinates a citation at step 8, the integrity score is set to zero and flagged, regardless of prior reasoning. The run continues to completion for full diagnostics.

Why We Chain

Chained evaluation provides a more accurate proxy for professional utility. A legal brief is only as strong as its weakest link. By forcing models to maintain coherent reasoning across ten dependent steps, we identify the exact point where "apparent competence" diverges from "actual reliability."