Chained Evaluation
Measuring why a model with 90% accuracy can fail 65% of professional workflows.
The Success Probability
The core insight of chained evaluation is multiplicative. A model that achieves impressive scores on independent tasks quickly collapses when those tasks are dependent.
The Multiplicative Reality
Legal research is not a series of independent tasks; it is a dependent workflow. LegalChain operationalizes this by evaluating models through a 10-step chained dependency.
| Chain Length | Completion Rate (@ 90% accuracy) |
|---|---|
| 1 Step | 90.0% |
| 3 Steps | 72.9% |
| 5 Steps | 59.0% |
| 10 Steps | 34.9% |
Cascade Failures
In a parallel benchmark (like LegalBench or CaseHOLD), an error at Step 1 is just one point lost. In LegalChain, an error at Step 1 corrupts the entire chain.
01. Propagation
If the model identifies the wrong case in S1, all downstream facts, distinctions, and synthesis are grounded in false authority.
02. The "Fabrication Gate"
If the model hallucinates a citation at step 8, the integrity score is set to zero and flagged, regardless of prior reasoning. The run continues to completion for full diagnostics.
Why We Chain
Chained evaluation provides a more accurate proxy for professional utility. A legal brief is only as strong as its weakest link. By forcing models to maintain coherent reasoning across ten dependent steps, we identify the exact point where "apparent competence" diverges from "actual reliability."