The Hard Gate
Why we reject "partial credit" for dangerous errors and apply zero scores when integrity checks fail.
Most benchmarks treat errors as additive penalties: a wrong answer subtracts points from a total. LegalChain operates differently. We identify classes of errors—fabrication, incoherence, silence—that render a legal document professionally unusable. When these errors occur, the integrity score is set to zero and the violations are flagged in scoring details. The run continues to completion, ensuring complete diagnostic data.
Defense Depth
We employ a series of integrity gates that every response must pass before it is even considered for quality scoring.
⚠ The Fabrication Gate (d8)
Policy: Any citation to a non-existent case results in a zero integrity score and is flagged in scoring details.
Rationale: Under ABA Model Rule 3.3, a lawyer cannot knowingly make a false statement of law. A brief with one fake citation is not "mostly good"; it is a malpractice liability. The run continues to completion for full diagnostic capture.
⚠ The Structure Gate
Policy: Failure to produce an IRAC structure (Issue, Rule, Application, Conclusion) results in a zero structure score.
Rationale: Legal reasoning requires structure. A stream-of-consciousness essay is not a legal memo, even if it touches on the right topics. It is functionally useless to a practitioner.
⚠ The Coverage Gate
Policy: Failure to cite any authority from the provided ResearchPack results in a zero consistency score.
Rationale: If the model ignores the provided evidence (RAG context) effectively, it is not performing the task. It is hallucinating generic law rather than analyzing the specific case.
Why Not Partial Credit?
"The 80% of the brief that is correct is worthless without trust in the 20% that is wrong."
In an educational setting, getting 4/5 answers right is a B. In a legal setting, getting 4/5 citations right and fabricating the fifth is a sanctionable offense.
We measure Completion Rate (percentage of queries that survive all gates) as a primary metric alongside Quality Score. A model with high quality but low completion is dangerous. A model with high completion but low quality is safe but weak. You need both.