The Automated Expert
How we use calibrated LLM judges to evaluate nuance, logic, and reasoning quality at scale.
Some questions have binary answers: "Did it contain a citation?" But legal reasoning is qualitative: "Is this argument persuasive?" To evaluate these dimensions without the prohibitive cost of human experts, LegalChain uses calibrated LLM judges running on strict, versioned rubrics.
The Judicial Process
The judge does not "vibe check." It follows a structured deliberation process defined by our prompt engineering.
The IRAC Rubric
We evaluate legal reasoning using the standard law school format: Issue, Rule, Application, Conclusion.
Application
Weight: 40%The core of legal reasoning. Does the model thoughtfully apply the rule to the specific facts, or is it mechanical and superficial?
Rule Statement
Weight: 25%Accuracy of the legal principle. Does the model correctly synthesize the holding from the cited case?
Conclusion
Weight: 20%Logical soundness. Does the conclusion follow necessarily from the premises established in the application?
Issue
Weight: 15%Problem identification. Did the model correctly spot the legal question presented by the prompt?
Calibration & Rigor
We do not blindly trust the LLM judge. To ensure reliability:
- Temperature Zero: All judge calls use
temperature=0to minimize randomness. - Hashed Prompts: Evaluation prompts are immutable. Any change to the judging criteria bumps the version number.
- Human Baseline: We calibrate the judge against a set of human-annotated samples. We require a Pearson correlation > 0.85 to certify a judge configuration.
What Judges DO NOT Do
We restrict the judge to qualitative assessment only.
- ❌ Did it cite real cases? (Handled by Deterministic Gate)
- ❌ Did it follow IRAC format? (Handled by Structure Check)
- ✔ Is the argument logical? (THIS is the judge's job)
This separation of powers ensures that hard constraints (citations, format) are enforced by code, while the LLM focuses entirely on softer reasoning skills.