The Automated Expert

How we use calibrated LLM judges to evaluate nuance, logic, and reasoning quality at scale.

Some questions have binary answers: "Did it contain a citation?" But legal reasoning is qualitative: "Is this argument persuasive?" To evaluate these dimensions without the prohibitive cost of human experts, LegalChain uses calibrated LLM judges running on strict, versioned rubrics.

The Judicial Process

The judge does not "vibe check." It follows a structured deliberation process defined by our prompt engineering.

graph LR Prompt[Versioned Prompt] --> Judge{LLM Judge} Evidence[ResearchPack] --> Judge Output[Model Response] --> Judge Judge -->|Evaluates| Rubric[IRAC Rubric] Rubric -->|Scores| Result[Phase 3 Score] subgraph "Configuration" Temp[Temperature: 0] Model[Model: Claude-3.5] end Temp -.-> Judge Model -.-> Judge style Result fill:#dcfce7,stroke:#22c55e,stroke-width:2px,color:#166534 style Judge fill:#2563eb,stroke:#1d4ed8,stroke-width:2px,color:#fff

The IRAC Rubric

We evaluate legal reasoning using the standard law school format: Issue, Rule, Application, Conclusion.

Application

Weight: 40%

The core of legal reasoning. Does the model thoughtfully apply the rule to the specific facts, or is it mechanical and superficial?

Rule Statement

Weight: 25%

Accuracy of the legal principle. Does the model correctly synthesize the holding from the cited case?

Conclusion

Weight: 20%

Logical soundness. Does the conclusion follow necessarily from the premises established in the application?

Issue

Weight: 15%

Problem identification. Did the model correctly spot the legal question presented by the prompt?

Calibration & Rigor

We do not blindly trust the LLM judge. To ensure reliability:

Temperature Zero: All judge calls use temperature=0 to minimize randomness.
Hashed Prompts: Evaluation prompts are immutable. Any change to the judging criteria bumps the version number.
Human Baseline: We calibrate the judge against a set of human-annotated samples. We require a Pearson correlation > 0.85 to certify a judge configuration.

What Judges DO NOT Do

We restrict the judge to qualitative assessment only.

❌ Did it cite real cases? (Handled by Deterministic Gate)
❌ Did it follow IRAC format? (Handled by Structure Check)
✔ Is the argument logical? (THIS is the judge's job)

This separation of powers ensures that hard constraints (citations, format) are enforced by code, while the LLM focuses entirely on softer reasoning skills.