No Hidden Variables

Why we freeze scoring constants and enforce calculating 75% of metrics using code, not judges.

If the score changes, the model must have changed. That is the golden rule of benchmarking. But with non-deterministic aggregators and "vibe-based" judging, scores often drift due to randomness in the evaluation process itself. LegalChain maximizes metric stability by relying on deterministic, audit-trail-verified scoring logic for all structural and citation components.

The Formula is Fixed

When ranking precedents for a ResearchPack, or verifying a citation, we use frozen code paths. There is no LLM "deciding" if a case is relevant; there is a formula.

graph LR Inputs[Sealed Inputs] --> Formula{Deterministic\nAlgorithm} Formula --> Output[Frozen Score] subgraph "Inputs" Signals[Introduction Signals] Treat[Shepard's Treatment] Freq[Citation Frequency] Fowler[Network Score] end Inputs --> Signals Inputs --> Treat Inputs --> Freq Inputs --> Fowler Signals -.-> Formula Treat -.-> Formula Freq -.-> Formula Fowler -.-> Formula style Formula fill:#dcfce7,stroke:#22c55e,stroke-width:2px,color:#166534 style Output fill:#fff,stroke:#333,stroke-width:2px

R = 1.0*E + W_t + 0.2*log1p(freq) + W_s - 0.5*O

Frozen Constants

Every weight, every threshold, and every penalty is explicitly versioned. We do not tweak these "under the hood."

Signal Weights

"followed"	1.0
"cites"	0.7
"explained"	0.5
"overruled"	0.0

Budget Limits

Max Anchor Chars	80,000
Max Pack Chars	120,000
Top-K Authorities	12

What IS Judged?

We reserve LLM judging only for qualities that cannot be measured via code:

Quality of Reasoning: Is the argument logical and persuasive?
Doctrinal Nuance: Does the analysis capture the subtlety of a holding?
Completeness: Did the model address all relevant factors?

For everything else—Did it cite real cases? Did it follow IRAC structure? Did it find the right precedent?—we use code.