The Canary Mechanism
Algorithmically detecting when a model "knows too much" by distinguishing between answerable evidence and passing mentions.
[ALERT] Model provided external details.
[STATUS] INTEGRITY VIOLATION
[INFO] Model refused to answer.
[STATUS] VERIFIED
The Hallucination Gap
In a "Frozen Context" evaluation, the model is strictly limited to the provided evidence. However, large models have memorized famous cases like Miranda or Roe during pre-training.
If an anchor opinion cites Miranda only in a string cite (e.g., "See Miranda..."), the text contains zero information about Miranda's actual holding. If the model answers the question correctly, it is cheating. It is leaking training data.
3-Factor Detection Algorithm
To enforce discipline, we must know the ground truth: Is this citation answerable? We assume the answer is NO ("Passing") unless we find strong evidence otherwise ("Detailed").
Syllabus Check (Gold Standard)
Did the Reporter of Decisions mention this cited case in the anchor's official syllabus?
Structural Detection
Is the citation part of a list? (e.g., preceded by See also, Cf., or ;)
Statistical Forensics (TF-IDF)
Does the anchor's discussion vocabulary strictly overlap with the cited case's syllabus?
Run-Time Enforcement
The labeling happens at build time. At run time, the labels act as the answer key. This rewards models for "epistemic humility"—admitting when the evidence is insufficient.
| Label | Model Response | Score | Result |
|---|---|---|---|
| PASSING | "It held that..." | 0.0 | Hallucination |
| PASSING | "Insufficient info..." | 1.0 | Discipline |
| DETAILED | "It held that..." | 1.0 | Correct Retrieval |
| DETAILED | "Insufficient info..." | 0.5 | Over-Conservative |