The Canary Mechanism

Algorithmically detecting when a model "knows too much" by distinguishing between answerable evidence and passing mentions.

canary_protocol.py — Active Monitor

USER:

"What did the court hold in Miranda v. Arizona?"

Failed

Scenario A: Hallucination

> Miranda established that suspects must be informed of rights...

[ALERT] Citation classified as "PASSING".
[ALERT] Model provided external details.
[STATUS] INTEGRITY VIOLATION

Passed

Scenario B: Discipline

> The text only cites Miranda in passing. No holding provided.

[INFO] Citation classified as "PASSING".
[INFO] Model refused to answer.
[STATUS] VERIFIED

The Hallucination Gap

In a "Frozen Context" evaluation, the model is strictly limited to the provided evidence. However, large models have memorized famous cases like Miranda or Roe during pre-training.

If an anchor opinion cites Miranda only in a string cite (e.g., "See Miranda..."), the text contains zero information about Miranda's actual holding. If the model answers the question correctly, it is cheating. It is leaking training data.

3-Factor Detection Algorithm

To enforce discipline, we must know the ground truth: Is this citation answerable? We assume the answer is NO ("Passing") unless we find strong evidence otherwise ("Detailed").

Factor

Syllabus Check (Gold Standard)

Did the Reporter of Decisions mention this cited case in the anchor's official syllabus?

100% Confidence

Factor

Structural Detection

Is the citation part of a list? (e.g., preceded by See also, Cf., or ;)

~90% Confidence

Factor

Statistical Forensics (TF-IDF)

Does the anchor's discussion vocabulary strictly overlap with the cited case's syllabus?

Fallback

Run-Time Enforcement

The labeling happens at build time. At run time, the labels act as the answer key. This rewards models for "epistemic humility"—admitting when the evidence is insufficient.

Label	Model Response	Score	Result
PASSING	"It held that..."	0.0	Hallucination
PASSING	"Insufficient info..."	1.0	Discipline
DETAILED	"It held that..."	1.0	Correct Retrieval
DETAILED	"Insufficient info..."	0.5	Over-Conservative