Closed-Book vs Open-Book
Step variants that measure parametric knowledge versus retrieval-augmented capability using the same ground truth.
Some steps run in both CB (closed-book) and RAG variants. CB tests what the model knows from pretraining alone. RAG tests whether the model can effectively use provided evidence. Same ground truth, different input conditions.
The gap between CB and RAG scores is diagnostic. High RAG but low CB may indicate "copying" rather than reasoning. Similar scores in both may suggest genuine understanding—or data contamination (which the Canary mechanism detects).
Measuring both modes reveals whether a model is reasoning from evidence or merely pattern-matching against pretraining. The delta is often more informative than either score alone.