The Sealed Room
Guaranteeing long-term benchmark stability. Evaluation Units are self-contained, air-gapped test environments that bundle a task with its specific research evidence.
Population
27,412
Atomic Units
Architecture
JSONL
Self-Contained
Security
SEALED
SHA-256 Verified
The "Briefcase" Model
Should the runner query a live database, or should every authority be packed inside the test case? We chose the latter: the Option C principle.
[ SEPARATION OF CONVENTION ]
Reproducibility
inputs are preserved. Even if the database changes, the EU remains identical.
Portability
No "database setup" required. Air-gapped execution possible on any node.
Anatomy of an EU
// eu.json (The Truth)
{
"unit_id": "scotus_1402",
"question": "Does 42 USC...",
"rubric": ["cite_authority", "logic"],
"ground_truth": ["120 U.S. 45"]
}
// research_pack.json (The Evidence)
{
"task_context": "...",
"authorities": [
{ "cite": "120 U.S. 45", "text": "..." }
]
}