The Sealed Room

Guaranteeing long-term benchmark stability. Evaluation Units are self-contained, air-gapped test environments that bundle a task with its specific research evidence.

Population
27,412
Atomic Units
Architecture
JSONL
Self-Contained
Security
SEALED
SHA-256 Verified
The "Briefcase" Model

Should the runner query a live database, or should every authority be packed inside the test case? We chose the latter: the Option C principle.

[ SEPARATION OF CONVENTION ]
Reproducibility

inputs are preserved. Even if the database changes, the EU remains identical.

Portability

No "database setup" required. Air-gapped execution possible on any node.

Anatomy of an EU
// eu.json (The Truth)
{
  "unit_id": "scotus_1402",
  "question": "Does 42 USC...",
  "rubric": ["cite_authority", "logic"],
  "ground_truth": ["120 U.S. 45"]
}
// research_pack.json (The Evidence)
{
  "task_context": "...",
  "authorities": [
    { "cite": "120 U.S. 45", "text": "..." }
  ]
}