The Sealed Room

Guaranteeing long-term benchmark stability. Evaluation Units are self-contained, air-gapped test environments that bundle a task with its specific research evidence.

Population

27,412

Atomic Units

Architecture

JSONL

Self-Contained

Security

SEALED

SHA-256 Verified

The "Briefcase" Model

Should the runner query a live database, or should every authority be packed inside the test case? We chose the latter: the Option C principle.

[ SEPARATION OF CONVENTION ]

Reproducibility

inputs are preserved. Even if the database changes, the EU remains identical.

Portability

No "database setup" required. Air-gapped execution possible on any node.

Anatomy of an EU

// eu.json (The Truth)

{
  "unit_id": "scotus_1402",
  "question": "Does 42 USC...",
  "rubric": ["cite_authority", "logic"],
  "ground_truth": ["120 U.S. 45"]
}

// research_pack.json (The Evidence)

{
  "task_context": "...",
  "authorities": [
    { "cite": "120 U.S. 45", "text": "..." }
  ]
}