Design

L10 Atomic evaluates each skill independently (no error propagation), producing per-skill scores and a final summary.

Scope

Skills Included

L10 Atomic evaluates all 7 core legal reasoning skills independently: S1 (Known Authority), S2 (Unknown Authority), S3 (Validate Authority), S4 (Fact Extraction), S5 (Distinguish), S6 (IRAC Synthesis), S7 (Citation Integrity).

Modalities

  • S5:cb - Closed-book (metadata + extracted facts only)
  • S5:rag - RAG-enhanced (includes citing opinion text)
  • All other skills run in a single modality

Gates

S7 (Citation Integrity) functions as a hard gate in L10 Agentic, voiding S6 output if fabrications are detected. In L10 Atomic, S7 is evaluated independently and does not void other skill scores.

Scoring Methods
Step Method Score Correct
S1 Exact match 1.0 or 0.0 All fields match
S2 Ranked retrieval MRR (0.0-1.0) hit@10
S3 Exact + partial credit 1.0, 0.5, or 0.0 is_overruled + year match
S4 Weighted fields 0.5 per field Both fields match
S5:cb / S5:rag Binary 1.0 or 0.0 agrees == edge.agree
S6 Rubric-based 0.0-1.0 weighted score >= 0.5
S7 Binary 1.0 or 0.0 all_valid == True
Detailed Scoring Rules

S1: Known Authority Retrieval

All three fields must match exactly: us_cite, case_name, term.

Normalization: Citations canonicalized (spaces → underscores, periods removed, lowercase)

S2: Unknown Authority Retrieval

Uses ranked retrieval metrics. Primary score is MRR (Mean Reciprocal Rank).

MRR: 1/(rank of first correct result), or 0 if not found
hit@k: True if ground truth appears in top k results
Additional metrics: hit@1, hit@5, hit@10, hit@20, rank

S3: Validate Authority

Condition Score Correct
Not overruled, model says not overruled1.0True
Overruled, model says overruled with correct year1.0True
Overruled, model says overruled with wrong year0.5False
Overruled, model says not overruled0.0False
Not overruled, model says overruled0.0False

S4: Fact Extraction

Two fields evaluated independently, 0.5 weight each:

  • disposition - Must match closed enum exactly
  • party_winning - "petitioner", "respondent", or "unclear"
Disposition enum: stay granted, affirmed, reversed, reversed and remanded, vacated and remanded, affirmed and reversed in part, affirmed and vacated in part, affirmed and reversed in part and remanded, vacated, petition denied, certification

S5: Distinguish

Binary classification: does the citing case agree with or distinguish from the cited case?

S5:cb (Closed-Book)
Uses only metadata + S4 extracted facts
S5:rag (RAG-Enhanced)
Adds citing opinion text

S6: IRAC Synthesis

Rubric-based scoring with weighted components:

Component Weight Criteria
Issue20%Clear, correctly framed legal question
Rule25%Accurate statement of legal rule from case
Application35%Logical application with citation support
Conclusion20%Consistent with analysis, cites outcome

S7: Citation Integrity

All citations must be verifiable real cases. A single fabricated citation results in failure.

Verification: Citations extracted using eyecite library
Ground truth: Cross-referenced against SCDB and fake_cases.csv

Dataset Schema

Ground Truth Sources

scdb_sample.csv
  • S1: usCite, caseName, term
  • S4: caseDisposition, partyWinning
scotus_shepards_sample.csv
  • S2: citing_case_us_cite
  • S5: agree (edge label)
scotus_overruled_db.csv
  • S3: overruling_case_name, year_overruled
fake_cases.csv
  • S7: Known fabricated citations

Output Schemas

// S1 Output
{ "us_cite": string, "case_name": string, "term": int }

// S2 Output
{ "citing_cases": [{ "us_cite": string, "case_name": string }] }

// S3 Output
{ "is_overruled": bool, "overruling_case": string|null, "year_overruled": int|null }

// S4 Output
{ "disposition": string, "party_winning": string, "holding_summary": string }

// S5 Output
{ "agrees": bool, "reasoning": string }

// S6 Output
{ "issue": string, "rule": string, "application": string, "conclusion": string }

// S7 Output
{ "citations_found": [{ "cite": string, "exists": bool }], "all_valid": bool }

Aggregation & Reporting

Per-Skill Metrics

Each skill reports accuracy as: correct / total for instances with status="OK".

Coverage Reporting

S5:rag coverage = instances with citing text / total instances. Reported separately from accuracy.

Versioning Policy

Breaking Changes (Major Version)

  • Changes to scoring formulas
  • Adding/removing skills from evaluation
  • Changes to ground truth sources

Compatible Changes (Minor Version)

  • Adding new metrics without changing existing ones
  • Expanding dataset coverage
  • Improving documentation

Run Artifacts

Each benchmark run produces: manifest with dataset hashes, per-instance step traces, aggregated scores, and model metadata.