Design

L10 Atomic evaluates each skill independently (no error propagation), producing per-skill scores and a final summary.

Scope

Skills Included

L10 Atomic evaluates all 7 core legal reasoning skills independently: S1 (Known Authority), S2 (Unknown Authority), S3 (Validate Authority), S4 (Fact Extraction), S5 (Distinguish), S6 (IRAC Synthesis), S7 (Citation Integrity).

Modalities

S5:cb - Closed-book (metadata + extracted facts only)
S5:rag - RAG-enhanced (includes citing opinion text)
All other skills run in a single modality

Gates

S7 (Citation Integrity) functions as a hard gate in L10 Agentic, voiding S6 output if fabrications are detected. In L10 Atomic, S7 is evaluated independently and does not void other skill scores.

Scoring Methods

Step	Method	Score	Correct
S1	Exact match	`1.0` or `0.0`	All fields match
S2	Ranked retrieval	MRR (0.0-1.0)	hit@10
S3	Exact + partial credit	`1.0`, `0.5`, or `0.0`	is_overruled + year match
S4	Weighted fields	0.5 per field	Both fields match
S5:cb / S5:rag	Binary	`1.0` or `0.0`	agrees == edge.agree
S6	Rubric-based	0.0-1.0 weighted	score >= 0.5
S7	Binary	`1.0` or `0.0`	all_valid == True

Detailed Scoring Rules

S1: Known Authority Retrieval

All three fields must match exactly: us_cite, case_name, term.

Normalization: Citations canonicalized (spaces → underscores, periods removed, lowercase)

S2: Unknown Authority Retrieval

Uses ranked retrieval metrics. Primary score is MRR (Mean Reciprocal Rank).

MRR: 1/(rank of first correct result), or 0 if not found

hit@k: True if ground truth appears in top k results

Additional metrics: hit@1, hit@5, hit@10, hit@20, rank

S3: Validate Authority

Condition	Score	Correct
Not overruled, model says not overruled	1.0	True
Overruled, model says overruled with correct year	1.0	True
Overruled, model says overruled with wrong year	0.5	False
Overruled, model says not overruled	0.0	False
Not overruled, model says overruled	0.0	False

S4: Fact Extraction

Two fields evaluated independently, 0.5 weight each:

disposition - Must match closed enum exactly
party_winning - "petitioner", "respondent", or "unclear"

Disposition enum: stay granted, affirmed, reversed, reversed and remanded, vacated and remanded, affirmed and reversed in part, affirmed and vacated in part, affirmed and reversed in part and remanded, vacated, petition denied, certification

S5: Distinguish

Binary classification: does the citing case agree with or distinguish from the cited case?

S5:cb (Closed-Book)

Uses only metadata + S4 extracted facts

S5:rag (RAG-Enhanced)

Adds citing opinion text

S6: IRAC Synthesis

Rubric-based scoring with weighted components:

Component	Weight	Criteria
Issue	20%	Clear, correctly framed legal question
Rule	25%	Accurate statement of legal rule from case
Application	35%	Logical application with citation support
Conclusion	20%	Consistent with analysis, cites outcome

S7: Citation Integrity

All citations must be verifiable real cases. A single fabricated citation results in failure.

Verification: Citations extracted using eyecite library

Ground truth: Cross-referenced against SCDB and fake_cases.csv

Dataset Schema

Ground Truth Sources

scdb_sample.csv

S1: usCite, caseName, term
S4: caseDisposition, partyWinning

scotus_shepards_sample.csv

S2: citing_case_us_cite
S5: agree (edge label)

scotus_overruled_db.csv

S3: overruling_case_name, year_overruled

fake_cases.csv

S7: Known fabricated citations

Output Schemas

// S1 Output
{ "us_cite": string, "case_name": string, "term": int }

// S2 Output
{ "citing_cases": [{ "us_cite": string, "case_name": string }] }

// S3 Output
{ "is_overruled": bool, "overruling_case": string|null, "year_overruled": int|null }

// S4 Output
{ "disposition": string, "party_winning": string, "holding_summary": string }

// S5 Output
{ "agrees": bool, "reasoning": string }

// S6 Output
{ "issue": string, "rule": string, "application": string, "conclusion": string }

// S7 Output
{ "citations_found": [{ "cite": string, "exists": bool }], "all_valid": bool }

Aggregation & Reporting

Per-Skill Metrics

Each skill reports accuracy as: correct / total for instances with status="OK".

Coverage Reporting

S5:rag coverage = instances with citing text / total instances. Reported separately from accuracy.

Versioning Policy

Breaking Changes (Major Version)

Changes to scoring formulas
Adding/removing skills from evaluation
Changes to ground truth sources

Compatible Changes (Minor Version)

Adding new metrics without changing existing ones
Expanding dataset coverage
Improving documentation

Run Artifacts

Each benchmark run produces: manifest with dataset hashes, per-instance step traces, aggregated scores, and model metadata.

← Back to Overview Skill Reference →