Design
L10 Atomic evaluates each skill independently (no error propagation), producing per-skill scores and a final summary.
Skills Included
L10 Atomic evaluates all 7 core legal reasoning skills independently: S1 (Known Authority), S2 (Unknown Authority), S3 (Validate Authority), S4 (Fact Extraction), S5 (Distinguish), S6 (IRAC Synthesis), S7 (Citation Integrity).
Modalities
- S5:cb - Closed-book (metadata + extracted facts only)
- S5:rag - RAG-enhanced (includes citing opinion text)
- All other skills run in a single modality
Gates
S7 (Citation Integrity) functions as a hard gate in L10 Agentic, voiding S6 output if fabrications are detected. In L10 Atomic, S7 is evaluated independently and does not void other skill scores.
| Step | Method | Score | Correct |
|---|---|---|---|
| S1 | Exact match | 1.0 or 0.0 |
All fields match |
| S2 | Ranked retrieval | MRR (0.0-1.0) | hit@10 |
| S3 | Exact + partial credit | 1.0, 0.5, or 0.0 |
is_overruled + year match |
| S4 | Weighted fields | 0.5 per field | Both fields match |
| S5:cb / S5:rag | Binary | 1.0 or 0.0 |
agrees == edge.agree |
| S6 | Rubric-based | 0.0-1.0 weighted | score >= 0.5 |
| S7 | Binary | 1.0 or 0.0 |
all_valid == True |
S1: Known Authority Retrieval
All three fields must match exactly: us_cite,
case_name,
term.
S2: Unknown Authority Retrieval
Uses ranked retrieval metrics. Primary score is MRR (Mean Reciprocal Rank).
S3: Validate Authority
| Condition | Score | Correct |
|---|---|---|
| Not overruled, model says not overruled | 1.0 | True |
| Overruled, model says overruled with correct year | 1.0 | True |
| Overruled, model says overruled with wrong year | 0.5 | False |
| Overruled, model says not overruled | 0.0 | False |
| Not overruled, model says overruled | 0.0 | False |
S4: Fact Extraction
Two fields evaluated independently, 0.5 weight each:
- disposition - Must match closed enum exactly
- party_winning - "petitioner", "respondent", or "unclear"
S5: Distinguish
Binary classification: does the citing case agree with or distinguish from the cited case?
S6: IRAC Synthesis
Rubric-based scoring with weighted components:
| Component | Weight | Criteria |
|---|---|---|
| Issue | 20% | Clear, correctly framed legal question |
| Rule | 25% | Accurate statement of legal rule from case |
| Application | 35% | Logical application with citation support |
| Conclusion | 20% | Consistent with analysis, cites outcome |
S7: Citation Integrity
All citations must be verifiable real cases. A single fabricated citation results in failure.
eyecite libraryDataset Schema
Ground Truth Sources
- S1: usCite, caseName, term
- S4: caseDisposition, partyWinning
- S2: citing_case_us_cite
- S5: agree (edge label)
- S3: overruling_case_name, year_overruled
- S7: Known fabricated citations
Output Schemas
// S1 Output
{ "us_cite": string, "case_name": string, "term": int }
// S2 Output
{ "citing_cases": [{ "us_cite": string, "case_name": string }] }
// S3 Output
{ "is_overruled": bool, "overruling_case": string|null, "year_overruled": int|null }
// S4 Output
{ "disposition": string, "party_winning": string, "holding_summary": string }
// S5 Output
{ "agrees": bool, "reasoning": string }
// S6 Output
{ "issue": string, "rule": string, "application": string, "conclusion": string }
// S7 Output
{ "citations_found": [{ "cite": string, "exists": bool }], "all_valid": bool }
Aggregation & Reporting
Per-Skill Metrics
Each skill reports accuracy as: correct / total for instances with status="OK".
Coverage Reporting
S5:rag coverage = instances with citing text / total instances. Reported separately from accuracy.
Versioning Policy
Breaking Changes (Major Version)
- Changes to scoring formulas
- Adding/removing skills from evaluation
- Changes to ground truth sources
Compatible Changes (Minor Version)
- Adding new metrics without changing existing ones
- Expanding dataset coverage
- Improving documentation
Run Artifacts
Each benchmark run produces: manifest with dataset hashes, per-instance step traces, aggregated scores, and model metadata.