L7
L7 is an atomic benchmark: each skill is evaluated independently (no chaining, no stateful carry-over), producing per-skill scores that isolate what a model struggles with, without error propagation.
Mode
Atomic (stateless)
Skills
7 independent skills
Best for
Diagnosis + baseline skill profiling
The 7 skills
S1
Known Authority Retrieval
Given a citation or case name, return canonical details.
S2
Unknown Authority Retrieval
Given a case, predict which later cases cite it.
S3
Validate Authority
Determine whether authority remains good law.
S4
Fact Extraction
Extract holding/disposition and key facts from an opinion.
S5
Distinguish
Compare cases and decide whether they meaningfully differ.
S6
IRAC Synthesis
Write a structured legal analysis under a rubric.
S7
Citation Integrity
Binary check: a single fabricated authority fails the skill (evaluated independently in L7).
Scoring summary
| Skill | Method | Signal |
|---|---|---|
| S1 | Exact match | 0/1 |
| S2 | Ranked retrieval | MRR / hit@k |
| S3 | Exact + partial credit | 0 / 0.5 / 1 |
| S4 | Weighted fields | 0–1 |
| S5 | Binary (two modalities) | 0/1 |
| S6 | Rubric-based | 0–1 |
| S7 | Binary integrity check | pass/fail |
Modalities
S5:cb runs closed-book (metadata + extracted facts only). S5:rag runs
RAG-enhanced (includes additional opinion text).
When to use L7
- Isolate which skills a model struggles with (without chain-level confounds)
- Track improvements/regressions per capability over time
- Complement chained benchmarks (AG8/AG10) during iteration
Results
Leaderboard
Pick an L7 run spec (when available) to compare models.
How it works
Methodology
Atomic skills, scoring, and integrity policies.