L7

L7 is an atomic benchmark: each skill is evaluated independently (no chaining, no stateful carry-over), producing per-skill scores that isolate what a model struggles with, without error propagation.

Mode
Atomic (stateless)
Skills
7 independent skills
Best for
Diagnosis + baseline skill profiling

The 7 skills

S1
Known Authority Retrieval
Given a citation or case name, return canonical details.
S2
Unknown Authority Retrieval
Given a case, predict which later cases cite it.
S3
Validate Authority
Determine whether authority remains good law.
S4
Fact Extraction
Extract holding/disposition and key facts from an opinion.
S5
Distinguish
Compare cases and decide whether they meaningfully differ.
S6
IRAC Synthesis
Write a structured legal analysis under a rubric.
S7
Citation Integrity
Binary check: a single fabricated authority fails the skill (evaluated independently in L7).

Scoring summary

Skill Method Signal
S1 Exact match 0/1
S2 Ranked retrieval MRR / hit@k
S3 Exact + partial credit 0 / 0.5 / 1
S4 Weighted fields 0–1
S5 Binary (two modalities) 0/1
S6 Rubric-based 0–1
S7 Binary integrity check pass/fail
Modalities
S5:cb runs closed-book (metadata + extracted facts only). S5:rag runs RAG-enhanced (includes additional opinion text).

When to use L7

  • Isolate which skills a model struggles with (without chain-level confounds)
  • Track improvements/regressions per capability over time
  • Complement chained benchmarks (AG8/AG10) during iteration
Results
Leaderboard
Pick an L7 run spec (when available) to compare models.
How it works
Methodology
Atomic skills, scoring, and integrity policies.