Data Inventory & Lineage
LegalChain evaluates AI against the largest empirically-grounded legal corpus ever assembled: 27,733 Supreme Court opinions, 43,000+ lower federal court decisions, and 5.7 million precedential relationships.
Why Domain Provenance Matters
A benchmark is only as credible as its underlying data. LegalChain draws exclusively from publicly available, academically validated sources. We do not use proprietary annotations, crowd-sourced labels, or synthetic training data.
Verifiability
Any researcher can reconstruct our corpus from original sources. We publish exact source files and transformation scripts.
Authority
Every source is peer-reviewed or published by official government institutions like the SCDB or Harvard Library.
Stability
Academic datasets change with version control, unlike commercial databases that drift without preserving state.
Primary Data Sources
The LegalChain corpus is synthesized from four primary authoritative libraries.
Supreme Court Database (SCDB)
The foundation of the anchor corpus. Coverage of 29,021 cases (1791–2021) with 55 standardized variables including decision date, voting alignment, and case names.
Caselaw Access Project (CAP)
The first complete digitization of American case law. We extract 43,043 federal appellate and district court opinions cited directly by the Supreme Court.
NW Fowler Authority Scores
Eigenvector centrality measures for 27,846 cases, providing a continuous importance score (0.0–1.0) rather than binary classifications.
TR Shepard's treatment Citations
5.7 million citation edges labeled with treatment semantics (Followed, Overruled, Distinguished) to evaluate relational reasoning.
Derived Artifacts
Raw data is transformed into resolution maps and ground-truth vectors used for evaluation scoring.
| Artifact / Map | Coverage | Status |
|---|---|---|
| SCOTUS-to-SCOTUS | 323,404 citations | Verified |
| SCOTUS-to-CAP | 55,534 citations | Verified |
| Precedent Validity Map | 288 overrulings | Annotated |
| Hallucination Scopes | 1,000 fake cases | Synthetic |
Corpus Constraints
To ensure high precision and avoid model leakage, we apply strict exclusion rules.
No Westlaw or Lexis editorial annotations that would break redistribution rights.
Fixed temporal cutoff prevents modern cases from leaking into model evaluation.
Zero reliance on mechanical turk or low-expertise labels for legal ground truth.
Exclusively limited to the Federal Case Law universe for high precision evaluation.
Storage & Stability
The total LegalChain runtime inventory occupies **3.7 GB**. This represents a 94% reduction from the raw 24 GB CAP corpus, achieved by surgical extraction of only relevant precedential anchors.