Data Inventory & Lineage

LegalChain evaluates AI against the largest empirically-grounded legal corpus ever assembled: 27,733 Supreme Court opinions, 43,000+ lower federal court decisions, and 5.7 million precedential relationships.

~70k

Verified Opinions

5.7M

Citations Mapped

100%

Academic Sourced

Why Domain Provenance Matters

A benchmark is only as credible as its underlying data. LegalChain draws exclusively from publicly available, academically validated sources. We do not use proprietary annotations, crowd-sourced labels, or synthetic training data.

Verifiability

Any researcher can reconstruct our corpus from original sources. We publish exact source files and transformation scripts.

Authority

Every source is peer-reviewed or published by official government institutions like the SCDB or Harvard Library.

Stability

Academic datasets change with version control, unlike commercial databases that drift without preserving state.

Primary Data Sources

The LegalChain corpus is synthesized from four primary authoritative libraries.

Supreme Court Database (SCDB)

Washington University in St. Louis / Harold J. Spaeth

The foundation of the anchor corpus. Coverage of 29,021 cases (1791–2021) with 55 standardized variables including decision date, voting alignment, and case names.

Caselaw Access Project (CAP)

Harvard Law School Library

The first complete digitization of American case law. We extract 43,043 federal appellate and district court opinions cited directly by the Supreme Court.

NW Fowler Authority Scores

Eigenvector centrality measures for 27,846 cases, providing a continuous importance score (0.0–1.0) rather than binary classifications.

TR Shepard's treatment Citations

5.7 million citation edges labeled with treatment semantics (Followed, Overruled, Distinguished) to evaluate relational reasoning.

Derived Artifacts

Raw data is transformed into resolution maps and ground-truth vectors used for evaluation scoring.

Artifact / Map	Coverage	Status
SCOTUS-to-SCOTUS	323,404 citations	Verified
SCOTUS-to-CAP	55,534 citations	Verified
Precedent Validity Map	288 overrulings	Annotated
Hallucination Scopes	1,000 fake cases	Synthetic

Corpus Constraints

To ensure high precision and avoid model leakage, we apply strict exclusion rules.

✕ No Proprietary data

No Westlaw or Lexis editorial annotations that would break redistribution rights.

✕ No Post-2021 Data

Fixed temporal cutoff prevents modern cases from leaking into model evaluation.

✕ No Crowd-Sourcing

Zero reliance on mechanical turk or low-expertise labels for legal ground truth.

✕ No State Court Data

Exclusively limited to the Federal Case Law universe for high precision evaluation.

Storage & Stability

The total LegalChain runtime inventory occupies **3.7 GB**. This represents a 94% reduction from the raw 24 GB CAP corpus, achieved by surgical extraction of only relevant precedential anchors.

SEALED_BUILD_MANIFEST.JSON INTEGRITY_VERIFIED

Case_metadata.parquet SHA256: 4e9d7...d3c8

Citation_inventory.parquet SHA256: f1a2b...9e0c

Appellate_text_v1.bin SHA256: a8b7c...2d1e