Citation Inventory

Stage 1 performs a deterministic scan of Supreme Court majority opinions. It decomposes legal text into atomic citation occurrences (S1), governed by a strict I/O contract.

Anchor Specs
27,733
Majority Opinions
Extraction
378,938
Raw Occurrences
Unique Set
~67k
Verified Cites
Extraction Methodology

Unlike naive approaches that deduplicate cases immediately, Stage 1 preserves every occurrence. This captures Frequency Signal and Position Signal—distinguishing cert-defining citations from boilerplate footnotes.

[ PIPELINE FLOW ]
01
load_anchor(caseId)
02
scan_patterns(regex)
03
record_offsets()
Parquet Schema: citation_inventory
Column Type Role
anchor_caseId string Primary PK
cite_type string Reporter
normalized_cite string Joint-Key
start / end int Spatial
Coverage Stats
U.S. Reports 85.4%
Federal Reporters 10.1%