Citation Inventory
Stage 1 performs a deterministic scan of Supreme Court majority opinions. It decomposes legal text into atomic citation occurrences (S1), governed by a strict I/O contract.
Anchor Specs
27,733
Majority Opinions
Extraction
378,938
Raw Occurrences
Unique Set
~67k
Verified Cites
Extraction Methodology
Unlike naive approaches that deduplicate cases immediately, Stage 1 preserves every occurrence. This captures Frequency Signal and Position Signal—distinguishing cert-defining citations from boilerplate footnotes.
[ PIPELINE FLOW ]
01
load_anchor(caseId)
02
scan_patterns(regex)
03
record_offsets()
Parquet Schema: citation_inventory
| Column | Type | Role |
|---|---|---|
| anchor_caseId | string | Primary PK |
| cite_type | string | Reporter |
| normalized_cite | string | Joint-Key |
| start / end | int | Spatial |
Coverage Stats
U.S. Reports
85.4%
Federal Reporters
10.1%