CAP Byte Index

Enabling O(1) random access to 1.4 GB of JSONL data. Stage 3 reduces case extraction time from 500ms to 1ms by pre-computing exact byte offsets for every authority.

Latency Gain

500x

500ms -> 1ms

Complexity

O(1)

Constant Time

Overhead

624 KB

In-Memory Index

The Performance Problem

CAP text is stored in massive JSONL bundles. A standard linear scan for a specific case (O(n)) requires reading gigabytes of text, making the 27k ResearchPack construction prohibitively slow (10+ hours).

[ EXTRACTION TRACE ]

01. INDEX LOOKUP

Locate `cap_id` in Parquet Index

02. FILE SEEK

f.seek(523,847,621)

03. EXACT READ

f.read(12,847 bytes)

Index Mapping Example

CAP ID	Byte Offset	Length
1403610	523,847,621	12,847
1046253	892,104,502	9,112
1098412	1,102,543,892	15,403

Build Manifest Excerpt

{
  "cap_sha256": "a1b2c3...",
  "index_sha256": "g7h8i9...",
  "strategy": "constant_time"
}

Build-Time Utility

Byte Indexing is exclusive to Stage 4A (Build-Time). Runners receive fully extracted Research Packs, requiring no runtime lookups.