Architecture
Architecture
Section titled “Architecture”codexPDF is a contract-first facts engine for PDF documents.
Boundary
Section titled “Boundary”- Read-only extraction + render. Codex never produces new PDF bytes
—
scripts/produce_surface_audit.pyenforces this on every CI run. - No customer policy / rule adjudication. Codex emits detection signals; pass/fail belongs to Lint.
- No display / viewer presentation. PNG renders are byte-accurate source-of-truth for Lens to display, not a viewer in themselves.
- Consumer-agnostic output: same JSON contract regardless of caller.
Pipeline
Section titled “Pipeline”- Input PDF bytes are loaded by the extractor layer (PyMuPDF for the fast path, pikepdf for slower per-object inspection).
- Domain extractors populate
CodexDocumentfields: pages, boxes, fonts, images, color spaces (with Separation tint transforms evaluated att=1.0so spot inks land on the right swatch), OCG / layers, annotations, transparency, trapping, form XObjects. Optional AI signal extractors (1.10.0 +) populatedetected_language,detected_logos,detected_symbols,detected_barcodes, anddocument_classification; seepolicies.md. - Output is serialized as JSON against the published schemas in
schemas/v1/. Each section (document, color, geom) versions independently and reports itsschema_versioninline. - Render endpoints rasterize pages, separations, TAC heatmaps, and OCG-isolated layers via Ghostscript + PyMuPDF.
Sparse extraction (1.18.0+)
Section titled “Sparse extraction (1.18.0+)”When the caller sends X-Codex-Fields, the pipeline runs in sparse
mode: only the extractors needed for the requested fields execute.
The PyMuPDF structure pass (step 1) always runs. Heavier pikepdf
passes (color world, OCGs, forms, content-stream signals) and the AI
signal lane are each skipped unless a requested field depends on them.
See docs/contract.md
and docs/unified-extraction.md
for the full field→extractor mapping and HTTP example.
Primary contract
Section titled “Primary contract”- Runtime model:
codex_pdf.models.v1.CodexDocument - Document schema:
schemas/v1/codex-document.schema.json - Section versions:
codex_pdf.color.COLOR_SCHEMA_VERSION,codex_pdf.geom.GEOM_SCHEMA_VERSION - Live manifest:
GET /v1/contract
Deployed surface
Section titled “Deployed surface”In production, codex runs as three services sharing one
content-addressed cache namespace
(codex:{VERSION}:{kind}:{pdf_sha}:{args_sha}), so a VERSION
bump invalidates every tier atomically. The full deployed map —
URLs, account / service IDs, and the version-bump checklist —
lives in CLAUDE.md.
- codex-pdf API (Railway) — FastAPI under gunicorn + uvicorn workers. Bearer + internal token auth. Backed by Redis for cache and blob storage.
- codex-speculator (Railway sidecar) — a Redis-Streams
consumer.
POST /v1/probeand the blob-store put both XADD a sha onto thecodex:speculatestream; the speculator runs Phase 1 + Phase 2 ahead of the next request so/v1/extractlands warm. Idempotent — cache-hit short-circuit collapses replays to a single Redis GET. - codex-edge (Cloudflare Worker + KV) — drop-in DNS-level
replacement that captures every probe / extract SSE frame and
replays from KV on the next hash-keyed request. Multipart
uploads bypass to origin.
ctx.waitUntilkeeps the Worker alive long enough to persist every frame before the response stream closes.
Optional retention layer
Section titled “Optional retention layer”Codex 1.8+ adds an opt-in persistence branch on POST /v1/extract.
When the caller sends retain_for_training=true and the deploy is
wired to an S3-compatible bucket, the PDF + extract + a small
metadata object land under a hive-partitioned key
({prefix}/tenant=…/dt=…/sha256=…/). Default behaviour is
unchanged — bytes leave memory the moment the response ships. The
production deployment uses Cloudflare R2 with a 90-day bucket
lifecycle; see docs/deploy.md for the env contract
and CLAUDE.md for the live bucket layout.
Consumer relationship
Section titled “Consumer relationship”Downstream engines (lint-pdf, lens-pdf, marketing demos)
treat codex output as the source of truth for document facts and
keep any product-specific behaviour in adapter layers. New
products map to one owner per capability — see the “Service
boundary” and “Offshoot rule” sections of
CLAUDE.md.