codex-pdf contract

codex-pdf is the read-only PDF facts + render service for the Think Neverland tooling family (lint-pdf, lens-pdf, the marketing demos, and the upcoming Forge producers). This document is the canonical pointer for every contract surface codex exposes, the versioning policy that governs each section, and the read-only invariants the producer-surface audit enforces.

Contract endpoints (HTTP)

The service mounts at the configured base URL (Railway service domain or custom apex). Auth modes are documented in docs/deploy.md; the table below uses the bearer mode for examples.

Endpoint	Section	Owner	Notes
`GET /healthz`, `GET /v1/healthz`	meta	render	unauthed liveness; carries `version` and `cache_backend`
`GET /v1/version`	meta	render	bare `{version}`
`GET /v1/contract`	meta	render	endpoint inventory + `section_schema_versions`
`GET /v1/schema/{name}`	document	extract	JSON schemas served from `schemas/v1/<name>.schema.json`
`POST /v1/extract`, `POST /extract`	document	extract	multipart PDF or JSON `{url, pdf_sha256}` → CodexDocument; supports sparse projection via `X-Codex-Fields` header (§ below)
`POST /v1/assets`	document	extract	Standard data request (1.20.0+): ingest bytes or `{sha256}`; cache hit → `complete` inline, miss → `processing` + background speculator pull. See data-requests.md
`GET /v1/assets/{pdf_hash}`	document	extract	poll an ingested asset → `complete` (+ document) / `processing` / `unknown`
`GET /v1/assets/{pdf_hash}/signals/{kind}`	document	extract	cached AI signal by hash (alias of `/v1/documents/{pdf_hash}/signals/{kind}`)
`POST /v1/probe`	document	extract	two-event SSE stream: `probe-min` (instant) + `probe-std` (after secondary parse)
`POST /v1/extract/stream`	document	extract	SSE stream of `phase-1` + `phase-2` extract events; `?granular=1` adds per-section progress
`POST /v1/render/page`	document	render	PNG raster
`POST /v1/render/separations`	document	render	tiffsep channel manifest
`POST /v1/render/heatmap`	document	render	TAC heatmap PNG + per-run header
`POST /v1/render/layer`	document	render	OCG-toggled layer raster
`POST /v1/sample/color`	document	render	per-pixel sRGB sample
`POST /v1/sample/density`	document	render	per-channel density sample
`POST /v1/walk/content-stream`	document	extract	content-stream signals JSON
`POST /v1/walk/type4`	document	eval	Type-4 PostScript evaluator
`POST /v1/color/resolve`	color	color	host → codex → pantone → curated → hash resolver
`POST /v1/color/match-pantone`	color	color	nearest-Pantone search via ΔE2000
`GET /v1/color/inkbook`	color	color	curated + Pantone catalogue manifest
`POST /v1/geom/tile`	geom	geom	imposition tile-grid layout
`POST /v1/geom/intersect`	geom	geom	polygon Boolean intersection
`POST /v1/geom/union`	geom	geom	polygon Boolean union
`POST /v1/geom/difference`	geom	geom	polygon Boolean difference
`POST /v1/geom/offset`	geom	geom	polygon inset / outset by signed distance
`POST /v1/color/neutral-density`	color	color	per-channel neutral density sample
`POST /v1/retention/delete`	retention	extract	erase persisted PDF + extract + meta for an `sha256` from R2 (only meaningful when retention is configured — `CLAUDE.md` deployed surface §4)
`GET /metrics`	meta	render	Prometheus metrics (when prometheus-client installed)

Schema sections + versioning

Each section under codex versions independently of the top-level codex-document schema. The contract endpoint exposes the per- section versions in a single map so SDK consumers can pin against exactly the surface they validate.

Section	Version constant	Current value	Bump policy
document (codex-document)	embedded in `/v1/contract.schema_version`	`1.3.0`	additive bumps remain `1.x`; breaking changes go to `2.0.0`
color	`codex_pdf.color.COLOR_SCHEMA_VERSION`	`1.0.0`	bump on any change to `/v1/color/*` request/response shapes
geom	`codex_pdf.geom.GEOM_SCHEMA_VERSION`	`1.0.0`	bump on any change to `/v1/geom/*` request/response shapes

Sample contract response:

{
  "contract_name": "codex-document",
  "schema_version": "1.3.0",
  "package_version": "1.15.0",
  "schema_id": "https://schemas.thinkneverland.com/codex-pdf/v1/codex-document.schema.json",
  "endpoints": [
    "POST /v1/extract",
    "POST /v1/probe",
    "POST /v1/assets",
    "GET /v1/assets/{pdf_hash}",
    "GET /v1/assets/{pdf_hash}/signals/{kind}",
    "GET /v1/documents/{pdf_hash}/signals/{kind}",
    "..."
  ],
  "section_schema_versions": {
    "color": "1.0.0",
    "geom": "1.0.0"
  },
  "ai_model_versions": {
    "language":       { "model": "claude-haiku-4-5",  "prompt": "lang-1",   "schema": "1.0.0" },
    "logos":          { "model": "claude-sonnet-4-6", "prompt": "logos-1",  "schema": "1.0.0" },
    "symbols":        { "model": "claude-sonnet-4-6", "prompt": "symbols-1","schema": "1.0.0" },
    "barcodes":       { "model": "pyzbar+pylibdmtx",  "prompt": "n/a",      "schema": "1.0.0" },
    "classification": { "model": "claude-haiku-4-5",  "prompt": "class-1",  "schema": "1.0.0" },
    "spell":          { "model": "claude-haiku-4-5",  "prompt": "spell-1",  "schema": "1.0.0" }
  }
}

ai_model_versions was added in 1.13.0 (AI Signal Phase 4). It mirrors codex_pdf.ai.versions.AI_MODEL_VERSIONS so SDK consumers can pin against the exact extractor that produced a signal. Bump the per-kind prompt constant whenever the system prompt changes so consumers can invalidate stale caches deliberately.

Every per-section response also carries schema_version inline (e.g. ColorResolveResponse.schema_version) so a consumer that hits the surface without first calling /v1/contract still has the information it needs to pick a validator.

Sparse field projection (1.18.0+)

POST /v1/extract accepts an optional X-Codex-Fields request header containing a comma-separated list of CodexDocument field names. When present the server runs only the extractors required for those fields and returns only those fields in the response, reducing both latency and payload size.

POST /v1/extract
Content-Type: multipart/form-data
X-Codex-Fields: detected_barcodes, color_spaces

Supported field names

Field name	Maps to	Notes
`pdf_version`, `is_encrypted`, `is_linearized`, `conformance`, `info`, `xmp`, `trapped_flag`, `trap_evidence`	fitz structure pass	Always fast; included for completeness
`pages`	fitz structure pass	Core page geometry; AI sub-fields filtered per request
`fonts`	fitz fonts pass
`images`	fitz images pass
`annotations`	fitz annotations pass
`output_intents`	pikepdf color world
`color_spaces` / `spot_colors` (alias)	pikepdf color world	`spot_colors` resolves to `color_spaces`
`icc_profiles`	pikepdf color world
`ocgs`	pikepdf OCG pass
`form_xobjects`	pikepdf forms pass
`analysis`	pikepdf content-stream signals	Expensive; skipped when not requested
`document_classification`	AI classification
`detected_language`	AI language signal	Per-page; fitz pages always included
`detected_barcodes`	AI barcodes signal	CPU-only (pyzbar + pylibdmtx); no Claude calls
`detected_logos`	AI logos signal
`detected_symbols`	AI symbols signal
`spell_candidates`	AI spell signal
`trap_zone_candidates`	AI trap-zones signal
`summary`	derived	Built from whatever was collected; no extra extractor

Response shape

The response is a filtered CodexDocument containing only the requested fields. The following metadata keys are always included regardless of the field filter:

schema_version, codex_version, document_id, source
pdf_sha256, extraction_warnings, stage_durations_ms
preflight_reports, conformance_verdicts, ai_status

Page objects (pages[*]) always contain core geometry keys; page sub-fields (detected_barcodes, etc.) are stripped unless explicitly requested.

Omitting X-Codex-Fields returns the full document — identical to pre-1.18.0 behaviour, no breaking change.

TypeScript client

// Full extract (unchanged behaviour)
const doc = await client.extract(pdfBuffer);

// Sparse — only barcode + colour data
const sparse = await client.extract(pdfBuffer, {
  fields: ["detected_barcodes", "color_spaces"],
});

Read-only invariants

Codex never produces new PDF bytes. The invariant is enforced by scripts/produce_surface_audit.py, which fails CI when:

pikepdf.new() is invoked anywhere.
Any Pdf.save(...) call appears outside the documented apply_ocg_overrides allowlist (a transient in-memory PDF fed straight to Ghostscript with the requested OCG override applied; bytes are never returned to a caller).
A producer package is imported (pypdf, pdfrw, reportlab, fpdf, fpdf2, pdfkit, borb).
A Ghostscript invocation passes a PDF-writer device (-sDEVICE=pdfwrite, pdfimage8, pdfimage24, pdfimage32).
mutool {clean,create,merge}, qpdf write modes, or cpdf is invoked.
A b"%PDF-" literal is concatenated into output (read-only sniffs like raw[:5] == b"%PDF-" are explicitly allowed).
pikepdf / pymupdf / fitz is imported outside the allowlist (codex_pdf.extract.*, codex_pdf.render.*, codex_pdf.preflight_ingest.adapters, codex_pdf.eval.ps_type4, codex_pdf.api.{main,url_ingest}, codex_pdf.parity, codex_pdf.cli).

The audit emits a JSON report (reports/audit/produce_surface.json) on every CI run alongside the parity gate so reviewers can track status changes commit-by-commit.

Forge expansion rule

Any future need to write PDF bytes goes into a separate Forge service (rewrite, marks, impose, trap), never into a consumer. Codex stays read-only; consumers stay byte-level-clean.