Skip to content

codex-pdf contract

codex-pdf is the read-only PDF facts + render service for the Think Neverland tooling family (lint-pdf, lens-pdf, the marketing demos, and the upcoming Forge producers). This document is the canonical pointer for every contract surface codex exposes, the versioning policy that governs each section, and the read-only invariants the producer-surface audit enforces.

The service mounts at the configured base URL (Railway service domain or custom apex). Auth modes are documented in docs/deploy.md; the table below uses the bearer mode for examples.

EndpointSectionOwnerNotes
GET /healthz, GET /v1/healthzmetarenderunauthed liveness; carries version and cache_backend
GET /v1/versionmetarenderbare {version}
GET /v1/contractmetarenderendpoint inventory + section_schema_versions
GET /v1/schema/{name}documentextractJSON schemas served from schemas/v1/<name>.schema.json
POST /v1/extract, POST /extractdocumentextractmultipart PDF or JSON {url, pdf_sha256} → CodexDocument; supports sparse projection via X-Codex-Fields header (§ below)
POST /v1/assetsdocumentextractStandard data request (1.20.0+): ingest bytes or {sha256}; cache hit → complete inline, miss → processing + background speculator pull. See data-requests.md
GET /v1/assets/{pdf_hash}documentextractpoll an ingested asset → complete (+ document) / processing / unknown
GET /v1/assets/{pdf_hash}/signals/{kind}documentextractcached AI signal by hash (alias of /v1/documents/{pdf_hash}/signals/{kind})
POST /v1/probedocumentextracttwo-event SSE stream: probe-min (instant) + probe-std (after secondary parse)
POST /v1/extract/streamdocumentextractSSE stream of phase-1 + phase-2 extract events; ?granular=1 adds per-section progress
POST /v1/render/pagedocumentrenderPNG raster
POST /v1/render/separationsdocumentrendertiffsep channel manifest
POST /v1/render/heatmapdocumentrenderTAC heatmap PNG + per-run header
POST /v1/render/layerdocumentrenderOCG-toggled layer raster
POST /v1/sample/colordocumentrenderper-pixel sRGB sample
POST /v1/sample/densitydocumentrenderper-channel density sample
POST /v1/walk/content-streamdocumentextractcontent-stream signals JSON
POST /v1/walk/type4documentevalType-4 PostScript evaluator
POST /v1/color/resolvecolorcolorhost → codex → pantone → curated → hash resolver
POST /v1/color/match-pantonecolorcolornearest-Pantone search via ΔE2000
GET /v1/color/inkbookcolorcolorcurated + Pantone catalogue manifest
POST /v1/geom/tilegeomgeomimposition tile-grid layout
POST /v1/geom/intersectgeomgeompolygon Boolean intersection
POST /v1/geom/uniongeomgeompolygon Boolean union
POST /v1/geom/differencegeomgeompolygon Boolean difference
POST /v1/geom/offsetgeomgeompolygon inset / outset by signed distance
POST /v1/color/neutral-densitycolorcolorper-channel neutral density sample
POST /v1/retention/deleteretentionextracterase persisted PDF + extract + meta for an sha256 from R2 (only meaningful when retention is configured — CLAUDE.md deployed surface §4)
GET /metricsmetarenderPrometheus metrics (when prometheus-client installed)

Each section under codex versions independently of the top-level codex-document schema. The contract endpoint exposes the per- section versions in a single map so SDK consumers can pin against exactly the surface they validate.

SectionVersion constantCurrent valueBump policy
document (codex-document)embedded in /v1/contract.schema_version1.3.0additive bumps remain 1.x; breaking changes go to 2.0.0
colorcodex_pdf.color.COLOR_SCHEMA_VERSION1.0.0bump on any change to /v1/color/* request/response shapes
geomcodex_pdf.geom.GEOM_SCHEMA_VERSION1.0.0bump on any change to /v1/geom/* request/response shapes

Sample contract response:

{
"contract_name": "codex-document",
"schema_version": "1.3.0",
"package_version": "1.15.0",
"schema_id": "https://schemas.thinkneverland.com/codex-pdf/v1/codex-document.schema.json",
"endpoints": [
"POST /v1/extract",
"POST /v1/probe",
"POST /v1/assets",
"GET /v1/assets/{pdf_hash}",
"GET /v1/assets/{pdf_hash}/signals/{kind}",
"GET /v1/documents/{pdf_hash}/signals/{kind}",
"..."
],
"section_schema_versions": {
"color": "1.0.0",
"geom": "1.0.0"
},
"ai_model_versions": {
"language": { "model": "claude-haiku-4-5", "prompt": "lang-1", "schema": "1.0.0" },
"logos": { "model": "claude-sonnet-4-6", "prompt": "logos-1", "schema": "1.0.0" },
"symbols": { "model": "claude-sonnet-4-6", "prompt": "symbols-1","schema": "1.0.0" },
"barcodes": { "model": "pyzbar+pylibdmtx", "prompt": "n/a", "schema": "1.0.0" },
"classification": { "model": "claude-haiku-4-5", "prompt": "class-1", "schema": "1.0.0" },
"spell": { "model": "claude-haiku-4-5", "prompt": "spell-1", "schema": "1.0.0" }
}
}

ai_model_versions was added in 1.13.0 (AI Signal Phase 4). It mirrors codex_pdf.ai.versions.AI_MODEL_VERSIONS so SDK consumers can pin against the exact extractor that produced a signal. Bump the per-kind prompt constant whenever the system prompt changes so consumers can invalidate stale caches deliberately.

Every per-section response also carries schema_version inline (e.g. ColorResolveResponse.schema_version) so a consumer that hits the surface without first calling /v1/contract still has the information it needs to pick a validator.

POST /v1/extract accepts an optional X-Codex-Fields request header containing a comma-separated list of CodexDocument field names. When present the server runs only the extractors required for those fields and returns only those fields in the response, reducing both latency and payload size.

POST /v1/extract
Content-Type: multipart/form-data
X-Codex-Fields: detected_barcodes, color_spaces
Field nameMaps toNotes
pdf_version, is_encrypted, is_linearized, conformance, info, xmp, trapped_flag, trap_evidencefitz structure passAlways fast; included for completeness
pagesfitz structure passCore page geometry; AI sub-fields filtered per request
fontsfitz fonts pass
imagesfitz images pass
annotationsfitz annotations pass
output_intentspikepdf color world
color_spaces / spot_colors (alias)pikepdf color worldspot_colors resolves to color_spaces
icc_profilespikepdf color world
ocgspikepdf OCG pass
form_xobjectspikepdf forms pass
analysispikepdf content-stream signalsExpensive; skipped when not requested
document_classificationAI classification
detected_languageAI language signalPer-page; fitz pages always included
detected_barcodesAI barcodes signalCPU-only (pyzbar + pylibdmtx); no Claude calls
detected_logosAI logos signal
detected_symbolsAI symbols signal
spell_candidatesAI spell signal
trap_zone_candidatesAI trap-zones signal
summaryderivedBuilt from whatever was collected; no extra extractor

The response is a filtered CodexDocument containing only the requested fields. The following metadata keys are always included regardless of the field filter:

  • schema_version, codex_version, document_id, source
  • pdf_sha256, extraction_warnings, stage_durations_ms
  • preflight_reports, conformance_verdicts, ai_status

Page objects (pages[*]) always contain core geometry keys; page sub-fields (detected_barcodes, etc.) are stripped unless explicitly requested.

Omitting X-Codex-Fields returns the full document — identical to pre-1.18.0 behaviour, no breaking change.

// Full extract (unchanged behaviour)
const doc = await client.extract(pdfBuffer);
// Sparse — only barcode + colour data
const sparse = await client.extract(pdfBuffer, {
fields: ["detected_barcodes", "color_spaces"],
});

Codex never produces new PDF bytes. The invariant is enforced by scripts/produce_surface_audit.py, which fails CI when:

  • pikepdf.new() is invoked anywhere.
  • Any Pdf.save(...) call appears outside the documented apply_ocg_overrides allowlist (a transient in-memory PDF fed straight to Ghostscript with the requested OCG override applied; bytes are never returned to a caller).
  • A producer package is imported (pypdf, pdfrw, reportlab, fpdf, fpdf2, pdfkit, borb).
  • A Ghostscript invocation passes a PDF-writer device (-sDEVICE=pdfwrite, pdfimage8, pdfimage24, pdfimage32).
  • mutool {clean,create,merge}, qpdf write modes, or cpdf is invoked.
  • A b"%PDF-" literal is concatenated into output (read-only sniffs like raw[:5] == b"%PDF-" are explicitly allowed).
  • pikepdf / pymupdf / fitz is imported outside the allowlist (codex_pdf.extract.*, codex_pdf.render.*, codex_pdf.preflight_ingest.adapters, codex_pdf.eval.ps_type4, codex_pdf.api.{main,url_ingest}, codex_pdf.parity, codex_pdf.cli).

The audit emits a JSON report (reports/audit/produce_surface.json) on every CI run alongside the parity gate so reviewers can track status changes commit-by-commit.

Any future need to write PDF bytes goes into a separate Forge service (rewrite, marks, impose, trap), never into a consumer. Codex stays read-only; consumers stay byte-level-clean.