Skip to content

Unified Extraction — integration guide

This guide is for consumer services (preflight engines, viewers, batch import pipelines) wiring against the codex-pdf unified extraction API. It covers the surface, the cache-key contract, tenancy, rate-limiting, error shapes, and the per-stage telemetry that ships with every response.

Consumer-agnostic: nothing in this surface presumes a specific caller. The same endpoints serve lint-pdf, lens-pdf, compile-pdf, and any future consumer.

Verb / PathPurposeCache key
POST /v1/assetsStandard data request (1.20.0+). Ingest bytes or {sha256}; cache hit returns the document inline, miss enqueues a background pull. See data-requests.md.(tenant, pdf_hash)
GET /v1/assets/{pdf_hash}Poll an ingested asset.(tenant, pdf_hash)
GET /v1/assets/{pdf_hash}/signals/{kind}Cached AI signal by hash (alias of the documents signals endpoint).(tenant, pdf_hash, kind)
POST /v1/extractFirst-stop. Returns the full CodexDocument. Add X-Codex-Fields header for sparse projection (1.18.0+).(tenant, pdf_hash) — full only; sparse bypasses cache
GET /v1/documents/{pdf_hash}/text-regions?page_index=N&dpi=NSecond-stop. One page’s detected regions, in PDF user-space points.(tenant, pdf_hash, page_index, dpi)
POST /v1/documents/{document_id}/conformance/{profile}Compute (or fetch from cache) a conformance verdict.(tenant, pdf_hash, profile)
GET /v1/documents/{pdf_hash}/rendersList (page_index, dpi, color_space) tuples already in the render cache.n/a (it’s the index)

The first-stop / second-stop split is intentional. /v1/extract returns everything codex knows; consumers cherry-pick. Per-resource endpoints let consumers that already have the codex doc ask for exactly the slice they need without an extract-then-discard round trip.

New consumers should start with /v1/assets (requestAsset in both clients) rather than calling /v1/extract directly: it adds the cache-hit-inline / miss-pulls-in-background contract on top of the same idempotent extract. /v1/extract remains the lower-level primitive. The canonical pattern, response shapes, and the viewer backfill flow are documented in data-requests.md.

Cache keys are part of the public contract — stable across releases:

  • text-regions: (pdf_hash, page_index, dpi)
  • conformance: (pdf_hash, profile)
  • render: (pdf_hash, page_index, dpi, color_space)

The codex implementation also scopes by tenant (see below) but the tenant component is transparent to most consumers and isn’t part of the contract the caller cares about — it’s a server-side isolation knob.

Every request can carry an X-Codex-Tenant header. The server:

  1. Normalises the value ([a-z0-9][a-z0-9-]{0,62}; falls back to "default" for missing/invalid).
  2. Scopes the cache lookup, the blob store, and the renders index by tenant.

A hash uploaded by Tenant A is invisible to Tenant B even if B learns the hash. The 412 message on a blob miss is intentionally identical for “wrong tenant” and “expired” — probing isn’t informative.

# Python client
from codex_pdf.client import HttpClient
client = HttpClient(
base_url="https://codex.example.com",
bearer_token="",
tenant="acme-corp", # surfaces as X-Codex-Tenant on every request
)
// TypeScript client
import { HttpClient } from "@printwithsynergy/codex-client";
const client = new HttpClient({
baseUrl: "https://codex.example.com",
bearerToken: "",
tenant: "acme-corp", // surfaces as X-Codex-Tenant on every request
});

Both clients also read the tenant from the CODEX_TENANT env when the option is omitted.

Compute-and-cache POSTs (/v1/extract, render, sample, walk, conformance) consult an in-process token bucket per (tenant, endpoint). Bucket exhausted → 429 Too Many Requests with a Retry-After header in seconds.

Both bundled clients honour Retry-After and back off automatically; consumers using raw HTTP should do the same.

Operator knobs (env, codex-pdf service):

VariableDefaultPurpose
CODEX_RATE_LIMIT_RPM120Refills per minute
CODEX_RATE_LIMIT_BURST30Bucket size
CODEX_RATE_LIMIT_DISABLEDfalseOff-switch

The limiter is in-process and per-replica. Multi-replica fleets see effective limit N × rpm.

Every 4xx/5xx response uses the shared envelope:

{ "detail": "human-readable message" }

The new endpoints document their per-status shapes in OpenAPI under responses=:

  • 400 Bad Request — invalid pdf_hash, page_index, dpi, or unknown conformance profile.
  • 404 Not Found — no PDF cached for (tenant, document_id). Upload via /v1/extract first.
  • 429 Too Many Requests — rate limit exceeded. Retry-After header carries the wait in seconds.

Every response carries per-stage wall-clock timing in two places:

  1. Response envelope: stage_durations_ms: { stage: int_ms }.
  2. Response header: X-Codex-Stage-Durations-Ms (same dict serialised as JSON).

The header is there for transports that strip envelope bodies (in-process clients, mocks). Both clients back-fill the envelope from the header when present.

Initial stage names:

  • extract — full CodexDocument parse.
  • render — page render.
  • text_regions — detected text regions per page.
  • conformance — verdict compute for one profile.

New stage names are non-breaking: consumers must treat unknown keys as opaque.

Prometheus metrics on the codex-pdf service (/metrics):

MetricTypeLabels
codex_api_requests_totalCounterendpoint, status
codex_api_request_secondsHistogramendpoint
codex_api_cache_lookups_totalCounterendpoint, outcome (hit/miss)
codex_api_stage_secondsHistogramstage

The stage histogram observes the same numbers consumers see in stage_durations_ms. Cache hit rate per endpoint = rate(codex_api_cache_lookups_total{outcome="hit"}[5m]) / rate(codex_api_cache_lookups_total[5m]).

ProfileNotes
pdfx4OutputIntent + Trapped + PDF ≥1.4 + XMP pdfxid
pdfx1aOutputIntent + Trapped + PDF=1.3
pdfx3OutputIntent + Trapped + PDF ≥1.3
pdfa1b / pdfa2b / pdfa3bXMP present + not encrypted + correct pdfaid:part
pdfua1XMP present + pdfuaid + non-empty Title

The profile enum is forward-compatible. Consumers must treat unknown profile strings (e.g. a future pdfx6, pdfa4) as opaque so an older client doesn’t break against a newer server.

Clause coverage is the minimum-viable set in the rc.x series. Full ISO coverage lands in later phases; the framework is registry- driven, so new clauses are additive only.

Codex 1.11.0 lit up the AI Signal contract frozen in 1.10.0; subsequent 1.x releases iterated on it:

ReleaseChange
1.11.0Six extractors wired behind CODEX_AI_ENABLED.
1.12.0codex-vision-sidecar (CODEX_VISION_URL) — optional CPU CV lane.
1.13.0ai_model_versions on /v1/contract + codex_ai_signal_calls_total Prometheus metric.
1.14.0Per-tenant entitlements (CODEX_AI_TENANTS_ALLOWLIST / DENYLIST) + ai_tenant_excluded warning.
1.15.0Dieline-candidate / dieline-size reconciliation: bbox-based geometry detection now synthesises a candidate so dieline.count agrees with dieline.size.

The extracted CodexDocument carries six AI signal surfaces:

FieldScopeBackendPurpose
detected_languageper pageClaude Haiku (text)BCP-47 tag + confidence.
detected_logosper pageClaude Sonnet (vision)Brand identity + bbox in PDF user-space points.
detected_symbolsper pageClaude Sonnet (vision)Regulatory / safety / sustainability symbols (GHS, recycling, FDA, CE, ™, ©, etc.).
detected_barcodesper pagepyzbar + pylibdmtx (CPU)Decoded value + format + bbox. No Claude cost.
spell_candidatesper pageClaude Haiku (text)Suspect words for lint-pdf’s tenant spell rule.
document_classificationdocumentClaude Haiku (text)Probability map ({"label": 0.7, "folding_carton": 0.2}).

The dedicated endpoint GET /v1/documents/{pdf_hash}/signals/{kind} returns the same shapes scoped to one signal kind, so consumers can re-fetch a single signal without re-running the full extract. Pass ?page_index=N for page-scoped kinds (language, logos, symbols, barcodes, spell); classification is document-scoped so the parameter is ignored.

Codex emits a structured CodexWarning on every /v1/extract response describing the AI lane’s state:

Warning codeWhen
ai_disabledOperator gate (CODEX_AI_ENABLED) is off.
ai_skippedCaller sent X-Codex-Skip-AI: true.
ai_tenant_excludedOperator opted in but the requesting tenant is gated out by CODEX_AI_TENANTS_ALLOWLIST / DENYLIST (1.14.0 +).
ai_missing_credentialsOperator opted in but anthropic SDK isn’t importable or ANTHROPIC_API_KEY is unset.
ai_tierAdvisory — AI ran. message carries cpu+claude or gpu plus the realised dollar spend.
ai_budget_exceededPer-request cost cap (CODEX_AI_COST_CAP_USD_PER_REQUEST, default $0.10) was hit mid-extract.

See policies.md for the full warning catalogue, cache-key contract, and the two-backend (CPU + Claude default vs optional GPU) policy.

Pass X-Codex-Fields: <comma-separated fields> on POST /v1/extract to run only the extractors needed for the requested fields and receive only those fields in the response. Both latency and payload size shrink proportionally to the number of extractors skipped.

The fitz structure pass (page count, boxes, fonts summary) always runs; only the heavier pikepdf passes and the AI signal lane are gated.

Requested fieldExtractors skipped when absent
color_spaces / spot_colorspikepdf colour-world pass
detected_barcodespyzbar + pylibdmtx AI lane
detected_languageClaude Haiku language AI lane
detected_logosClaude Sonnet vision AI lane
detected_symbolsClaude Sonnet vision AI lane
document_classificationClaude Haiku classification AI lane
spell_candidatesClaude Haiku spell AI lane
ocgspikepdf OCG pass
form_xobjectspikepdf forms pass
analysispikepdf content-stream signals pass
fontsPyMuPDF fonts sub-pass
imagesPyMuPDF images sub-pass
annotationsPyMuPDF annotations sub-pass

Omitting X-Codex-Fields returns the full document (unchanged default behaviour — no breaking change).

Sparse responses are not cached — field sets vary too much for content-addressed cache keys to be useful. Full-extract responses remain cached as before.

POST /v1/extract HTTP/1.1
Content-Type: application/pdf
Authorization: Bearer <token>
X-Codex-Fields: detected_barcodes, color_spaces
<pdf bytes>
// TypeScript client (1.17.0+)
import { HttpClient } from "@printwithsynergy/codex-client";
const client = new HttpClient({ baseUrl, bearerToken });
const doc = await client.extract(pdfBytes, {
fields: ["detected_barcodes", "color_spaces"],
});
// doc contains only color_spaces + detected_barcodes + core metadata
# Python — raw header
import httpx
r = httpx.post(
"https://codex.example.com/v1/extract",
content=pdf_bytes,
headers={
"Content-Type": "application/pdf",
"Authorization": f"Bearer {token}",
"X-Codex-Fields": "detected_barcodes,color_spaces",
},
)
doc = r.json()
from codex_pdf.client import HttpClient
client = HttpClient(
base_url="https://codex.example.com",
bearer_token="",
tenant="acme-corp",
)
# First stop — full payload, includes detected text regions per page.
doc = client.extract(pdf_bytes)
sha = doc["pdf_sha256"]
# Second-stop re-fetch — one page only, cache-hit on second call.
regions_page_0 = client.text_regions(sha, page_index=0, dpi=150)
print(len(regions_page_0["regions"]))
# Compute a verdict; cached on the server.
verdict = client.conformance(sha, "pdfx4")
print(verdict["passed"], verdict["clauses"])
# What renders already exist in the cache?
print(client.list_renders(sha)["renders"])
import { HttpClient } from "@printwithsynergy/codex-client";
const client = new HttpClient({
baseUrl: "https://codex.example.com",
bearerToken: "",
tenant: "acme-corp",
});
const doc = await client.extract(pdfBytes);
const sha = doc.pdf_sha256;
const regions = await client.getTextRegions(sha, { pageIndex: 0, dpi: 150 });
const verdict = await client.computeConformance(sha, "pdfx4");
const renders = await client.listRenders(sha);

Schema version (the codex-document contract) and package version move on different cadences:

  • Schema version (schema_version in the payload) — only bumped when the CodexDocument contract changes.
  • Package version (pyproject.toml / package.json) — bumped on every release. Pre-release tags (rcN) signal in-flight phases.

The cache-key version segment ({VERSION} in codex:{VERSION}:{kind}:{tenant}:{pdf_sha}:{args_sha}) tracks the package version so a deploy that bumps either dimension invalidates the cache atomically.