Data requests — the standard "ask codex for data" pattern
Data requests — the standard “ask codex for data” pattern
Section titled “Data requests — the standard “ask codex for data” pattern”This is the canonical way every Print-with-Synergy service asks codex for a PDF’s facts. One contract, one client call, one cache.
TL;DR — call
requestAsset(pdf). If codex has the document cached you get it back inline in one round-trip. If it doesn’t, codex pulls it in the background and the same call polls until the data lands. Repeat requests for the same PDF are served from cache — codex never re-extracts.
Before this contract, consumers diverged: some raw-fetched
/v1/extract synchronously, some hand-rolled an ingest+poll client
against an endpoint that didn’t exist, some double-cached results in
their own Redis on top of codex’s cache. /v1/assets collapses all of
that into one pattern built on the cache + speculator codex already
runs.
The contract
Section titled “The contract”| Verb / Path | Purpose |
|---|---|
POST /v1/assets | Ingest. Body is either a multipart pdf field or JSON { "sha256": "…" } referencing a PDF codex has already seen. |
GET /v1/assets/{pdf_hash} | Poll. Serves the cached document when ready. |
GET /v1/assets/{pdf_hash}/signals/{kind} | Cached AI signal (language, logos, symbols, barcodes, spell, classification) by hash. Thin alias over GET /v1/documents/{pdf_hash}/signals/{kind}. |
pdf_hash is the lower-case hex sha256 of the raw PDF bytes — the same
content-address codex keys its cache on. It is the asset_id.
POST /v1/assets
Section titled “POST /v1/assets”Resolves the PDF bytes (multipart upload, or {sha256} looked up in the
blob store), computes the sha, then:
-
Cache hit →
200{"asset_id": "<sha256>","pdf_hash": "<sha256>","status": "complete","document": { "pdf_sha256": "<sha256>", "...": "full CodexDocument" },"summary": { "...": "codex summary or null" }}This is the sync-fast path. A warm cache answers in one round-trip — satisfying callers that expect a
<3ssynchronous extract. -
Miss →
202{ "asset_id": "<sha256>", "pdf_hash": "<sha256>", "status": "processing" }The sha is handed to the speculator (
codex:speculateRedis stream) for a background pull. The caller pollsGET /v1/assets/{pdf_hash}. -
{sha256}for a hash codex has never seen →404{ "asset_id": "<sha256>", "pdf_hash": "<sha256>", "status": "unknown","error": "No PDF cached for this sha256. Upload the bytes via multipart 'pdf' once to seed it." }Upload the bytes once (multipart) to seed the blob store; subsequent by-sha requests then resolve.
GET /v1/assets/{pdf_hash}
Section titled “GET /v1/assets/{pdf_hash}”- cached →
200 { status: "complete", document }. - blob present but not yet extracted (speculator in flight) → the poll
extracts on demand from the blob store and returns
complete, so a poll always makes forward progress even if the speculator stalls. - unknown sha →
404 { status: "unknown" }.
GET /v1/assets/{pdf_hash}/signals/{kind}
Section titled “GET /v1/assets/{pdf_hash}/signals/{kind}”- ingested → delegates to the dedicated
run_signalpath (per-kind cache + AI gating) and returns{ status: "complete", kind, ai_status, signal }. - never ingested →
404 { status: "processing" }— drivePOST /v1/assets(orGET /v1/assets/{pdf_hash}) first.
AI signals are opt-in (operator sets CODEX_AI_ENABLED=true). When AI
is gated off, signal is empty and ai_status is "disabled" /
"skipped" — the call still succeeds.
The one client call: requestAsset
Section titled “The one client call: requestAsset”Both bundled clients ship a single helper that is the pattern. Use it.
// TypeScript — @printwithsynergy/codex-client (1.20.0+)import { HttpClient } from "@printwithsynergy/codex-client";
const codex = new HttpClient({ baseUrl: process.env.CODEX_API_BASE, bearerToken: process.env.CODEX_BEARER_TOKEN,});
// Cache hit returns inline; miss pulls in the background and this polls// until codex has the document (or `timeoutMs` elapses).const asset = await codex.requestAsset(pdfBytes, { pollMs: 1500, timeoutMs: 60_000 });if (asset.status === "complete") { useDocument(asset.document);}
// By hash — when you already uploaded the bytes once and only have the sha:const again = await codex.requestAsset({ sha256 });# Python — codex_pdf.client (1.20.0+)from codex_pdf.client import HttpClient
codex = HttpClient(base_url="https://codex.example.com", bearer_token="…")
asset = codex.request_asset(pdf_bytes, poll_seconds=1.5, timeout_seconds=60.0)if asset["status"] == "complete": use_document(asset["document"])
# By hash:asset = codex.request_asset(sha256=sha)Lower-level methods are available when you need them:
ingestAsset / ingest_asset, getAsset / get_asset,
getAssetSignals / get_asset_signals.
The backfill pattern (viewers)
Section titled “The backfill pattern (viewers)”The intended UX for viewers (lens-pdf and the demo sites): paint the
PDF immediately from a local renderer (pdf.js), then requestAsset in
the background and backfill richer data — findings, dieline/spot
overlays, signals — when codex’s document lands. Nothing blocks the
first paint; better data arrives asynchronously. Repeat visits hit the
cache, so the backfill is instant.
Why this is not a new engine
Section titled “Why this is not a new engine”/v1/assets is glue over machinery codex already runs:
- Idempotent extract —
_run_extractis cache-check → extract → cache, keyed bycache_key(pdf_bytes, {}, kind="extract", tenant). - Content-addressed cache —
codex:{VERSION}:{kind}:{tenant}:{pdf_sha}:{args_sha}(api/cache.py), Redis (24h TTL) or in-memory LRU. The poll path peeks this cache by sha viacache_key_for_sha— no bytes, no re-extract. - Speculator —
_publish_speculate(sha)enqueues a background pull on thecodex:speculateRedis stream; the consumer extracts + caches.
No new cache layer. Consumers must not add their own cache on top —
codex’s cache is the single source of truth for “how long a derived
artifact lives” (CODEX_CACHE_TTL_SECONDS, default 24h).
Tenancy, rate limits, telemetry
Section titled “Tenancy, rate limits, telemetry”The asset endpoints honour the same X-Codex-Tenant scoping, the same
(tenant, endpoint) rate-limit token bucket (Retry-After on 429,
which both clients honour automatically), and the same
X-Codex-Stage-Durations-Ms telemetry as the rest of the unified
extraction surface. See unified-extraction.md.
Consumers
Section titled “Consumers”Every PWS consumer talks to codex through this contract:
| Repo | Path |
|---|---|
| lint-pdf | preflight reads the cache-backed document; codex_signals_* analyzers read AI signals by hash |
| lens-pdf | viewer overlays backfill from requestAsset after first paint |
| lint-pdf-marketing / lens-pdf-marketing | /api/demo/codex/[id] proxies requestAsset (no demo-side cache) |
| synergy | codex.probe / codex.signals flow nodes |
| platform | /ai + /extraction routes |
Each consumer’s docs link here rather than re-describing its own flow.