Data requests — the standard "ask codex for data" pattern

Data requests — the standard “ask codex for data” pattern

This is the canonical way every Print-with-Synergy service asks codex for a PDF’s facts. One contract, one client call, one cache.

TL;DR — call requestAsset(pdf). If codex has the document cached you get it back inline in one round-trip. If it doesn’t, codex pulls it in the background and the same call polls until the data lands. Repeat requests for the same PDF are served from cache — codex never re-extracts.

Before this contract, consumers diverged: some raw-fetched /v1/extract synchronously, some hand-rolled an ingest+poll client against an endpoint that didn’t exist, some double-cached results in their own Redis on top of codex’s cache. /v1/assets collapses all of that into one pattern built on the cache + speculator codex already runs.

The contract

Verb / Path	Purpose
`POST /v1/assets`	Ingest. Body is either a multipart `pdf` field or JSON `{ "sha256": "…" }` referencing a PDF codex has already seen.
`GET /v1/assets/{pdf_hash}`	Poll. Serves the cached document when ready.
`GET /v1/assets/{pdf_hash}/signals/{kind}`	Cached AI signal (`language`, `logos`, `symbols`, `barcodes`, `spell`, `classification`) by hash. Thin alias over `GET /v1/documents/{pdf_hash}/signals/{kind}`.

pdf_hash is the lower-case hex sha256 of the raw PDF bytes — the same content-address codex keys its cache on. It is the asset_id.

`POST /v1/assets`

Resolves the PDF bytes (multipart upload, or {sha256} looked up in the blob store), computes the sha, then:

Cache hit → 200

{
  "asset_id": "<sha256>",
  "pdf_hash": "<sha256>",
  "status": "complete",
  "document": { "pdf_sha256": "<sha256>", "...": "full CodexDocument" },
  "summary": { "...": "codex summary or null" }
}

This is the sync-fast path. A warm cache answers in one round-trip — satisfying callers that expect a <3s synchronous extract.

Miss → 202
```
{ "asset_id": "<sha256>", "pdf_hash": "<sha256>", "status": "processing" }
```
The sha is handed to the speculator (codex:speculate Redis stream) for a background pull. The caller polls GET /v1/assets/{pdf_hash}.

{sha256} for a hash codex has never seen → 404

{ "asset_id": "<sha256>", "pdf_hash": "<sha256>", "status": "unknown",
  "error": "No PDF cached for this sha256. Upload the bytes via multipart 'pdf' once to seed it." }

Upload the bytes once (multipart) to seed the blob store; subsequent by-sha requests then resolve.

`GET /v1/assets/{pdf_hash}`

cached → 200 { status: "complete", document }.
blob present but not yet extracted (speculator in flight) → the poll extracts on demand from the blob store and returns complete, so a poll always makes forward progress even if the speculator stalls.
unknown sha → 404 { status: "unknown" }.

`GET /v1/assets/{pdf_hash}/signals/{kind}`

ingested → delegates to the dedicated run_signal path (per-kind cache + AI gating) and returns { status: "complete", kind, ai_status, signal }.
never ingested → 404 { status: "processing" } — drive POST /v1/assets (or GET /v1/assets/{pdf_hash}) first.

AI signals are opt-in (operator sets CODEX_AI_ENABLED=true). When AI is gated off, signal is empty and ai_status is "disabled" / "skipped" — the call still succeeds.

The one client call: `requestAsset`

Both bundled clients ship a single helper that is the pattern. Use it.

// TypeScript — @printwithsynergy/codex-client (1.20.0+)
import { HttpClient } from "@printwithsynergy/codex-client";

const codex = new HttpClient({
  baseUrl: process.env.CODEX_API_BASE,
  bearerToken: process.env.CODEX_BEARER_TOKEN,
});

// Cache hit returns inline; miss pulls in the background and this polls
// until codex has the document (or `timeoutMs` elapses).
const asset = await codex.requestAsset(pdfBytes, { pollMs: 1500, timeoutMs: 60_000 });
if (asset.status === "complete") {
  useDocument(asset.document);
}

// By hash — when you already uploaded the bytes once and only have the sha:
const again = await codex.requestAsset({ sha256 });

# Python — codex_pdf.client (1.20.0+)
from codex_pdf.client import HttpClient

codex = HttpClient(base_url="https://codex.example.com", bearer_token="…")

asset = codex.request_asset(pdf_bytes, poll_seconds=1.5, timeout_seconds=60.0)
if asset["status"] == "complete":
    use_document(asset["document"])

# By hash:
asset = codex.request_asset(sha256=sha)

Lower-level methods are available when you need them: ingestAsset / ingest_asset, getAsset / get_asset, getAssetSignals / get_asset_signals.

The backfill pattern (viewers)

The intended UX for viewers (lens-pdf and the demo sites): paint the PDF immediately from a local renderer (pdf.js), then requestAsset in the background and backfill richer data — findings, dieline/spot overlays, signals — when codex’s document lands. Nothing blocks the first paint; better data arrives asynchronously. Repeat visits hit the cache, so the backfill is instant.

Why this is not a new engine

/v1/assets is glue over machinery codex already runs:

Idempotent extract — _run_extract is cache-check → extract → cache, keyed by cache_key(pdf_bytes, {}, kind="extract", tenant).
Content-addressed cache — codex:{VERSION}:{kind}:{tenant}:{pdf_sha}:{args_sha} (api/cache.py), Redis (24h TTL) or in-memory LRU. The poll path peeks this cache by sha via cache_key_for_sha — no bytes, no re-extract.
Speculator — _publish_speculate(sha) enqueues a background pull on the codex:speculate Redis stream; the consumer extracts + caches.

No new cache layer. Consumers must not add their own cache on top — codex’s cache is the single source of truth for “how long a derived artifact lives” (CODEX_CACHE_TTL_SECONDS, default 24h).

Tenancy, rate limits, telemetry

The asset endpoints honour the same X-Codex-Tenant scoping, the same (tenant, endpoint) rate-limit token bucket (Retry-After on 429, which both clients honour automatically), and the same X-Codex-Stage-Durations-Ms telemetry as the rest of the unified extraction surface. See unified-extraction.md.

Consumers

Every PWS consumer talks to codex through this contract:

Repo	Path
lint-pdf	preflight reads the cache-backed document; `codex_signals_*` analyzers read AI signals by hash
lens-pdf	viewer overlays backfill from `requestAsset` after first paint
lint-pdf-marketing / lens-pdf-marketing	`/api/demo/codex/[id]` proxies `requestAsset` (no demo-side cache)
synergy	`codex.probe` / `codex.signals` flow nodes
platform	`/ai` + `/extraction` routes

Each consumer’s docs link here rather than re-describing its own flow.