Skip to content

Data requests — the standard "ask codex for data" pattern

Data requests — the standard “ask codex for data” pattern

Section titled “Data requests — the standard “ask codex for data” pattern”

This is the canonical way every Print-with-Synergy service asks codex for a PDF’s facts. One contract, one client call, one cache.

TL;DR — call requestAsset(pdf). If codex has the document cached you get it back inline in one round-trip. If it doesn’t, codex pulls it in the background and the same call polls until the data lands. Repeat requests for the same PDF are served from cache — codex never re-extracts.

Before this contract, consumers diverged: some raw-fetched /v1/extract synchronously, some hand-rolled an ingest+poll client against an endpoint that didn’t exist, some double-cached results in their own Redis on top of codex’s cache. /v1/assets collapses all of that into one pattern built on the cache + speculator codex already runs.

Verb / PathPurpose
POST /v1/assetsIngest. Body is either a multipart pdf field or JSON { "sha256": "…" } referencing a PDF codex has already seen.
GET /v1/assets/{pdf_hash}Poll. Serves the cached document when ready.
GET /v1/assets/{pdf_hash}/signals/{kind}Cached AI signal (language, logos, symbols, barcodes, spell, classification) by hash. Thin alias over GET /v1/documents/{pdf_hash}/signals/{kind}.

pdf_hash is the lower-case hex sha256 of the raw PDF bytes — the same content-address codex keys its cache on. It is the asset_id.

Resolves the PDF bytes (multipart upload, or {sha256} looked up in the blob store), computes the sha, then:

  • Cache hit200

    {
    "asset_id": "<sha256>",
    "pdf_hash": "<sha256>",
    "status": "complete",
    "document": { "pdf_sha256": "<sha256>", "...": "full CodexDocument" },
    "summary": { "...": "codex summary or null" }
    }

    This is the sync-fast path. A warm cache answers in one round-trip — satisfying callers that expect a <3s synchronous extract.

  • Miss202

    { "asset_id": "<sha256>", "pdf_hash": "<sha256>", "status": "processing" }

    The sha is handed to the speculator (codex:speculate Redis stream) for a background pull. The caller polls GET /v1/assets/{pdf_hash}.

  • {sha256} for a hash codex has never seen404

    { "asset_id": "<sha256>", "pdf_hash": "<sha256>", "status": "unknown",
    "error": "No PDF cached for this sha256. Upload the bytes via multipart 'pdf' once to seed it." }

    Upload the bytes once (multipart) to seed the blob store; subsequent by-sha requests then resolve.

  • cached → 200 { status: "complete", document }.
  • blob present but not yet extracted (speculator in flight) → the poll extracts on demand from the blob store and returns complete, so a poll always makes forward progress even if the speculator stalls.
  • unknown sha → 404 { status: "unknown" }.
  • ingested → delegates to the dedicated run_signal path (per-kind cache + AI gating) and returns { status: "complete", kind, ai_status, signal }.
  • never ingested → 404 { status: "processing" } — drive POST /v1/assets (or GET /v1/assets/{pdf_hash}) first.

AI signals are opt-in (operator sets CODEX_AI_ENABLED=true). When AI is gated off, signal is empty and ai_status is "disabled" / "skipped" — the call still succeeds.

Both bundled clients ship a single helper that is the pattern. Use it.

// TypeScript — @printwithsynergy/codex-client (1.20.0+)
import { HttpClient } from "@printwithsynergy/codex-client";
const codex = new HttpClient({
baseUrl: process.env.CODEX_API_BASE,
bearerToken: process.env.CODEX_BEARER_TOKEN,
});
// Cache hit returns inline; miss pulls in the background and this polls
// until codex has the document (or `timeoutMs` elapses).
const asset = await codex.requestAsset(pdfBytes, { pollMs: 1500, timeoutMs: 60_000 });
if (asset.status === "complete") {
useDocument(asset.document);
}
// By hash — when you already uploaded the bytes once and only have the sha:
const again = await codex.requestAsset({ sha256 });
# Python — codex_pdf.client (1.20.0+)
from codex_pdf.client import HttpClient
codex = HttpClient(base_url="https://codex.example.com", bearer_token="")
asset = codex.request_asset(pdf_bytes, poll_seconds=1.5, timeout_seconds=60.0)
if asset["status"] == "complete":
use_document(asset["document"])
# By hash:
asset = codex.request_asset(sha256=sha)

Lower-level methods are available when you need them: ingestAsset / ingest_asset, getAsset / get_asset, getAssetSignals / get_asset_signals.

The intended UX for viewers (lens-pdf and the demo sites): paint the PDF immediately from a local renderer (pdf.js), then requestAsset in the background and backfill richer data — findings, dieline/spot overlays, signals — when codex’s document lands. Nothing blocks the first paint; better data arrives asynchronously. Repeat visits hit the cache, so the backfill is instant.

/v1/assets is glue over machinery codex already runs:

  • Idempotent extract_run_extract is cache-check → extract → cache, keyed by cache_key(pdf_bytes, {}, kind="extract", tenant).
  • Content-addressed cachecodex:{VERSION}:{kind}:{tenant}:{pdf_sha}:{args_sha} (api/cache.py), Redis (24h TTL) or in-memory LRU. The poll path peeks this cache by sha via cache_key_for_sha — no bytes, no re-extract.
  • Speculator_publish_speculate(sha) enqueues a background pull on the codex:speculate Redis stream; the consumer extracts + caches.

No new cache layer. Consumers must not add their own cache on top — codex’s cache is the single source of truth for “how long a derived artifact lives” (CODEX_CACHE_TTL_SECONDS, default 24h).

The asset endpoints honour the same X-Codex-Tenant scoping, the same (tenant, endpoint) rate-limit token bucket (Retry-After on 429, which both clients honour automatically), and the same X-Codex-Stage-Durations-Ms telemetry as the rest of the unified extraction surface. See unified-extraction.md.

Every PWS consumer talks to codex through this contract:

RepoPath
lint-pdfpreflight reads the cache-backed document; codex_signals_* analyzers read AI signals by hash
lens-pdfviewer overlays backfill from requestAsset after first paint
lint-pdf-marketing / lens-pdf-marketing/api/demo/codex/[id] proxies requestAsset (no demo-side cache)
synergycodex.probe / codex.signals flow nodes
platform/ai + /extraction routes

Each consumer’s docs link here rather than re-describing its own flow.