trove
The content-addressed digital-asset plane for the Print With Synergy tools — cloud storage or cloud stubs with real bytes on an on-site store, recalled on demand by the outbound trove-agent.
A Tier-1 / platform-class primitive (a format-agnostic asset plane, not
PDF-specific) — hence no pdf suffix.
-
blob— content-addressed bytes (dedupe + integrity unit, keyed by sha256). -
asset— a logical, named, versioned, tenant-scoped handle over blobs. Write-back creates a new version; blobs are never mutated. -
placement— where a blob physically lives (cloud / on-prem), one row per copy. -
derivation— cached deterministic producer output (codex first) keyed by(blob, producer, version). -
Placement policy (all four supported, default
onprem-origin+cloud-cache):Policy Bytes Cloud copy cloud-onlycloud is origin durable onprem-origin+cloud-cacheon-prem origin TTL cache after first materialize stub-onlyon-prem only metadata/derived-facts only mirrorboth durable both sides
Derived-facts cache (codex-on-ingest)
Section titled “Derived-facts cache (codex-on-ingest)”Because the model is content-addressed and codex is deterministic,
codex(blobHash, codexVersion) is a pure function of the bytes. Every PDF blob’s
codex facts are extracted once (lazily, on first materialize) and cached —
the full facts as their own content-addressed blob, plus a small queryable
summary. A changed file is a new blob = cache miss; a codex bump changes the
producer version = cache miss. The summary is answerable even for stub-only
assets without recalling the bytes.
HTTP API
Section titled “HTTP API”Operational (open): GET /healthz, GET /readyz, GET /v1/contract, and
GET /metrics (Prometheus). The
self-hosted data plane GET|PUT /v1/data is also open — it is authorized by a
short-lived HMAC signature on the URL itself (see TROVE_DATA_SIGNING_KEY), the
localfs equivalent of an S3 presigned URL.
Tenant-scoped routes resolve the calling tenant in one of three modes (see
Auth below). Errors are RFC 7807 application/problem+json.
| Method + path | Purpose |
|---|---|
POST /v1/assets | Create an asset ({key, policy?}); 409 on duplicate key |
GET /v1/assets | List the tenant’s assets |
GET /v1/assets/by-key?key= | Fetch one asset |
POST /v1/agents | Register an on-prem agent ({name, fingerprint}) |
GET /v1/agents | List registered agents |
POST /v1/ingest?key= | Upload bytes (raw body) → content-addressed blob + new asset version; enqueues trove.codex for PDFs |
POST /v1/ingest/stub | Register an on-prem-origin asset without uploading bytes ({key, sha256, size, mediaType, agentId, path, policy?}) |
POST /v1/materialize?key= | Materialize: presigned download URL + placement policy/plan + cached codex summary |
POST /v1/materialize/recall?key= | Recall: presign the cloud cache, else dispatch an on-prem pull to a connected agent into presigned staging |
POST /v1/materialize/issue-presigned?sha256= | Presign a blob the tenant owns |
POST /v1/retention/delete | GDPR erase a blob ({sha256}): storage object + derivations; blob only if unreferenced |
GET|PUT /v1/agent | Agent control-channel WebSocket (signed-token handshake) |
Every tenant-scoped query runs inside withTenant (Postgres RLS) and filters
tenant_id explicitly (defense in depth).
Tenant-scoped routes pick a mode by environment, in precedence order:
- Per-tenant JWT (production) — set
TROVE_JWKS_URL(verify against a remote JWKS) orTROVE_JWT_PUBLIC_KEY(a static PEM public key). trove verifies theAuthorization: Bearer <jwt>signature plusiss/aud/expand takes the tenant from the token’s tenant claim (TROVE_JWT_TENANT_CLAIM, defaulttenant_id, falling back tosub). A presentX-Trove-Tenantheader must match the claim (confused-deputy guard); it is never trusted on its own. - Shared bearer token — set
TROVE_API_TOKEN; theAuthorization: Bearermust match, thenX-Trove-Tenant: <uuid>names the tenant. Single-tenant / internal use. - Open (neither set) — tenant from
X-Trove-Tenant: <uuid>only. Dev.
Layout
Section titled “Layout”apps/api Hono HTTP API (asset/agent/ingest/materialize + health/contract)apps/worker pg-boss worker: trove.codex derivation (pull/return, staging GC: WIP)packages/core shared types, placement vocabulary + resolver, RFC 7807 errorspackages/db Drizzle schema + RLS helpers + migrations (advisory-locked runner)packages/storage R2/S3 + localfs storage drivers (presigned data plane)packages/derivation codex extractor seam + cache-first deriveCodexpackages/agent-protocol control-channel message catalog (TS source of truth)Configuration
Section titled “Configuration”| Env | Purpose |
|---|---|
DATABASE_URL | Postgres (RLS-enforced; app role should be NOBYPASSRLS) |
PORT | API port (default 8080) |
LOG_LEVEL | pino level for structured JSON logs (default info; silent in tests) |
TROVE_RATE_LIMIT_RPS / _BURST | Per-tenant token-bucket rate limit on tenant routes (defaults 50 rps / 100 burst; in-memory, per-replica). TROVE_RATE_LIMIT_DISABLED=true turns it off |
TROVE_JWKS_URL | Per-tenant JWT mode: remote JWKS to verify bearer tokens against |
TROVE_JWT_PUBLIC_KEY | Per-tenant JWT mode: static PEM public key (alt. to JWKS) |
TROVE_JWT_ISSUER / TROVE_JWT_AUDIENCE | Expected iss / aud (enforced when set) |
TROVE_JWT_TENANT_CLAIM / TROVE_JWT_ALG | Tenant claim name (default tenant_id) / key alg (default RS256) |
TROVE_API_TOKEN | Shared-bearer mode: a single static token (X-Trove-Tenant names the tenant) |
TROVE_STORAGE_DRIVER | s3 or localfs (default) |
TROVE_S3_ENDPOINT / _BUCKET / _REGION / _ACCESS_KEY_ID / _SECRET_ACCESS_KEY / _FORCE_PATH_STYLE | R2/S3 config |
TROVE_LOCAL_STORAGE_ROOT | localfs root (default ./data/storage) |
TROVE_PUBLIC_URL | Public origin the localfs signed data plane points back at (default http://localhost:$PORT) |
TROVE_DATA_SIGNING_KEY | HMAC key(s) (hex/base64, 32 bytes) for /v1/data signed URLs; required across replicas (else a per-process ephemeral key is used). Comma-separated for rotation: first signs, all verify |
CODEX_API_BASE_URL / CODEX_API_TOKEN / CODEX_VERSION | codex for the derivation worker (worker self-skips if unset) |
TROVE_PRESIGN_TTL / TROVE_STAGING_TTL | presigned URL / staging cache TTL seconds (default 3600) |
TROVE_MAX_UPLOAD_BYTES | max in-process upload size for ingest / /v1/data PUT (default 100 MiB; oversized → 413) |
TROVE_MULTIPART_PART_SIZE | recall files larger than this in resumable multipart parts of this size (default 16 MiB; S3 requires ≥5 MiB; needs an S3/R2 driver + a multipart-capable agent) |
Deploy
Section titled “Deploy”Railway: the API (apps/api, pnpm --filter @trove/api start) and worker
(apps/worker) services, a Postgres add-on, and an R2 bucket. Run migrations at
boot (pnpm --filter @trove/db db:migrate); the runner is advisory-locked so
multiple replicas migrate safely.
Operating it in production — secret rotation, backup/restore, alerting, and
incident playbooks — is documented in docs/RUNBOOK.md.
Develop
Section titled “Develop”pnpm installpnpm lint && pnpm build && pnpm typecheck && pnpm testDB-backed tests skip without DATABASE_URL and run against it when set (CI
provides a Postgres service). Test tasks run serially (--concurrency=1) so the
integration tests don’t contend on the shared database. The S3Driver is
covered against MinIO when TROVE_S3_TEST_ENDPOINT is set (CI starts one).
Live agent recall e2e
Section titled “Live agent recall e2e”scripts/e2e-agent-recall.sh runs a full binary-level on-prem recall: a real
trove server (S3/MinIO byte plane) + the real Go trove-agent
binary, moving bytes over a real presigned URL. It registers an agent,
stub-ingests an on-prem file, recalls it, and asserts the bytes match.
pnpm build # apps/api/dist must exist# needs: a Postgres in DATABASE_URL, a MinIO at TROVE_S3_ENDPOINT, the Go# toolchain, and the trove-agent source (TROVE_AGENT_DIR, default ../trove-agent)DATABASE_URL=postgres://… TROVE_AGENT_DIR=../trove-agent bash scripts/e2e-agent-recall.shIn CI it runs as the e2e — live Go agent recall job. That job checks out the
private trove-agent repo, so a maintainer must add a TROVE_AGENT_PAT Actions
secret (a fine-grained PAT with read access to printwithsynergy/trove-agent);
until then the job cleanly self-skips rather than failing.
License: AGPL-3.0-or-later.