trove

The content-addressed digital-asset plane for the Print With Synergy tools — cloud storage or cloud stubs with real bytes on an on-site store, recalled on demand by the outbound trove-agent.

A Tier-1 / platform-class primitive (a format-agnostic asset plane, not PDF-specific) — hence no pdf suffix.

Model

blob — content-addressed bytes (dedupe + integrity unit, keyed by sha256).
asset — a logical, named, versioned, tenant-scoped handle over blobs. Write-back creates a new version; blobs are never mutated.
placement — where a blob physically lives (cloud / on-prem), one row per copy.
derivation — cached deterministic producer output (codex first) keyed by (blob, producer, version).

Placement policy (all four supported, default onprem-origin+cloud-cache):

Policy	Bytes	Cloud copy
`cloud-only`	cloud is origin	durable
`onprem-origin+cloud-cache`	on-prem origin	TTL cache after first materialize
`stub-only`	on-prem only	metadata/derived-facts only
`mirror`	both	durable both sides

Derived-facts cache (codex-on-ingest)

Because the model is content-addressed and codex is deterministic, codex(blobHash, codexVersion) is a pure function of the bytes. Every PDF blob’s codex facts are extracted once (lazily, on first materialize) and cached — the full facts as their own content-addressed blob, plus a small queryable summary. A changed file is a new blob = cache miss; a codex bump changes the producer version = cache miss. The summary is answerable even for stub-only assets without recalling the bytes.

HTTP API

Operational (open): GET /healthz, GET /readyz, GET /v1/contract, and GET /metrics (Prometheus). The self-hosted data plane GET|PUT /v1/data is also open — it is authorized by a short-lived HMAC signature on the URL itself (see TROVE_DATA_SIGNING_KEY), the localfs equivalent of an S3 presigned URL.

Tenant-scoped routes resolve the calling tenant in one of three modes (see Auth below). Errors are RFC 7807 application/problem+json.

Method + path	Purpose
`POST /v1/assets`	Create an asset (`{key, policy?}`); `409` on duplicate key
`GET /v1/assets`	List the tenant’s assets
`GET /v1/assets/by-key?key=`	Fetch one asset
`POST /v1/agents`	Register an on-prem agent (`{name, fingerprint}`)
`GET /v1/agents`	List registered agents
`POST /v1/ingest?key=`	Upload bytes (raw body) → content-addressed blob + new asset version; enqueues `trove.codex` for PDFs
`POST /v1/ingest/stub`	Register an on-prem-origin asset without uploading bytes (`{key, sha256, size, mediaType, agentId, path, policy?}`)
`POST /v1/materialize?key=`	Materialize: presigned download URL + placement policy/plan + cached codex `summary`
`POST /v1/materialize/recall?key=`	Recall: presign the cloud cache, else dispatch an on-prem `pull` to a connected agent into presigned staging
`POST /v1/materialize/issue-presigned?sha256=`	Presign a blob the tenant owns
`POST /v1/retention/delete`	GDPR erase a blob (`{sha256}`): storage object + derivations; blob only if unreferenced
`GET\|PUT /v1/agent`	Agent control-channel WebSocket (signed-token handshake)

Every tenant-scoped query runs inside withTenant (Postgres RLS) and filters tenant_id explicitly (defense in depth).

Auth

Tenant-scoped routes pick a mode by environment, in precedence order:

Per-tenant JWT (production) — set TROVE_JWKS_URL (verify against a remote JWKS) or TROVE_JWT_PUBLIC_KEY (a static PEM public key). trove verifies the Authorization: Bearer <jwt> signature plus iss/aud/exp and takes the tenant from the token’s tenant claim (TROVE_JWT_TENANT_CLAIM, default tenant_id, falling back to sub). A present X-Trove-Tenant header must match the claim (confused-deputy guard); it is never trusted on its own.
Shared bearer token — set TROVE_API_TOKEN; the Authorization: Bearer must match, then X-Trove-Tenant: <uuid> names the tenant. Single-tenant / internal use.
Open (neither set) — tenant from X-Trove-Tenant: <uuid> only. Dev.

Layout

apps/api        Hono HTTP API (asset/agent/ingest/materialize + health/contract)
apps/worker     pg-boss worker: trove.codex derivation (pull/return, staging GC: WIP)
packages/core   shared types, placement vocabulary + resolver, RFC 7807 errors
packages/db     Drizzle schema + RLS helpers + migrations (advisory-locked runner)
packages/storage         R2/S3 + localfs storage drivers (presigned data plane)
packages/derivation      codex extractor seam + cache-first deriveCodex
packages/agent-protocol  control-channel message catalog (TS source of truth)

Configuration

Env	Purpose
`DATABASE_URL`	Postgres (RLS-enforced; app role should be `NOBYPASSRLS`)
`PORT`	API port (default 8080)
`LOG_LEVEL`	pino level for structured JSON logs (default `info`; `silent` in tests)
`TROVE_RATE_LIMIT_RPS` / `_BURST`	Per-tenant token-bucket rate limit on tenant routes (defaults 50 rps / 100 burst; in-memory, per-replica). `TROVE_RATE_LIMIT_DISABLED=true` turns it off
`TROVE_JWKS_URL`	Per-tenant JWT mode: remote JWKS to verify bearer tokens against
`TROVE_JWT_PUBLIC_KEY`	Per-tenant JWT mode: static PEM public key (alt. to JWKS)
`TROVE_JWT_ISSUER` / `TROVE_JWT_AUDIENCE`	Expected `iss` / `aud` (enforced when set)
`TROVE_JWT_TENANT_CLAIM` / `TROVE_JWT_ALG`	Tenant claim name (default `tenant_id`) / key alg (default `RS256`)
`TROVE_API_TOKEN`	Shared-bearer mode: a single static token (`X-Trove-Tenant` names the tenant)
`TROVE_STORAGE_DRIVER`	`s3` or `localfs` (default)
`TROVE_S3_ENDPOINT` / `_BUCKET` / `_REGION` / `_ACCESS_KEY_ID` / `_SECRET_ACCESS_KEY` / `_FORCE_PATH_STYLE`	R2/S3 config
`TROVE_LOCAL_STORAGE_ROOT`	localfs root (default `./data/storage`)
`TROVE_PUBLIC_URL`	Public origin the localfs signed data plane points back at (default `http://localhost:$PORT`)
`TROVE_DATA_SIGNING_KEY`	HMAC key(s) (hex/base64, 32 bytes) for `/v1/data` signed URLs; required across replicas (else a per-process ephemeral key is used). Comma-separated for rotation: first signs, all verify
`CODEX_API_BASE_URL` / `CODEX_API_TOKEN` / `CODEX_VERSION`	codex for the derivation worker (worker self-skips if unset)
`TROVE_PRESIGN_TTL` / `TROVE_STAGING_TTL`	presigned URL / staging cache TTL seconds (default 3600)
`TROVE_MAX_UPLOAD_BYTES`	max in-process upload size for ingest / `/v1/data` PUT (default 100 MiB; oversized → 413)
`TROVE_MULTIPART_PART_SIZE`	recall files larger than this in resumable multipart parts of this size (default 16 MiB; S3 requires ≥5 MiB; needs an S3/R2 driver + a multipart-capable agent)

Deploy

Railway: the API (apps/api, pnpm --filter @trove/api start) and worker (apps/worker) services, a Postgres add-on, and an R2 bucket. Run migrations at boot (pnpm --filter @trove/db db:migrate); the runner is advisory-locked so multiple replicas migrate safely.

Operating it in production — secret rotation, backup/restore, alerting, and incident playbooks — is documented in docs/RUNBOOK.md.

Develop

corepack enable && corepack prepare [email protected] --activate
pnpm install
pnpm lint && pnpm build && pnpm typecheck && pnpm test

DB-backed tests skip without DATABASE_URL and run against it when set (CI provides a Postgres service). Test tasks run serially (--concurrency=1) so the integration tests don’t contend on the shared database. The S3Driver is covered against MinIO when TROVE_S3_TEST_ENDPOINT is set (CI starts one).

Live agent recall e2e

scripts/e2e-agent-recall.sh runs a full binary-level on-prem recall: a real trove server (S3/MinIO byte plane) + the real Go trove-agent binary, moving bytes over a real presigned URL. It registers an agent, stub-ingests an on-prem file, recalls it, and asserts the bytes match.

pnpm build                 # apps/api/dist must exist
# needs: a Postgres in DATABASE_URL, a MinIO at TROVE_S3_ENDPOINT, the Go
# toolchain, and the trove-agent source (TROVE_AGENT_DIR, default ../trove-agent)
DATABASE_URL=postgres://… TROVE_AGENT_DIR=../trove-agent bash scripts/e2e-agent-recall.sh

In CI it runs as the e2e — live Go agent recall job. That job checks out the private trove-agent repo, so a maintainer must add a TROVE_AGENT_PAT Actions secret (a fine-grained PAT with read access to printwithsynergy/trove-agent); until then the job cleanly self-skips rather than failing.

License: AGPL-3.0-or-later.