Skip to content

trove

The content-addressed digital-asset plane for the Print With Synergy tools — cloud storage or cloud stubs with real bytes on an on-site store, recalled on demand by the outbound trove-agent.

A Tier-1 / platform-class primitive (a format-agnostic asset plane, not PDF-specific) — hence no pdf suffix.

  • blob — content-addressed bytes (dedupe + integrity unit, keyed by sha256).

  • asset — a logical, named, versioned, tenant-scoped handle over blobs. Write-back creates a new version; blobs are never mutated.

  • placement — where a blob physically lives (cloud / on-prem), one row per copy.

  • derivation — cached deterministic producer output (codex first) keyed by (blob, producer, version).

  • Placement policy (all four supported, default onprem-origin+cloud-cache):

    PolicyBytesCloud copy
    cloud-onlycloud is origindurable
    onprem-origin+cloud-cacheon-prem originTTL cache after first materialize
    stub-onlyon-prem onlymetadata/derived-facts only
    mirrorbothdurable both sides

Because the model is content-addressed and codex is deterministic, codex(blobHash, codexVersion) is a pure function of the bytes. Every PDF blob’s codex facts are extracted once (lazily, on first materialize) and cached — the full facts as their own content-addressed blob, plus a small queryable summary. A changed file is a new blob = cache miss; a codex bump changes the producer version = cache miss. The summary is answerable even for stub-only assets without recalling the bytes.

Operational (open): GET /healthz, GET /readyz, GET /v1/contract, and GET /metrics (Prometheus). The self-hosted data plane GET|PUT /v1/data is also open — it is authorized by a short-lived HMAC signature on the URL itself (see TROVE_DATA_SIGNING_KEY), the localfs equivalent of an S3 presigned URL.

Tenant-scoped routes resolve the calling tenant in one of three modes (see Auth below). Errors are RFC 7807 application/problem+json.

Method + pathPurpose
POST /v1/assetsCreate an asset ({key, policy?}); 409 on duplicate key
GET /v1/assetsList the tenant’s assets
GET /v1/assets/by-key?key=Fetch one asset
POST /v1/agentsRegister an on-prem agent ({name, fingerprint})
GET /v1/agentsList registered agents
POST /v1/ingest?key=Upload bytes (raw body) → content-addressed blob + new asset version; enqueues trove.codex for PDFs
POST /v1/ingest/stubRegister an on-prem-origin asset without uploading bytes ({key, sha256, size, mediaType, agentId, path, policy?})
POST /v1/materialize?key=Materialize: presigned download URL + placement policy/plan + cached codex summary
POST /v1/materialize/recall?key=Recall: presign the cloud cache, else dispatch an on-prem pull to a connected agent into presigned staging
POST /v1/materialize/issue-presigned?sha256=Presign a blob the tenant owns
POST /v1/retention/deleteGDPR erase a blob ({sha256}): storage object + derivations; blob only if unreferenced
GET|PUT /v1/agentAgent control-channel WebSocket (signed-token handshake)

Every tenant-scoped query runs inside withTenant (Postgres RLS) and filters tenant_id explicitly (defense in depth).

Tenant-scoped routes pick a mode by environment, in precedence order:

  1. Per-tenant JWT (production) — set TROVE_JWKS_URL (verify against a remote JWKS) or TROVE_JWT_PUBLIC_KEY (a static PEM public key). trove verifies the Authorization: Bearer <jwt> signature plus iss/aud/exp and takes the tenant from the token’s tenant claim (TROVE_JWT_TENANT_CLAIM, default tenant_id, falling back to sub). A present X-Trove-Tenant header must match the claim (confused-deputy guard); it is never trusted on its own.
  2. Shared bearer token — set TROVE_API_TOKEN; the Authorization: Bearer must match, then X-Trove-Tenant: <uuid> names the tenant. Single-tenant / internal use.
  3. Open (neither set) — tenant from X-Trove-Tenant: <uuid> only. Dev.
apps/api Hono HTTP API (asset/agent/ingest/materialize + health/contract)
apps/worker pg-boss worker: trove.codex derivation (pull/return, staging GC: WIP)
packages/core shared types, placement vocabulary + resolver, RFC 7807 errors
packages/db Drizzle schema + RLS helpers + migrations (advisory-locked runner)
packages/storage R2/S3 + localfs storage drivers (presigned data plane)
packages/derivation codex extractor seam + cache-first deriveCodex
packages/agent-protocol control-channel message catalog (TS source of truth)
EnvPurpose
DATABASE_URLPostgres (RLS-enforced; app role should be NOBYPASSRLS)
PORTAPI port (default 8080)
LOG_LEVELpino level for structured JSON logs (default info; silent in tests)
TROVE_RATE_LIMIT_RPS / _BURSTPer-tenant token-bucket rate limit on tenant routes (defaults 50 rps / 100 burst; in-memory, per-replica). TROVE_RATE_LIMIT_DISABLED=true turns it off
TROVE_JWKS_URLPer-tenant JWT mode: remote JWKS to verify bearer tokens against
TROVE_JWT_PUBLIC_KEYPer-tenant JWT mode: static PEM public key (alt. to JWKS)
TROVE_JWT_ISSUER / TROVE_JWT_AUDIENCEExpected iss / aud (enforced when set)
TROVE_JWT_TENANT_CLAIM / TROVE_JWT_ALGTenant claim name (default tenant_id) / key alg (default RS256)
TROVE_API_TOKENShared-bearer mode: a single static token (X-Trove-Tenant names the tenant)
TROVE_STORAGE_DRIVERs3 or localfs (default)
TROVE_S3_ENDPOINT / _BUCKET / _REGION / _ACCESS_KEY_ID / _SECRET_ACCESS_KEY / _FORCE_PATH_STYLER2/S3 config
TROVE_LOCAL_STORAGE_ROOTlocalfs root (default ./data/storage)
TROVE_PUBLIC_URLPublic origin the localfs signed data plane points back at (default http://localhost:$PORT)
TROVE_DATA_SIGNING_KEYHMAC key(s) (hex/base64, 32 bytes) for /v1/data signed URLs; required across replicas (else a per-process ephemeral key is used). Comma-separated for rotation: first signs, all verify
CODEX_API_BASE_URL / CODEX_API_TOKEN / CODEX_VERSIONcodex for the derivation worker (worker self-skips if unset)
TROVE_PRESIGN_TTL / TROVE_STAGING_TTLpresigned URL / staging cache TTL seconds (default 3600)
TROVE_MAX_UPLOAD_BYTESmax in-process upload size for ingest / /v1/data PUT (default 100 MiB; oversized → 413)
TROVE_MULTIPART_PART_SIZErecall files larger than this in resumable multipart parts of this size (default 16 MiB; S3 requires ≥5 MiB; needs an S3/R2 driver + a multipart-capable agent)

Railway: the API (apps/api, pnpm --filter @trove/api start) and worker (apps/worker) services, a Postgres add-on, and an R2 bucket. Run migrations at boot (pnpm --filter @trove/db db:migrate); the runner is advisory-locked so multiple replicas migrate safely.

Operating it in production — secret rotation, backup/restore, alerting, and incident playbooks — is documented in docs/RUNBOOK.md.

Terminal window
corepack enable && corepack prepare [email protected] --activate
pnpm install
pnpm lint && pnpm build && pnpm typecheck && pnpm test

DB-backed tests skip without DATABASE_URL and run against it when set (CI provides a Postgres service). Test tasks run serially (--concurrency=1) so the integration tests don’t contend on the shared database. The S3Driver is covered against MinIO when TROVE_S3_TEST_ENDPOINT is set (CI starts one).

scripts/e2e-agent-recall.sh runs a full binary-level on-prem recall: a real trove server (S3/MinIO byte plane) + the real Go trove-agent binary, moving bytes over a real presigned URL. It registers an agent, stub-ingests an on-prem file, recalls it, and asserts the bytes match.

Terminal window
pnpm build # apps/api/dist must exist
# needs: a Postgres in DATABASE_URL, a MinIO at TROVE_S3_ENDPOINT, the Go
# toolchain, and the trove-agent source (TROVE_AGENT_DIR, default ../trove-agent)
DATABASE_URL=postgres://… TROVE_AGENT_DIR=../trove-agent bash scripts/e2e-agent-recall.sh

In CI it runs as the e2e — live Go agent recall job. That job checks out the private trove-agent repo, so a maintainer must add a TROVE_AGENT_PAT Actions secret (a fine-grained PAT with read access to printwithsynergy/trove-agent); until then the job cleanly self-skips rather than failing.

License: AGPL-3.0-or-later.