Rewrite producer
Rewrite producer
Section titled “Rewrite producer”compile_pdf.rewrite applies object-tree mutations to a single PDF
input. The output is a single PDF with the requested mutations
applied and everything else byte-identical to the input — that
“nothing else touched” guarantee is mechanically verified.
What it does
Section titled “What it does”15 in-scope mutations grouped by category:
Structural
Section titled “Structural”- OCG flips — toggle Optional Content Group visibility per layer.
- Page lifecycle ops — insert / delete / reorder / rotate pages.
- Box adjustments — trim, bleed, art, crop, media boxes.
- Page-label surgery — fix
1, 2, i, ii, …numbering.
Hygiene
Section titled “Hygiene”- Metadata patches — Info dict + XMP. Strip / set / replace.
- Color-space swaps — DeviceRGB → DeviceCMYK pin (or inverse).
- Strip JS — remove
/JavaScriptactions and/JSkeys. - Strip embedded files — remove
/EmbeddedFiles. - Normalize page-tree fan-out — flatten deep trees.
Lifecycle
Section titled “Lifecycle”- PDF/X version pin — declare conformance level.
- Producer / Creator stamping — operator provenance.
What it doesn’t do
Section titled “What it doesn’t do”Out of scope, gated by a hard STOP-gate in the audit:
- Content-stream surgery
- Font subsetting / embedding changes
- Image recompression
- Color reflow
These belong to a future producer (or never — see the design spec for the rationale).
Plan schema
Section titled “Plan schema”A rewrite plan is a JSON document. Top-level fields:
{ "schema_version": "1.0.0", "ops": [ { "op": "ocg_flip", "layer": "Bleed", "visible": false }, { "op": "metadata_set", "key": "Title", "value": "Job 12345" }, { "op": "metadata_strip", "keys": ["JS", "JavaScript"] }, { "op": "page_rotate", "page": 1, "degrees": 90 }, { "op": "box_set", "page": 2, "box": "TrimBox", "rect_pt": [0, 0, 612, 792] } ]}Schema documented at compile-pdf schema rewrite. Validation runs
client-side (CLI) and server-side (POST /v1/rewrite/apply).
Determinism guarantee
Section titled “Determinism guarantee”Same input + same plan produces byte-identical output (verified by
SHA-256). The cache key composer (src/compile_pdf/cache.py)
includes the canonical plan hash, so re-running an identical request
short-circuits to the cached output.
Codex surface consumed
Section titled “Codex surface consumed”codex_pdf.CodexDocument— Compile reads the document to validate page-index references in the plan (“you can’t delete page 12 of a 10-page PDF”).
No re-implementation; the audit script enforces.
Retention-for-training
Section titled “Retention-for-training”POST /v1/rewrite/apply honours the X-Compile-Retain-For-Training
header. When truthy and COMPILE_RETAIN_BUCKET is configured, the
call’s input/output/result triplet is persisted to S3-compatible
storage with a TTL tag. The decision is reflected on the lineage
record. See operations/retention.md.
Status
Section titled “Status”Shipped. The mutation engine (pikepdf) + three-layer verify +
POST /v1/rewrite/apply are live; determinism is enforced in CI.