Accuracy methodology — the deterministic lanes
Every commercial artwork-AI tool publishes a confident capability list and zero accuracy numbers. Codex publishes the method, the corpus, and the score — and gates regressions in CI. This document is that published methodology.
It is the companion to determinism.md (the determinism +
transparency contract) and is backed mechanically by
tests/test_accuracy_methodology.py:
the numbers below are asserted there, so the doc and the harness move
together — a parser change that shifts a number fails CI until both update.
Scope — deterministic lanes only
Section titled “Scope — deterministic lanes only”We start with — and currently publish only — the lanes where ground truth is unambiguous: the answer is a fact, not a matter of opinion or a model’s judgement.
| Lane | Why ground truth is unambiguous |
|---|---|
GS1 Digital Link — is_digital_link classification | A URI is a structurally-recognisable GS1 Digital Link or it isn’t (per the GS1 Digital Link URI Syntax). |
| GS1 Digital Link — GTIN mod-10 check digit | Pure arithmetic — the terminal digit is the correct check digit or it isn’t. |
| ISO/IEC 15415/16 — grade-letter banding | The standard fixes the A..F letter for a measured numeric grade. |
The model lanes (logos, symbols, classification, spell,
substrate, regulatory, spec, language, order_intake) are
deliberately out of scope for a published accuracy number. Their “right
answer” is model-/policy-dependent — codex emits them as signals, not
verdicts (“expose detection signals, not policy verdicts”), and policy is
lint’s job. Publishing an accuracy percentage for an opinion-shaped lane
would be the exact fabrication every incumbent commits; we don’t.
This scope is itself behavior-locked: a test asserts the accuracy methodology covers only lanes classified
deterministicin thedeterminismcontract block, so an opinion-side corpus can’t be slipped in without re-thinking the methodology.
Method
Section titled “Method”- Labeled corpus = ground truth. Each fixture case carries its own expected label. The corpus is the ground truth — there is no human adjudication step, so the score is reproducible by anyone who runs the harness.
- Confusion-matrix scoring. Binary lanes are scored with true/false positive/negative counts → accuracy, precision, recall.
- Hard cases are explicit. The corpora are built around the documented failure modes — the false-positive guards (a marketing URL that looks like a Digital Link) and the false-negative guards (a defective-but-still- recognisable Digital Link that must be flagged, not dismissed).
- Pinned + CI-gated. The published numbers are asserted in the harness. Growing a corpus requires bumping the cited size in the same change.
Published results
Section titled “Published results”Measured on the corpora in
tests/test_accuracy_methodology.py
at the current VERSION.
GS1 Digital Link — is_digital_link classification
Section titled “GS1 Digital Link — is_digital_link classification”- Corpus: 10 cases — 7 recognisable Digital Links (including defective-but-valid: an un-padded GTIN, a bad check digit, a deprecated convenience-alpha path, and a compressed form) + 3 non-Digital-Links (marketing URLs).
- Accuracy: 100% · Precision: 100% (no marketing URL mis-classified as a DL) · Recall: 100% (no defective-but-valid DL dismissed as non-DL).
The two hard guarantees this lane makes:
- A plain marketing URL (
https://brand.com/promo) is reported as a neutral non-DL fact — never a false error. - A Digital Link with a defect (un-padded GTIN, bad check digit, …) is
still recognised as a Digital Link, with the defect surfaced in the
structured
errors— never silently dismissed.
GS1 GTIN mod-10 check digit
Section titled “GS1 GTIN mod-10 check digit”- Corpus: 8 cases — 4 valid (GTIN-13/14, zero-padded) + 4 invalid (wrong terminal digit, off-by-one, non-digit, empty).
- Accuracy: 100%. Pure arithmetic; no errors.
ISO/IEC 15415/16 — grade-letter banding
Section titled “ISO/IEC 15415/16 — grade-letter banding”- Corpus: 10 cases spanning the published band edges (A: ≥3.5, B: ≥2.5, C: ≥1.5, D: ≥0.5, F: <0.5).
- Accuracy: 100%. The numeric → letter mapping matches the ISO band table exactly.
Note on what is and isn’t scored for grading. Codex grades only the ISO metrics it can derive from the rendered region (symbol contrast, modulation, defects, …) and lists the metrics needing a reference decode or module grid in
omitted_metrics— it never fabricates a metric it can’t measure. The accuracy claim above is about the band mapping, the deterministic part; the per-metric measurement quality is bounded by what’s measurable from pixels and is reported transparently, not scored as a single number.
Reproducing the numbers
Section titled “Reproducing the numbers”uv run pytest tests/test_accuracy_methodology.py -vEach test name states the pinned claim; a failure means a parser change moved a published number — update the corpus/doc and the harness together.
Roadmap
Section titled “Roadmap”The methodology expands only along the deterministic perimeter. Natural next corpora (all objective): barcode-format detection accuracy, GTIN padding detection, key-qualifier ordering legality. The model lanes stay out of the published-accuracy surface by design.