Accuracy methodology — the deterministic lanes

Every commercial artwork-AI tool publishes a confident capability list and zero accuracy numbers. Codex publishes the method, the corpus, and the score — and gates regressions in CI. This document is that published methodology.

It is the companion to determinism.md (the determinism + transparency contract) and is backed mechanically by tests/test_accuracy_methodology.py: the numbers below are asserted there, so the doc and the harness move together — a parser change that shifts a number fails CI until both update.

Scope — deterministic lanes only

We start with — and currently publish only — the lanes where ground truth is unambiguous: the answer is a fact, not a matter of opinion or a model’s judgement.

Lane	Why ground truth is unambiguous
GS1 Digital Link — `is_digital_link` classification	A URI is a structurally-recognisable GS1 Digital Link or it isn’t (per the GS1 Digital Link URI Syntax).
GS1 Digital Link — GTIN mod-10 check digit	Pure arithmetic — the terminal digit is the correct check digit or it isn’t.
ISO/IEC 15415/16 — grade-letter banding	The standard fixes the A..F letter for a measured numeric grade.

The model lanes (logos, symbols, classification, spell, substrate, regulatory, spec, language, order_intake) are deliberately out of scope for a published accuracy number. Their “right answer” is model-/policy-dependent — codex emits them as signals, not verdicts (“expose detection signals, not policy verdicts”), and policy is lint’s job. Publishing an accuracy percentage for an opinion-shaped lane would be the exact fabrication every incumbent commits; we don’t.

This scope is itself behavior-locked: a test asserts the accuracy methodology covers only lanes classified deterministic in the determinism contract block, so an opinion-side corpus can’t be slipped in without re-thinking the methodology.

Method

Labeled corpus = ground truth. Each fixture case carries its own expected label. The corpus is the ground truth — there is no human adjudication step, so the score is reproducible by anyone who runs the harness.
Confusion-matrix scoring. Binary lanes are scored with true/false positive/negative counts → accuracy, precision, recall.
Hard cases are explicit. The corpora are built around the documented failure modes — the false-positive guards (a marketing URL that looks like a Digital Link) and the false-negative guards (a defective-but-still- recognisable Digital Link that must be flagged, not dismissed).
Pinned + CI-gated. The published numbers are asserted in the harness. Growing a corpus requires bumping the cited size in the same change.

Published results

Measured on the corpora in tests/test_accuracy_methodology.py at the current VERSION.

GS1 Digital Link — `is_digital_link` classification

Corpus: 10 cases — 7 recognisable Digital Links (including defective-but-valid: an un-padded GTIN, a bad check digit, a deprecated convenience-alpha path, and a compressed form) + 3 non-Digital-Links (marketing URLs).
Accuracy: 100% · Precision: 100% (no marketing URL mis-classified as a DL) · Recall: 100% (no defective-but-valid DL dismissed as non-DL).

The two hard guarantees this lane makes:

A plain marketing URL (https://brand.com/promo) is reported as a neutral non-DL fact — never a false error.
A Digital Link with a defect (un-padded GTIN, bad check digit, …) is still recognised as a Digital Link, with the defect surfaced in the structured errors — never silently dismissed.

GS1 GTIN mod-10 check digit

Corpus: 8 cases — 4 valid (GTIN-13/14, zero-padded) + 4 invalid (wrong terminal digit, off-by-one, non-digit, empty).
Accuracy: 100%. Pure arithmetic; no errors.

ISO/IEC 15415/16 — grade-letter banding

Corpus: 10 cases spanning the published band edges (A: ≥3.5, B: ≥2.5, C: ≥1.5, D: ≥0.5, F: <0.5).
Accuracy: 100%. The numeric → letter mapping matches the ISO band table exactly.

Note on what is and isn’t scored for grading. Codex grades only the ISO metrics it can derive from the rendered region (symbol contrast, modulation, defects, …) and lists the metrics needing a reference decode or module grid in omitted_metrics — it never fabricates a metric it can’t measure. The accuracy claim above is about the band mapping, the deterministic part; the per-metric measurement quality is bounded by what’s measurable from pixels and is reported transparently, not scored as a single number.

Reproducing the numbers

uv run pytest tests/test_accuracy_methodology.py -v

Each test name states the pinned claim; a failure means a parser change moved a published number — update the corpus/doc and the harness together.

Roadmap

The methodology expands only along the deterministic perimeter. Natural next corpora (all objective): barcode-format detection accuracy, GTIN padding detection, key-qualifier ordering legality. The model lanes stay out of the published-accuracy surface by design.