Generation Harness

Best for: fast iteration on prompt-following, layout, and text-rendering quality for text-to-image workflows.

This harness runs two example cases:

Run

python generation_harness/run_imagegen_evals.py

Quick smoke test for a single case (use --cases):

python generation_harness/run_imagegen_evals.py --cases ui_checkout_mockup

Run multiple cases (comma-separated, no spaces):

python generation_harness/run_imagegen_evals.py --cases ui_checkout_mockup,coffee_flyer_generation

Optional OCR-style text check for the coffee flyer:

python generation_harness/run_imagegen_evals.py --run-ocr

--cases: limit runs to specific case ids. Valid values:
- ui_checkout_mockup
- coffee_flyer_generation
--model: image model under test (defaults to gpt-image-1.5).
--judge-model: LLM used to grade outputs (defaults to gpt-5.2).
--num-images: number of images to generate per case (defaults to 1).
--image-size: size for generated images (defaults to 1024x1024).
--run-ocr: only relevant for coffee_flyer_generation (checks exact text).

Results are written under generation_harness/results/<run_id>/: