Generation Harness
Best for: fast iteration on prompt-following, layout, and text-rendering quality for text-to-image workflows.
This harness runs two example cases:
ui_checkout_mockup: mobile checkout UI mockupcoffee_flyer_generation: marketing flyer with strict copy constraints
Run
python generation_harness/run_imagegen_evals.py
Quick smoke test for a single case (use --cases):
python generation_harness/run_imagegen_evals.py --cases ui_checkout_mockup
Run multiple cases (comma-separated, no spaces):
python generation_harness/run_imagegen_evals.py --cases ui_checkout_mockup,coffee_flyer_generation
Optional OCR-style text check for the coffee flyer:
python generation_harness/run_imagegen_evals.py --run-ocr
Flags (when to use them)
--cases: limit runs to specific case ids. Valid values:ui_checkout_mockupcoffee_flyer_generation
--model: image model under test (defaults togpt-image-1.5).--judge-model: LLM used to grade outputs (defaults togpt-5.2).--num-images: number of images to generate per case (defaults to 1).--image-size: size for generated images (defaults to1024x1024).--run-ocr: only relevant forcoffee_flyer_generation(checks exact text).
Outputs
Results are written under generation_harness/results/<run_id>/:
results.jsonandresults.csvwith per-example scoressummary.jsonwith aggregated metricsartifacts/for generated images
Adapting the harness
- Update the prompts/criteria in
run_imagegen_evals.py. - Tune JSON schemas + parsers to match your evaluation rubric.
- Add more cases by creating new
TestCaseentries and graders.