Image Generation + Editing Evals
This folder contains a lightweight vision eval harness plus example runners for
image generation and image editing. The code mirrors the structure of
examples/evals/realtime_evals/ so you can adapt it quickly.
Directory layout
vision_harness/: minimal shared library (types, runners, graders, evaluate loop)generation_harness/: text-to-image evals (UI mockups + marketing flyer)editing_harness/: image-edit evals (virtual try-on + logo edit)shared/: reporting and optional rendering helpers
Quickstart
Python 3.9+ required.
pip install -r requirements.txt
export OPENAI_API_KEY="your_api_key"
Run a harness:
- Generation:
python generation_harness/run_imagegen_evals.py - Editing:
python editing_harness/run_imagegen_evals.py
What the harness does
- Builds a small set of
TestCaseobjects (prompt + criteria). - Runs the image model for each case.
- Grades each output with an LLM judge using a strict JSON schema.
- Writes results and artifacts to
results/<run_id>/.
The harness is intentionally small so you can copy/paste parts into your own production eval setup.
Example cases
Generation cases:
ui_checkout_mockup: mobile checkout screen with strict text + layout rulescoffee_flyer_generation: marketing flyer with exact copy constraints
Editing cases:
vto_jacket_tryon: virtual try-on with reference person + garmentlogo_year_edit: precision logo text edit
Required assets (editing harness)
The editing harness expects these files in images/:
images/base_woman.pngimages/jacket.pngimages/logo_generation_1.png
Results layout
Each harness writes into its own results/ folder:
results/<run_id>/results.json: per-example outputs and gradesresults/<run_id>/results.csv: tabular resultsresults/<run_id>/summary.json: aggregated metricsresults/<run_id>/artifacts/*.png: generated or edited images
Tip: keep results/ out of