Image Generation + Editing Evals

This folder contains a lightweight vision eval harness plus example runners for image generation and image editing. The code mirrors the structure of examples/evals/realtime_evals/ so you can adapt it quickly.

Directory layout

vision_harness/: minimal shared library (types, runners, graders, evaluate loop)
generation_harness/: text-to-image evals (UI mockups + marketing flyer)
editing_harness/: image-edit evals (virtual try-on + logo edit)
shared/: reporting and optional rendering helpers

Quickstart

Python 3.9+ required.

pip install -r requirements.txt
export OPENAI_API_KEY="your_api_key"

Run a harness:

Generation: python generation_harness/run_imagegen_evals.py
Editing: python editing_harness/run_imagegen_evals.py

What the harness does

Builds a small set of TestCase objects (prompt + criteria).
Runs the image model for each case.
Grades each output with an LLM judge using a strict JSON schema.
Writes results and artifacts to results/<run_id>/.

The harness is intentionally small so you can copy/paste parts into your own production eval setup.

Example cases

Generation cases:

ui_checkout_mockup: mobile checkout screen with strict text + layout rules
coffee_flyer_generation: marketing flyer with exact copy constraints

Editing cases:

vto_jacket_tryon: virtual try-on with reference person + garment
logo_year_edit: precision logo text edit

Required assets (editing harness)

The editing harness expects these files in images/:

images/base_woman.png
images/jacket.png
images/logo_generation_1.png

Results layout

Each harness writes into its own results/ folder:

results/<run_id>/results.json: per-example outputs and grades
results/<run_id>/results.csv: tabular results
results/<run_id>/summary.json: aggregated metrics
results/<run_id>/artifacts/*.png: generated or edited images

Tip: keep results/ out of

imagegen_evals

Image Generation + Editing Evals

Quickstart

What the harness does

Example cases

Required assets (editing harness)

Results layout

Related skills.

AmazonBedrock

anthropic_api_fundamentals

bootstrap-realtime-eval