referenceFree

editing_harness

**Best for:** high-precision checks where edits must be exact and non-target

Editing Harness

Best for: high-precision checks where edits must be exact and non-target content should remain unchanged.

This harness runs two example cases:

vto_jacket_tryon: virtual try-on using reference person + garment
logo_year_edit: precision logo text edit

Run

python editing_harness/run_imagegen_evals.py

Run a single case (use --cases):

python editing_harness/run_imagegen_evals.py --cases vto_jacket_tryon

Run multiple cases (comma-separated, no spaces):

python editing_harness/run_imagegen_evals.py --cases vto_jacket_tryon,logo_year_edit

Flags (when to use them)

--cases: limit runs to specific case ids. Valid values:
- vto_jacket_tryon
- logo_year_edit
--model: image model under test (defaults to gpt-image-1.5).
--judge-model: LLM used to grade outputs (defaults to gpt-5.2).
--num-images: number of edited images to generate per case (defaults to 1).

Inputs

This harness expects the following reference images in images/:

images/base_woman.png
images/jacket.png
images/logo_generation_1.png

Outputs

Results are written under editing_harness/results/<run_id>/:

results.json and results.csv with per-example scores
summary.json with aggregated metrics
artifacts/ for edited images

Adapting the harness

Replace the input images and edit prompts in run_imagegen_evals.py.
Update the schemas and verdict rules to fit your workflow.

Related skills.

AmazonBedrock

This course is intended to provide you with a comprehensive step-by-step understanding of how to engineer optimal prompts within Claude, using Bedrock.

anthropic_api_fundamentals

A series of notebook tutorials that cover the essentials of working with Claude models and the Anthropic SDK including:

bootstrap-realtime-eval

Bootstrap a new realtime eval folder inside this cookbook repo by choosing the right harness from examples/evals/realtime_evals, scaffolding prompt/tools/data files, generating a useful README, and validating it with smoke, full eval, and test runs. Use when a user wants to start a new crawl, walk, or run realtime eval in this repository.