Run Harness

Model-simulated multi-turn eval harness for the Realtime API. This harness uses one realtime model as the user simulator (audio-only) and another realtime model as the assistant under test. It replays fixed chunking (VAD off), mocks tools from the simulation JSON, and records full traces.

What it does

Loads run_harness/data/simulations.csv with pandas.
Reads per-simulation JSON files defining scenario, simulator identity, tool mocks, and LLM-as-judge grading criteria.
Generates user audio turns via a realtime simulator model (audio-only).
Streams user audio to the assistant in fixed-size chunks and commits manually.
Captures assistant audio/text, tool calls, tool outputs, and latencies.
Grades turn-level and trace-level criteria with an LLM-as-judge.
Writes results.csv, summary.json, and full trace logs under run_harness/results/.
Renders styled PNG plots under run_harness/results/<run_id>/plots/ by default.

Files

run_harness/run_realtime_evals.py: Run harness script.
run_harness/data/simulations.csv: Index of simulation files.
run_harness/data/sim_*.json: Simulation definitions (this repo currently ships 3 examples).
run_harness/results/<run_id>/events/*.jsonl: Full event trace per simulation.
run_harness/results/<run_id>/conversations/*.txt: Human-readable transcript with tool calls.

Simulation definitions (high level)

Each sim_*.json file defines:

the scenario (what the user is trying to do)
tool mocks (what each tool should return when called)
judge rubric (what should be graded per turn / overall)

This lets you run repeatable multi-turn evals without needing live backend integrations.

How to run

From repo root:

python run_harness/run_realtime_evals.py --max-examples 1

Common options:

--data-csv: Simulation index CSV.
--model: Alias for the assistant model under test.
--assistant-model: Realtime model under test (overrides --model

run_harness

Run Harness

What it does

Files

Simulation definitions (high level)

How to run

Related skills.

AmazonBedrock

anthropic_api_fundamentals

bootstrap-realtime-eval