referenceFree

run_harness

Model-simulated multi-turn eval harness for the Realtime API. This harness uses

Run Harness

Model-simulated multi-turn eval harness for the Realtime API. This harness uses one realtime model as the user simulator (audio-only) and another realtime model as the assistant under test. It replays fixed chunking (VAD off), mocks tools from the simulation JSON, and records full traces.

What it does

  • Loads run_harness/data/simulations.csv with pandas.
  • Reads per-simulation JSON files defining scenario, simulator identity, tool mocks, and LLM-as-judge grading criteria.
  • Generates user audio turns via a realtime simulator model (audio-only).
  • Streams user audio to the assistant in fixed-size chunks and commits manually.
  • Captures assistant audio/text, tool calls, tool outputs, and latencies.
  • Grades turn-level and trace-level criteria with an LLM-as-judge.
  • Writes results.csv, summary.json, and full trace logs under run_harness/results/.
  • Renders styled PNG plots under run_harness/results/<run_id>/plots/ by default.

Files

  • run_harness/run_realtime_evals.py: Run harness script.
  • run_harness/data/simulations.csv: Index of simulation files.
  • run_harness/data/sim_*.json: Simulation definitions (this repo currently ships 3 examples).
  • run_harness/results/<run_id>/events/*.jsonl: Full event trace per simulation.
  • run_harness/results/<run_id>/conversations/*.txt: Human-readable transcript with tool calls.

Simulation definitions (high level)

Each sim_*.json file defines:

  • the scenario (what the user is trying to do)
  • tool mocks (what each tool should return when called)
  • judge rubric (what should be graded per turn / overall)

This lets you run repeatable multi-turn evals without needing live backend integrations.

How to run

From repo root:

python run_harness/run_realtime_evals.py --max-examples 1

Common options:

  • --data-csv: Simulation index CSV.
  • --model: Alias for the assistant model under test.
  • --assistant-model: Realtime model under test (overrides --model