Crawl Harness
Deterministic single-turn replay for Realtime evals. This harness feeds fixed audio inputs to a Realtime session so you can compare runs with minimal variance. It is best for quick iteration and controlled comparisons.
What it does
- Loads a CSV of eval prompts from
crawl_harness/data/. - Uses TTS to synthesize a fixed input audio file per row.
- Streams audio into the Realtime API in fixed-size chunks.
- Captures the first assistant turn (text/audio/tool calls).
- Grades tool-call correctness and tool-call-arg correctness only.
- Writes artifacts into
crawl_harness/results/<run_id>/.
Inputs
CSV columns (required):
example_id: Unique ID for the datapoint.user_text: Text prompt to synthesize into audio.gt_tool_call: Expected tool name or empty if no tool call is expected.gt_tool_call_arg: Expected tool arguments as JSON string or empty.
Outputs
Each run writes:
results.csv: Per-example outputs and grades.summary.json: Aggregate metrics (latency + correctness).plots/*.png: Styled score, latency, token, and status charts.audio/<example_id>/input.wav: The TTS input audio.audio/<example_id>/output.wav: The model output audio (if any).events/<example_id>.jsonl: Realtime event stream for the datapoint.
How it works
- TTS generates PCM audio for each row.
- PCM is wrapped into a WAV file for easy playback.
- Audio is streamed into a Realtime session with fixed chunk sizes.
- The harness listens for
response.doneand records tool calls + output. - Grades are computed:
tool_call_correctness: correct tool chosen (or no tool call when none expected).tool_call_arg_correctness: expected args present in the tool call.
Run
From repo root:
python crawl_harness/run_realtime_evals.py
Common options:
--data-csv: Path to a CSV file.--results-dir: Output folder (defaults tocrawl_harness/results).--run-name: Optional run name (defaults to ti