Crawl Harness

Deterministic single-turn replay for Realtime evals. This harness feeds fixed audio inputs to a Realtime session so you can compare runs with minimal variance. It is best for quick iteration and controlled comparisons.

What it does

Loads a CSV of eval prompts from crawl_harness/data/.
Uses TTS to synthesize a fixed input audio file per row.
Streams audio into the Realtime API in fixed-size chunks.
Captures the first assistant turn (text/audio/tool calls).
Grades tool-call correctness and tool-call-arg correctness only.
Writes artifacts into crawl_harness/results/<run_id>/.

Inputs

CSV columns (required):

example_id: Unique ID for the datapoint.
user_text: Text prompt to synthesize into audio.
gt_tool_call: Expected tool name or empty if no tool call is expected.
gt_tool_call_arg: Expected tool arguments as JSON string or empty.

Outputs

Each run writes:

results.csv: Per-example outputs and grades.
summary.json: Aggregate metrics (latency + correctness).
plots/*.png: Styled score, latency, token, and status charts.
audio/<example_id>/input.wav: The TTS input audio.
audio/<example_id>/output.wav: The model output audio (if any).
events/<example_id>.jsonl: Realtime event stream for the datapoint.

How it works

TTS generates PCM audio for each row.
PCM is wrapped into a WAV file for easy playback.
Audio is streamed into a Realtime session with fixed chunk sizes.
The harness listens for response.done and records tool calls + output.
Grades are computed:
- tool_call_correctness: correct tool chosen (or no tool call when none expected).
- tool_call_arg_correctness: expected args present in the tool call.

Run

From repo root:

python crawl_harness/run_realtime_evals.py

Common options:

--data-csv: Path to a CSV file.
--results-dir: Output folder (defaults to crawl_harness/results).
--run-name: Optional run name (defaults to ti

crawl_harness

Crawl Harness

What it does

Inputs

Outputs

How it works

Run

Related skills.

AmazonBedrock

anthropic_api_fundamentals

bootstrap-realtime-eval