Harness Selection
Use this table when the user is not sure which realtime eval harness to start with.
| Harness | Use when | Required starting material | Best first outcome |
|---|---|---|---|
crawl | The user wants fast iteration on single-turn behavior, tool choice, or tool args, including basic synthetic audio generated from text. | Text prompts. Expected tool names and args are optional if the user wants tool-call grading. | A small CSV plus smoke/full commands that run immediately. |
walk | The user cares about saved phone audio, wants more realistic audio replay, or needs synthetic audio with replay-specific characteristics such as noise, telephony artifacts, or speaker traits. | Either a CSV with audio_path, or text rows that can be turned into audio. | A walk dataset plus optional audio-generation step and replay commands. |
run | The user needs multi-turn behavior, tool mocks, or conversation-level grading. | A scenario, starter user utterance, simulator prompt, tool mocks, and grader criteria. | simulations.csv plus one or more sim_*.json files. |
Recommendation Rules
- Recommend
crawlby default when the user only has text rows or a rough task description. - Recommend
crawlwhen the user asks for synthetic audio but does not mention any replay-specific audio characteristics. - Recommend
walkwhen the user already has WAV files, cares about telephony realism, wants to validate the audio replay path, or wants synthetic audio with specific noise or speaker characteristics. - Recommend
runonly when the evaluation target depends on multiple turns, tool outputs, or conversation-level completion.
Data Contracts
Crawl
Required columns:
example_iduser_text
Optional columns for tool-call grading:
gt_tool_callgt_tool_call_arg
Walk
If audio is already available, required columns are:
example_iduser_textaudio_path
Optional columns for tool-call grading:
- `gt_tool_cal