referenceFree

walk_harness

The walk harness replays saved audio to make realtime eval runs comparable.

Walk Harness

The walk harness replays saved audio to make realtime eval runs comparable. It streams G.711 mu-law WAV files in fixed-size chunks, commits the user turn manually (VAD off), and records responses, tool calls, and latency metrics.

What it does

  • Loads a CSV with the crawl columns (excluding expected_keywords) plus audio_path.
  • Streams saved audio in fixed chunk sizes (default 20 ms) at a deterministic cadence.
  • Commits the input buffer explicitly to avoid VAD variability.
  • Captures transcript deltas, audio deltas, tool calls, and completion events.
  • Writes results.csv and summary.json with accuracy and latency stats.
  • Renders styled PNG plots under results/<run>/plots/ by default.
  • Stores a JSONL stream of all realtime events per example under results/<run>/events/.

Files

  • walk_harness/generate_audio.py: Generates G.711 mu-law WAV files from the crawl CSV using TTS + ffmpeg.
  • walk_harness/data/customer_service_synthetic.csv: Walk dataset with audio_path pointing to WAV files.
  • walk_harness/run_realtime_evals.py: Runs the walk evals.
  • walk_harness/results/<run>/events/*.jsonl: Event logs per datapoint.
  • walk_harness/results/<run>/audio/<example_id>/output.wav: Assistant output audio per datapoint.

How to run

  1. Install ffmpeg (required for mu-law WAV encoding):
brew install ffmpeg
  1. Generate audio assets:
python walk_harness/generate_audio.py
  1. Run the eval harness:
python walk_harness/run_realtime_evals.py
  1. Quick smoke test:
python walk_harness/run_realtime_evals.py --max-examples 2

Inputs and audio format

  • The dataset CSV must include an audio_path column pointing to WAV files.
  • This harness currently expects G.711 mu-law audio at 8 kHz (g711_ulaw), which is a common telephony format and makes runs more realistic than pure PCM.
  • Output audio is saved as PCM16 WAV (typically 24 kHz) for easy playback.

Adapting the harness

  • C