Realtime Evals
This folder contains three evaluation harnesses for the Realtime API that increase in complexity: crawl → walk → run.
- crawl: synthetic single-turn (TTS → stream → grade)
- walk: deterministic replay of saved phone-audio (more realistic, still comparable)
- run: model-simulated multi-turn conversations (finds multi-turn/tooling failures)
Depending on your realtime eval maturity, point Codex (or your preferred coding assistant) to the folder that fits your use case so it can adapt the harness and bootstrap sample data; you can also have it run the included tests to confirm everything works.
Quickstart
Python 3.12+ required.
make install
source .venv/bin/activate
export OPENAI_API_KEY="your_api_key"
make install creates the local .venv and installs both runtime and dev dependencies. It uses uv when available and otherwise falls back to python -m venv plus pip install -r requirements.txt -r requirements-dev.txt.
Run a first command per harness. If uv is not installed, replace uv run with python and run these scripts with your .venv activated:
- Crawl:
uv run python crawl_harness/run_realtime_evals.py - Walk: install ffmpeg (
brew install ffmpeg), then:uv run python walk_harness/generate_audio.pyuv run python walk_harness/run_realtime_evals.py
- Run:
uv run python run_harness/run_realtime_evals.py --max-examples 1
Dev commands
Use the root Makefile for common checks. Run make install first to create .venv. These targets work with or without uv: when uv is installed they run through uv run, and otherwise they use the matching tool binaries from the local .venv.
make installmake streamlitmake formatmake lintmake lint-fixmake typecheckmake test
Crawl (synthetic single-turn)
Best for: Fast iteration and controlled comparisons.
- Uses text prompts from a CSV and synthesizes TTS audio per row.
- Streams