Realtime Evals

This folder contains three evaluation harnesses for the Realtime API that increase in complexity: crawl → walk → run.

crawl: synthetic single-turn (TTS → stream → grade)
walk: deterministic replay of saved phone-audio (more realistic, still comparable)
run: model-simulated multi-turn conversations (finds multi-turn/tooling failures)

Depending on your realtime eval maturity, point Codex (or your preferred coding assistant) to the folder that fits your use case so it can adapt the harness and bootstrap sample data; you can also have it run the included tests to confirm everything works.

Quickstart

Python 3.12+ required.

make install
source .venv/bin/activate
export OPENAI_API_KEY="your_api_key"

make install creates the local .venv and installs both runtime and dev dependencies. It uses uv when available and otherwise falls back to python -m venv plus pip install -r requirements.txt -r requirements-dev.txt.

Run a first command per harness. If uv is not installed, replace uv run with python and run these scripts with your .venv activated:

Crawl: uv run python crawl_harness/run_realtime_evals.py
Walk: install ffmpeg (brew install ffmpeg), then:
- uv run python walk_harness/generate_audio.py
- uv run python walk_harness/run_realtime_evals.py
Run: uv run python run_harness/run_realtime_evals.py --max-examples 1

Dev commands

Use the root Makefile for common checks. Run make install first to create .venv. These targets work with or without uv: when uv is installed they run through uv run, and otherwise they use the matching tool binaries from the local .venv.

make install
make streamlit
make format
make lint
make lint-fix
make typecheck
make test

Crawl (synthetic single-turn)

Best for: Fast iteration and controlled comparisons.

Uses text prompts from a CSV and synthesizes TTS audio per row.
Streams

realtime_evals

Realtime Evals

Quickstart

Dev commands

Crawl (synthetic single-turn)

Related skills.

AmazonBedrock

anthropic_api_fundamentals

bootstrap-realtime-eval