Code-defined LLM evals with a real comparison dashboard.
Write test cases in TypeScript. Run them across Claude, GPT, and Gemini. Score with deterministic checkers or an LLM judge. Track regressions on every PR. Free to self-host on Vercel + Neon.
- Runs completed
- 12
- Cases scored
- 336
- Models compared
- 6
- Suites
- 2
What's shipped
Every feature below is live in this build. Click any card to jump to it.
Code-first test suites
Define cases in *.eval.ts files. Type-safe, reviewable in PRs, no YAML drift.
OpenReal comparison dashboard
Side-by-side run diff, leaderboard bars, per-suite timelines, cost breakdowns. Recharts + RSC.
OpenCost dashboard
Daily spend, by-model, by-suite, top runs. Backfill demo costs in one SQL statement.
OpenOutgoing webhooks
HMAC-signed run.completed + regression.detected payloads. Test deliveries from the UI.
OpenPublic API + Swagger
Five GET endpoints + four webhook config endpoints. Interactive try-it-out at /api-docs.
OpenSheets + OneDrive export
OAuth to Google or Microsoft. CSV download works without any account at all.
OpenHow it fits together
Three entry points run the same eval pipeline; the dashboard reads the same Postgres tables they write to.
Define
Write a suite at tests/foo.eval.ts: name, models, prompt, cases, scorers.
Run
pnpm eval locally, hit POST /api/seed in the browser, or open a PR — same runner, same results.
Compare
Browse runs, drill into a leaderboard, side-by-side diff two runs, export to Sheets or OneDrive, or wire a webhook to Slack.
Try the demo
Everything in this build is live. The data is synthetic, generated by a deterministic mock:* provider — no real model calls happened. Open the leaderboard or browse the runs.