evalbench

Code-defined LLM evals with a real comparison dashboard.

Write test cases in TypeScript. Run them across Claude, GPT, and Gemini. Score with deterministic checkers or an LLM judge. Track regressions on every PR. Free to self-host on Vercel + Neon.

View live leaderboard API reference GitHub

Runs completed: 12
Cases scored: 336
Models compared: 6
Suites: 2

What's shipped

Every feature below is live in this build. Click any card to jump to it.

Code-first test suites

Define cases in *.eval.ts files. Type-safe, reviewable in PRs, no YAML drift.

Open

Real comparison dashboard

Side-by-side run diff, leaderboard bars, per-suite timelines, cost breakdowns. Recharts + RSC.

Open

Cost dashboard

Daily spend, by-model, by-suite, top runs. Backfill demo costs in one SQL statement.

Open

Outgoing webhooks

HMAC-signed run.completed + regression.detected payloads. Test deliveries from the UI.

Open

Public API + Swagger

Five GET endpoints + four webhook config endpoints. Interactive try-it-out at /api-docs.

Open

Sheets + OneDrive export

OAuth to Google or Microsoft. CSV download works without any account at all.

Open

How it fits together

Three entry points run the same eval pipeline; the dashboard reads the same Postgres tables they write to.

Define

Write a suite at tests/foo.eval.ts: name, models, prompt, cases, scorers.

Run

pnpm eval locally, hit POST /api/seed in the browser, or open a PR — same runner, same results.

Compare

Browse runs, drill into a leaderboard, side-by-side diff two runs, export to Sheets or OneDrive, or wire a webhook to Slack.

Try the demo

Everything in this build is live. The data is synthetic, generated by a deterministic mock:* provider — no real model calls happened. Open the leaderboard or browse the runs.

Open the code-review leaderboard Recent runs