evalbench
evalbench

Code-defined LLM evals with a real comparison dashboard.

Write test cases in TypeScript. Run them across Claude, GPT, and Gemini. Score with deterministic checkers or an LLM judge. Track regressions on every PR. Free to self-host on Vercel + Neon.

Runs completed
12
Cases scored
336
Models compared
6
Suites
2

How it fits together

Three entry points run the same eval pipeline; the dashboard reads the same Postgres tables they write to.

1

Define

Write a suite at tests/foo.eval.ts: name, models, prompt, cases, scorers.

2

Run

pnpm eval locally, hit POST /api/seed in the browser, or open a PR — same runner, same results.

3

Compare

Browse runs, drill into a leaderboard, side-by-side diff two runs, export to Sheets or OneDrive, or wire a webhook to Slack.

Try the demo

Everything in this build is live. The data is synthetic, generated by a deterministic mock:* provider — no real model calls happened. Open the leaderboard or browse the runs.