Suites
Test suites discovered from tests/**/*.eval.ts, persisted on first run, or created manually.
code-review
1 runCode review correctness — given a small diff and a one-line review comment, judge whether the comment is correct (a real, actionable issue) or incorrect (misses the bug, nitpicks something wrong, or is plain wrong).
43%
Last run · May 2, 2026, 10:46 PM
toy
1 runSmoke test — 5 trivially-answerable cases for the exact scorer
20%
Last run · May 2, 2026, 10:46 PM