Suites/code-review
code-review leaderboard
Code review correctness — given a small diff and a one-line review comment, judge whether the comment is correct (a real, actionable issue) or incorrect (misses the bug, nitpicks something wrong, or is plain wrong).
Each row is the latest complete run of code-review for that model. Updated hourly.
| # | Model | Pass rate | Avg cost / case | Avg latency | Runs | Latest |
|---|---|---|---|---|---|---|
| 1 | anthropicclaude-haiku-4-5 | 71% | $0.0500 | 497ms | 1 | May 2, 2026, 10:46 PM |
| 2 | googlegemini-2.5-pro | 71% | $0.0400 | 497ms | 1 | May 2, 2026, 10:46 PM |
| 3 | anthropicclaude-opus-4-7 | 65% | $0.3100 | 708ms | 1 | May 2, 2026, 10:46 PM |
| 4 | openaigpt-5 | 65% | $0.1400 | 708ms | 1 | May 2, 2026, 10:46 PM |
| 5 | googlegemini-1.5-flash | 43% | $0.0100 | 306ms | 1 | May 2, 2026, 10:46 PM |
| 6 | openaigpt-4o-mini | 43% | $0.0200 | 306ms | 1 | May 2, 2026, 10:46 PM |
Embed this leaderboard
Paste this snippet into a blog post or markdown file that supports HTML.
<iframe src="<your-host>/embed/leaderboard/code-review" width="100%" height="500" frameborder="0" loading="lazy"></iframe>