evalbench
Suites/code-review

code-review leaderboard

Code review correctness — given a small diff and a one-line review comment, judge whether the comment is correct (a real, actionable issue) or incorrect (misses the bug, nitpicks something wrong, or is plain wrong).

Each row is the latest complete run of code-review for that model. Updated hourly.

#ModelPass rateAvg cost / caseAvg latencyRunsLatest
1anthropicclaude-haiku-4-5
71%
36/51
$0.0500497ms1May 2, 2026, 10:46 PM
2googlegemini-2.5-pro
71%
36/51
$0.0400497ms1May 2, 2026, 10:46 PM
3anthropicclaude-opus-4-7
65%
33/51
$0.3100708ms1May 2, 2026, 10:46 PM
4openaigpt-5
65%
33/51
$0.1400708ms1May 2, 2026, 10:46 PM
5googlegemini-1.5-flash
43%
22/51
$0.0100306ms1May 2, 2026, 10:46 PM
6openaigpt-4o-mini
43%
22/51
$0.0200306ms1May 2, 2026, 10:46 PM
Embed this leaderboard

Paste this snippet into a blog post or markdown file that supports HTML.

<iframe src="<your-host>/embed/leaderboard/code-review" width="100%" height="500" frameborder="0" loading="lazy"></iframe>