evalbench
Suites/code-review

Code review correctness — given a small diff and a one-line review comment, judge whether the comment is correct (a real, actionable issue) or incorrect (misses the bug, nitpicks something wrong, or is plain wrong).

Pass rate over time

Latest by model

ModelPass rateCost / caseLatencyRuns
anthropicclaude-haiku-4-5
71%
36/51
$0.0500497ms1
googlegemini-2.5-pro
71%
36/51
$0.0400497ms1
anthropicclaude-opus-4-7
65%
33/51
$0.3100708ms1
openaigpt-5
65%
33/51
$0.1400708ms1
googlegemini-1.5-flash
43%
22/51
$0.0100306ms1
openaigpt-4o-mini
43%
22/51
$0.0200306ms1

Recent runs

#ModelPass rateStartedStatus
12googlegemini-1.5-flash
43%
22/51
May 2, 2026, 10:46 PMcomplete
11openaigpt-4o-mini
43%
22/51
May 2, 2026, 10:46 PMcomplete
10googlegemini-2.5-pro
71%
36/51
May 2, 2026, 10:46 PMcomplete
9anthropicclaude-haiku-4-5
71%
36/51
May 2, 2026, 10:46 PMcomplete
8openaigpt-5
65%
33/51
May 2, 2026, 10:46 PMcomplete
7anthropicclaude-opus-4-7
65%
33/51
May 2, 2026, 10:46 PMcomplete