Suites/code-review

code-review leaderboard

Code review correctness — given a small diff and a one-line review comment, judge whether the comment is correct (a real, actionable issue) or incorrect (misses the bug, nitpicks something wrong, or is plain wrong).

Each row is the latest complete run of code-review for that model. Updated hourly.

#	Model	Pass rate	Avg cost / case	Avg latency	Runs	Latest
1	anthropicclaude-haiku-4-5	71% 36/51	$0.0500	497ms	1	May 2, 2026, 10:46 PM
2	googlegemini-2.5-pro	71% 36/51	$0.0400	497ms	1	May 2, 2026, 10:46 PM
3	anthropicclaude-opus-4-7	65% 33/51	$0.3100	708ms	1	May 2, 2026, 10:46 PM
4	openaigpt-5	65% 33/51	$0.1400	708ms	1	May 2, 2026, 10:46 PM
5	googlegemini-1.5-flash	43% 22/51	$0.0100	306ms	1	May 2, 2026, 10:46 PM
6	openaigpt-4o-mini	43% 22/51	$0.0200	306ms	1	May 2, 2026, 10:46 PM

Embed this leaderboard

Paste this snippet into a blog post or markdown file that supports HTML.

<iframe src="<your-host>/embed/leaderboard/code-review" width="100%" height="500" frameborder="0" loading="lazy"></iframe>