Suites/code-review

code-review

Code review correctness — given a small diff and a one-line review comment, judge whether the comment is correct (a real, actionable issue) or incorrect (misses the bug, nitpicks something wrong, or is plain wrong).

Pass rate over time

Latest by model

Model	Pass rate	Cost / case	Latency	Runs
anthropicclaude-haiku-4-5	71% 36/51	$0.0500	497ms	1
googlegemini-2.5-pro	71% 36/51	$0.0400	497ms	1
anthropicclaude-opus-4-7	65% 33/51	$0.3100	708ms	1
openaigpt-5	65% 33/51	$0.1400	708ms	1
googlegemini-1.5-flash	43% 22/51	$0.0100	306ms	1
openaigpt-4o-mini	43% 22/51	$0.0200	306ms	1

Recent runs

#	Model	Pass rate	Started	Status
12	googlegemini-1.5-flash	43% 22/51	May 2, 2026, 10:46 PM	complete
11	openaigpt-4o-mini	43% 22/51	May 2, 2026, 10:46 PM	complete
10	googlegemini-2.5-pro	71% 36/51	May 2, 2026, 10:46 PM	complete
9	anthropicclaude-haiku-4-5	71% 36/51	May 2, 2026, 10:46 PM	complete
8	openaigpt-5	65% 33/51	May 2, 2026, 10:46 PM	complete
7	anthropicclaude-opus-4-7	65% 33/51	May 2, 2026, 10:46 PM	complete