Suites/code-review
code-review
public leaderboard →Code review correctness — given a small diff and a one-line review comment, judge whether the comment is correct (a real, actionable issue) or incorrect (misses the bug, nitpicks something wrong, or is plain wrong).
Pass rate over time
Latest by model
| Model | Pass rate | Cost / case | Latency | Runs |
|---|---|---|---|---|
| anthropicclaude-haiku-4-5 | 71% | $0.0500 | 497ms | 1 |
| googlegemini-2.5-pro | 71% | $0.0400 | 497ms | 1 |
| anthropicclaude-opus-4-7 | 65% | $0.3100 | 708ms | 1 |
| openaigpt-5 | 65% | $0.1400 | 708ms | 1 |
| googlegemini-1.5-flash | 43% | $0.0100 | 306ms | 1 |
| openaigpt-4o-mini | 43% | $0.0200 | 306ms | 1 |
Recent runs
| # | Model | Pass rate | Started | Status |
|---|---|---|---|---|
| 12 | googlegemini-1.5-flash | 43% | May 2, 2026, 10:46 PM | complete |
| 11 | openaigpt-4o-mini | 43% | May 2, 2026, 10:46 PM | complete |
| 10 | googlegemini-2.5-pro | 71% | May 2, 2026, 10:46 PM | complete |
| 9 | anthropicclaude-haiku-4-5 | 71% | May 2, 2026, 10:46 PM | complete |
| 8 | openaigpt-5 | 65% | May 2, 2026, 10:46 PM | complete |
| 7 | anthropicclaude-opus-4-7 | 65% | May 2, 2026, 10:46 PM | complete |