evalbench

Runs

Most recent first. Click a status badge to drill into a run.

#SuiteModelPass rateCostLatencyBranchStartedStatus
12code-reviewgooglegemini-1.5-flash
43%
22/51
$0.5100306msMay 2, 2026, 10:46 PMcomplete
11code-reviewopenaigpt-4o-mini
43%
22/51
$1.0200306msMay 2, 2026, 10:46 PMcomplete
10code-reviewgooglegemini-2.5-pro
71%
36/51
$2.0200497msMay 2, 2026, 10:46 PMcomplete
9code-reviewanthropicclaude-haiku-4-5
71%
36/51
$2.5200497msMay 2, 2026, 10:46 PMcomplete
8code-reviewopenaigpt-5
65%
33/51
$7.0200708msMay 2, 2026, 10:46 PMcomplete
7code-reviewanthropicclaude-opus-4-7
65%
33/51
$15.9900708msMay 2, 2026, 10:46 PMcomplete
6toygooglegemini-1.5-flash
20%
1/5
$0.0500401msMay 2, 2026, 10:46 PMcomplete
5toyopenaigpt-4o-mini
20%
1/5
$0.1000401msMay 2, 2026, 10:46 PMcomplete
4toygooglegemini-2.5-pro
20%
1/5
$0.2200584msMay 2, 2026, 10:46 PMcomplete
3toyanthropicclaude-haiku-4-5
20%
1/5
$0.2500584msMay 2, 2026, 10:46 PMcomplete
2toyopenaigpt-5
80%
4/5
$0.7400837msMay 2, 2026, 10:46 PMcomplete
1toyanthropicclaude-opus-4-7
80%
4/5
$1.4500837msMay 2, 2026, 10:46 PMcomplete