evalbench
Suites/toy

Smoke test — 5 trivially-answerable cases for the exact scorer

Pass rate over time

Latest by model

ModelPass rateCost / caseLatencyRuns
anthropicclaude-opus-4-7
80%
4/5
$0.2900837ms1
openaigpt-5
80%
4/5
$0.1500837ms1
anthropicclaude-haiku-4-5
20%
1/5
$0.0500584ms1
googlegemini-1.5-flash
20%
1/5
$0.0100401ms1
googlegemini-2.5-pro
20%
1/5
$0.0400584ms1
openaigpt-4o-mini
20%
1/5
$0.0200401ms1

Recent runs

#ModelPass rateStartedStatus
6googlegemini-1.5-flash
20%
1/5
May 2, 2026, 10:46 PMcomplete
5openaigpt-4o-mini
20%
1/5
May 2, 2026, 10:46 PMcomplete
4googlegemini-2.5-pro
20%
1/5
May 2, 2026, 10:46 PMcomplete
3anthropicclaude-haiku-4-5
20%
1/5
May 2, 2026, 10:46 PMcomplete
2openaigpt-5
80%
4/5
May 2, 2026, 10:46 PMcomplete
1anthropicclaude-opus-4-7
80%
4/5
May 2, 2026, 10:46 PMcomplete