evalbench
Suites/toy

toy leaderboard

Smoke test — 5 trivially-answerable cases for the exact scorer

Each row is the latest complete run of toy for that model. Updated hourly.

#ModelPass rateAvg cost / caseAvg latencyRunsLatest
1anthropicclaude-opus-4-7
80%
4/5
$0.2900837ms1May 2, 2026, 10:46 PM
2openaigpt-5
80%
4/5
$0.1500837ms1May 2, 2026, 10:46 PM
3anthropicclaude-haiku-4-5
20%
1/5
$0.0500584ms1May 2, 2026, 10:46 PM
4googlegemini-1.5-flash
20%
1/5
$0.0100401ms1May 2, 2026, 10:46 PM
5googlegemini-2.5-pro
20%
1/5
$0.0400584ms1May 2, 2026, 10:46 PM
6openaigpt-4o-mini
20%
1/5
$0.0200401ms1May 2, 2026, 10:46 PM
Embed this leaderboard

Paste this snippet into a blog post or markdown file that supports HTML.

<iframe src="<your-host>/embed/leaderboard/toy" width="100%" height="500" frameborder="0" loading="lazy"></iframe>