Suites/toy
toy leaderboard
Smoke test — 5 trivially-answerable cases for the exact scorer
Each row is the latest complete run of toy for that model. Updated hourly.
| # | Model | Pass rate | Avg cost / case | Avg latency | Runs | Latest |
|---|---|---|---|---|---|---|
| 1 | anthropicclaude-opus-4-7 | 80% | $0.2900 | 837ms | 1 | May 2, 2026, 10:46 PM |
| 2 | openaigpt-5 | 80% | $0.1500 | 837ms | 1 | May 2, 2026, 10:46 PM |
| 3 | anthropicclaude-haiku-4-5 | 20% | $0.0500 | 584ms | 1 | May 2, 2026, 10:46 PM |
| 4 | googlegemini-1.5-flash | 20% | $0.0100 | 401ms | 1 | May 2, 2026, 10:46 PM |
| 5 | googlegemini-2.5-pro | 20% | $0.0400 | 584ms | 1 | May 2, 2026, 10:46 PM |
| 6 | openaigpt-4o-mini | 20% | $0.0200 | 401ms | 1 | May 2, 2026, 10:46 PM |
Embed this leaderboard
Paste this snippet into a blog post or markdown file that supports HTML.
<iframe src="<your-host>/embed/leaderboard/toy" width="100%" height="500" frameborder="0" loading="lazy"></iframe>