evalbench
Runs/#4

toy

googlegemini-2.5-procomplete
Pass rate
1/5 (20%)
Cost
$0.2200
Avg latency
584ms
Started
May 2, 2026, 10:46 PM
Triggered
api-seed
Prompt template
Answer with a single lowercase word and nothing else. No punctuation, no quotes, no explanation.

Question: __SAMPLE__

Results

PassSrcInputExpectedOutputScoreCostLatency
A
What is the opposite of hot?
cold cold.
0%
exact: expected "cold", got "cold."
$0.0400204ms
A
What color is the sky on a clear day?
blueblue
100%
$0.0500534ms
A
What animal says "meow"?
catdog
0%
exact: expected "cat", got "dog"
$0.0400689ms
A
How many legs does a spider have? Answer as a written-out number.
eightsix
0%
exact: expected "eight", got "six"
$0.0400694ms
A
Capital of France?
parislondon
0%
exact: expected "paris", got "london"
$0.0500800ms