How WebBench tests small language models — entirely in your browser, with nothing leaving your device. The goal isn't to make the model look good. It's to show you what it actually does.
One number: multiple-choice accuracy on MMLU-style questions. Fraction of questions answered correctly, reported with a 95% Wilson confidence interval so you can see how noisy a small run really is.
No calibration score. No behavior score. No archetype classification. Earlier versions of WebBench had all of those — they were elaborate plumbing on top of numbers that didn't help readers understand what they were looking at. They dressed up weak models. Accuracy and the raw output are what matter.
Every word the model wrote, unedited. No grammar constraints, no token-by-token steering, no template forcing. We send the question, the model streams text back, we display it.
Small browser-runnable models (0.5B–8B parameters) often hallucinate, loop, hedge, and copy boilerplate from the few-shot examples. That's genuinely interesting and the whole point of running these locally. The site is built so you can see it — every report shows the full raw output of every question.
Each question is preceded by five solved examples from the same MMLU subject — the canonical 5-shot evaluation protocol. The model is told:
Temperature is 0.1 so the same question on the same model gives roughly the same answer between runs. Max 400 tokens per question — enough room to reason without letting runaway loops eat the budget.
The model is asked to end with ANSWER: X, but small models often don't — they trail off, hedge, paraphrase, or just write “the answer is C” in prose. Rather than penalize bad format-following (which would be measuring instruction-following, not knowledge), the extractor walks five tiers in order; the first hit wins:
Each question in the report shows which tier produced its answer, so the parsing is fully transparent. A run where most answers come from tier 4 is meaningfully different from one where they all come from tier 1.
Accuracy is the fraction of questions answered correctly. Because run sizes are small (3–10 questions), raw accuracy has high variance. WebBench reports a 95% Wilson score confidence interval, which is better-behaved at extreme values (0% or 100% correct) than the naive proportion.
A run of 3 questions can produce a CI as wide as ±50%. Ten questions narrows it considerably. Take a single small run as a signal, not a measurement.
Inference runs entirely in your browser using WebLLM and WebGPU. The model weights are downloaded directly to your device and cached by the browser — the model itself never runs on a server, and all computation happens locally on your GPU.
When a run finishes, its results are anonymously shared to the public results page so the community leaderboard reflects real runs. That includes the score, the raw model output for each question, and basic hardware info (GPU, browser, OS). There's no account, no login, and no personal identifier attached — a run can't be traced back to you. A copy is also kept in your browser's localStorage for your own history.
Questions sourced from the MMLU benchmark (Hendrycks et al., 2021). In-browser inference via WebLLM (MLC AI). Published baselines from official papers — not directly comparable due to question set differences.