methodology

How WebBench tests small language models — entirely in your browser, with nothing leaving your device. The goal isn't to make the model look good. It's to show you what it actually does.

what we measure

One number: multiple-choice accuracy on MMLU-style questions. Fraction of questions answered correctly, reported with a 95% Wilson confidence interval so you can see how noisy a small run really is.

No calibration score. No behavior score. No archetype classification. Earlier versions of WebBench had all of those — they were elaborate plumbing on top of numbers that didn't help readers understand what they were looking at. They dressed up weak models. Accuracy and the raw output are what matter.

what we show

Every word the model wrote, unedited. No grammar constraints, no token-by-token steering, no template forcing. We send the question, the model streams text back, we display it.

Small browser-runnable models (0.5B–8B parameters) often hallucinate, loop, hedge, and copy boilerplate from the few-shot examples. That's genuinely interesting and the whole point of running these locally. The site is built so you can see it — every report shows the full raw output of every question.

how we ask

Each question is preceded by five solved examples from the same MMLU subject — the canonical 5-shot evaluation protocol. The model is told:

You are taking a multiple-choice exam. Each question has four options labeled A, B, C, and D. Exactly one is correct. Reason briefly, then end your response with a line in the form ANSWER: X where X is the chosen letter.

Temperature is 0.1 so the same question on the same model gives roughly the same answer between runs. Max 400 tokens per question — enough room to reason without letting runaway loops eat the budget.

subjectscs · engineering · math · science

difficultyeasy · medium · hard (balanced)

format4-choice multiple choice

protocol5-shot in-context examples per question

temperature0.1

max tokens400 per question

how we parse the answer

The model is asked to end with ANSWER: X, but small models often don't — they trail off, hedge, paraphrase, or just write “the answer is C” in prose. Rather than penalize bad format-following (which would be measuring instruction-following, not knowledge), the extractor walks five tiers in order; the first hit wins:

1. explicit formatmatches ANSWER: X, **X**, \boxed{X}

2. phrased commitment"the answer is X", "I choose X", "therefore X"

3. quoted the choicemodel restated one choice's text distinctly more than the others

4. weighted letter votelate mentions of A/B/C/D weighted heavier; needs a clear margin

5. no answer foundcounts as incorrect — honest about model failure

Each question in the report shows which tier produced its answer, so the parsing is fully transparent. A run where most answers come from tier 4 is meaningfully different from one where they all come from tier 1.

accuracy and confidence intervals

Accuracy is the fraction of questions answered correctly. Because run sizes are small (3–10 questions), raw accuracy has high variance. WebBench reports a 95% Wilson score confidence interval, which is better-behaved at extreme values (0% or 100% correct) than the naive proportion.

A run of 3 questions can produce a CI as wide as ±50%. Ten questions narrows it considerably. Take a single small run as a signal, not a measurement.

privacy and execution

Inference runs entirely in your browser using WebLLM and WebGPU. The model weights are downloaded directly to your device and cached by the browser — the model itself never runs on a server, and all computation happens locally on your GPU.

When a run finishes, its results are anonymously shared to the public results page so the community leaderboard reflects real runs. That includes the score, the raw model output for each question, and basic hardware info (GPU, browser, OS). There's no account, no login, and no personal identifier attached — a run can't be traced back to you. A copy is also kept in your browser's localStorage for your own history.

Questions sourced from the MMLU benchmark (Hendrycks et al., 2021). In-browser inference via WebLLM (MLC AI). Published baselines from official papers — not directly comparable due to question set differences.