WifeBench lets wife rate LLM models
WifeBench is an informal LLM evaluation dashboard where the author's wife asks each new model 10 questions and scores answers on a 1-100 scale, according to the site. For practitioners, the useful signal is not leaderboard rigor; it is a reminder that small, user-specific evaluations can reveal preferences that generic benchmarks miss. The site should be treated as a lightweight UX and model-selection prompt, not evidence that one model is broadly better than another. Its value is in making subjective human evaluation visible and repeatable enough for teams to discuss, while the methodology remains intentionally narrow, unblinded, and personal.
Personal evals are useful when they expose a real user's judgment, but they become risky when readers treat them like universal model rankings. WifeBench is best read as a visible example of qualitative LLM evaluation: narrow, subjective, and useful for UX thinking, not a replacement for controlled benchmark suites.
What happened
WifeBench presents a public dashboard where the author's wife rates LLM model answers. The site's methodology says each new model is asked 10 private questions and scored on a 1-100 scale based on how close the answers are to hers. The page is intentionally informal and does not claim a committee, rubric, or peer-reviewed process.
Technical context
The useful distinction is between benchmark theater and task-grounded evaluation. A personal evaluator can catch tone, usefulness, and preference failures that broad leaderboards often miss, but the result is not statistically general. It is a qualitative signal from one evaluator, one question set, and one scoring style.
For practitioners
Teams can borrow the pattern safely by defining their own recurring prompts, preserving outputs, recording who judged them, and separating subjective preference from objective correctness. That makes the evaluation useful for product taste and UX fit without overstating it as a model capability benchmark.
What to watch
The site would become more useful if it published prompt categories, example scoring decisions, model versions, and timestamps for each run. Until then, treat it as a lightweight reminder to build your own user-facing evals, not as a canonical ranking of frontier models.
Key Points
- 1WifeBench scores LLM answers to ten private questions on a 1-100 scale, according to the site's methodology.
- 2The page is useful as a subjective UX signal, not as a controlled or general-purpose benchmark.
- 3Teams evaluating models should pair similar human ratings with reproducible prompts, rubrics, and task-specific failure analysis.
Scoring Rationale
This is a minor but relevant evaluation artifact: it demonstrates a simple human preference workflow, not a robust benchmark. A score near 4.0 keeps it below substantial product or research news while recognizing its usefulness for practitioners thinking about lightweight model evaluation.
Sources
Public references used for this report.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems
