What happened
The GitHub repository Andyyyy64/whichllm publishes a command-line tool that auto-detects a machine's GPU/CPU/RAM and returns ranked local LLM recommendations from HuggingFace, according to the repository README. The README includes example output such as running whichllm --gpu "RTX 4090" and returning a ranked list where Qwen/Qwen3.6-27B is shown as the top pick with a reported score of 92.8 and throughput 27 t/s. The README documents that rankings merge live benchmark data from LiveBench, Artificial Analysis, Aider, multimodal/vision benchmarks, Chatbot Arena ELO, and the Open LLM Leaderboard, and that each score is labelled (direct, variant, base, interpolated, self-reported) and discounted by confidence. The README also highlights a recency-aware adjustment so older leaderboard snapshots are demoted when compared to newer-generation models.
Editorial analysis - technical context
Tools that combine hardware capacity checks with multi-source benchmarking address a common practitioner problem: a model that "fits" VRAM may not provide the best latency-quality tradeoff. Industry-pattern observations show that accurate, multi-benchmark aggregation requires careful handling of lineage, dataset overlap, and evaluator drift; whichllm's documented score tagging and recency discounting are methods commonly used to mitigate those issues. The README also notes throughput is measured on "active" parameters while quality metrics use total parameters, which matters for model classes such as MoE where active vs total params diverge.
Industry context
For ML engineers experimenting with local inference, the practical value is twofold: faster iteration when choosing a model that balances latency and quality for a given device, and reduced time wasted testing large models that nominally fit but underperform. Observed patterns in similar tools suggest adoption depends on keeping benchmark feeds current and transparent about confidence and metric provenance.
What to watch
Track additions of benchmark sources or integration with continuous feeds from HuggingFace, changes to the confidence-discounting rules, and community reports of real-world throughput on diverse hardware. Also watch how the project handles forks and model variants in its lineage logic, since model forks with self-reported claims can distort aggregated scores.
Key Points
- 1whichllm auto-detects GPU/CPU/RAM and ranks HuggingFace models using merged live benchmark data, saving manual trial-and-error.
- 2Recency-aware scoring and evidence-tagging aim to reduce stale leaderboard effects and inflated self-reported claims.
- 3For practitioners, device-aware ranking plus throughput metrics reduce wasted cycles testing models that merely "fit" but underperform.
Scoring Rationale
A practical developer tool that streamlines local model selection and benchmarking is useful for ML engineers and hobbyists. It is not a frontier-research release, but its evidence-aggregation and recency handling make it materially valuable for practitioners.
Sources
Public references used for this report.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems
