whichllm finds best local LLMs for your hardware
The GitHub repository Andyyyy64/whichllm provides a CLI that auto-detects GPU/CPU/RAM and ranks HuggingFace models that fit a user system, according to the project README. The tool runs a single command such as whichllm --gpu "RTX 4090" to simulate results and prints ranked picks with measured scores and throughput (examples in the README show Qwen/Qwen3.6-27B ranked first on an RTX 4090 with a reported score of 92.8 and 27 t/s, per the repository). The README states the ranking merges live benchmark sources including LiveBench, Artificial Analysis, Aider, multimodal/vision benchmarks, Chatbot Arena ELO, and the Open LLM Leaderboard, and that scores are tagged and discounted by confidence. The project emphasizes recency-aware ranking so older models are demoted relative to newer-generation models, per the README. Editorial analysis: This is a practical utility for practitioners who need evidence-based, hardware-aware model selection rather than size-only heuristics.
What happened
The GitHub repository Andyyyy64/whichllm publishes a command-line tool that auto-detects a machine's GPU/CPU/RAM and returns ranked local LLM recommendations from HuggingFace, according to the repository README. The README includes example output such as running whichllm --gpu "RTX 4090" and returning a ranked list where Qwen/Qwen3.6-27B is shown as the top pick with a reported score of 92.8 and throughput 27 t/s. The README documents that rankings merge live benchmark data from LiveBench, Artificial Analysis, Aider, multimodal/vision benchmarks, Chatbot Arena ELO, and the Open LLM Leaderboard, and that each score is labelled (direct, variant, base, interpolated, self-reported) and discounted by confidence. The README also highlights a recency-aware adjustment so older leaderboard snapshots are demoted when compared to newer-generation models.
Editorial analysis - technical context
Tools that combine hardware capacity checks with multi-source benchmarking address a common practitioner problem: a model that "fits" VRAM may not provide the best latency-quality tradeoff. Industry-pattern observations show that accurate, multi-benchmark aggregation requires careful handling of lineage, dataset overlap, and evaluator drift; whichllm's documented score tagging and recency discounting are methods commonly used to mitigate those issues. The README also notes throughput is measured on "active" parameters while quality metrics use total parameters, which matters for model classes such as MoE where active vs total params diverge.
Industry context
For ML engineers experimenting with local inference, the practical value is twofold: faster iteration when choosing a model that balances latency and quality for a given device, and reduced time wasted testing large models that nominally fit but underperform. Observed patterns in similar tools suggest adoption depends on keeping benchmark feeds current and transparent about confidence and metric provenance.
What to watch
Track additions of benchmark sources or integration with continuous feeds from HuggingFace, changes to the confidence-discounting rules, and community reports of real-world throughput on diverse hardware. Also watch how the project handles forks and model variants in its lineage logic, since model forks with self-reported claims can distort aggregated scores.
Scoring Rationale
A practical developer tool that streamlines local model selection and benchmarking is useful for ML engineers and hobbyists. It is not a frontier-research release, but its evidence-aggregation and recency handling make it materially valuable for practitioners.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

